Statistical learning models of early phonetic acquisition struggle with child-centered audio data (2022)

In this work, we trained self-supervised representation learning model (Contrastive Predictive Coding) on:

  1. child-centered long-form recordings (acquired with child-worn microphones)

  2. audiobooks (clean read-speech commonly used in self-supervised representation learning)

  3. simulated long-forms, that are audiobooks contaminated with additive noise and reverberation.

Below, you’ll find some audio samples that the model is exposed to during training.

Child-centered long-form recordings

Audio samples extracted from child-centered long-form recordings:

Female speech (storytelling):
Male speech (storytelling):
Female speech (with cries):
Male speech (in the car):
Female speech (conversation):
Male speech (conversation):


Audio samples extracted from audiobooks:

Female speech (reading):
Female speech (reading):

Simulated long-forms

The same audio samples contaminated with additive noise and reverberation to simulate the challenging acoustic conditions found in long-forms:

Contaminated female speech (reading):
Contaminated female speech (reading):