Statistical learning models of early phonetic acquisition struggle with child-centered audio data (2022)

In this work, we trained self-supervised representation learning model (Contrastive Predictive Coding) on:

child-centered long-form recordings (acquired with child-worn microphones)
audiobooks (clean read-speech commonly used in self-supervised representation learning)
simulated long-forms, that are audiobooks contaminated with additive noise and reverberation.

Below, you’ll find some audio samples that the model is exposed to during training.

Audio samples extracted from child-centered long-form recordings:

Female speech (storytelling):	Male speech (storytelling):
Female speech (with cries):	Male speech (in the car):
Female speech (conversation):	Male speech (conversation):

Audio samples extracted from audiobooks:

Female speech (reading):	Female speech (reading):

The same audio samples contaminated with additive noise and reverberation to simulate the challenging acoustic conditions found in long-forms:

Contaminated female speech (reading):	Contaminated female speech (reading):