BabAR: from phoneme recognition to developmental measures of young children's speech production

1. What is BabAR?

Understanding how children learn to speak requires analyzing the sounds they produce, but doing this by hand is extremely time-consuming, limiting most studies to small samples. BabAR (Babbling Automatic Recognition) is an open-source tool that automatically transcribes young children’s vocalizations into International Phonetic Alphabet (IPA) symbols, the standard notation used by linguists to represent speech sounds.

Analyzing a child’s speech from a daylong recording is a two-step process. First, we need to figure out when the child is speaking. Daylong recordings contain hours of audio where most of the time the child is silent, and other people (parents, siblings) are talking. For this, we use VTC (Voice Type Classifier), a neural network that scans the audio and detects segments where the target child is vocalizing. Second, once we have those segments, BabAR takes over and transcribes them into IPA phonemes.

Paper: Lavechin, M., Bergelson, E., Levy, R. (2026) BabAR: from phoneme recognition to developmental measures of young children’s speech production. Submitted to Interspeech

Code: If you’d like to run BabAR on your own data, check out the GitHub repository.

2. BabAR in action

Here’s a video example of BabAR’s predictions on a recording collected using a child-worn microphone:

3. Listening to development

Below are vocalizations randomly sampled from the same child recorded at three different ages: 6, 12, and 17 months from the SEEDLingS dataset. For each sample, we show BabAR’s predicted IPA transcription. We recommend using headphones, as some sounds can be quite faint.

A few things to note: BabAR sometimes returns an empty prediction. This can happen when VTC detects very short segments that don’t contain enough acoustic information for BabAR to identify any phoneme, or when the child produces non-speech vocalizations (cries, laughs, vegetative sounds) that do not map onto phonemic categories. Additionally, VTC can occasionally misattribute an older sibling’s speech to the target child. The last sample at 17 months (m ɑ m i j u k ɛ n f a ɪ n d ɔ ɹ r d, i.e. “mommy you can find…”) is an example of this: a sibling’s utterance incorrectly detected as the target child’s.

6 months 12 months 17 months
Audio IPA Audio IPA Audio IPA
ɛ ɛ ɪ ɛ (empty) (empty)
(empty) ɪ d ʒ æ
(empty) (empty) m ʌ t u t i m ʌ t u t
ɛ i ɡ o ʊ m a b ɔ
j æ m o ə ɪ
(empty) ʌ (empty)
ɛ ʌ h æ h a f a d a
ə ɛ ɛ ʊ d ʊ ʃ m m a t e w a t ʃ
(empty) ʊ ɡ a
(empty) d u d u ɡ ʌ m ɑ m i j u k ɛ n f a ɪ n d ɔ ɹ r d

4. Acknowledgments

We acknowledge funding from the Simons Foundation International (funding from The Simons Foundation International (034070-00033) and the National Institutes of Health (NIH, grant number DP5-OD019812). We gratefully acknowledge PhonBank, funded by NIH-NICHD grant RO1-HD051698, and thank the data contributors whose corpora made this research possible. This work was performed using HPC resources from GENCI-IDRIS (Grant 2025-A0181011829).