Publications
2025
- From perception to production: how acoustic invariance facilitates articulatory learning in a self-supervised vocal imitation modelMarvin Lavechin, and Thomas HueberIn Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing 2025
Human infants face a formidable challenge in speech acquisition: mapping extremely variable acoustic inputs into appropriate articulatory movements without explicit instruction. We present a computational model that addresses the acoustic-to-articulatory mapping problem through self-supervised learning. Our model comprises a feature extractor that transforms speech into latent representations, an inverse model that maps these representations to articulatory parameters, and a synthesizer that generates speech outputs. Experiments conducted in both single- and multi-speaker settings reveal that intermediate layers of a pre-trained wav2vec 2.0 model provide optimal representations for articulatory learning, significantly outperforming MFCC features. These representations enable our model to learn articulatory trajectories that correlate with human patterns, discriminate between places of articulation, and produce intelligible speech. Critical to successful articulatory learning are representations that balance phonetic discriminability with speaker invariance – precisely the characteristics of self-supervised representation learning models. Our findings provide computational evidence consistent with developmental theories proposing that perceptual learning of phonetic categories guides articulatory development, offering insights into how infants might acquire speech production capabilities despite the complex mapping problem they face.
- Simulating Early Phonetic and Word Learning Without Linguistic CategoriesMarvin Lavechin, Maureen Seyssel, Hadrien Titeux, and 4 more authorsDevelopmental Science 2025
Before they even talk, infants become sensitive to the speech sounds of their native language and recognize the auditory form of an increasing number of words. Traditionally, these early perceptual changes are attributed to an emerging knowledge of linguistic categories such as phonemes or words. However, there is growing skepticism surrounding this interpretation due to limited evidence of category knowledge in infants. Previous modeling work has shown that a distributional learning algorithm could reproduce perceptual changes in infants’ early phonetic learning without acquiring phonetic categories. Taking this inquiry further, we propose that linguistic categories may not be needed for early word learning. We introduce STELA, a predictive coding algorithm designed to extract statistical patterns from continuous raw speech data. Our findings demonstrate that STELA can reproduce some developmental patterns of phonetic and word form learning without relying on linguistic categories such as phonemes or words nor requiring explicit word segmentation. Through an analysis of the learned representations, we show evidence that linguistic categories may emerge as an end product of learning rather than being prerequisites during early language acquisition.
- Simulating prenatal language exposure in computational models: An exploration studyMarı́a Andrea Cruz Blandón, Nayeli Gonzalez-Gomez, Marvin Lavechin, and 1 more authorCognition 2025
Researchers have hypothesized that infant language learning starts from the third trimester of pregnancy. This is supported by studies with fetuses and newborns showing discrimination/preference for their native language. Jointly with empirical research, initial computational modeling studies have investigated whether learning language patterns from speech input benefits from auditory prenatal language exposure (PLE), showing some advantages for prior adaptation to speech-like patterns. However, these modeling studies have not modeled prenatal speech input in an ecologically representative manner regarding quality or quantity. This study describes an ecologically representative framework for modeling PLE for full-term and preterm infants. The approach is based on empirical estimates of the amount of prenatal speech input together with a model of speech signal attenuation from the external air to the fetus’ auditory system. Using this framework, we conduct language learning simulations with computational models that learn from acoustic speech input in an unsupervised manner. We compare the effects of PLE to standard learning from only postnatal input on various early language phenomena. The results show how incorporating PLE can affect models’ learning outcomes, including differences between full-term and preterm conditions. Moreover, PLE duration might influence model behavior, depending on the linguistic capability being tested. While the inclusion of PLE did not improve the compatibility of the tested models with empirical infant data, our study highlights the relevance of PLE as a factor in modeling studies. Moreover, it provides a basic framework for modeling the prenatal period in future computational studies.
2024
- Modeling the initial state of early phonetic learning in infantsMaxime Poli, Thomas Schatz, Emmanuel Dupoux, and 1 more authorLanguage Development Research 2024
What are the necessary conditions to acquire language? Do infants rely on simple statistical mechanisms, or do they come pre-wired with innate capabilities allowing them to learn their native language(s)? Previous modeling studies have shown that unsupervised learning algorithms could reproduce some aspects of infant phonetic learning. Despite these successes, algorithms still fail to reproduce the learning trajectories observed in infants. Here, we advocate that this failure is partly due to a wrong initial state. Contrary to infants, unsupervised learning algorithms start with little to no prior knowledge of speech sounds. In this work, we propose a modeling approach to investigate the relative contribution of innate factors and language experience in infant speech perception. Our approach allows us to investigate theories hypothesizing a more significant role of innate factors, offering new modeling opportunities for studying infant language acquisition.
- Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipelineAlejandrina Cristia, Lucas Gautheron, Zixing Zhang, and 8 more authorsBehavior Research Methods 2024
Long-form audio recordings are increasingly used to study individual variation, group differences, and many other topics in theoretical and applied fields of developmental science, particularly for the description of children’s language input (typically speech from adults) and children’s language output (ranging from babble to sentences). The proprietary LENA software has been available for over a decade, and with it, users have come to rely on derived metrics like adult word count (AWC) and child vocalization counts (CVC), which have also more recently been derived using an open-source alternative, the ACLEW pipeline. Yet, there is relatively little work assessing the reliability of long-form metrics in terms of the stability of individual differences across time. Filling this gap, we analyzed eight spoken-language datasets: four from North American English-learning infants, and one each from British English-, French-, American English-/Spanish-, and Quechua-/Spanish-learning infants. The audio data were analyzed using two types of processing software: LENA and the ACLEW open-source pipeline. When all corpora were included, we found relatively low to moderate reliability (across multiple recordings, intraclass correlation coefficient attributed to the child identity [Child ICC], was < 50% for most metrics). There were few differences between the two pipelines. Exploratory analyses suggested some differences as a function of child age and corpora. These findings suggest that, while reliability is likely sufficient for various group-level analyses, caution is needed when using either LENA or ACLEW tools to study individual variation. We also encourage improvement of extant tools, specifically targeting accurate measurement of individual variation.
- Decode, move and speak! Self-supervised learning of speech units, gestures, and sound relationships using vocal imitationMarc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, and 1 more authorComputational Linguistics 2024
Speech learning encompasses mastering a complex motor system to produce speech sounds from articulatory gestures while simultaneously uncovering discrete units that provide entry to the linguistic system. Remarkably, children acquire these associations between speech sounds, articulatory gestures, and linguistic units in a weakly supervised manner, without the need for explicit labeling of auditory inputs or access to target articulatory gestures. This study uses self-supervised deep learning to investigate the respective roles of sounds, gestures, and linguistic units in speech acquisition and control. In a first experiment, we analyzed the quantized representations learned by vector-quantized variational autoencoders (VQ-VAE) from ground truth acoustic and articulatory data using ABX tests. We show an interesting complementarity between acoustic and articulatory modalities that may help in the discovery of phonemes. In a second experiment, we introduce a computational agent that repeats auditory speech inputs by controlling a virtual vocal apparatus. This agent integrates an articulatory synthesizer capable of reproducing diverse speech stimuli from interpretable parameters, along with two internal models implementing the articulatory-to-acoustic (forward) and acoustic-to-articulatory (inverse) mapping, respectively. Additionally, two inductive biases are used to regularize the ill-posed acoustic-to-articulatory inverse mapping. In line with the first experiment, we explore the complementarity between the auditory input and the articulatory parameters inferred by the agent. We also evaluate the impact of discretizing auditory inputs using VQ-VAE. While the majority of the agent’s productions are intelligible (according to perceptual evaluations), our analysis highlights inconsistencies in the underlying articulatory trajectories. In particular, we show that the agent’s productions only partially reproduce the complementarity between the auditory and articulatory modalities observed in humans.
- Modeling early phonetic acquisition from child-centered audio dataMarvin Lavechin, Maureen Seyssel, Marianne Métais, and 5 more authorsCognition 2024
Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement and the circumstances in which it takes place remain unclear. This paper presents the first attempt to study perceptual attunement using ecological child-centered audio data. We show that a simple prediction algorithm exhibits perceptual attunement when applied on unrealistic clean audio-book data, but fails to do so when applied on ecologically-valid child-centered data. In the latter scenario, perceptual attunement only emerges when the prediction mechanism is supplemented with inductive biases that force the algorithm to focus exclusively on speech segments while learning speaker-, pitch-, and room-invariant representations. We argue these biases are plausible given previous research on infants and non-human animals. More generally, we show that what our model learns and how it develops through exposure to speech depends exquisitely on the details of the input signal. By doing so, we illustrate the importance of considering ecologically valid input data when modeling language acquisition.
2023
- Measuring language development from child-centered recordingsYaya Sy, William Havard, Marvin Lavechin, and 2 more authorsIn Interspeech 2023
Standard ways to measure child language development from spontaneous corpora rely on detailed linguistic descriptions of a language as well as exhaustive transcriptions of the child’s speech, which today can only be done through costly human labor. We tackle both issues by proposing (1) a new language development metric (based on entropy) that does not require linguistic knowledge other than having a corpus of text in the language in question to train a language model, (2) a method to derive this metric directly from speech based on a smaller text-speech parallel corpus. Here, we present descriptive results on an open archive including data from six English-learning children as a proof of concept. We document that our entropy metric documents a gradual convergence of children’s speech towards adults’ speech as a function of age, and it also correlates moderately with lexical and morphosyntactic measures derived from morphologically-parsed transcriptions. The source code of the experiments is released at https://github.com/yaya-sy/EntropyBasedCLDMetrics
- BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language modelsMarvin Lavechin, Yaya Sy, Hadrien Titeux, and 5 more authorsInterspeech 2023
Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children’s language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
- Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimationMarvin Lavechin, Marianne Métais, Hadrien Titeux, and 7 more authorsASRU 2023
Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI’s Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community.
- Realistic and broad-scope learning simulations: first results and challengesMaureen Seyssel, Marvin Lavechin, and Emmanuel DupouxJournal of Child Language 2023
There is a current ’theory crisis’ in language acquisition research, resulting from fragmentation both at the level of the approaches and the linguistic level studied. We identify a need for integrative approaches that go beyond these limitations, and propose to analyse the strengths and weaknesses of current theoretical approaches of language acquisition. In particular, we advocate that language learning simulations, if they integrate realistic input and multiple levels of language, have the potential to contribute significantly to our understanding of language acquisition. We then review recent results obtained through such language learning simulations. Finally, we propose some guidelines for the community to build better simulations.
- ProsAudit, a prosodic benchmark for self-supervised speech modelsMaureen Seyssel, Marvin Lavechin, Hadrien Titeux, and 6 more authorsInterspeech 2023
We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.
2022
- Probing phoneme, language and speaker information in unsupervised speech representationsMaureen Seyssel, Marvin Lavechin, Yossi Adi, and 2 more authorsInterspeech 2022
Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.
- Reverse engineering language acquisition with child-centered long-form recordingsMarvin Lavechin, Maureen Seyssel, Lucas Gautheron, and 2 more authorsAnnual Review of Linguistics 2022
Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings - that is, audio (including speech) over whole days. The hardware and software underlying this technique is increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children’s own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children’s input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.
2021
- A thorough evaluation of the Language Environment Analysis (LENA) systemAlejandrina Cristia, Marvin Lavechin, Camila Scaff, and 5 more authorsBehavior Research Methods 2021
In the previous decade, dozens of studies involving thousands of children across several research disciplines have made use of a combined daylong audio-recorder and automated algorithmic analysis called the LENA® system, which aims to assess children’s language environment. While the system’s prevalence in the language acquisition domain is steadily growing, there are only scattered validation efforts, on only some of its key characteristics. Here, we assess the LENA® system’s accuracy across all of its key measures: speaker classification, Child Vocalization Counts (CVC), Conversational Turn Counts (CTC), and Adult Word Counts (AWC). Our assessment is based on manual annotation of clips that have been randomly or periodically sampled out of daylong recordings, collected from (a) populations similar to the system’s original training data (North American English-learning children aged 3-36 months), (b) children learning another dialect of English (UK), and (c) slightly older children growing up in a different linguistic and socio-cultural setting (Tsimane’ learners in rural Bolivia). We find reasonably high accuracy in some measures (AWC, CVC), with more problematic levels of performance in others (CTC, precision of male adults and other children). Statistical analyses do not support the view that performance is worse for children who are dissimilar from the LENA® original training set. Whether LENA® results are accurate enough for a given research, educational, or clinical application depends largely on the specifics at hand. We therefore conclude with a set of recommendations to help researchers make this determination for their goals.
- ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordingsOkko Räsänen, Shreyas Seshadri, Marvin Lavechin, and 2 more authorsBehavior Research Methods 2021
Recordings captured by wearable microphones are a standard method for investigating young children’s language environments. A key measure to quantify from such data is the amount of speech present in children’s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language- independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, enabling automatic phoneme, syllable, and word count estimation from child-centered audio recordings.
- ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling trackAfra Alishahi, Grzegorz Chrupała, Alejandrina Cristià, and 5 more authorsarXiv preprint arXiv:2107.06546 2021
We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.
2020
- An open-source voice type classifier for child-centered daylong recordingsMarvin Lavechin, Ruben Bousbib, Hervé Bredin, and 2 more authorsInterspeech 2020
Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.
- Pyannote.audio: neural building blocks for speaker diarizationHervé Bredin, Ruiqing Yin, Juan Manuel Coria, and 7 more authorsIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
We introduce pyannote. audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote. audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding-reaching state-of-the-art performance for most of them.
- End-to-end domain-adversarial voice activity detectionMarvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, and 2 more authorsInterspeech 2020
Voice activity detection is the task of detecting speech regions in a given audio stream or recording. First, we design a neural network combining trainable filters and recurrent layers to tackle voice activity detection directly from the waveform. Experiments on the challenging DIHARD dataset show that the proposed end-to-end model reaches state-of-the-art performance and outperforms a variant where trainable filters are replaced by standard cepstral coefficients. Our second contribution aims at making the proposed voice activity detection model robust to domain mismatch. To that end, a domain classification branch is added to the network and trained in an adversarial manner. The same DIHARD dataset, drawn from 11 different domains is used for evaluation under two scenarios. In the in-domain scenario where the training and test sets cover the exact same domains, we show that the domain-adversarial approach does not degrade performance of the proposed end-to-end model. In the out-domain scenario where the test domain is different from training domains, it brings a relative improvement of more than 10%. Finally, our last contribution is the provision of a fully reproducible open-source pipeline than can be easily adapted to other datasets.
- Speaker detection in the wild: Lessons learned from JSALT 2019Paola Garcı́a, Jesus Villalba, Hervé Bredin, and 8 more authorsOdyssey 2020
This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an e ective diarization improves detection, and not having a diarization stage impoverishes the performance. All the di erent configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection.
- Longform recordings: Opportunities and challengesLucas Gautheron, Marvin Lavechin, Rachid Riad, and 2 more authorsIn LIFT 2020-2èmes journées scientifiques du Groupement de Recherche" Linguistique informatique, formelle et de terrain" 2020
Technological developments have allowed the development of lightweight, wearable recorders that collect audio (including speech) lasting up to a whole day. We provide a general description of the technique and lay out the advantages and drawbacks when using this methodology. Field linguists may gain a uniquely naturalistic viewpoint of language use as people go about their everyday activities. However, due to their duration, noisiness, and likelihood of containing sensitive information, long- form recordings remain difficult to annotate manually. Open-source tools improve reproducibility and ease-of-use for researchers, to which end speech technologists can contribute. Additionally, new approaches to human and automated annotation make the study of speech in longform recordings increasingly feasible and promising.