Publications | Marvin Lavechin

2025

Simulating Early Phonetic and Word Learning Without Linguistic Categories

Marvin Lavechin, Maureen Seyssel, Hadrien Titeux, and 4 more authors

Developmental Science 2025

PDF
Simulating prenatal language exposure in computational models: An exploration study

Marı́a Andrea Cruz Blandón, Nayeli Gonzalez-Gomez, Marvin Lavechin, and 1 more author

Cognition 2025

PDF

2024

Modeling the initial state of early phonetic learning in infants

Maxime Poli, Thomas Schatz, Emmanuel Dupoux, and 1 more author

Language Development Research 2024

PDF
Establishing the reliability of metrics extracted from long-form recordings using LENA and the ACLEW pipeline

Alejandrina Cristia, Lucas Gautheron, Zixing Zhang, and 8 more authors

Behavior Research Methods 2024

PDF
Decode, move and speak! Self-supervised learning of speech units, gestures, and sound relationships using vocal imitation

Marc-Antoine Georges, Marvin Lavechin, Jean-Luc Schwartz, and 1 more author

Computational Linguistics 2024

PDF
Modeling early phonetic acquisition from child-centered audio data

Marvin Lavechin, Maureen Seyssel, Marianne Métais, and 5 more authors

Cognition 2024

PDF

2023

Measuring language development from child-centered recordings

Yaya Sy, William Havard, Marvin Lavechin, and 2 more authors

In Interspeech 2023

PDF
BabySLM: language-acquisition-friendly benchmark of self-supervised spoken language models

Marvin Lavechin, Yaya Sy, Hadrien Titeux, and 5 more authors

Interspeech 2023

Abs PDF Blog Code

Self-supervised techniques for learning speech representations have been shown to develop linguistic competence from exposure to speech without the need for human labels. In order to fully realize the potential of these approaches and further our understanding of how infants learn language, simulations must closely emulate real-life situations by training on developmentally plausible corpora and benchmarking against appropriate test sets. To this end, we propose a language-acquisition-friendly benchmark to probe spoken language models at the lexical and syntactic levels, both of which are compatible with the vocabulary typical of children’s language experiences. This paper introduces the benchmark and summarizes a range of experiments showing its usefulness. In addition, we highlight two exciting challenges that need to be addressed for further progress: bridging the gap between text and speech and between clean speech and in-the-wild speech.
Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation

Marvin Lavechin, Marianne Métais, Hadrien Titeux, and 7 more authors

ASRU 2023

Abs PDF Blog Code Hugging Face

Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI’s Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community.
Realistic and broad-scope learning simulations: first results and challenges

Maureen Seyssel, Marvin Lavechin, and Emmanuel Dupoux

Journal of Child Language 2023

Abs PDF

There is a current ’theory crisis’ in language acquisition research, resulting from fragmentation both at the level of the approaches and the linguistic level studied. We identify a need for integrative approaches that go beyond these limitations, and propose to analyse the strengths and weaknesses of current theoretical approaches of language acquisition. In particular, we advocate that language learning simulations, if they integrate realistic input and multiple levels of language, have the potential to contribute significantly to our understanding of language acquisition. We then review recent results obtained through such language learning simulations. Finally, we propose some guidelines for the community to build better simulations.
ProsAudit, a prosodic benchmark for self-supervised speech models

Maureen Seyssel, Marvin Lavechin, Hadrien Titeux, and 6 more authors

Interspeech 2023

Abs PDF

We present ProsAudit, a benchmark in English to assess structural prosodic knowledge in self-supervised learning (SSL) speech models. It consists of two subtasks, their corresponding metrics, and an evaluation dataset. In the protosyntax task, the model must correctly identify strong versus weak prosodic boundaries. In the lexical task, the model needs to correctly distinguish between pauses inserted between words and within words. We also provide human evaluation scores on this benchmark. We evaluated a series of SSL models and found that they were all able to perform above chance on both tasks, even when evaluated on an unseen language. However, non-native models performed significantly worse than native ones on the lexical task, highlighting the importance of lexical knowledge in this task. We also found a clear effect of size with models trained on more data performing better in the two subtasks.

2022

Reverse engineering language acquisition with child-centered long-form recordings

Marvin Lavechin, Maureen Seyssel, Lucas Gautheron, and 2 more authors

Annual Review of Linguistics 2022

Abs PDF

Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings - that is, audio (including speech) over whole days. The hardware and software underlying this technique is increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children’s own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children’s input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.
Probing phoneme, language and speaker information in unsupervised speech representations

Maureen Seyssel, Marvin Lavechin, Yossi Adi, and 2 more authors

Interspeech 2022

Abs PDF

Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.

2021

A thorough evaluation of the Language Environment Analysis (LENA) system

Alejandrina Cristia, Marvin Lavechin, Camila Scaff, and 5 more authors

Behavior Research Methods 2021

Abs PDF

In the previous decade, dozens of studies involving thousands of children across several research disciplines have made use of a combined daylong audio-recorder and automated algorithmic analysis called the LENA® system, which aims to assess children’s language environment. While the system’s prevalence in the language acquisition domain is steadily growing, there are only scattered validation efforts, on only some of its key characteristics. Here, we assess the LENA® system’s accuracy across all of its key measures: speaker classification, Child Vocalization Counts (CVC), Conversational Turn Counts (CTC), and Adult Word Counts (AWC). Our assessment is based on manual annotation of clips that have been randomly or periodically sampled out of daylong recordings, collected from (a) populations similar to the system’s original training data (North American English-learning children aged 3-36 months), (b) children learning another dialect of English (UK), and (c) slightly older children growing up in a different linguistic and socio-cultural setting (Tsimane’ learners in rural Bolivia). We find reasonably high accuracy in some measures (AWC, CVC), with more problematic levels of performance in others (CTC, precision of male adults and other children). Statistical analyses do not support the view that performance is worse for children who are dissimilar from the LENA® original training set. Whether LENA® results are accurate enough for a given research, educational, or clinical application depends largely on the specifics at hand. We therefore conclude with a set of recommendations to help researchers make this determination for their goals.
ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordings

Okko Räsänen, Shreyas Seshadri, Marvin Lavechin, and 2 more authors

Behavior Research Methods 2021

Abs PDF Code

Recordings captured by wearable microphones are a standard method for investigating young children’s language environments. A key measure to quantify from such data is the amount of speech present in children’s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language- independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, enabling automatic phoneme, syllable, and word count estimation from child-centered audio recordings.
ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling track

Afra Alishahi, Grzegorz Chrupała, Alejandrina Cristià, and 5 more authors

arXiv preprint arXiv:2107.06546 2021

Abs PDF

We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.

2020

An open-source voice type classifier for child-centered daylong recordings

Marvin Lavechin, Ruben Bousbib, Hervé Bredin, and 2 more authors

Interspeech 2020

Abs PDF Code

Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.
Pyannote.audio: neural building blocks for speaker diarization

Hervé Bredin, Ruiqing Yin, Juan Manuel Coria, and 7 more authors

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020

Abs PDF Code

We introduce pyannote. audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote. audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding-reaching state-of-the-art performance for most of them.
End-to-end domain-adversarial voice activity detection

Marvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, and 2 more authors

Interspeech 2020

Abs PDF Code

Voice activity detection is the task of detecting speech regions in a given audio stream or recording. First, we design a neural network combining trainable filters and recurrent layers to tackle voice activity detection directly from the waveform. Experiments on the challenging DIHARD dataset show that the proposed end-to-end model reaches state-of-the-art performance and outperforms a variant where trainable filters are replaced by standard cepstral coefficients. Our second contribution aims at making the proposed voice activity detection model robust to domain mismatch. To that end, a domain classification branch is added to the network and trained in an adversarial manner. The same DIHARD dataset, drawn from 11 different domains is used for evaluation under two scenarios. In the in-domain scenario where the training and test sets cover the exact same domains, we show that the domain-adversarial approach does not degrade performance of the proposed end-to-end model. In the out-domain scenario where the test domain is different from training domains, it brings a relative improvement of more than 10%. Finally, our last contribution is the provision of a fully reproducible open-source pipeline than can be easily adapted to other datasets.
Speaker detection in the wild: Lessons learned from JSALT 2019

Paola Garcı́a, Jesus Villalba, Hervé Bredin, and 8 more authors

Odyssey 2020

Abs PDF

This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an e ective diarization improves detection, and not having a diarization stage impoverishes the performance. All the di erent configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection.
Longform recordings: Opportunities and challenges

Lucas Gautheron, Marvin Lavechin, Rachid Riad, and 2 more authors

In LIFT 2020-2èmes journées scientifiques du Groupement de Recherche" Linguistique informatique, formelle et de terrain" 2020

Abs PDF

Technological developments have allowed the development of lightweight, wearable recorders that collect audio (including speech) lasting up to a whole day. We provide a general description of the technique and lay out the advantages and drawbacks when using this methodology. Field linguists may gain a uniquely naturalistic viewpoint of language use as people go about their everyday activities. However, due to their duration, noisiness, and likelihood of containing sensitive information, long- form recordings remain difficult to annotate manually. Open-source tools improve reproducibility and ease-of-use for researchers, to which end speech technologists can contribute. Additionally, new approaches to human and automated annotation make the study of speech in longform recordings increasingly feasible and promising.