- Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimationMarvin Lavechin, Marianne Métais, Hadrien Titeux, and 7 more authorsSubmission to ICASSP 2023
Most automatic speech processing systems are sensitive to the acoustic environment, with degraded performance when applied to noisy or reverberant speech. But how can one tell whether speech is noisy or reverberant? We propose Brouhaha, a pipeline to simulate audio segments recorded in noisy and reverberant conditions. We then use the simulated audio to jointly train the Brouhaha model for voice activity detection, signal-to-noise ratio estimation, and C50 room acoustics prediction. We show how the predicted SNR and C50 values can be used to investigate and help diagnose errors made by automatic speech processing tools (such as pyannote.audio for speaker diarization or OpenAI’s Whisper for automatic speech recognition). Both our pipeline and a pretrained model are open source and shared with the speech community.
- Statistical learning models of early phonetic acquisition struggle with child-centered audio dataMarvin Lavechin, Maureen Seyssel, Marianne Métais, and 5 more authorsMar 2023
Infants learn their native language(s) at an amazing speed. Before they even talk, their perception adapts to the language(s) they hear. However, the mechanisms responsible for this perceptual attunement remain unclear. The currently dominant explanation for perceptual attunement posits that infants apply a statistical learning mechanism consisting in learning regularities from the speech stream they hear, and which may be found in learning across domains and species. Critically, the feasibility of the statistical learning hypothesis has only been demonstrated with computational models on unrealistic and simplified input. This paper presents the first attempt to study perceptual attunement with 2,000 hours of ecological child-centered recordings in American English and Metropolitan French. We show that, when applied on ecologically valid data, generic learning mechanisms develop a language-relevant perceptual space but fail to show evidence for perceptual attunement. It is only when supplemented with inductive biases, in the form of data filtering, sampling, and augmentation mechanisms that computational models show a significant attunement to the language they have been exposed to. As inductive biases are necessary for our model to become attuned to their native language, we reflect on whether similar inductive biases may shape early phonetic learning in infants. More generally, we show that what our model learns, and how it develops through exposure to speech, depends exquisitely on details of the input signal. By doing so, we illustrate the importance of considering ecologically valid input data when modeling language acquisition.
- Reverse engineering language acquisition with child-centered long-form recordingsMarvin Lavechin, Maureen Seyssel, Lucas Gautheron, and 2 more authorsAnnual Review of Linguistics Mar 2022
Language use in everyday life can be studied using lightweight, wearable recorders that collect long-form recordings - that is, audio (including speech) over whole days. The hardware and software underlying this technique is increasingly accessible and inexpensive, and these data are revolutionizing the language acquisition field. We first place this technique into the broader context of the current ways of studying both the input being received by children and children’s own language production, laying out the main advantages and drawbacks of long-form recordings. We then go on to argue that a unique advantage of long-form recordings is that they can fuel realistic models of early language acquisition that use speech to represent children’s input and/or to establish production benchmarks. To enable the field to make the most of this unique empirical and conceptual contribution, we outline what this reverse engineering approach from long-form recordings entails, why it is useful, and how to evaluate success.
- Probing phoneme, language and speaker information in unsupervised speech representationsMaureen Seyssel, Marvin Lavechin, Yossi Adi, and 2 more authorsInterspeech Mar 2022
Unsupervised models of representations based on Contrastive Predictive Coding (CPC) are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitative and quantitative tools, we find that both gender and phone class information are present in both types of models. Language information, however, is very salient in the bilingual model only, suggesting CPC models learn to discriminate languages when trained on multiple languages. Some language information can also be retrieved from monolingual models, but it is more diffused across all features. These patterns hold when analyses are carried on the discrete units from a downstream clustering model. However, although there is no effect of the number of target clusters on phone class and language information, more gender information is encoded with more clusters. Finally, we find that there is some cost to being exposed to two languages on a downstream phoneme discrimination task.
- A thorough evaluation of the Language Environment Analysis (LENA) systemAlejandrina Cristia, Marvin Lavechin, Camila Scaff, and 5 more authorsBehavior Research Methods Mar 2021
In the previous decade, dozens of studies involving thousands of children across several research disciplines have made use of a combined daylong audio-recorder and automated algorithmic analysis called the LENA® system, which aims to assess children’s language environment. While the system’s prevalence in the language acquisition domain is steadily growing, there are only scattered validation efforts, on only some of its key characteristics. Here, we assess the LENA® system’s accuracy across all of its key measures: speaker classification, Child Vocalization Counts (CVC), Conversational Turn Counts (CTC), and Adult Word Counts (AWC). Our assessment is based on manual annotation of clips that have been randomly or periodically sampled out of daylong recordings, collected from (a) populations similar to the system’s original training data (North American English-learning children aged 3-36 months), (b) children learning another dialect of English (UK), and (c) slightly older children growing up in a different linguistic and socio-cultural setting (Tsimane’ learners in rural Bolivia). We find reasonably high accuracy in some measures (AWC, CVC), with more problematic levels of performance in others (CTC, precision of male adults and other children). Statistical analyses do not support the view that performance is worse for children who are dissimilar from the LENA® original training set. Whether LENA® results are accurate enough for a given research, educational, or clinical application depends largely on the specifics at hand. We therefore conclude with a set of recommendations to help researchers make this determination for their goals.
- ALICE: An open-source tool for automatic measurement of phoneme, syllable, and word counts from child-centered daylong recordingsOkko Räsänen, Shreyas Seshadri, Marvin Lavechin, and 2 more authorsBehavior Research Methods Mar 2021
Recordings captured by wearable microphones are a standard method for investigating young children’s language environments. A key measure to quantify from such data is the amount of speech present in children’s home environments. To this end, the LENA recorder and software—a popular system for measuring linguistic input—estimates the number of adult words that children may hear over the course of a recording. However, word count estimation is challenging to do in a language- independent manner; the relationship between observable acoustic patterns and language-specific lexical entities is far from uniform across human languages. In this paper, we ask whether some alternative linguistic units, namely phone(me)s or syllables, could be measured instead of, or in parallel with, words in order to achieve improved cross-linguistic applicability and comparability of an automated system for measuring child language input. We discuss the advantages and disadvantages of measuring different units from theoretical and technical points of view. We also investigate the practical applicability of measuring such units using a novel system called Automatic LInguistic unit Count Estimator (ALICE) together with audio from seven child-centered daylong audio corpora from diverse cultural and linguistic environments. We show that language-independent measurement of phoneme counts is somewhat more accurate than syllables or words, but all three are highly correlated with human annotations on the same data. We share an open-source implementation of ALICE for use by the language research community, enabling automatic phoneme, syllable, and word count estimation from child-centered audio recordings.
- ZR-2021VG: Zero-Resource Speech Challenge, Visually-Grounded Language Modelling trackAfra Alishahi, Grzegorz Chrupała, Alejandrina Cristià, and 5 more authorsarXiv preprint arXiv:2107.06546 Mar 2021
We present the visually-grounded language modelling track that was introduced in the Zero-Resource Speech challenge, 2021 edition, 2nd round. We motivate the new track and discuss participation rules in detail. We also present the two baseline systems that were developed for this track.
- An open-source voice type classifier for child-centered daylong recordingsMarvin Lavechin, Ruben Bousbib, Hervé Bredin, and 2 more authorsInterspeech Mar 2020
Spontaneous conversations in real-world settings such as those found in child-centered recordings have been shown to be amongst the most challenging audio files to process. Nevertheless, building speech processing models handling such a wide variety of conditions would be particularly useful for language acquisition studies in which researchers are interested in the quantity and quality of the speech that children hear and produce, as well as for early diagnosis and measuring effects of remediation. In this paper, we present our approach to designing an open-source neural network to classify audio segments into vocalizations produced by the child wearing the recording device, vocalizations produced by other children, adult male speech, and adult female speech. To this end, we gathered diverse child-centered corpora which sums up to a total of 260 hours of recordings and covers 10 languages. Our model can be used as input for downstream tasks such as estimating the number of words produced by adult speakers, or the number of linguistic units produced by children. Our architecture combines SincNet filters with a stack of recurrent layers and outperforms by a large margin the state-of-the-art system, the Language ENvironment Analysis (LENA) that has been used in numerous child language studies.
- Pyannote.audio: neural building blocks for speaker diarizationHervé Bredin, Ruiqing Yin, Juan Manuel Coria, and 7 more authorsIn IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Mar 2020
We introduce pyannote. audio, an open-source toolkit written in Python for speaker diarization. Based on PyTorch machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines. pyannote. audio also comes with pre-trained models covering a wide range of domains for voice activity detection, speaker change detection, overlapped speech detection, and speaker embedding-reaching state-of-the-art performance for most of them.
- End-to-end domain-adversarial voice activity detectionMarvin Lavechin, Marie-Philippe Gill, Ruben Bousbib, and 2 more authorsInterspeech Mar 2020
Voice activity detection is the task of detecting speech regions in a given audio stream or recording. First, we design a neural network combining trainable filters and recurrent layers to tackle voice activity detection directly from the waveform. Experiments on the challenging DIHARD dataset show that the proposed end-to-end model reaches state-of-the-art performance and outperforms a variant where trainable filters are replaced by standard cepstral coefficients. Our second contribution aims at making the proposed voice activity detection model robust to domain mismatch. To that end, a domain classification branch is added to the network and trained in an adversarial manner. The same DIHARD dataset, drawn from 11 different domains is used for evaluation under two scenarios. In the in-domain scenario where the training and test sets cover the exact same domains, we show that the domain-adversarial approach does not degrade performance of the proposed end-to-end model. In the out-domain scenario where the test domain is different from training domains, it brings a relative improvement of more than 10%. Finally, our last contribution is the provision of a fully reproducible open-source pipeline than can be easily adapted to other datasets.
- Speaker detection in the wild: Lessons learned from JSALT 2019Paola Garcı́a, Jesus Villalba, Hervé Bredin, and 8 more authorsOdyssey Mar 2020
This paper presents the problems and solutions addressed at the JSALT workshop when using a single microphone for speaker detection in adverse scenarios. The main focus was to tackle a wide range of conditions that go from meetings to wild speech. We describe the research threads we explored and a set of modules that was successful for these scenarios. The ultimate goal was to explore speaker detection; but our first finding was that an e ective diarization improves detection, and not having a diarization stage impoverishes the performance. All the di erent configurations of our research agree on this fact and follow a main backbone that includes diarization as a previous stage. With this backbone, we analyzed the following problems: voice activity detection, how to deal with noisy signals, domain mismatch, how to improve the clustering; and the overall impact of previous stages in the final speaker detection. In this paper, we show partial results for speaker diarizarion to have a better understanding of the problem and we present the final results for speaker detection.
- Longform recordings: Opportunities and challengesLucas Gautheron, Marvin Lavechin, Rachid Riad, and 2 more authorsIn LIFT 2020-2èmes journées scientifiques du Groupement de Recherche" Linguistique informatique, formelle et de terrain" Mar 2020
Technological developments have allowed the development of lightweight, wearable recorders that collect audio (including speech) lasting up to a whole day. We provide a general description of the technique and lay out the advantages and drawbacks when using this methodology. Field linguists may gain a uniquely naturalistic viewpoint of language use as people go about their everyday activities. However, due to their duration, noisiness, and likelihood of containing sensitive information, long- form recordings remain difficult to annotate manually. Open-source tools improve reproducibility and ease-of-use for researchers, to which end speech technologists can contribute. Additionally, new approaches to human and automated annotation make the study of speech in longform recordings increasingly feasible and promising.