Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation (2023)
Training time
In this project, we contaminated read-speech audio samples (retrieved from LibriSpeech) with noise and reverberation.
The noise level is measured with the signal-to-noise ratio (SNR): the lower the SNR, the noisier the resulting audio. Similarly, the reverberation level is measured with C50: the lower the C50, the more reverberated the resulting audio.
Here are some contaminated audio samples used to train Brouhaha:
Low SNR | High SNR | |
---|---|---|
Low C50 | ||
High C50 |
Inference time
I recorded myself with an Olympus VN-540PC in three locations: 1) my place; 2) the front of the beautiful church of Notre-Dame Dijon (semi-enclosed space); 3) inside the same church. Here’s what Brouhaha C50 prediction looks like as a function of time:
Home: | |
Church front: | |
Church: |
Predicted C50 in the semi-enclosed space (Church front) or the closed space (Home) are around 57 dB. Brouhaha predicts a lower C50 (more reverberation) for the audio sample recorded within the church (35 dB).