Brouhaha: multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation (2023)

Brouhaha predicts: 1) speech/non-speech segments; 2) Speech-to-Noise Ratio; 3) and C50 room acoustic measure.

Training time

In this project, we contaminated read-speech audio samples (retrieved from LibriSpeech) with noise and reverberation.

The noise level is measured with the signal-to-noise ratio (SNR): the lower the SNR, the noisier the resulting audio. Similarly, the reverberation level is measured with C50: the lower the C50, the more reverberated the resulting audio.

Here are some contaminated audio samples used to train Brouhaha:

	Low SNR	High SNR
Low C50
High C50

Inference time

I recorded myself with an Olympus VN-540PC in three locations: 1) my place; 2) the front of the beautiful church of Notre-Dame Dijon (semi-enclosed space); 3) inside the same church. Here’s what Brouhaha C50 prediction looks like as a function of time:

Home:
Church front:
Church:

Predicted C50 in the semi-enclosed space (Church front) or the closed space (Home) are around 57 dB. Brouhaha predicts a lower C50 (more reverberation) for the audio sample recorded within the church (35 dB).