Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task

Naini, Abinay Reddy; Satyapriya, Malla; Ghosh, Prasanta Kumar

doi:10.21437/Interspeech.2020-3217

Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task

Abinay Reddy Naini, Malla Satyapriya, Prasanta Kumar Ghosh

In this work, we proposed a method to detect the whispered speech region in a noisy audio file called whisper activity detection (WAD). Due to the lack of pitch and noisy nature of whispered speech, it makes WAD a way more challenging task than standard voice activity detection (VAD). In this work, we proposed a Long-short term memory (LSTM) based whisper activity detection algorithm. However, this LSTM network is trained by keeping it as an attention pooling layer to a Convolutional neural network (CNN), which is trained for a speaker identification task. WAD experiments with 186 speakers, with eight noise types in seven different signal-to-noise ratio (SNR) conditions, show that the proposed method performs better than the best baseline scheme in most of the conditions. Particularly in the case of unknown noises and environmental conditions, the proposed WAD performs significantly better than the best baseline scheme. Another key advantage of the proposed WAD method is that it requires only a small part of the training data with annotation to fine-tune the post-processing parameters, unlike the existing baseline schemes requiring full training data annotated with the whispered speech regions.

doi: 10.21437/Interspeech.2020-3217

Cite as: Naini, A.R., Satyapriya, M., Ghosh, P.K. (2020) Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task. Proc. Interspeech 2020, 2922-2926, doi: 10.21437/Interspeech.2020-3217

@inproceedings{naini20_interspeech,
  author={Abinay Reddy Naini and Malla Satyapriya and Prasanta Kumar Ghosh},
  title={{Whisper Activity Detection Using CNN-LSTM Based Attention Pooling Network Trained for a Speaker Identification Task}},
  year=2020,
  booktitle={Proc. Interspeech 2020},
  pages={2922--2926},
  doi={10.21437/Interspeech.2020-3217}
}