Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition

Do, Cong-Thanh; Stylianou, Yannis

doi:10.21437/Interspeech.2018-1721

Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition

Cong-Thanh Do, Yannis Stylianou

This paper proposes a new method for weighting two-dimensional (2D) time-frequency (T-F) representation of speech using auditory saliency for noise-robust automatic speech recognition (ASR). Auditory saliency is estimated via 2D auditory saliency maps which model the mechanism for allocating human auditory attention. These maps are used to weight T-F representation of speech, namely the 2D magnitude spectrum or spectrogram, prior to features extraction for ASR. Experiments on Aurora-4 corpus demonstrate the effectiveness of the proposed method for noise-robust ASR. In multi-stream ASR, relative word error rate (WER) reduction of up to 5.3% and 4.0% are observed when comparing the multi-stream system using the proposed method with the baseline single-stream system not using T-F representation weighting and that using conventional spectral masking noise-robust technique, respectively. Combining the multi-stream system using the proposed method and the single-stream system using the conventional spectral masking technique reduces further the WER.

doi: 10.21437/Interspeech.2018-1721

Cite as: Do, C.-T., Stylianou, Y. (2018) Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition. Proc. Interspeech 2018, 1591-1595, doi: 10.21437/Interspeech.2018-1721

@inproceedings{do18_interspeech,
  author={Cong-Thanh Do and Yannis Stylianou},
  title={{Weighting Time-Frequency Representation of Speech Using Auditory Saliency for Automatic Speech Recognition}},
  year=2018,
  booktitle={Proc. Interspeech 2018},
  pages={1591--1595},
  doi={10.21437/Interspeech.2018-1721}
}