Human listeners can recognize target speech streams in complex auditory scenes. The cortical activities can robustly track the amplitude fluctuations of target speech with auditory attentional modulation under a range of signal-to-masker ratios (SMRs). The root-mean-square (RMS) level of the speech signal is a crucial acoustic cue for target speech perception. However, in most studies, the neural-tracking activities were analyzed with the intact speech temporal envelopes, ignoring the characteristic decoding features in different RMS-level-specific speech segments. This study aimed to explore the contributions of high- and middle-RMS-level segments to target speech decoding in noisy conditions based on electroencephalogram (EEG) signals. The target stimulus was mixed with a competing speaker at five SMRs (i.e., 6, 3, 0, -3, and -6 dB), and then the temporal response function (TRF) was used to analyze the relationship between neural responses and high/middle-RMS-level segments. Experimental results showed that target and ignored speech streams had significantly different TRF responses under conditions with the high- or middle-RMS-level segments. Besides, the high- and middle-RMS-level segments elicited different TRF responses in morphological distributions. These results suggested that distinct models could be used in different RMS-level-specific speech segments to better decode target speech with corresponding EEG signals.
Cite as: Wang, L., Wu, E.X., Chen, F. (2020) Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions. Proc. Interspeech 2020, 121-124, doi: 10.21437/Interspeech.2020-1652
@inproceedings{wang20_interspeech, author={Lei Wang and Ed X. Wu and Fei Chen}, title={{Contribution of RMS-Level-Based Speech Segments to Target Speech Decoding Under Noisy Conditions}}, year=2020, booktitle={Proc. Interspeech 2020}, pages={121--124}, doi={10.21437/Interspeech.2020-1652} }