Abstract:
Detecting vowels in a noisy speech signal is a very challenging task. The problem is further aggravated when the noise exhibits speech-like characteristics, e.g., babble ...Show MoreMetadata
Abstract:
Detecting vowels in a noisy speech signal is a very challenging task. The problem is further aggravated when the noise exhibits speech-like characteristics, e.g., babble noise. In this work, a novel front-end feature extraction technique exploiting variational mode decomposition (VMD) is proposed to improve the detection of vowels in speech data degraded by speech-like noise. Each short-time analysis frame of speech is first decomposed into a set of variational mode functions (VMFs) using VMD. The logarithmic energy present in each of the VMFs is then used as the front-end features for detecting vowels. A three-class classifier (vowel, non-vowel and silence) with acoustic modeling based on long short-term memory (LSTM) architecture is developed on the TIMIT database using the proposed features as well as mel-frequency cepstral coefficients (MFCC). Using the three-class classifier, frame-level time-alignments for a given speech utterance are obtained to detect the vowel regions. The proposed features result in significantly improved performance under noisy test conditions than the MFCC features. Further, the vowel regions detected using the proposed features are also quite different from those obtained through the MFCC. Exploiting the aforementioned differences, the evidences are combined to further improve the detection accuracy.
Published in: 2019 National Conference on Communications (NCC)
Date of Conference: 20-23 February 2019
Date Added to IEEE Xplore: 06 June 2019
ISBN Information: