Abstract
This article introduces a novel coordinated representation method, termed MFCC aided sparse representation (MSR), for speech recognition. The proposed MSR combines a top level sparse representation feature with the conventional MFCC, i.e., a bottom level feature of speech, so that complex information of various hidden attributes in the speech can be contained. A neural network architecture with attention mechanism has also been designed to validate the effective of the proposed MSR for speech recognition. Experiments on the TIMIT database show that significant performance improvements, in terms of recognition accuracy, can be obtained by the proposed MSR compared with the scenarios that adopt the MFCC or the sparse representation solely.
This work was supported in part by the NSFC under Grant 61973088, and in part by the NSF of Guangdong under Grant 2019A1515011371.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011)
Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. Computer Science (2013)
Palaz, D., Magimai.-Doss, M., Collobert, R.: Convolutional neural networks-based continuous speech recognition using raw speech signal. In: IEEE International Conference on Acoustics (2015)
Kim, C., Stern, R.M.: Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4574–4577 (2010). https://doi.org/10.1109/ICASSP.2010.5495570
Sailor, H.B., Patil, H.A.: Novel unsupervised auditory filter bank learning using convolutional RBM for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. PP(12), 1 (2016)
Kanevsky, D., Nahamoo, D., Ramabhadran, B., Sainath, T.N.: Sparse representation features for speech recognition (2012)
Sharma, P., Abrol, V., Dileep, A.D., Sao, A.K.: Sparse coding based features for speech units classification. Comput. Speech Lang. 47, 333–350 (2017)
Sharma, P., Abrol, V., Sao, A.K.: Deep sparse representation based features for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. PP(11), 1 (2017)
Tripathi, K., Rao, K.S.: Analysis of sparse representation based feature on speech mode classification. In: INTERSPEECH (2018)
Tripathi, K., Rao, K.S.: Discriminative sparse representation for speech mode classification. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 655–659 (2018). https://doi.org/10.1109/ICACCI.2018.8554644
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
Chen, S.S., Saunders, D.M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Yılmaz, E., Gemmeke, J.F., Hamme, H.V.: Noise-robust speech recognition with exemplar-based sparse representations using alpha-beta divergence. In: IEEE International Conference on Acoustics (2014)
Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Smit, W.J.: Sparse coding for speech recognition. In: IEEE International Conference on Acoustics Speech & Signal Processing (2008)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453 (2017). https://doi.org/10.1109/CVPR.2017.367
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: Darpa timit acoustic-phonetic continuous speech corpus cd-rom TIMIT
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhou, L., Zhang, J. (2021). From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_33
Download citation
DOI: https://doi.org/10.1007/978-3-030-68780-9_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)