From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

Zhou, Lixia; Zhang, Jun

doi:10.1007/978-3-030-68780-9_33

From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

Conference paper
First Online: 25 February 2021

2204 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12666))

Abstract

This article introduces a novel coordinated representation method, termed MFCC aided sparse representation (MSR), for speech recognition. The proposed MSR combines a top level sparse representation feature with the conventional MFCC, i.e., a bottom level feature of speech, so that complex information of various hidden attributes in the speech can be contained. A neural network architecture with attention mechanism has also been designed to validate the effective of the proposed MSR for speech recognition. Experiments on the TIMIT database show that significant performance improvements, in terms of recognition accuracy, can be obtained by the proposed MSR compared with the scenarios that adopt the MFCC or the sparse representation solely.

This work was supported in part by the NSFC under Grant 61973088, and in part by the NSF of Guangdong under Grant 2019A1515011371.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Jaitly, N., Hinton, G.: Learning a better representation of speech soundwaves using restricted Boltzmann machines. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011)
Google Scholar
Palaz, D., Collobert, R., Doss, M.M.: Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks. Computer Science (2013)
Google Scholar
Palaz, D., Magimai.-Doss, M., Collobert, R.: Convolutional neural networks-based continuous speech recognition using raw speech signal. In: IEEE International Conference on Acoustics (2015)
Google Scholar
Kim, C., Stern, R.M.: Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4574–4577 (2010). https://doi.org/10.1109/ICASSP.2010.5495570
Sailor, H.B., Patil, H.A.: Novel unsupervised auditory filter bank learning using convolutional RBM for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. PP(12), 1 (2016)
Google Scholar
Kanevsky, D., Nahamoo, D., Ramabhadran, B., Sainath, T.N.: Sparse representation features for speech recognition (2012)
Google Scholar
Sharma, P., Abrol, V., Dileep, A.D., Sao, A.K.: Sparse coding based features for speech units classification. Comput. Speech Lang. 47, 333–350 (2017)
Article Google Scholar
Sharma, P., Abrol, V., Sao, A.K.: Deep sparse representation based features for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. PP(11), 1 (2017)
Google Scholar
Tripathi, K., Rao, K.S.: Analysis of sparse representation based feature on speech mode classification. In: INTERSPEECH (2018)
Google Scholar
Tripathi, K., Rao, K.S.: Discriminative sparse representation for speech mode classification. In: 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI), pp. 655–659 (2018). https://doi.org/10.1109/ICACCI.2018.8554644
Aharon, M., Elad, M., Bruckstein, A.: K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
Article Google Scholar
Chen, S.S., Saunders, D.M.A.: Atomic decomposition by basis pursuit. SIAM Rev. 43(1), 129–159 (2001)
Article MathSciNet Google Scholar
Yılmaz, E., Gemmeke, J.F., Hamme, H.V.: Noise-robust speech recognition with exemplar-based sparse representations using alpha-beta divergence. In: IEEE International Conference on Acoustics (2014)
Google Scholar
Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: Exemplar-based sparse representations for noise robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 19(7), 2067–2080 (2011)
Article Google Scholar
Smit, W.J.: Sparse coding for speech recognition. In: IEEE International Conference on Acoustics Speech & Signal Processing (2008)
Google Scholar
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453 (2017). https://doi.org/10.1109/CVPR.2017.367
Chan, W., Jaitly, N., Le, Q., Vinyals, O.: Listen, attend and spell: a neural network for large vocabulary conversational speech recognition. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2016)
Google Scholar
Garofolo, J.S., Lamel, L.F., Fisher, W.M., Fiscus, J.G., Pallett, D.S., Dahlgren, N.L.: Darpa timit acoustic-phonetic continuous speech corpus cd-rom TIMIT
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information Engineering, Guangdong University of Technology, Guangzhou, 510006, Guangdong, China
Lixia Zhou & Jun Zhang

Authors

Lixia Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Zhang .

Editor information

Editors and Affiliations

Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Alberto Del Bimbo
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Rita Cucchiara
Department of Computer Science, Boston University, Boston, MA, USA
Stan Sclaroff
Dipartimento di Matematica e Informatica, University of Catania, Catania, Italy
Giovanni Maria Farinella
Cloud & AI, JD.COM, Beijing, China
Tao Mei
Dipartimento di Ingegneria dell’Informazione, University of Firenze, Firenze, Italy
Marco Bertini
Computational Sciences Department, National Institute of Astrophysics, Optics and Electronics (INAOE), Tonantzintla, Puebla, Mexico
Hugo Jair Escalante
Dipartimento di Ingegneria “Enzo Ferrari”, Università di Modena e Reggio Emilia, Modena, Italy
Roberto Vezzani

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, L., Zhang, J. (2021). From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition. In: Del Bimbo, A., et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12666. Springer, Cham. https://doi.org/10.1007/978-3-030-68780-9_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-68780-9_33
Published: 25 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-68779-3
Online ISBN: 978-3-030-68780-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)