Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Kathania, Hemant Kumar; Shahnawazuddin, S.; Ahmad, Waquar; Adiga, Nagaraj

doi:10.1007/s00034-019-01072-7

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Published: 26 February 2019

Volume 38, pages 4667–4682, (2019)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Hemant Kumar Kathania¹,
S. Shahnawazuddin ORCID: orcid.org/0000-0002-3916-9693²,
Waquar Ahmad³ &
…
Nagaraj Adiga⁴

520 Accesses
13 Citations
Explore all metrics

Abstract

In the context of automatic speech recognition (ASR), the power spectrum is generally warped to the Mel-scale during front-end speech parameterization. This is motivated by the fact that human perception of sound is nonlinear. The Mel-filterbank provides better resolution for low-frequency contents, while a greater degree of averaging happens in the high-frequency range. The work presented in this paper aims at studying the role of linear, Mel and inverse-Mel-filterbanks in the context of ASR. When speech data are from high-pitched speakers like children, there is a significant amount of relevant information in the high-frequency region. Hence, down-sampling the information in that range through Mel-filterbank reduces the recognition performance. On the other hand, employing inverse-Mel or linear-filterbanks is expected to be more effective in such cases. The same has been experimentally validated in this work. For that purpose, an ASR system is developed on adults’ speech and tested using data from adult as well as child speakers. Significantly improved recognition rates are noted for children’s as well adult females’ speech when linear or inverse-Mel-filterbank is used. The use of linear filters results in a relative improvement of \(21\%\) over the baseline. To further boost the performance, vocal-tract length normalization, explicit pitch scaling and pitch-adaptive spectral estimation are also explored on top of linear filterbank.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Databases, features and classifiers for speech emotion recognition: a review

Article 19 January 2018

References

W. Ahmad, S. Shahnawazuddin, H.K. Kathania, G. Pradhan, A.B. Samaddar, Improving children’s speech recognition through explicit pitch scaling based on iterative spectrogram inversion. in Proceedings of INTERSPEECH (2017)
A. Batliner, M. Blomberg, S.D’Arcy, D. Elenius, D. Giuliani, M. Gerosa, C. Hacker, M. Russell, M. Wong, The \(\text{PF}\_\text{ STAR }\) children’s speech corpus. in Proceedings of INTERSPEECH, pp. 2761–2764 (2005)
D. Byrd, S. Yildirim, S. Narayanan, S. Khurana, Acoustic analysis of preschool children’s speech. in Proceedings of 15th ICPhS Barcelona, pp. 949–952 (2003)
S. Chakroborty, A. Roy, G. Saha, Improved closed set text-independent speaker identification by combining MFCC with evidence from flipped filter banks. Int. J. Electr. Comput. Energ. Electron. Commun. Eng. 2(11), 2554–2561 (2008)
Google Scholar
G. Dahl, D. Yu, L. Deng, A. Acero, Context-dependent pre-trained deep neural networks for large vocabulary speech recognition. IEEE Trans. Speech Audio Process. 20(1), 30–42 (2012)
Article Google Scholar
S. D’Arcy, M. Russell, A comparison of human and computer recognition accuracy for children’s speech. in Proceedings of INTERSPEECH, pp. 2187–2200 (2005)
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980)
Article Google Scholar
G. Garau, S. Renals, Combining spectral representations for large-vocabulary continuous speech recognition. IEEE Trans. Speech Audio Process. 16(3), 508–518 (2008)
Article Google Scholar
M. Gerosa, D. Giuliani, S. Narayanan, A. Potamianos, A review of ASR technologies for children’s speech. in Proceedings of Workshop on Child, Computer and Interaction, pp. 7:1–7:8 (2009)
S. Ghai, Addressing pitch mismatch for children’s automatic speech recognition. Ph.D. thesis, Department of EEE, Indian Institute of Technology Guwahati, India, 2011
S. Ghai, R. Sinha, A study on the effect of pitch on LPCC and PLPC features for children’s ASR in comparison to MFCC. in Proceedings of INTERSPEECH, pp. 2589–2592 (2011)
S. Ghai, R. Sinha, Analyzing pitch robustness of PMVDR and MFCC features for children’s speech recognition. in Proceedings of Signal Processing and Communications (SPCOM) (2010)
S. Ghai, R. Sinha, Exploring the role of spectral smoothing in context of children’s speech recognition. in Proceedings of INTERSPEECH, pp. 1607–1610 (2009)
G.E. Hinton, L. Deng, D. Yu, G. Dahl, A.R. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition. Signal Process. Mag. 29(6), 82–97 (2012)
Article Google Scholar
H.K. Kathania, S. Shahnawazuddin, N. Adiga, W. Ahmad, Role of prosodic features on children’s speech recognition. in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5519–5523 (2018)
H.K. Kathania, W. Ahmad, S. Shahnawazuddin, A.B. Samaddar, Explicit pitch mapping for improved children’s speech recognition. Circuits Syst. Signal Process. 37(5), 2021–2044 (2017)
Article MathSciNet MATH Google Scholar
H. Kawahara, M. Morise, Technical foundations of TANDEM-STRAIGHTs, a speech analysis, modification and synthesis framework. Sadhana 36(5), 713–727 (2011)
Article Google Scholar
H. Kawahara, I. Masuda-Katsuse, A. De Cheveigné, Restructuring speech representations using a pitch-adaptive time–frequency smoothing and an instantaneous-frequency-based f0 extraction: Possible role of a repetitive structure in sounds. Speech Commun. 27(3), 187–207 (1999)
Article Google Scholar
R.D. Kent, Anatomical and neuromuscular maturation of the speech mechanism: evidence from acoustic studies. JHSR 9, 421–447 (1976)
Google Scholar
L. Lee, R. Rose, A frequency warping approach to speaker normalization. IEEE Trans. Speech Audio Process. 6(1), 49–60 (1998)
Article Google Scholar
H. Lei, E. Gonzalo, Mel, linear, and antimel frequency cepstral coefficients in broad phonetic regions for telephone speaker recognition. in Proceedings of INTERSPEECH, pp. 2323–2326 (2009)
H. Liao, G. Pundak, O. Siohan, M.K. Carroll, N. Coccaro, Q. Jiang, T.N. Sainath, A.W. Senior, F. Beaufays, M. Bacchiani, Large vocabulary automatic speech recognition for children. in Proceedings of INTERSPEECH, pp. 1611–1615 (2015)
A. Metallinou, J. Cheng, Using deep neural networks to improve proficiency assessment for children English language learners. in Proceedings of INTERSPEECH, pp. 1468–1472 (2014)
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi speech recognition toolkit. in Proceedings of ASRU (2011)
M.R. Qun Li, An analysis of the causes of increased error rates in children’s speech recognition. in Proceedings of ICSLP2002, Sept 2002
S.P. Rath, D. Povey, K. Veselý, J. Černocký, Improved feature processing for deep neural networks. in Proceedings of INTERSPEECH (2013)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals, WSJCAM0: A British English speech corpus for large vocabulary continuous speech recognition. in Proceedings of ICASSP, vol. 1, pp. 81–84 (1995)
A. Roy, G. Saha, S. Majumdar, S. Chakroborty, Capturing complementary information via reversed filter bank and parallel implementation with MFCC for improved text-independent speaker identification. in Proceedings of International Conference on Computing: Theory and Applications(ICCTA), pp. 463–467 (2007)
M. Russell, S. D’Arcy, Challenges for computer recognition of children’s speech. in Proceedings of Speech and Language Technologies in Education (SLaTE) (2007)
M. Russell, S. D’Arcy, L. Qun, The effects of bandwidth reduction on human and computer recognition of children’s speech. IEEE Signal Process. Lett. 14(12), 1044–1046 (2007)
Article Google Scholar
R. Serizel, D. Giuliani, Vocal tract length normalisation approaches to DNN-based children’s and adults’ speech recognition. in Proceedings of Spoken Language Technology Workshop (SLT), pp. 135–140 (2014)
R. Serizel, D. Giuliani, Deep-neural network approaches for speech recognition with heterogeneous groups of speakers including children. Nat. Lang. Eng. 23(3), 325–350 (2016)
Article Google Scholar
S. Shahnawazuddin, A. Dey, R. Sinha, Pitch-adaptive front-end features for robust children’s ASR. in Proceedings of INTERSPEECH (2016)
S. Shahnawazuddin, H. Kathania, R. Sinha, Enhancing the recognition of children’s speech on acoustically mismatched ASR system. in Proceedings of TENCON (2015)
S. Shahnawazuddin, N. Adiga, H.K. Kathania, G. Pradhan, R. Sinha, Studying the role of pitch-adaptive spectral estimation and speaking-rate normalization in automatic speech recognition. Digit. Signal Process. 79, 142–151 (2018)
Article MathSciNet Google Scholar
S. Shahnawazuddin, H.K. Kathania, A. Dey, R. Sinha, Improving children’s mismatched asr using structured low-rank feature projection. Speech Commun. 105, 103–113 (2018)
Article Google Scholar
R. Sinha, S. Ghai, On the use of pitch normalization for improving children’s speech recognition. in Proceedings of INTERSPEECH, pp. 568–571 (2009)
R. Sinha, S. Shahnawazuddin, Assessment of pitch-adaptive front-end signal processing for children’s speech recognition. Comput. Speech Lang. 48, 103–121 (2018)
Article Google Scholar
R. Vergin, D. O’Shaughnessy, A. Farhat, Generalized Mel frequency cepstral coefficients for large-vocabulary speaker-independent continuous-speech recognition. IEEE Trans. ASSP 7(5), 525–532 (1999)
Google Scholar
X. Zhou, D. Garcia-Romero, R. Duraiswami, C. Espy-Wilson, S. Shamma, Linear versus Mel frequency cepstral coefficients for speaker recognition. in Proceedings of ASRU, pp. 559–564 (2011)

Download references

Author information

Authors and Affiliations

Department of ECE, National Institute of Technology Sikkim, Sikkim, India
Hemant Kumar Kathania
Department of ECE, National Institute of Technology Patna, Patna, India
S. Shahnawazuddin
Department of ECE, National Institute of Technology Calicut, Calicut, India
Waquar Ahmad
Department of Computer Science, University of Crete, Rethymnon, Greece
Nagaraj Adiga

Authors

Hemant Kumar Kathania
View author publications
You can also search for this author in PubMed Google Scholar
S. Shahnawazuddin
View author publications
You can also search for this author in PubMed Google Scholar
Waquar Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Nagaraj Adiga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to S. Shahnawazuddin.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kathania, H.K., Shahnawazuddin, S., Ahmad, W. et al. Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers. Circuits Syst Signal Process 38, 4667–4682 (2019). https://doi.org/10.1007/s00034-019-01072-7

Download citation

Received: 09 July 2018
Revised: 17 February 2019
Accepted: 18 February 2019
Published: 26 February 2019
Issue Date: October 2019
DOI: https://doi.org/10.1007/s00034-019-01072-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Databases, features and classifiers for speech emotion recognition: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Role of Linear, Mel and Inverse-Mel Filterbanks in Automatic Recognition of Speech from High-Pitched Speakers

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Databases, features and classifiers for speech emotion recognition: a review

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation