Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

Joshi, Vikas; Prasad, N. Vishnu; Umesh, S.

doi:10.1007/s00034-015-0129-y

Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

Published: 06 August 2015

Volume 35, pages 1593–1609, (2016)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Vikas Joshi¹^nAff2,
N. Vishnu Prasad¹^nAff3 &
S. Umesh¹

1211 Accesses
10 Citations
Explore all metrics

Abstract

Cepstral mean and variance normalization (CMVN) is an efficient noise compensation technique popularly used in many speech applications. CMVN eliminates the mismatch between training and test utterances by transforming them to zero mean and unit variance. In this work, we argue that some amount of useful information is lost during normalization as every utterance is forced to have the same first- and second-order statistics, i.e., zero mean and unit variance. We propose to modify CMVN methodology to retain the useful information and yet compensate for noise. The proposed normalization approach transforms every test utterance to utterance-specific clean mean (i.e., utterance mean if the noise was absent) and clean variance, instead of zero mean and unit variance. We derive expressions to estimate the clean mean and variance from a noisy utterance. The proposed normalization is effective in the recognizing voice commands that are typically short (single words or short phrases), where more advanced methods [such as histogram equalization (HEQ)] are not effective. Recognition results show a relative improvement (RI) of \(21\,\%\) in word error rate over conventional CMVN on the Aurora-2 database and a RI of 20 and \(11\,\%\) over CMVN and HEQ on short utterances of the Aurora-2 database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

References

R. Balchandran, R. Mammone, Non-parametric estimation and correction of non-linear distortion in speech system. in Proceedings of ICASSP (1998)
J. Du, R.H. Wang, Cepstral shape normalization for robust speech recognition. in Proceedings of ICASSP (2008), pp. 4389–4392
S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)
Article Google Scholar
M. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)
Article Google Scholar
L. Garcia, J.C. Segura, J. Ramirez, A. Torre, C. Benitez, Parametric nonlinear feature equalization for robust speech recognition. in Proceedings of ICASSP (2006)
C. Hsu, L. Lee, Higher order cepstral moment normalization for improved robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(2), 205–220 (2009)
Article MathSciNet Google Scholar
V. Joshi, N.V. Prasad, S. Umesh, Modified cepstral mean normalization–transforming to utterance specific non-zero mean. in Interspeech, (Lyon, 2013), pp. 881–885
C. Leggetter, P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput. Speech Lang. 9, 171–185 (1995)
Article Google Scholar
J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22, 1–33 (2013)
Google Scholar
S. Molau, M. Pitz, H. Ney, Histogram based normalization in the acoustic feature space. in Proceedings of ASRU (2001)
P. Moreno, Speech recognition in noisy environments. PhD thesis, Carnegie Mellon University (1996)
P. Moreno, B. Raj, R. Stern, A vector taylor series approach for environment-independent speech recognition. in Proceedings of ICASSP (1996), pp. 733–736
Y. Obuchi, R. Stern, Normalization of time-derivative parameters using histogram equalization. in Proceedings of EUROSPEECH 2003 (Geneva, 2003)
D. Pearce, H.G. Hirsch, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. in ISCA ITRW ASR2000 (2000), pp. 29–32
N. Prasad, S. Umesh, Improved cepstral mean and variance normalization using bayesian framework. in Proceedings of Automatic Speech Recognition and Understanding (ASRU) (2013), pp. 156–161
J. Segura, C. Benitez, A. Torre, A. Rubio, J. Ramirez, Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process. Lett. 11, 517–520 (2004)
Article Google Scholar
O. Strand, A. Egeberg, Cepstral mean and variance normalization in the model domain. in ISCA Tutorial and Research Workshop (2004)
R. Togneri, A. Ming Toh, S. Nordholm, Evaluation and modification of cepstral moment normalization for speech recognition in additibe babble ensemble. in Australian International Conference on Speech Science and Technology (2006)
A. Torre, J. Segura, C. Benitez, A. Peinado, A. Rubio, Non-linear transformations of the feature space for robust speech recognition. in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (2002), pp. 401–404
A. Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez, A. Rubio, Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Speech Audio Process. 13(3), 355–366 (2005)
Article Google Scholar
O. Viikki, K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1), 133–147 (1998)
Article Google Scholar
S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P.C. Woodland, The HTK Book, version 3.4. (Cambridge University Engineering Department, Cambridge, 2006)
Google Scholar

Download references

Author information

Vikas Joshi
Present address: IBM India Research Labs, Bangalore, India
N. Vishnu Prasad
Present address: Soliton Technologies, Coimbatore, Tamil Nadu, India

Authors and Affiliations

Department of Electrical Engineering, IIT Madras, Chennai, Tamil Nadu, India
Vikas Joshi, N. Vishnu Prasad & S. Umesh

Authors

Vikas Joshi
View author publications
You can also search for this author in PubMed Google Scholar
N. Vishnu Prasad
View author publications
You can also search for this author in PubMed Google Scholar
S. Umesh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vikas Joshi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Joshi, V., Prasad, N.V. & Umesh, S. Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates. Circuits Syst Signal Process 35, 1593–1609 (2016). https://doi.org/10.1007/s00034-015-0129-y

Download citation

Received: 05 October 2014
Revised: 13 July 2015
Accepted: 14 July 2015
Published: 06 August 2015
Issue Date: May 2016
DOI: https://doi.org/10.1007/s00034-015-0129-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation