Skip to main content
Log in

Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Cepstral mean and variance normalization (CMVN) is an efficient noise compensation technique popularly used in many speech applications. CMVN eliminates the mismatch between training and test utterances by transforming them to zero mean and unit variance. In this work, we argue that some amount of useful information is lost during normalization as every utterance is forced to have the same first- and second-order statistics, i.e., zero mean and unit variance. We propose to modify CMVN methodology to retain the useful information and yet compensate for noise. The proposed normalization approach transforms every test utterance to utterance-specific clean mean (i.e., utterance mean if the noise was absent) and clean variance, instead of zero mean and unit variance. We derive expressions to estimate the clean mean and variance from a noisy utterance. The proposed normalization is effective in the recognizing voice commands that are typically short (single words or short phrases), where more advanced methods [such as histogram equalization (HEQ)] are not effective. Recognition results show a relative improvement (RI) of \(21\,\%\) in word error rate over conventional CMVN on the Aurora-2 database and a RI of 20 and \(11\,\%\) over CMVN and HEQ on short utterances of the Aurora-2 database.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. R. Balchandran, R. Mammone, Non-parametric estimation and correction of non-linear distortion in speech system. in Proceedings of ICASSP (1998)

  2. J. Du, R.H. Wang, Cepstral shape normalization for robust speech recognition. in Proceedings of ICASSP (2008), pp. 4389–4392

  3. S. Furui, Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29, 254–272 (1981)

    Article  Google Scholar 

  4. M. Gales, Maximum likelihood linear transformations for hmm-based speech recognition. Comput. Speech Lang. 12, 75–98 (1998)

    Article  Google Scholar 

  5. L. Garcia, J.C. Segura, J. Ramirez, A. Torre, C. Benitez, Parametric nonlinear feature equalization for robust speech recognition. in Proceedings of ICASSP (2006)

  6. C. Hsu, L. Lee, Higher order cepstral moment normalization for improved robust speech recognition. IEEE Trans. Audio Speech Lang. Process. 17(2), 205–220 (2009)

    Article  MathSciNet  Google Scholar 

  7. V. Joshi, N.V. Prasad, S. Umesh, Modified cepstral mean normalization–transforming to utterance specific non-zero mean. in Interspeech, (Lyon, 2013), pp. 881–885

  8. C. Leggetter, P. Woodland, Maximum likelihood linear regression for speaker adaptation of continuous density hidden markov models. Comput. Speech Lang. 9, 171–185 (1995)

    Article  Google Scholar 

  9. J. Li, L. Deng, Y. Gong, R. Haeb-Umbach, An overview of noise-robust automatic speech recognition. IEEE Trans. Audio Speech Lang. Process. 22, 1–33 (2013)

    Google Scholar 

  10. S. Molau, M. Pitz, H. Ney, Histogram based normalization in the acoustic feature space. in Proceedings of ASRU (2001)

  11. P. Moreno, Speech recognition in noisy environments. PhD thesis, Carnegie Mellon University (1996)

  12. P. Moreno, B. Raj, R. Stern, A vector taylor series approach for environment-independent speech recognition. in Proceedings of ICASSP (1996), pp. 733–736

  13. Y. Obuchi, R. Stern, Normalization of time-derivative parameters using histogram equalization. in Proceedings of EUROSPEECH 2003 (Geneva, 2003)

  14. D. Pearce, H.G. Hirsch, The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. in ISCA ITRW ASR2000 (2000), pp. 29–32

  15. N. Prasad, S. Umesh, Improved cepstral mean and variance normalization using bayesian framework. in Proceedings of Automatic Speech Recognition and Understanding (ASRU) (2013), pp. 156–161

  16. J. Segura, C. Benitez, A. Torre, A. Rubio, J. Ramirez, Cepstral domain segmental nonlinear feature transformations for robust speech recognition. IEEE Signal Process. Lett. 11, 517–520 (2004)

    Article  Google Scholar 

  17. O. Strand, A. Egeberg, Cepstral mean and variance normalization in the model domain. in ISCA Tutorial and Research Workshop (2004)

  18. R. Togneri, A. Ming Toh, S. Nordholm, Evaluation and modification of cepstral moment normalization for speech recognition in additibe babble ensemble. in Australian International Conference on Speech Science and Technology (2006)

  19. A. Torre, J. Segura, C. Benitez, A. Peinado, A. Rubio, Non-linear transformations of the feature space for robust speech recognition. in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol 1 (2002), pp. 401–404

  20. A. Torre, A. Peinado, J. Segura, J. Perez-Cordoba, M. Benitez, A. Rubio, Histogram equalization of speech representation for robust speech recognition. IEEE Trans. Speech Audio Process. 13(3), 355–366 (2005)

    Article  Google Scholar 

  21. O. Viikki, K. Laurila, Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25(1), 133–147 (1998)

    Article  Google Scholar 

  22. S. Young, G. Evermann, M. Gales, T. Hain, D. Kershaw, G. Moore, J. Odell, D. Ollason, D. Povey, V. Valtchev, P.C. Woodland, The HTK Book, version 3.4. (Cambridge University Engineering Department, Cambridge, 2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vikas Joshi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Joshi, V., Prasad, N.V. & Umesh, S. Modified Mean and Variance Normalization: Transforming to Utterance-Specific Estimates. Circuits Syst Signal Process 35, 1593–1609 (2016). https://doi.org/10.1007/s00034-015-0129-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-015-0129-y

Keywords

Navigation