Abstract
Quality estimation of speech is essential for monitoring and maintenance of the quality of service at different nodes of modern telecommunication networks. It is also required in the selection of codecs in speech communication systems. There is no requirement of the original clean speech signal as a reference in non-intrusive speech quality evaluation, and thus it is of importance in evaluating the quality of speech at any node of the communication network. In this paper, non-intrusive speech quality assessment of narrowband speech is done by Gaussian Mixture Model (GMM) training using several combinations of auditory perception and speech production features, which include principal components of Lyon’s auditory model features, MFCC, LSF and their first and second differences. Results are obtained and compared for several combinations of auditory features for three sets of databases. The results are also compared with ITU-T Recommendation P.563 for non-intrusive speech quality assessment. It is found that many combinations of these feature sets outperform the ITU-T P.563 Recommendation under the test conditions.
Similar content being viewed by others
References
ITU-T Recommendation P.800 (1996). Methods for subjective determination of transmission quality. International Telecommunication Union-Telecommunication Standardization Sector, Geneva.
Fegyo, T., Szarvas, M., Tatai, P., & Gordos, G. (2000). Objective speech quality estimation for analog mobile channels: problems and solutions. International Journal of Speech Technology, 3, 277–287.
ITU-T Recommendation P.862 (1996). Perceptual evaluation of speech quality (PESQ), an objective method for end-to-end speech quality assessment of narrow-band telephone networks. International Telecommunication Union-Telecommunication Standardization Sector, Geneva.
ITU-T Recommendation P.861 (1996). Objective quality measurement of telephone band (300–3400 Hz) speech codecs. International Telecommunication Union-Telecommunication Standardization Sector, Geneva.
ITU-T Recommendation P.563 (2004). Single ended method for objective speech quality assessment in narrow-band telephony applications. International Telecommunication Union- Telecommunication Standardization Sector, Geneva.
Liang, J., & Kubichek, R. (1994). Output based objective speech quality. In IEEE 44th vehicular technology conf. (Vol. 3(8–10), pp. 1719–1723).
Au, O. C., & Lam, K. (1998). A novel output based objective speech quality measure for wireless communication. In Proc. 4th international conf. on signal processing (Vol. 1, pp. 666–669).
Gray, P., Hollier, M., & Massara, R. (2000). Non-intrusive speech quality assessment using vocal-tract models. IEE Proceedings. Vision, Image and Signal Processing, 147(6), 493–501.
Malfait, L., Berger, J., & Kastner, M. (2006). P.563-The ITU-T standard for single-ended speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1924–1934.
Kim, D. S. (2005). ANIQUE: an auditory model for single ended speech quality estimation. IEEE Transactions on Audio, Speech, and Language Processing, 13(5), 821–831.
Grancharov, V., Zhao, D. Y., Lindblom, J., & Kleijn, W. B. (2006). Low-complexity, non-intrusive speech quality assessment. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1948–1956.
Falk, T., Xu, Q., & Chan, W. Y. (2005). Non-intrusive GMM based speech quality measurement. In Proc. IEEE international conf. on acoustics, speech and signal processing (Vol. 1, pp. 125–128).
Chen, G., & Parsa, V. (2006). Bayesian model based non-intrusive speech quality evaluation. In Proc. IEEE international conf. on acoustics, speech and signal processing (Vol. 1, pp. 385–388).
Falk, T., & Chan, W. Y. (2006a). Enhanced non-intrusive speech quality measurement using degradation models. In Proc. IEEE international conf. on acoustics, speech and signal processing (Vol. 1, pp. 837–840).
Falk, T., & Chan, W. Y. (2006b). Single-ended speech quality measurement using machine learning methods. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1935–1947.
Chen, G., & Parsa, V. (2005). Non-intrusive speech quality evaluation using an adaptive neuro-fuzzy inference system. IEEE Signal Processing Letters, 12(5), 403–406.
Slaney, M. (1988). Lyon’s cochlear model. Advanced technology group, apple technical report no. 13, Apple Computer Inc.
Lyon, R. F. (1982). A computational model of filtering, detection, and compression in the cochlea. In Proc. IEEE international conf. on acoustics, speech and signal processing (pp. 1282–1285).
Jing, Z., & Johnson, M. H. (2002). Auditory modeling inspired methods of feature extraction for robust automatic speech recognition. In Proc. IEEE international conf. on acoustics, speech and signal processing (Vol. 4, pp. 4176–4179).
ITU-T Recommendation P. Supplement-23 (1998). ITU-T Coded-Speech database (1998). International telecommunication union-telecommunication standardization sector, Geneva.
Hu, Y., & Loizou, P. C. (2008). Evaluation of objective quality measures for speech enhancement. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 229–238.
Hu, Y., & Loizou, P. C. (2007). Subjective comparison and evaluation of speech enhancement algorithms. Speech Communications, 49, 588–601.
Narwaria, M., Lin, W., McLoughlin, I. V., Emmanuel, S., & Tien, C. L. (2010). Non-intrusive speech quality assessment with support vector regression. In Advances in multimedia modeling, 16th International multimedia modeling conf. (Vol. 5916, pp. 325–335). Berlin: Springer.
Hasan, M. R., Jamil, M., Rabbani, M. G., & Rahman, M. S. (2004). Speaker identification using mel frequency cepstral coefficients. In 3rd international conf. on electrical & computer engineering (pp. 565–568). Dhaka: ICECE.
Bozkurt, E., Erzin, E., Erdem, C. E., & Erdem, A. T. (2010). Use of line spectral frequencies for emotion recognition from speech. In IEEE international conf. on pattern recognition, Turkey (pp. 3708–3711).
Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87, 1738–1752.
Audhkhasi, K., & Kumar, A. (2010). Two scale auditory features based non-intrusive speech quality evaluation. IETE Journal of Research, 56(2), 111–118.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B. Methodological, 39(1), 1–38.
Karmakar, A., Kumar, A., & Patney, R. K. (2006). A multiresolution model of auditory excitation pattern and its application to objective evaluation of perceived speech quality. IEEE Transactions on Audio, Speech, and Language Processing, 14(6), 1912–1923.
Rabiner, L. R., & Schafer, R. W. (1978). Digital processing of speech signals. Englewood: Prentice-Hall.
Acknowledgements
The authors would like to thank Mr. Yi Hu and Dr. Philipos C. Loizou of Department of Electrical Engineering, The University of Texas at Dallas, Richardson, TX75083-0688, USA for providing the NOIZEUS-2240 database of 2240 speech utterances passed through different speech processing algorithms at different noisy conditions to create the different degradations. The authors would also like to thank Prof. S.C. Saxena, Vice Chancellor (Acting), Jaypee Institute of Information Technology, Noida, India for providing the suitable environment to the corresponding author to complete the work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dubey, R.K., Kumar, A. Non-intrusive speech quality assessment using several combinations of auditory features. Int J Speech Technol 16, 89–101 (2013). https://doi.org/10.1007/s10772-012-9162-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-012-9162-4