Abstract
A novel approach, based on robust regression with normalized score fusion (namely Normalized Scores following Robust Regression Fusion: NSRRF), is proposed for enhancement of speaker recognition over IP networks, which can be used both in Network Speaker Recognition (NSR) and Distributed Speaker Recognition (DSR) systems. In this framework, it is basically assumed that the speech must be encoded by G729 coder in client side, and then, transmitted at a server side, where the ASR systems are located. The Universal Background Gaussian Mixture Model (GMM-UBM) and Gaussian Supervector (GMM-SVM) with normalized scores are used for speaker recognition. In this work, Mel Frequency Cepstral Coefficient (MFCC) and Linear Prediction Cepstral Coefficient (LPCC), both of these features are derived from Line Spectral Pairs (LSP) extracted from G729 bit-stream over IP, constitute the features vectors. Experimental results, conducted with the LIA SpkDet system based on the ALIZE platform3 using ARADIGITS database, have shown in first that the proposed method using features extracted directly from G729 bit-stream reduces significantly the error rate and outperforms the baseline system in ASR over IP based on the resynthesized (reconstructed) speech obtained from the G729 decoder. In addition, the obtained results show that the proposed approach, based on scores normalization following robust regression fusion technique, achieves the best result and outperform the conventional ASR over IP network.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal, C., Olshefski, D., Saha, D., Shae, Z. Y., & Yu, P. C. SR. (2005). Speaker recognition from compressed VoIP packet stream. In IEEE international conference on multimedia and expo, Amsterdam, Netherlands (pp. 970–973).
Amrouche, A., Debyeche, M., Taleb Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2010). Efficient system for speech recognition in adverse conditions using nonparametric regression. Engineering Applications of Artificial Intelligence, 23(1), 85–94.
Barras, C., & Gauvain, J.L. (2003). Feature and score normalization for speaker verification of cellular data. In 2003 IEEE international conference on acoustics, speech and signal processing, Hong Kong, China (pp. 49–52).
Bonastre, J. F., Wils, F., & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In IEEE international conference on acoustics, speech and signal processing, Philadelphia, USA (pp. 737–740).
Campbell, W. M. (2002). Generalized linear discriminant sequence kernels for speaker recognition. In IEEE international conference on acoustics speech and signal processing, Orlando, USA (pp. 161–164).
Campbell, W., Sturim, D., Reynolds, D. A., & Solomonoff, A. (2006). SVM based speaker verification using a GMM supervector kernel and Nap variability compensation. In IEEE international conference on acoustics, speech and signal processing, Toulouse, France (pp. 97–100).
Carmona, J. L., Peinado, A. M., Perez-Cordoba, J. L., & Gomez, A. M. (2010). MMSE-Based packet loss concealment for CELP-coded speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1341–1353.
Chen, K. (2003). Towards better making a decision in speaker verification. Pattern Recognition, 36(2), 329–349.
Do, M. N. (2003). Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE Signal Processing Letters, 10(4), 115–118.
Fakhr, W., AbdelSalam, A., & Hamdy, N. (2004). Enhancement of mismatched conditions in speaker recognition of multimedia applications. In IEEE international conference on acoustics, speech and signal processing, Montréal, Canada (pp. 377–380).
Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 3–73.
ITU-T (1996) Recommendation G.729-coding of speech at 8 kbit/s using conjugate-structure. Algebraic-code-excited linear-prediction (CS-ACELP).
Jain, A., Nandakumar, K., & Ross, A. (2005). Score normalization in multimodal biometric systems. Pattern Recognition, 38(12), 2270–2285.
Khan, L. A., Baig, M. S., & Youssef Amr, M. (2009). Speaker recognition from encrypted VoIP communications. Digital Investigation, 7(1–2), 65–73.
Karam, Z. N., & Campbell, W. M. (2008). A multi-class MLLR kernel for SVM speaker recognition. In IEEE international conference on acoustics, speech and signal processing, Las Vegas, USA (pp. 4117–4120).
Kim, H. K., & Cox, R. V. (2001). A bits-stream-based front-end for wireless speech recognition on is-136 communications system. IEEE Transactions on Speech and Audio Processing, 9(5), 558–568.
Limin, N., Xuan, W., Xiaorong, Y., & Jiancheng, L. (2009). The implementation of speaker recognition on VoIP auditing in gigabit high-speed environment. In International workshop on information security and application, Qingdao, China (pp. 396–400).
Linguistic Data Consortium, (1996–1999) NIST speaker recognition benchmarks. http://www.ldc.upenn.edu;
Moreno, P. J., Ho, P., & Vasconcelos, N. A. (2004). Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. Advances in Neural Information Processing Systems, 16, 1385–1393.
Moreno-Daniel, A., Juang, B. H., & Nolazco Flores, J. A. (2005). Robustness of bit-stream based features for speaker verification. In IEEE international conference on acoustics, speech, and signal processing, Philadelphia, USA (pp. 749–752).
Nandakumar, K., Chen, Y., Dass, S. C., & Jain, A. K. (2008). Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 342–347.
Petracca, M., Servetti, A., & De Martin, J. C. (2005). Low-complexity automatic speaker recognition in the compressed GSM-AMR domain. In IEEE international conference on multimedia and expo, Amsterdam, Netherlands (pp. 662–665).
Poh, N., & Kittler, J. (2008). Incorporating variation of model-specific score distribution in speaker verification systems. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 594–606.
Quatieri, T. F., Singer, E., Dunn, R. B., Reynolds, D. A., & Campbell, J. P. (1999). Speaker and language recognition using speech codec parameters. In Eurospeech’99, Budapest, Hungary (pp. 787–790).
Quatieri, T. F., Dunn, R. B., Reynolds, D. A., Campbell, J. P., & Singer, E. (2000). Speaker recognition using G.729 speech codec parameters. In IEEE international conference on acoustics, speech, and signal processing, Turkey (pp. 1089–1092).
Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.
Rosenberg, A. E., Delong, J., Lee, C. H., Juang, B. H., & Soong, F. K. (1992). The use of cohort normalized scores for speaker recognition. In International conference spoken language (pp. 599–602).
Turunen, J., & Valj, D. (2001). A study of speech coding parameters in speech recognition. In Eurospeech’01, Aalborg, Denmark (pp. 2363–2366).
Wan, V., & Renals, S. (2003). SVMSVM: support vector machine speaker verification methodology. In IEEE international conference on acoustics, speech, and signal processing proceedings, Hong Kong, China (pp. 221–224).
Yessad, D., & Amrouche, A. (2012). G729 coded parameters under matched and mismatched conditions for distributed speaker recognition. In International Congress on Telecommunication and Application’12, Bejaia, Algeria.
Yessad, D., Amrouche, A., & Debyeche, M. (2011). Influence of G729 speech coding on automatic speaker recognition in VoIP applications. In Lectures notes in electrical engineering: Vol. 114. The 2011 computer science and convergence (pp. 745–751). Berlin: Springer.
Yu, E. W. M., Mak, M. W., Sit, C. H., & Kung, S. Y. (2003). Speaker verification based on G.729 and G.723.1 coder parameters and handset mismatch compensation. In Eurospeech’03, Geneva, Switzerland (pp. 1681–1684).
Zhaopin, S., Jianguo, J., Shiguo, L., Guofu, Z., & Donghui, H. (2010). Hierarchical selective encryption for G.729 speech based on bit sensitivity. Journal of Internet Technology, 11(5), 599–608.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Yessad, D., Amrouche, A. Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Int J Speech Technol 17, 43–51 (2014). https://doi.org/10.1007/s10772-013-9204-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9204-6