Skip to main content
Log in

Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

A novel approach, based on robust regression with normalized score fusion (namely Normalized Scores following Robust Regression Fusion: NSRRF), is proposed for enhancement of speaker recognition over IP networks, which can be used both in Network Speaker Recognition (NSR) and Distributed Speaker Recognition (DSR) systems. In this framework, it is basically assumed that the speech must be encoded by G729 coder in client side, and then, transmitted at a server side, where the ASR systems are located. The Universal Background Gaussian Mixture Model (GMM-UBM) and Gaussian Supervector (GMM-SVM) with normalized scores are used for speaker recognition. In this work, Mel Frequency Cepstral Coefficient (MFCC) and Linear Prediction Cepstral Coefficient (LPCC), both of these features are derived from Line Spectral Pairs (LSP) extracted from G729 bit-stream over IP, constitute the features vectors. Experimental results, conducted with the LIA SpkDet system based on the ALIZE platform3 using ARADIGITS database, have shown in first that the proposed method using features extracted directly from G729 bit-stream reduces significantly the error rate and outperforms the baseline system in ASR over IP based on the resynthesized (reconstructed) speech obtained from the G729 decoder. In addition, the obtained results show that the proposed approach, based on scores normalization following robust regression fusion technique, achieves the best result and outperform the conventional ASR over IP network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Aggarwal, C., Olshefski, D., Saha, D., Shae, Z. Y., & Yu, P. C. SR. (2005). Speaker recognition from compressed VoIP packet stream. In IEEE international conference on multimedia and expo, Amsterdam, Netherlands (pp. 970–973).

    Google Scholar 

  • Amrouche, A., Debyeche, M., Taleb Ahmed, A., Rouvaen, J. M., & Yagoub, M. C. E. (2010). Efficient system for speech recognition in adverse conditions using nonparametric regression. Engineering Applications of Artificial Intelligence, 23(1), 85–94.

    Article  Google Scholar 

  • Barras, C., & Gauvain, J.L. (2003). Feature and score normalization for speaker verification of cellular data. In 2003 IEEE international conference on acoustics, speech and signal processing, Hong Kong, China (pp. 49–52).

    Google Scholar 

  • Bonastre, J. F., Wils, F., & Meignier, S. (2005). ALIZE, a free toolkit for speaker recognition. In IEEE international conference on acoustics, speech and signal processing, Philadelphia, USA (pp. 737–740).

    Google Scholar 

  • Campbell, W. M. (2002). Generalized linear discriminant sequence kernels for speaker recognition. In IEEE international conference on acoustics speech and signal processing, Orlando, USA (pp. 161–164).

    Google Scholar 

  • Campbell, W., Sturim, D., Reynolds, D. A., & Solomonoff, A. (2006). SVM based speaker verification using a GMM supervector kernel and Nap variability compensation. In IEEE international conference on acoustics, speech and signal processing, Toulouse, France (pp. 97–100).

    Google Scholar 

  • Carmona, J. L., Peinado, A. M., Perez-Cordoba, J. L., & Gomez, A. M. (2010). MMSE-Based packet loss concealment for CELP-coded speech recognition. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1341–1353.

    Article  Google Scholar 

  • Chen, K. (2003). Towards better making a decision in speaker verification. Pattern Recognition, 36(2), 329–349.

    Article  Google Scholar 

  • Do, M. N. (2003). Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE Signal Processing Letters, 10(4), 115–118.

    Article  MathSciNet  Google Scholar 

  • Fakhr, W., AbdelSalam, A., & Hamdy, N. (2004). Enhancement of mismatched conditions in speaker recognition of multimedia applications. In IEEE international conference on acoustics, speech and signal processing, Montréal, Canada (pp. 377–380).

    Google Scholar 

  • Huber, P. J. (1964). Robust estimation of a location parameter. Annals of Mathematical Statistics, 35, 3–73.

    Article  Google Scholar 

  • ITU-T (1996) Recommendation G.729-coding of speech at 8 kbit/s using conjugate-structure. Algebraic-code-excited linear-prediction (CS-ACELP).

  • Jain, A., Nandakumar, K., & Ross, A. (2005). Score normalization in multimodal biometric systems. Pattern Recognition, 38(12), 2270–2285.

    Article  Google Scholar 

  • Khan, L. A., Baig, M. S., & Youssef Amr, M. (2009). Speaker recognition from encrypted VoIP communications. Digital Investigation, 7(1–2), 65–73.

    Google Scholar 

  • Karam, Z. N., & Campbell, W. M. (2008). A multi-class MLLR kernel for SVM speaker recognition. In IEEE international conference on acoustics, speech and signal processing, Las Vegas, USA (pp. 4117–4120).

    Google Scholar 

  • Kim, H. K., & Cox, R. V. (2001). A bits-stream-based front-end for wireless speech recognition on is-136 communications system. IEEE Transactions on Speech and Audio Processing, 9(5), 558–568.

    Article  Google Scholar 

  • Limin, N., Xuan, W., Xiaorong, Y., & Jiancheng, L. (2009). The implementation of speaker recognition on VoIP auditing in gigabit high-speed environment. In International workshop on information security and application, Qingdao, China (pp. 396–400).

    Google Scholar 

  • Linguistic Data Consortium, (1996–1999) NIST speaker recognition benchmarks. http://www.ldc.upenn.edu;

  • Moreno, P. J., Ho, P., & Vasconcelos, N. A. (2004). Kullback-Leibler divergence based kernel for SVM classification in multimedia applications. Advances in Neural Information Processing Systems, 16, 1385–1393.

    Google Scholar 

  • Moreno-Daniel, A., Juang, B. H., & Nolazco Flores, J. A. (2005). Robustness of bit-stream based features for speaker verification. In IEEE international conference on acoustics, speech, and signal processing, Philadelphia, USA (pp. 749–752).

    Google Scholar 

  • Nandakumar, K., Chen, Y., Dass, S. C., & Jain, A. K. (2008). Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2), 342–347.

    Article  Google Scholar 

  • Petracca, M., Servetti, A., & De Martin, J. C. (2005). Low-complexity automatic speaker recognition in the compressed GSM-AMR domain. In IEEE international conference on multimedia and expo, Amsterdam, Netherlands (pp. 662–665).

    Google Scholar 

  • Poh, N., & Kittler, J. (2008). Incorporating variation of model-specific score distribution in speaker verification systems. IEEE Transactions on Audio, Speech, and Language Processing, 16(3), 594–606.

    Article  Google Scholar 

  • Quatieri, T. F., Singer, E., Dunn, R. B., Reynolds, D. A., & Campbell, J. P. (1999). Speaker and language recognition using speech codec parameters. In Eurospeech’99, Budapest, Hungary (pp. 787–790).

    Google Scholar 

  • Quatieri, T. F., Dunn, R. B., Reynolds, D. A., Campbell, J. P., & Singer, E. (2000). Speaker recognition using G.729 speech codec parameters. In IEEE international conference on acoustics, speech, and signal processing, Turkey (pp. 1089–1092).

    Google Scholar 

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted Gaussian mixture models. Digital Signal Processing, 10(1–3), 19–41.

    Article  Google Scholar 

  • Rosenberg, A. E., Delong, J., Lee, C. H., Juang, B. H., & Soong, F. K. (1992). The use of cohort normalized scores for speaker recognition. In International conference spoken language (pp. 599–602).

    Google Scholar 

  • Turunen, J., & Valj, D. (2001). A study of speech coding parameters in speech recognition. In Eurospeech’01, Aalborg, Denmark (pp. 2363–2366).

    Google Scholar 

  • Wan, V., & Renals, S. (2003). SVMSVM: support vector machine speaker verification methodology. In IEEE international conference on acoustics, speech, and signal processing proceedings, Hong Kong, China (pp. 221–224).

    Google Scholar 

  • Yessad, D., & Amrouche, A. (2012). G729 coded parameters under matched and mismatched conditions for distributed speaker recognition. In International Congress on Telecommunication and Application’12, Bejaia, Algeria.

    Google Scholar 

  • Yessad, D., Amrouche, A., & Debyeche, M. (2011). Influence of G729 speech coding on automatic speaker recognition in VoIP applications. In Lectures notes in electrical engineering: Vol. 114. The 2011 computer science and convergence (pp. 745–751). Berlin: Springer.

    Chapter  Google Scholar 

  • Yu, E. W. M., Mak, M. W., Sit, C. H., & Kung, S. Y. (2003). Speaker verification based on G.729 and G.723.1 coder parameters and handset mismatch compensation. In Eurospeech’03, Geneva, Switzerland (pp. 1681–1684).

    Google Scholar 

  • Zhaopin, S., Jianguo, J., Shiguo, L., Guofu, Z., & Donghui, H. (2010). Hierarchical selective encryption for G.729 speech based on bit sensitivity. Journal of Internet Technology, 11(5), 599–608.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abderrahmane Amrouche.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yessad, D., Amrouche, A. Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Int J Speech Technol 17, 43–51 (2014). https://doi.org/10.1007/s10772-013-9204-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-013-9204-6

Keywords

Navigation