Skip to main content

Advertisement

Log in

A method to compensate the influence of speech codec in speaker recognition

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The recognition of a person by his voice or “speaker recognition”, is a biometric specialty increasingly used in electronic commerce and electronic banking transactions and forensic investigations, among others. Speaker recognition is supported by the discriminative information contained in the speech of a person and its main challenge is the variability that exists between different speech samples of the same person, used for training and evaluation, or “session variability”. When a speech communication is transmitted over the internet, for example, the coding–decoding process “codec” of the speech causes loss of such information and affects the effectiveness of the speaker recognition. Some methods have been proposed to mitigate this effect. This work makes a study of the degree of affectation of this information for some commonly used codec types and proposes our own solution, to compensate the session variability provoked by the codec. The influence of some types of codec in the quality of the sample was evaluated first with a set of synthesized speech samples. Later, experiments were carried out with speech samples of international competitions, retransmitted over two different codecs, and the effect on the speaker recognition effectiveness was checked. Finally, the variability compensation was applied, with an improvement of the recognition effectiveness, measured by the equal error rate, of 20.8% for the g.722 codec and 27.8% for the gsm 6.20 codec.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Notes

  1. Speaker Recognition Evaluation of National Institute of Standardization, USA. https://www.nist.gov/itl/iad/mig/speaker-recognition.

  2. Asterisk is the world’s most popular open source communications project that lets you create telephony apps for IP PBXs, VoIP Gateways and Conference Servers. Available in: https://www.asterisk.org/.

  3. Cepstral coefficients in linear or Mel scale, standardized with respect to their mean and variance, plus energy and its derivatives, usually obtained every 20 msec. of speech, with a dimension F that can vary from 39 to 60, depending on the application.

  4. VoIP: Coded voice to be transmitted over Internet protocols.

  5. As a convention to identify each codec, we will use in this work: “codec name (bit rate)”.

  6. The threshold was set by knowing a priori the target and impostor labels of the development samples. With the evaluation scores, is possible to determine the probability of acceptance and false rejection of the targets, as well as the probability of rejection and false acceptance of the impostors, establishing the score of the EER point in the DET curve, where the probabilities of false acceptance and false rejection are equated, as a threshold to accept or reject the result of the comparison.

  7. 2008 NIST Speaker Recognition Evaluation Plan, April 3, 2008.

  8. Private Branch Exchange: shares one to several telephone lines with a group of users.

  9. “There is something there, in the air, that changes the meaning of things. That gentle wind flies, touches your face, as you count the leaves of the trees. The water runs looking for the fields. When I open the doors of my house, I think: this country, one more morning. At my age my strength begins to run out, I am hardly young anymore, and the death of my wife in the war weighs me down. When the body reaches that hour, the science of doctors can not stop the passage of time. As a child, back in my land, I used to spend my days rummaging from one place to another. Little by little, the cars of the city were calling my attention; My mother said to be careful, but I thought I was very old, so I had no interest or time for my own sign. But I’m still, it’s true; How many good things I found among your people. If I count the beloved summers then there are not seven, nor nine, nor twenty. It must be that I am a child again in this sad body.”

  10. MOS, Mean opinion score numerical indication about the perceptual quality of the voice after it has been processed (encoded, compressed, encrypted, etc.) and transmitted over the telephone channel. It is a survey conducted on a population of samples in which users are asked to rate the quality of the voice perceived with values from 1 (worst case) to 5 (best case). The grades are averaged to obtain the MOS. MOS scale is: (1) Impossible to communicate. (2) Very poor quality, almost impossible to communicate. (3) Poor quality, unclear and irritating, but still functional. (4) Failure to communicate can be perceived, but it is still possible to clearly hear the speaker. (5) Perfect conversation like in a face-to-face conversation or at a radio reception.

  11. Medium bit rate codec, commonly used in VoIP communications.

  12. Low bit rate codec, commonly used in mobile telephony.

References

  • Benesty, J., Sondhi, M. M., & Huang, Y. (2008). Springer handbook of speech processing. Berlin: Springer.

    Book  Google Scholar 

  • Calvo, J. R. (2015). (In Spanish) Métodos de transmisión de voz sobre internet: VoIP. El reconocimiento del locutor en Internet. Technical Report RT078, Blue Serie, CENATAV.

  • Campbell, W., Sturim, D., & Reynolds, D. (2006). Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters, 13(5), 308–311.

    Article  Google Scholar 

  • Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network acoustic modeling. IEEE/ACM Transactions on Audio, Speech and Language Processing, 23(9),1469–1477.

    Article  Google Scholar 

  • Dehak, N., Kenny, P., Dehak, R., Dumouchel, P., & Ouellet, P. (2011).). Front-end factor analysis for speaker verification. IEEE/ACM Transactions on Audio, Speech and Language Processing, 19(4), 788–798.

    Article  Google Scholar 

  • Dunn, R. B., et al. (2001). Speaker recognition from coded speech in matched and mismatched conditions. In IEEE Odyssey’01 The Speaker and Language Recognition Workshop Proceedings, pp 72–83.

  • Fernández, L., Wagner, M., & Möller, S. (2012). Analysis of automatic speaker verification performance over different narrowband and wideband telephone channels. In SST’12 Australasian Conference Proceedings, pp. 157–160.

  • Fernández, L., Wagner, M., & Möller, S. (2014a). Advantages of wideband over narrowband channels for speaker verification employing MFCCs and LFCCs. In ISCA Interspeech Conference Proceedings, pp 1115–1118.

  • Fernández, L., Wagner, M., & Möller, S. (2014b). Spectral sub-band analysis of speaker verification employing narrowband and wideband speech. IEEE Odyssey’14 The Speaker and Language Recognition Workshop Proceedings, pp 81–87.

  • Hatch, A. O., Kajarekar, S. S., & Stolcke, A. (2006). Within-class covariance normalization for svm-based speaker recognition. ISCA ICSLP’06 Conference Proceedings, pp. 1471–1474.

  • Hernández, G., Calvo, J. R., Bonastre, J., & Bousquet, P. M. (2014). Session compensation using binary speech representation for speaker recognition. Pattern Recognition Letters, 49, 17–23.

    Article  Google Scholar 

  • International Telecommunication Union (2004). ITU-T Recommendation P.563: Single-ended method for objective speech quality assessment in narrow-band telephony applications. https://www.itu.int/rec/T-REC-P.563.

  • International Telecommunication Union (1996). Recommendation Series, I. T. U. T. P.800: “Methods for subjective determination of transmission quality”. https://www.itu.int/rec/T-REC-P.800.

  • Jain, A., Flynn, P., & Ross, A. (2007). Handbook of biometrics. Berlin: Springer.

    Google Scholar 

  • Janicki, A. (2010). SVM-based speaker verification for codec and un-coded speech. EUSIPCO’10 Conference Proceedings, pp 26–30.

  • Janicki, A., & Staroszczyk, T. (2011). Speaker recognition from coded speech using SVM. TSD’11 Conference Proceedings, LNAI 6836, pp. 291–298.

  • Kenny, P., Boulianne, G., Ouellet, P., & Dumouchel, P. (2007). Speaker and session variability in gmm-based speaker verification. IEEE Transactions on Audio, Speech and Language Processing, 15(4), 1448–1460.

    Article  Google Scholar 

  • Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. ESCA Eurospeech’97 Conference Proceedings, pp 1895–1898.

  • McLaren, M., et al. (2013). Improving robustness to compressed speech in speaker recognition. In Proceedings of interspeech, pp. 3698–3701, 2013.

  • National Institute of Standardization (2008). The 2008 NIST speaker recognition evaluation results. https://www.nist.gov/itl/iad/mig/2008-nist-speaker-recognition-evaluation-results.

  • Ortega, J., Gonzalez, J., & Marrero, V. (2000). AHUMADA: A large speech corpus in Spanish for speaker characterization and identification. Speech Communication, 31, 255–264.

    Article  Google Scholar 

  • Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital Signal Processing, 10,(1–3), 19–41.

    Article  Google Scholar 

  • Reynolds, D. A., & Rose, R. C. (1995). Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing, 3(1), 72–83.

    Article  Google Scholar 

  • Scheffer, N., Ferrer, L., Lawson, A., Lei, Y., & McLaren, M. (2013). Recent developments in voice biometrics: Robustness and high accuracy. In IEEE Proceedings of International Conference on technologies for homeland security (HST), pp. 447–452.

  • Silovsky, J., et al. (2011). Assessment of speaker recognition on lossy codecs used for transmission of speech. In ELMAR’11 Symposium Proceedings, pp. 205–208.

  • Solomonoff, A., Campbell, W. M., & Boardman, I. (2005). Advances in channel compensation for SVM speaker recognition. In IEEE ICASSP’05 Conference Proceedings, pp 629–632.

  • Yessad, D., & Amrouche, A. (2014). Robust regression fusion of GMM-UBM and GMM-SVM normalized scores using G729 bit-stream for speaker recognition over IP. Springer International Journal of Speech Technologies, 17, 43–51.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to José R. Calvo de Lara.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Calvo de Lara, J.R., Reyes Diaz, F.J., Hernández Sierra, G. et al. A method to compensate the influence of speech codec in speaker recognition. Int J Speech Technol 21, 975–985 (2018). https://doi.org/10.1007/s10772-018-9547-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-018-9547-0

Keywords