Skip to main content

Advertisement

Log in

Detection of replay signals using excitation source and shifted CQCC features

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

The replay attack is refereed as an unauthorized attempt to access the automatic speaker verification (ASV) system by using the pre-recorded speech samples of any target. The replay attack is performed by placing the pre-recorded speech sample of the target before the machine. Of late the replay attack is identified as the greatest threat to ASV system, mainly due to the availability of high quality recording and playback devices. In this work, excitation source feature referred as glottal mel frequency cepstral coefficient (GMFCC) and shifted constant Q cepstral coefficient (SCQCC) are proposed for detection of replay signals. The GMFCC is derived by applying conventional mel-cepstral technique to glottal flow derivative signal. The SCQCC is computed by using constant Q cepstral processing. The effectiveness of the proposed features are demonstrated by conducting experiments with ASVspoof 2017 version 2.0 database. The proposed GMFCC feature provides an equal error rate (EER) of 16.78%, that is 19.63% higher than the recently proposed residual mel frequency cepstral coefficient(RMFCC) feature. The conventional CQCC feature provides an EER of 12.32%. The proposed SCQCC feature provides an EER of 11.34%, shows a relative improvement of 7.94% over CQCC. Further, the CQCC in together with proposed GMFCC provides an EER of 8.82%. On the other hand, the proposed SCQCC+GMFCC system provides an EER of 8.60%. These results signify the usefulness of the proposed system to counter replay attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Alku, P. (1991). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11, 109–118.

    Article  Google Scholar 

  • Beigi, H. (2011). Speaker recognition. In: Fundamentals of Speaker Recognition (pp. 543–559). New York: Springer.

  • Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of IEEE, 85(9), 1437–1462.

    Article  Google Scholar 

  • Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., Yamagishi, J. (2018). Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In Odyssey 2018 The Speaker and Language Recognition Workshop.

  • Drugman, T., Thomas, M., Gudnason, J., Naylor, P., & Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.

    Article  Google Scholar 

  • Font, R., Espın, J. M., & Cano, M. J. (2017). Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 challenge. in Proc Interspeech pp. 7–11.

  • Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272.

    Article  Google Scholar 

  • Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.

    Article  Google Scholar 

  • Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017). Spoof detection using source, instantaneous frequency and cepstral features. In Proc Interspeech (pp. 22–26).

  • Kamble, M., & Patil, H. (2018). Novel variable length energy separation algorithm using instantaneous amplitude features for replay detection. Proc. Interspeech, 2018, pp. 646–650.

  • Kinnunen, T., Evans, N., Yamagishi, J., Lee, K. A., Sahidullah, M., Todisco, M., et al. (2017). Asvspoof 2017: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training, 10(1508), 1508.

    Google Scholar 

  • Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. v., et al. (2015). The reddots data collection for speaker recognition. In Sixteenth Annual Conference of the International Speech Communication Association.

  • Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In: Proc. Eur. conf. on speech communication technology, Rhodes, Greece, Vol. 4, pp. 1895–1898.

  • Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech and Language Processing, 16(8), 1602–1613.

    Article  Google Scholar 

  • Naylor, P. A., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 34–43.

    Article  Google Scholar 

  • Nordin, F., & Eriksson, T. (2001). A speech spectrum distortion measure with interframe memory. Proc. ICASSP, 2, 717–720.

    Google Scholar 

  • Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. (2017). Novel variable length teager energy separation based instantaneous frequency features for replay detection. In Proc Interspeech (pp. 12–16).

  • Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modelling of glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.

    Article  Google Scholar 

  • Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006a). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.

    Article  Google Scholar 

  • Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006b). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.

    Article  Google Scholar 

  • Prathosh, A., Ananthapadmanabha, T., & Ramakrishnan, A. (2013). Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2471–2480.

    Article  Google Scholar 

  • Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.

    Article  Google Scholar 

  • Sailor, H., Kamble, M., Patil, H. (2018). Auditory filterbank learning for temporal modulation features in replay spoof speech detection. In Proc. Interspeech, pp. 666–670.

  • Singh, M., & Pati, D. (2019). Usefulness of linear prediction residual for replay attack detection. AEU-International Journal of Electronics and Communications. https://doi.org/10.1016/j.aeue.2019.152837.

  • Suthokumar, G., Sethu, V., Wijenayake, C., Ambikairajah, E. (2018). Modulation dynamic features for the detection of replay attacks. Proc Interspeech pp. 691–695.

  • Tak, H., & Patil, H. (2018). Novel linear frequency residual cepstral features for replay attack detection. Proc. Interspeech, 2018, 726–730.

    Article  Google Scholar 

  • The Bosaris toolkit [software package]. Retrieved 2013 from https://sites.google.com/site/bosaristoolkit.

  • Thomas, M. R., Gudnason, J., & Naylor, P. A. (2012). Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 82–91.

    Article  Google Scholar 

  • Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, Jr J.R. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In Seventh International Conference on Spoken Language Processing.

  • Villaba, J., Lieida, E. (2011). Preventing replay attacks on speaker verification systems. In Proc. Int. carnahan conf. on security technology (ICCST), pp. 1–8.

  • Wang, Z., Wei, G., He, Q.H. (2011). Channel pattern noise based playback attack detection algorithm for speaker recognition. in Proc IEEE Int conference of the biometrics special interest Group (BIOSIG) on machine learning and cybernetics pp 1708–1713.

  • Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015). Spoofing and counter measures for speaker verification: A survey. Speech Communication, 66, 130–153.

    Article  Google Scholar 

  • Zhang, W. Q., He, L., Deng, Y., Liu, J., & Johnson, M. T. (2010). Time-frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 266–276.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Krishna Dutta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dutta, K., Singh, M. & Pati, D. Detection of replay signals using excitation source and shifted CQCC features. Int J Speech Technol 24, 497–507 (2021). https://doi.org/10.1007/s10772-021-09810-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-021-09810-6

Keywords

Navigation