Detection of replay signals using excitation source and shifted CQCC features

Dutta, Krishna; Singh, Madhusudan; Pati, Debadatta

doi:10.1007/s10772-021-09810-6

Detection of replay signals using excitation source and shifted CQCC features

Published: 04 February 2021

Volume 24, pages 497–507, (2021)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

232 Accesses
2 Citations
Explore all metrics

Abstract

The replay attack is refereed as an unauthorized attempt to access the automatic speaker verification (ASV) system by using the pre-recorded speech samples of any target. The replay attack is performed by placing the pre-recorded speech sample of the target before the machine. Of late the replay attack is identified as the greatest threat to ASV system, mainly due to the availability of high quality recording and playback devices. In this work, excitation source feature referred as glottal mel frequency cepstral coefficient (GMFCC) and shifted constant Q cepstral coefficient (SCQCC) are proposed for detection of replay signals. The GMFCC is derived by applying conventional mel-cepstral technique to glottal flow derivative signal. The SCQCC is computed by using constant Q cepstral processing. The effectiveness of the proposed features are demonstrated by conducting experiments with ASVspoof 2017 version 2.0 database. The proposed GMFCC feature provides an equal error rate (EER) of 16.78%, that is 19.63% higher than the recently proposed residual mel frequency cepstral coefficient(RMFCC) feature. The conventional CQCC feature provides an EER of 12.32%. The proposed SCQCC feature provides an EER of 11.34%, shows a relative improvement of 7.94% over CQCC. Further, the CQCC in together with proposed GMFCC provides an EER of 8.82%. On the other hand, the proposed SCQCC+GMFCC system provides an EER of 8.60%. These results signify the usefulness of the proposed system to counter replay attacks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

Databases, features and classifiers for speech emotion recognition: a review

Article 19 January 2018

Milestones in speaker recognition

Article Open access 15 February 2024

References

Alku, P. (1991). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11, 109–118.
Article Google Scholar
Beigi, H. (2011). Speaker recognition. In: Fundamentals of Speaker Recognition (pp. 543–559). New York: Springer.
Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of IEEE, 85(9), 1437–1462.
Article Google Scholar
Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., Yamagishi, J. (2018). Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In Odyssey 2018 The Speaker and Language Recognition Workshop.
Drugman, T., Thomas, M., Gudnason, J., Naylor, P., & Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.
Article Google Scholar
Font, R., Espın, J. M., & Cano, M. J. (2017). Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 challenge. in Proc Interspeech pp. 7–11.
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272.
Article Google Scholar
Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Article Google Scholar
Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017). Spoof detection using source, instantaneous frequency and cepstral features. In Proc Interspeech (pp. 22–26).
Kamble, M., & Patil, H. (2018). Novel variable length energy separation algorithm using instantaneous amplitude features for replay detection. Proc. Interspeech, 2018, pp. 646–650.
Kinnunen, T., Evans, N., Yamagishi, J., Lee, K. A., Sahidullah, M., Todisco, M., et al. (2017). Asvspoof 2017: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training, 10(1508), 1508.
Google Scholar
Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. v., et al. (2015). The reddots data collection for speaker recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In: Proc. Eur. conf. on speech communication technology, Rhodes, Greece, Vol. 4, pp. 1895–1898.
Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech and Language Processing, 16(8), 1602–1613.
Article Google Scholar
Naylor, P. A., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 34–43.
Article Google Scholar
Nordin, F., & Eriksson, T. (2001). A speech spectrum distortion measure with interframe memory. Proc. ICASSP, 2, 717–720.
Google Scholar
Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. (2017). Novel variable length teager energy separation based instantaneous frequency features for replay detection. In Proc Interspeech (pp. 12–16).
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modelling of glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.
Article Google Scholar
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006a). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Article Google Scholar
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006b). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Article Google Scholar
Prathosh, A., Ananthapadmanabha, T., & Ramakrishnan, A. (2013). Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2471–2480.
Article Google Scholar
Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.
Article Google Scholar
Sailor, H., Kamble, M., Patil, H. (2018). Auditory filterbank learning for temporal modulation features in replay spoof speech detection. In Proc. Interspeech, pp. 666–670.
Singh, M., & Pati, D. (2019). Usefulness of linear prediction residual for replay attack detection. AEU-International Journal of Electronics and Communications. https://doi.org/10.1016/j.aeue.2019.152837.
Suthokumar, G., Sethu, V., Wijenayake, C., Ambikairajah, E. (2018). Modulation dynamic features for the detection of replay attacks. Proc Interspeech pp. 691–695.
Tak, H., & Patil, H. (2018). Novel linear frequency residual cepstral features for replay attack detection. Proc. Interspeech, 2018, 726–730.
Article Google Scholar
The Bosaris toolkit [software package]. Retrieved 2013 from https://sites.google.com/site/bosaristoolkit.
Thomas, M. R., Gudnason, J., & Naylor, P. A. (2012). Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 82–91.
Article Google Scholar
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, Jr J.R. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In Seventh International Conference on Spoken Language Processing.
Villaba, J., Lieida, E. (2011). Preventing replay attacks on speaker verification systems. In Proc. Int. carnahan conf. on security technology (ICCST), pp. 1–8.
Wang, Z., Wei, G., He, Q.H. (2011). Channel pattern noise based playback attack detection algorithm for speaker recognition. in Proc IEEE Int conference of the biometrics special interest Group (BIOSIG) on machine learning and cybernetics pp 1708–1713.
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015). Spoofing and counter measures for speaker verification: A survey. Speech Communication, 66, 130–153.
Article Google Scholar
Zhang, W. Q., He, L., Deng, Y., Liu, J., & Johnson, M. T. (2010). Time-frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 266–276.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronics and Communication Engineering, National Institute of Technology Nagaland, Dimapur, 797103, India
Krishna Dutta, Madhusudan Singh & Debadatta Pati

Authors

Krishna Dutta
View author publications
You can also search for this author in PubMed Google Scholar
Madhusudan Singh
View author publications
You can also search for this author in PubMed Google Scholar
Debadatta Pati
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Krishna Dutta.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dutta, K., Singh, M. & Pati, D. Detection of replay signals using excitation source and shifted CQCC features. Int J Speech Technol 24, 497–507 (2021). https://doi.org/10.1007/s10772-021-09810-6

Download citation

Received: 31 January 2020
Accepted: 02 January 2021
Published: 04 February 2021
Issue Date: June 2021
DOI: https://doi.org/10.1007/s10772-021-09810-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detection of replay signals using excitation source and shifted CQCC features

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Databases, features and classifiers for speech emotion recognition: a review

Milestones in speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detection of replay signals using excitation source and shifted CQCC features

Abstract

Access this article

Similar content being viewed by others

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Databases, features and classifiers for speech emotion recognition: a review

Milestones in speaker recognition

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation