Abstract
The replay attack is refereed as an unauthorized attempt to access the automatic speaker verification (ASV) system by using the pre-recorded speech samples of any target. The replay attack is performed by placing the pre-recorded speech sample of the target before the machine. Of late the replay attack is identified as the greatest threat to ASV system, mainly due to the availability of high quality recording and playback devices. In this work, excitation source feature referred as glottal mel frequency cepstral coefficient (GMFCC) and shifted constant Q cepstral coefficient (SCQCC) are proposed for detection of replay signals. The GMFCC is derived by applying conventional mel-cepstral technique to glottal flow derivative signal. The SCQCC is computed by using constant Q cepstral processing. The effectiveness of the proposed features are demonstrated by conducting experiments with ASVspoof 2017 version 2.0 database. The proposed GMFCC feature provides an equal error rate (EER) of 16.78%, that is 19.63% higher than the recently proposed residual mel frequency cepstral coefficient(RMFCC) feature. The conventional CQCC feature provides an EER of 12.32%. The proposed SCQCC feature provides an EER of 11.34%, shows a relative improvement of 7.94% over CQCC. Further, the CQCC in together with proposed GMFCC provides an EER of 8.82%. On the other hand, the proposed SCQCC+GMFCC system provides an EER of 8.60%. These results signify the usefulness of the proposed system to counter replay attacks.
Similar content being viewed by others
References
Alku, P. (1991). Glottal wave analysis with pitch synchronous iterative adaptive inverse filtering. Speech Communication, 11, 109–118.
Beigi, H. (2011). Speaker recognition. In: Fundamentals of Speaker Recognition (pp. 543–559). New York: Springer.
Campbell, J. P. (1997). Speaker recognition: A tutorial. Proceedings of IEEE, 85(9), 1437–1462.
Delgado, H., Todisco, M., Sahidullah, M., Evans, N., Kinnunen, T., Lee, K., Yamagishi, J. (2018). Asvspoof 2017 version 2.0: meta-data analysis and baseline enhancements. In Odyssey 2018 The Speaker and Language Recognition Workshop.
Drugman, T., Thomas, M., Gudnason, J., Naylor, P., & Dutoit, T. (2012). Detection of glottal closure instants from speech signals: A quantitative review. IEEE Transactions on Audio, Speech, and Language Processing, 20(3), 994–1006.
Font, R., Espın, J. M., & Cano, M. J. (2017). Experimental analysis of features for replay attack detection–results on the ASVspoof 2017 challenge. in Proc Interspeech pp. 7–11.
Furui, S. (1981). Cepstral analysis technique for automatic speaker verification. IEEE Transactions on Acoustics, Speech, and Signal Processing, 29(2), 254–272.
Hermansky, H., & Morgan, N. (1994). Rasta processing of speech. IEEE Transactions on Speech and Audio Processing, 2(4), 578–589.
Jelil, S., Das, R. K., Prasanna, S. M., & Sinha, R. (2017). Spoof detection using source, instantaneous frequency and cepstral features. In Proc Interspeech (pp. 22–26).
Kamble, M., & Patil, H. (2018). Novel variable length energy separation algorithm using instantaneous amplitude features for replay detection. Proc. Interspeech, 2018, pp. 646–650.
Kinnunen, T., Evans, N., Yamagishi, J., Lee, K. A., Sahidullah, M., Todisco, M., et al. (2017). Asvspoof 2017: Automatic speaker verification spoofing and countermeasures challenge evaluation plan. Training, 10(1508), 1508.
Lee, K. A., Larcher, A., Wang, G., Kenny, P., Brümmer, N., Leeuwen, D. v., et al. (2015). The reddots data collection for speaker recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance. In: Proc. Eur. conf. on speech communication technology, Rhodes, Greece, Vol. 4, pp. 1895–1898.
Murthy, K. S. R., & Yegnanarayana, B. (2008). Epoch extraction from speech signals. IEEE Transactions on Audio Speech and Language Processing, 16(8), 1602–1613.
Naylor, P. A., Kounoudes, A., Gudnason, J., & Brookes, M. (2007). Estimation of glottal closure instants in voiced speech using the DYPSA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 15(1), 34–43.
Nordin, F., & Eriksson, T. (2001). A speech spectrum distortion measure with interframe memory. Proc. ICASSP, 2, 717–720.
Patil, H. A., Kamble, M. R., Patel, T. B., & Soni, M. (2017). Novel variable length teager energy separation based instantaneous frequency features for replay detection. In Proc Interspeech (pp. 12–16).
Plumpe, M. D., Quatieri, T. F., & Reynolds, D. A. (1999). Modelling of glottal flow derivative waveform with application to speaker identification. IEEE Transactions on Speech and Audio Processing, 7(5), 569–586.
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006a). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Prasanna, S. R. M., Gupta, C. S., & Yegnanarayana, B. (2006b). Extraction of speaker-specific excitation information from linear prediction residual of speech. Speech Communication, 48, 1243–1261.
Prathosh, A., Ananthapadmanabha, T., & Ramakrishnan, A. (2013). Epoch extraction based on integrated linear prediction residual using plosion index. IEEE Transactions on Audio, Speech, and Language Processing, 21(12), 2471–2480.
Reynolds, D. A. (1995). Speaker identification and verification using gaussian mixture speaker models. Speech Communication, 17, 91–108.
Sailor, H., Kamble, M., Patil, H. (2018). Auditory filterbank learning for temporal modulation features in replay spoof speech detection. In Proc. Interspeech, pp. 666–670.
Singh, M., & Pati, D. (2019). Usefulness of linear prediction residual for replay attack detection. AEU-International Journal of Electronics and Communications. https://doi.org/10.1016/j.aeue.2019.152837.
Suthokumar, G., Sethu, V., Wijenayake, C., Ambikairajah, E. (2018). Modulation dynamic features for the detection of replay attacks. Proc Interspeech pp. 691–695.
Tak, H., & Patil, H. (2018). Novel linear frequency residual cepstral features for replay attack detection. Proc. Interspeech, 2018, 726–730.
The Bosaris toolkit [software package]. Retrieved 2013 from https://sites.google.com/site/bosaristoolkit.
Thomas, M. R., Gudnason, J., & Naylor, P. A. (2012). Estimation of glottal closing and opening instants in voiced speech using the YAGA algorithm. IEEE Transactions on Audio, Speech, and Language Processing, 20(1), 82–91.
Torres-Carrasquillo, P. A., Singer, E., Kohler, M. A., Greene, R. J., Reynolds, D. A., & Deller, Jr J.R. (2002). Approaches to language identification using gaussian mixture models and shifted delta cepstral features. In Seventh International Conference on Spoken Language Processing.
Villaba, J., Lieida, E. (2011). Preventing replay attacks on speaker verification systems. In Proc. Int. carnahan conf. on security technology (ICCST), pp. 1–8.
Wang, Z., Wei, G., He, Q.H. (2011). Channel pattern noise based playback attack detection algorithm for speaker recognition. in Proc IEEE Int conference of the biometrics special interest Group (BIOSIG) on machine learning and cybernetics pp 1708–1713.
Wu, Z., Evans, N., Kinnunen, T., Yamagishi, J., Alegre, F., & Li, H. (2015). Spoofing and counter measures for speaker verification: A survey. Speech Communication, 66, 130–153.
Zhang, W. Q., He, L., Deng, Y., Liu, J., & Johnson, M. T. (2010). Time-frequency cepstral features and heteroscedastic linear discriminant analysis for language recognition. IEEE Transactions on Audio, Speech, and Language Processing, 19(2), 266–276.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Dutta, K., Singh, M. & Pati, D. Detection of replay signals using excitation source and shifted CQCC features. Int J Speech Technol 24, 497–507 (2021). https://doi.org/10.1007/s10772-021-09810-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-021-09810-6