Practical Backdoor Attack Against Speaker Recognition System

Luo, Yuxiao; Tai, Jianwei; Jia, Xiaoqi; Zhang, Shengzhi

doi:10.1007/978-3-031-21280-2_26

Yuxiao Luo¹⁰,
Jianwei Tai¹¹,
Xiaoqi Jia¹² &
…
Shengzhi Zhang¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13620))

Included in the following conference series:

International Conference on Information Security Practice and Experience

916 Accesses
1 Citations

Abstract

Deep learning-based models have achieved state-of-the-art performance in a wide variety of classification and recognition tasks. Although such models have been demonstrated to suffer from backdoor attacks in multiple domains, little is known whether speaker recognition system is vulnerable to such an attack, especially in the physical world. In this paper, we launch such backdoor attack on speaker recognition system (SRS) in both digital and physical space and conduct more comprehensive experiments on two common tasks of a speaker recognition system. Taking the poison position, intensity, length, frequency characteristics, and poison rate of the backdoor patterns into consideration, we design four backdoor triggers and use them to poison the training dataset. We demonstrate the results of digital and physical attack success rate (ASR) and show that all 4 backdoor patterns can achieve over 89% ASR on digital attacks and at least 70% on physical attacks. We also show that the maliciously trained model is able to provide comparable performance on clean data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Refer to [15] for more detailed description of the model.

References

Agarap, A.F.: Deep learning using rectified linear units (ReLU). arXiv preprint arXiv:1803.08375 (2018)
Bhattacharya, G., Alam, M.J., Kenny, P.: Deep speaker recognition: modular or monolithic? In: INTERSPEECH, pp. 1143–1147 (2019)
Google Scholar
Chen, X., Liu, C., Li, B., Lu, K., Song, D.: Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526 (2017)
Chung, S.P., Mok, A.K.: Allergy attack against automatic signature generation. In: Zamboni, D., Kruegel, C. (eds.) RAID 2006. LNCS, vol. 4219, pp. 61–80. Springer, Heidelberg (2006). https://doi.org/10.1007/11856214_4
Chapter Google Scholar
Chung, S.P., Mok, A.K.: Advanced allergy attacks: does a corpus really help? In: Kruegel, C., Lippmann, R., Clark, A. (eds.) RAID 2007. LNCS, vol. 4637, pp. 236–255. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74320-0_13
Chapter Google Scholar
Conneau, A., Schwenk, H., Barrault, L., Lecun, Y.: Very deep convolutional networks for text classification. arXiv preprint arXiv:1606.01781 (2016)
Dalvi, N., Domingos, P., Sanghai, S., Verma, D.: Adversarial classification. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 99–108 (2004)
Google Scholar
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2010)
Article Google Scholar
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc.: Ser. B (Methodol.) 39(1), 1–22 (1977)
MathSciNet MATH Google Scholar
Fortuna, J., Sivakumaran, P., Ariyaeeinia, A., Malegaonkar, A.: Open-set speaker identification using adapted Gaussian mixture models. In: Ninth European Conference on Speech Communication and Technology (2005)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press, Cambridge (2016)
MATH Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Gu, T., Dolan-Gavitt, B., Garg, S.: BadNets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017)
Han, J., Moraga, C.: The influence of the sigmoid function parameters on the speed of backpropagation learning. In: Mira, J., Sandoval, F. (eds.) IWANN 1995. LNCS, vol. 930, pp. 195–201. Springer, Heidelberg (1995). https://doi.org/10.1007/3-540-59497-3_175
Chapter Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning. Image Recogn. 7 (2015)
Google Scholar
Huang, L., Joseph, A.D., Nelson, B., Rubinstein, B.I., Tygar, J.D.: Adversarial machine learning. In: Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, pp. 43–58 (2011)
Google Scholar
Huang, Y.Y., Wang, W.Y.: Deep residual learning for weakly-supervised relation extraction. arXiv preprint arXiv:1707.08866 (2017)
Koffas, S., Xu, J., Conti, M., Picek, S.: Can you hear it? Backdoor attacks via ultrasonic triggers. arXiv preprint arXiv:2107.14569 (2021)
Lowd, D., Meek, C.: Adversarial learning. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 641–647 (2005)
Google Scholar
Lowd, D., Meek, C.: Good word attacks on statistical spam filters. In: CEAS, vol. 2005 (2005)
Google Scholar
McLaren, M., Ferrer, L., Castan, D., Lawson, A.: The speakers in the wild (SITW) speaker recognition database. In: Interspeech, pp. 818–822 (2016)
Google Scholar
Muda, L., Begam, M., Elamvazuthi, I.: Voice recognition algorithms using Mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010)
Multimodal Information Group (2022). https://www.nist.gov/itl/iad/mig/speaker-recognition
Nagrani, A., Chung, J.S., Xie, W., Zisserman, A.: VoxCeleb: large-scale speaker verification in the wild. Comput. Speech Lang. 60, 101027 (2020)
Article Google Scholar
Nandwana, M.K., Ferrer, L., McLaren, M., Castan, D., Lawson, A.: Analysis of critical metadata factors for the calibration of speaker recognition systems. In: INTERSPEECH, pp. 4325–4329 (2019)
Google Scholar
Newsome, J., Karp, B., Song, D.: Paragraph: thwarting signature learning by training maliciously. In: Zamboni, D., Kruegel, C. (eds.) RAID 2006. LNCS, vol. 4219, pp. 81–105. Springer, Heidelberg (2006). https://doi.org/10.1007/11856214_5
Chapter Google Scholar
Reynolds, D.A.: Gaussian mixture models. Encyclopedia Biometrics 741(659–663) (2009)
Google Scholar
Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker verification using adapted Gaussian mixture models. Digit. Signal Process. 10(1–3), 19–41 (2000)
Article Google Scholar
Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3(1), 72–83 (1995)
Article Google Scholar
Saha, A., Subramanya, A., Pirsiavash, H.: Hidden trigger backdoor attacks. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 11957–11965 (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Snyder, D., Garcia-Romero, D., Sell, G., McCree, A., Povey, D., Khudanpur, S.: Speaker recognition for multi-speaker conversations using X-vectors. In: 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ICASSP 2019, pp. 5796–5800. IEEE (2019)
Google Scholar
Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., Khudanpur, S.: X-vectors: robust DNN embeddings for speaker recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333. IEEE (2018)
Google Scholar
Turner, A., Tsipras, D., Madry, A.: Clean-label backdoor attacks (2018)
Google Scholar
Wittel, G.L., Wu, S.F.: On attacking statistical spam filters. In: CEAS. Citeseer (2004)
Google Scholar
Xu, M., Duan, L.-Y., Cai, J., Chia, L.-T., Xu, C., Tian, Q.: HMM-based audio keyword generation. In: Aizawa, K., Nakamura, Y., Satoh, S. (eds.) PCM 2004. LNCS, vol. 3333, pp. 566–574. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30543-9_71
Chapter Google Scholar
Ye, J., Liu, X., You, Z., Li, G., Liu, B.: DriNet: dynamic backdoor attack against automatic speech recognization models. Appl. Sci. 12(12), 5786 (2022)
Article Google Scholar
Zhai, T., Li, Y., Zhang, Z.M., Wu, B., Jiang, Y., Xia, S.: Backdoor attack against speaker verification. In: 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2560–2564 (2021)
Google Scholar

Download references

Acknowledgement

We would like to thank the reviewers for their helpful comments. Jianwei Tai is supported by the National Key Research and Development Program of China (No. 2019YFE0110300) and the National Natural Science Foundation of China under Grant 71971075, 72271076, and 71871079. Xiaoqi Jia is supported in part by Strategic Priority Research Program of Chinese Academy of Sciences (No. XDC02010900) and National Key Research and Development Program of China (No. 2019YFB1005201 and No. 2021YFB2910109).

Author information

Authors and Affiliations

Graduate School of Arts and Sciences, Boston University, Boston, USA
Yuxiao Luo
School of Management, Hefei University of Technology, Anhui, China
Jianwei Tai
Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China
Xiaoqi Jia
MET College Department of Computer Science, Boston Universty, Boston, USA
Shengzhi Zhang

Authors

Yuxiao Luo
View author publications
You can also search for this author in PubMed Google Scholar
Jianwei Tai
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoqi Jia
View author publications
You can also search for this author in PubMed Google Scholar
Shengzhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shengzhi Zhang .

Editor information

Editors and Affiliations

University of Aizu, Fukushima, Japan
Chunhua Su
Athens University of Economics and Business, Athens, Greece
Dimitris Gritzalis
Università degli Studi di Milano, Milan, Italy
Vincenzo Piuri

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luo, Y., Tai, J., Jia, X., Zhang, S. (2022). Practical Backdoor Attack Against Speaker Recognition System. In: Su, C., Gritzalis, D., Piuri, V. (eds) Information Security Practice and Experience. ISPEC 2022. Lecture Notes in Computer Science, vol 13620. Springer, Cham. https://doi.org/10.1007/978-3-031-21280-2_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-21280-2_26
Published: 19 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21279-6
Online ISBN: 978-3-031-21280-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics