Abstract
In the automatic speech recognition (ASR) domain, most, if not all, current audio AEs are generated by applying perturbations to input audio. Adversaries either constrain norm of the perturbations or hide perturbations below the hearing threshold based on psychoacoustics. These two approaches have their respective problems: norm-constrained perturbations will introduce noticeable noise while hiding perturbations below the hearing threshold can be prevented by deliberately removing inaudible components from audio. In this paper, we present a novel method of generating targeted audio AEs. The perceptual quality of our audio AEs are significantly better compared to audio AEs generated by applying norm-constrained perturbations. Furthermore, unlike approaches that rely on psychoacoustics to hide perturbations below the hearing threshold, we show that our audio AEs can still be successfully generated even when inaudible components are removed from audio.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Carlini, N., Wagner, D.: Audio adversarial examples: targeted attacks on speech-to-text. In: 2018 IEEE Security and Privacy Workshops (SPW), pp. 1–7. IEEE (2018)
Chen, Y., et al.: Devil’s whisper: a general approach for physical adversarial attacks against commercial black-box speech recognition devices. In: 29th USENIX Security Symposium (USENIX Security 2020) (2020)
Cisse, M., Adi, Y., Neverova, N., Keshet, J.: Houdini: fooling deep structured visual and speech recognition models with adversarial examples. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6980–6990 (2017)
Eisenhofer, T., Schönherr, L., Frank, J., Speckemeier, L., Kolossa, D., Holz, T.: Dompteur: taming audio adversarial examples. arXiv preprint arXiv:2102.05431 (2021)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014)
Jia, R., Liang, P.: Adversarial examples for evaluating reading comprehension systems. In: Palmer, M., Hwa, R., Riedel, S. (eds.) Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, 9–11 September 2017, pp. 2021–2031. Association for Computational Linguistics (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings (2015)
Li, J., Qu, S., Li, X., Szurley, J., Kolter, J.Z., Metze, F.: Adversarial music: real world audio adversary against wake-word detection system. In: Advances in Neural Information Processing Systems, pp. 11931–11941 (2019)
Li, Z., Wu, Y., Liu, J., Chen, Y., Yuan, B.: AdvPulse: universal, synchronization-free, and targeted audio adversarial attacks via subsecond perturbations. In: Ligatti, J., Ou, X., Katz, J., Vigna, G. (eds.) CCS 2020: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, 9–13 November 2020, pp. 1121–1134. ACM (2020)
Lin, Y., Abdulla, W.H.: Principles of psychoacoustics. In: Lin, Y., Abdulla, W.H. (eds.) Audio Watermark, pp. 15–49. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-07974-5_2
Liu, X., Wan, K., Ding, Y., Zhang, X., Zhu, Q.: Weighted-sampling audio adversarial example attack. In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, 7–12 February 2020, pp. 4908–4915. AAAI Press (2020)
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: LibriSpeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE (2015)
Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Kubin, G., Kacic, Z. (eds.) Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria, 15–19 September 2019, pp. 2613–2617. ISCA (2019)
Qin, Y., Carlini, N., Cottrell, G.W., Goodfellow, I.J., Raffel, C.: Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In: Proceedings of the 36th International Conference on Machine Learning, ICML 2019, Long Beach, California, USA, 9–15 June 2019, pp. 5231–5240 (2019)
Rix, A.W., Beerends, J.G., Hollier, M.P., Hekstra, A.P.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2001, Salt Palace Convention Center, Salt Lake City, Utah, USA, 7–11 May 2001, Proceedings, pp. 749–752. IEEE (2001)
Schönherr, L., Kohls, K., Zeiler, S., Holz, T., Kolossa, D.: Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. In: 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, 24–27 February 2019. The Internet Society (2019)
Szegedy, C., et al.: Intriguing properties of neural networks. In: Bengio, Y., LeCun, Y. (eds.) 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16 April 2014, Conference Track Proceedings (2014)
Taori, R., Kamsetty, A., Chu, B., Vemuri, N.: Targeted adversarial examples for black box audio systems. In: 2019 IEEE Security and Privacy Workshops (SPW), pp. 15–20. IEEE (2019)
Wang, Q., Guo, P., Xie, L.: Inaudible adversarial perturbations for targeted attack in speaker recognition. In: INTERSPEECH 2020 (2020)
Xie, Q., Luong, M., Hovy, E.H., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020, pp. 10684–10695. IEEE (2020)
Yang, Z., Li, B., Chen, P., Song, D.: Characterizing audio adversarial examples using temporal dependency. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, 6–9 May 2019. OpenReview.net (2019)
Yuan, X., et al.: CommanderSong: a systematic approach for practical adversarial voice recognition. In: 27th USENIX Security Symposium (USENIX Security 2018), pp. 49–64 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zong, W., Chow, YW., Susilo, W. (2022). High Quality Audio Adversarial Examples Without Using Psychoacoustics. In: Chen, X., Shen, J., Susilo, W. (eds) Cyberspace Safety and Security. CSS 2022. Lecture Notes in Computer Science, vol 13547. Springer, Cham. https://doi.org/10.1007/978-3-031-18067-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-18067-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18066-8
Online ISBN: 978-3-031-18067-5
eBook Packages: Computer ScienceComputer Science (R0)