skip to main content
research-article

Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems

Published: 27 September 2023 Publication History

Abstract

Intelligent audio systems are ubiquitous in our lives, such as speech command recognition and speaker recognition. However, it is shown that deep learning-based intelligent audio systems are vulnerable to adversarial attacks. In this paper, we propose a physical adversarial attack that exploits reverberation, a natural indoor acoustic effect, to realize imperceptible, fast, and targeted black-box attacks. Unlike existing attacks that constrain the magnitude of adversarial perturbations within a fixed radius, we generate reverberation-alike perturbations that blend naturally with the original voice sample 1. Additionally, we can generate more robust adversarial examples even under over-the-air propagation by considering distortions in the physical environment. Extensive experiments are conducted using two popular intelligent audio systems in various situations, such as different room sizes, distance, and ambient noises. The results show that Echo can invade into intelligent audio systems in both digital and physical over-the-air environment.

References

[1]
Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick Traynor, Kevin RB Butler, and Joseph Wilson. 2019. Practical hidden voice attacks against speech and speaker recognition systems. arXiv preprint arXiv:1904.05734 (2019).
[2]
Hadi Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, Tom Shrimpton, and Patrick Traynor. 2021. Hear" no evil", see" kenansville": Efficient and transferable black-box attacks on speech recognition and voice identification systems. In Symposium on Security and Privacy (SP). IEEE, 712--729.
[3]
Hadi Abdullah, Kevin Warren, Vincent Bindschaedler, Nicolas Papernot, and Patrick Traynor. 2021. Sok: The faults in our asrs: An overview of attacks against automatic speech recognition and speaker identification systems. In Symposium on security and privacy (SP). IEEE, 730--747.
[4]
Last accessed. 2022. Android app which enables unlock of mobile phone via voice print. https://app.mi.com/details?id=com.jie.lockscreen
[5]
Last accessed. 2022. Social software wechat adds voiceprint lock login function. https://kf.qq.com/touch/wxappfaq/ 1208117b2mai141125YZjAra.html
[6]
Last accessed. 2022. Voice Commands. https://www.tesla.com/support/voice-commands
[7]
Moustafa Alzantot, Bharathan Balaji, and Mani Srivastava. 2018. Did you hear that? adversarial examples against automatic speech recognition. arXiv preprint arXiv:1801.00554 (2018).
[8]
Karissa Bell. 2015. A smarter Siri learns to recognize the sound of your voice in iOS 9. https://mashable.com/archive/hey-siri-voice-recognition
[9]
Sourav Bhattacharya, Dionysis Manousakas, Alberto Gil CP Ramos, Stylianos I Venieris, Nicholas D Lane, and Cecilia Mascolo. 2020. Countering acoustic adversarial attacks in microphone-equipped smart home devices. Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 2 (2020), 1--24.
[10]
Wieland Brendel, Jonas Rauber, and Matthias Bethge. 2017. Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248 (2017).
[11]
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In 25th USENIX security symposium (USENIX security 16). 513--530.
[12]
Nicholas Carlini and David Wagner. 2018. Audio adversarial examples: Targeted attacks on speech-to-text. In Security and Privacy Workshops (SPW). IEEE, 1--7.
[13]
Guangke Chen, Sen Chenb, Lingling Fan, Xiaoning Du, Zhe Zhao, Fu Song, and Yang Liu. 2021. Who is real bob? adversarial attacks on speaker recognition systems. In Symposium on Security and Privacy (SP). IEEE.
[14]
Tao Chen, Longfei Shangguan, Zhenjiang Li, and Kyle Jamieson. 2020. Metamorph: Injecting inaudible commands into over-the-air voice controlled systems. In Network and Distributed Systems Security (NDSS) Symposium.
[15]
Yuxuan Chen, Xuejing Yuan, Jiangshan Zhang, Yue Zhao, Shengzhi Zhang, Kai Chen, and XiaoFeng Wang. 2020. Devil's whisper: A general approach for physical adversarial attacks against commercial black-box speech recognition devices. In {USENIX} Security Symposium ({USENIX} Security 20). 2667--2684.
[16]
J. S. Chung, A. Nagrani, and A. Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In INTERSPEECH.
[17]
Moustapha Cisse, Yossi Adi, Natalia Neverova, and Joseph Keshet. 2017. Houdini: Fooling deep structured prediction models. arXiv preprint arXiv:1707.05373 (2017).
[18]
Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Li Chen, Michael E Kounavis, and Duen Horng Chau. 2018. Adagio: Interactive experimentation with adversarial attack and defense for audio. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 677--681.
[19]
Jean Decety. 2011. The neuroevolution of empathy. Annals of the New York Academy of Sciences 1231, 1 (2011), 35--45.
[20]
Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. 2009. Voice conversion using artificial neural networks. In International Conference on Acoustics, Speech and Signal Processing. IEEE, 3893--3896.
[21]
Tianyu Du, Shouling Ji, Jinfeng Li, Qinchen Gu, Ting Wang, and Raheem Beyah. 2020. Sirenattack: Generating adversarial audio for end-to-end acoustic systems. In ACM Asia Conference on Computer and Communications Security. 357--369.
[22]
Julia C Dunbar, Emily Bascom, Ashley Boone, and Alexis Hiniker. 2021. Is Someone Listening? Audio-Related Privacy Perceptions and Design Recommendations from Guardians, Pragmatists, and Cynics. Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--23.
[23]
Dario Floreano, Peter Dürr, and Claudio Mattiussi. 2008. Neuroevolution: from architectures to learning. Evolutionary intelligence 1, 1 (2008), 47--62.
[24]
Nikolaus Hansen, Sibylle D Müller, and Petros Koumoutsakos. 2003. Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evolutionary computation 11, 1 (2003), 1--18.
[25]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In IEEE conference on computer vision and pattern recognition. 770--778.
[26]
F Kreuk, Y Adi, M Cisse, and J Keshet. 2018. Fooling end-to-end speaker verification by adversarial examples. arXiv preprint. arXiv preprint arXiv:1801.03339 (2018).
[27]
Eric A Lehmann and Anders M Johansson. 2008. Prediction of energy decay in room impulse responses simulated with an image-source model. The Journal of the Acoustical Society of America 124, 1 (2008), 269--277.
[28]
Eric A Lehmann and Anders M Johansson. 2009. Diffuse reverberation model for efficient image-source simulation of room impulse responses. Transactions on Audio, Speech, and Language Processing 18, 6 (2009), 1429--1439.
[29]
Eric A Lehmann, Anders M Johansson, and Sven Nordholm. 2007. Reverberation-time prediction method for room impulse responses simulated with the image-source model. In Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 159--162.
[30]
Zhuohang Li, Cong Shi, Yi Xie, Jian Liu, Bo Yuan, and Yingying Chen. 2020. Practical adversarial attacks against speaker recognition systems. In International workshop on mobile computing systems and applications. 9--14.
[31]
Zhuohang Li, Yi Wu, Jian Liu, Yingying Chen, and Bo Yuan. 2020. AdvPulse: Universal, Synchronization-free, and Targeted Audio Adversarial Attacks via Subsecond Perturbations. In ACM SIGSAC Conference on Computer and Communications Security.
[32]
Llewelyn S. Lloyd. 1937. Music and sound. Freeport, N.Y., Books for Libraries Press.
[33]
Seyedali Mirjalili. 2019. Genetic algorithm. In Evolutionary algorithms and neural networks. Springer, 43--55.
[34]
Satoshi Nakamura, Kazuo Hiyane, Futoshi Asano, Takanobu Nishiura, and Takeshi Yamada. 2000. Acoustical sound database in real environments for sound scene understanding and hands-free speech recognition. (2000).
[35]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. 2015. Librispeech: an ASR corpus based on public domain audio books. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5206--5210.
[36]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).
[37]
Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. 2015. A time delay neural network architecture for efficient modeling of long temporal contexts. In Sixteenth annual conference of the international speech communication association.
[38]
Riccardo Poli, James Kennedy, and Tim Blackwell. 2007. Particle swarm optimization. Swarm intelligence 1, 1 (2007), 33--57.
[39]
Yao Qin, Nicholas Carlini, Garrison Cottrell, Ian Goodfellow, and Colin Raffel. 2019. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In International conference on machine learning. PMLR, 5231--5240.
[40]
Mirco Ravanelli, Titouan Parcollet, Peter Plantinga, Aku Rouhe, Samuele Cornell, Loren Lugosch, Cem Subakan, Nauman Dawalatabad, Abdelwahab Heba, Jianyuan Zhong, Ju-Chieh Chou, Sung-Lin Yeh, Szu-Wei Fu, Chien-Feng Liao, Elena Rastorgueva, François Grondin, William Aris, Hwidong Na, Yan Gao, Renato De Mori, and Yoshua Bengio. 2021. SpeechBrain: A General-Purpose Speech Toolkit. arXiv:2106.04624 [eess.AS] arXiv:2106.04624.
[41]
Sebastian Risi, Joel Lehman, and Kenneth O Stanley. 2010. Evolving the placement and density of neurons in the hyperneat substrate. In Annual conference on Genetic and evolutionary computation. 563--570.
[42]
Tara Sainath and Carolina Parada. 2015. Convolutional neural networks for small-footprint keyword spotting. (2015).
[43]
Lea Schönherr, Katharina Kohls, Steffen Zeiler, Thorsten Holz, and Dorothea Kolossa. 2018. Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. arXiv preprint arXiv:1808.05665 (2018).
[44]
Suwon Shon, Hao Tang, and James Glass. 2018. Frame-level speaker embeddings for text-independent speaker recognition and analysis of end-to-end model. In Spoken language technology workshop (slt). IEEE, 1007--1013.
[45]
SLR31. 2022. Mini LibriSpeech ASR corpus. https://www.openslr.org/31/
[46]
David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. X-vectors: Robust dnn embeddings for speaker recognition. In International conference on acoustics, speech and signal processing (ICASSP). IEEE, 5329--5333.
[47]
Kenneth O Stanley, Bobby D Bryant, and Risto Miikkulainen. 2005. Real-time neuroevolution in the NERO video game. Transactions on evolutionary computation 9, 6 (2005), 653--668.
[48]
Kenneth O Stanley, Jeff Clune, Joel Lehman, and Risto Miikkulainen. 2019. Designing neural networks through neuroevolution. Nature Machine Intelligence 1, 1 (2019), 24--35.
[49]
Daniel Stoller, Sebastian Ewert, and Simon Dixon. 2018. Wave-u-net: A multi-scale neural network for end-to-end audio source separation. arXiv preprint arXiv:1806.03185 (2018).
[50]
Amrita S Tulshan and Sudhir Namdeorao Dhage. 2018. Survey on virtual assistant: Google assistant, siri, cortana, alexa. In International symposium on signal processing and intelligent recognition systems. Springer, 190--201.
[51]
Michael Vorländer and Jason E Summers. 2008. Auralization: Fundamentals of acoustics, modelling, simulation, algorithms, and acoustic virtual reality. Acoustical Society of America Journal 123, 6 (2008), 4028.
[52]
P. Warden. 2018. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition. ArXiv e-prints (April 2018). arXiv:1804.03209 [cs.CL] https://arxiv.org/abs/1804.03209
[53]
Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. 2014. Natural evolution strategies. The Journal of Machine Learning Research 15, 1 (2014), 949--980.
[54]
Wiki. 2022. Reverberation. https://en.wikipedia.org/wiki/Reverberation
[55]
Weidi Xie, Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2019. Utterance-level aggregation for speaker recognition in the wild. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5791--5795.
[56]
Yi Xie, Zhuohang Li, Cong Shi, Jian Liu, Yingying Chen, and Bo Yuan. 2020. Enabling fast and universal audio adversarial attack using generative model. arXiv preprint arXiv:2004.12261 (2020).
[57]
Hiromu Yakura and Jun Sakuma. 2018. Robust audio adversarial example for a physical attack. arXiv preprint arXiv:1810.11793 (2018).
[58]
Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. 2019. CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92). https://doi.org/10.7488/ds/2645
[59]
Zhuolin Yang, Bo Li, Pin-Yu Chen, and Dawn Song. 2018. Characterizing audio adversarial examples using temporal dependency. arXiv preprint arXiv:1809.10875 (2018).
[60]
Xuejing Yuan, Yuxuan Chen, Yue Zhao, Yunhui Long, Xiaokang Liu, Kai Chen, Shengzhi Zhang, Heqing Huang, Xiaofeng Wang, and Carl A Gunter. 2018. {CommanderSong}: A Systematic Approach for Practical Adversarial Voice Recognition. In 27th USENIX security symposium (USENIX security 18). 49--64.
[61]
Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017. Dolphinattack: Inaudible voice commands. In ACM SIGSAC Conference on Computer and Communications Security. 103--117.
[62]
Baolin Zheng, Peipei Jiang, Qian Wang, Qi Li, Chao Shen, Cong Wang, Yunjie Ge, Qingyang Teng, and Shenyi Zhang. 2021. Black-box adversarial attacks on commercial speech platforms with minimal information. In ACM SIGSAC Conference on Computer and Communications Security. 86--107.
[63]
Yingke Zhu, Tom Ko, David Snyder, Brian Mak, and Daniel Povey. 2018. Self-attentive speaker embeddings for text-independent speaker verification. In Interspeech, Vol. 2018. 3573--3577.

Index Terms

  1. Echo: Reverberation-based Fast Black-Box Adversarial Attacks on Intelligent Audio Systems

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 7, Issue 3
    September 2023
    1734 pages
    EISSN:2474-9567
    DOI:10.1145/3626192
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 September 2023
    Published in IMWUT Volume 7, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Adversarial example attacks
    2. Inconspicuous attack
    3. Intelligent audio systems

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 204
      Total Downloads
    • Downloads (Last 12 months)107
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media