1 Introduction

To date, automatic speech recognition (ASR) [2, 6, 9, 19] systems have been deployed ubiquitously in popular commercial products, such as Google Assistant, Amazon Alexa, and so on. An ASR system converts speech from audio into text before further processing. Deep learning techniques play an important role in modern ASR systems. Specifically, end-to-end ASR, which relies on recurrent neural network (RNN), was able to achieve human level performance when tested on several benchmark datasets [2].

However, deep learning models suffer from the threat of adversarial examples (AEs), which were first found in the image recognition domain [23]. An image AE is generated by applying imperceptible perturbations to a benign (normal) image, such that the resulting modified image will fool a deep learning model. There are targeted and untargeted image AEs. Targeted AEs force a target model to output predefined labels, while untargeted image AEs merely aim to make the target model output an incorrect result [16]. In addition, adversaries can assume a white-box or black-box threat model to generate AEs [4, 10, 30]. Under a white-box threat model, adversaries can access the internal workings of the target model, including model weights, training data, etc. In contrast, under a black-box threat model only input and output pairs can be obtained.

Besides image recognition, researchers also found that ASR models are vulnerable to audio AEs. In seminal work conducted by Carlini and Wagner [5], they generated audio AEs by solving an optimization problem by constraining the maximum norm of perturbations. Their work was improved in Qin et al. [20] via incorporating psychoacoustics to hide perturbations below the hearing threshold. However, such adversarial perturbations can only produce an AE for a specific audio signal, and must be recalculated to produce AEs for different audio signals. To overcome this shortcoming, researchers have investigated the generation of AEs using universal adversarial perturbations (UAPs) that can be applied directly to generic audio [1]. UAPs can be used to generate both untargeted and targeted audio AEs [17, 26]. It should be mentioned that the concept of UAPs was first introduced for image AEs [15].

Although a great amount of effort has been spent on attacking speaker verification models, sound classification models, etc., there is limited research focused on generating UAPs to attack ASR systems. For a given audio, ASR models deal with an excessively large number of potential transcripts. This task is typically more difficult compared to other classification models, which only output a fixed set of labels. Early work was conducted by Neekhara et al. [17], in which they generated UAPs for untargeted audio AEs. Compared to targeted audio AEs, untargeted audio AEs are less interesting as they only make ASR models output incorrect or even meaningless transcripts. Lu et al. [14] recently performed a preliminary study on targeted UAPs to attack ASR models. However, their UAPs cannot generate UAPs against models with connectionist temporal classification (CTC) loss [8]. This severely limits their method since CTC loss is widely deployed in modern ASR models that achieve state-of-the-art performance [2, 9].

In this paper, we fill the research gap by proposing UAPs that can be applied directly to audio to generate targeted audio AEs. Our main contributions are summarized as follows:

  • To the best of our knowledge, our UAP method is the first to successfully attack CTC loss based ASR models. Most existing work focus on speaker verification models, sound classification models, etc., instead of ASR models.

  • Unlike previous work by Lu et al. [14], we improve the quality of audio AEs by constraining the maximum norm of UAPs. Furthermore, we conducted a feasibility study to hide UAPs below the hearing threshold in a piece of music.

  • In addition to generating UAPs, we empirically show that UAPs can be considered to be signals that will be transcribed into the target phrase. The generation of UAPs can then be viewed as training (modifying) UAPs to be robust against modification using audio containing speech.

  • We show that the UAPs themselves preserve temporal dependency, such that the audio AEs generated by applying these UAPs also preserve temporal dependency.

2 Related Work

Early work in this field by Neekhara et al. [17], studied the generation of untargeted UAPs by maximizing CTC loss for each input audio. Compared to random noise, their UAPs can more effectively cause DeepSpeech [9] to output incorrect transcripts. However, untargeted attack cannot predetermine the output of a target model, and this makes untargeted attack less interesting than targeted attack. In contrast, our work focuses on targeted UAPs which pose severe threats because an adversary is able to control the output from a target model. Abdoli et al. [1] proposed UAPs that can generate targeted audio AEs. Instead of attacking ASR models, they attacked environmental sound classification and speech command recognition models.

In other work, Xie et al. [26] proposed to incorporate transformations by simulated room impulse response (RIR), so that audio AEs generated by their UAPs were robust against such transformations. The purpose is to make audio AEs still adversarial when played through speakers and received by microphones. They focused on fooling speaker verification models. Compared to ASR models which transcribe voice input, speak verification models aim to identify whether input voice comes from a valid user. Li et al. [13] demonstrated that it is unnecessary to perturb all samples in an audio signal. They generated UAPs that were much shorter than the input audio and the UAPs can be applied to an arbitrary position within the input audio. To make audio AEs physically adversarial, they used datasets of physically recorded RIRs instead of simulated RIRs.

As opposed to generating input-agnostic UAPs, another line of work focused on training a generative model, so that perturbations can be efficiently generated for previously unknown audio. Broadly speaking, the generative model represents UAPs that are input-dependent. Wang et al. [24] trained a generative adversarial network (GAN) to produce specific perturbations for an input audio. The output of GAN can fool the prediction of command classification and music classification models into outputing predetermined labels. Recent work by Li et al. [12] trained a generator that can map random noise to targeted UAPs given an input audio.

In contrast with existing work, this research investigates targeted UAPs against ASR models.

3 Problem Definition and Assumptions

Our goal is to generate UAPs \(\delta \) that will result in targeted audio AEs when applied to input audio. Note that \(\delta \) is specific to a target phrase, such that a different target phrase will require a different \(\delta \). We assume a white-box threat model, under which the internal workings of the target model are accessible and gradients with respect to the input can explicitly be calculated. Formally, let \(\delta \in \mathbb {R}^m\) be perturbations of length m. \(\delta _{i:j}= (\delta _i, \dots , \delta _j)\) denotes a slice of \(\delta \) from the \(i^{th}\) to \(j^{th}\) elements. Let \(f(\cdot )\) represent the ASR model. Let \(\mathcal {D}\) be a set of audio with audio sample values ranging from \([-1, 1]\), i.e. if \(x \in \mathcal {D}\) then \(||x||_\infty \le 1\). It should be noted that the length \(x \in \mathcal {D}\) varies. Without loss of generality, let n represent the length of x: \(x \in \mathbb {R}^n\). It is required that \(n \le m\), as given an input audio, \(\delta \) will first be truncated to the same length as the input. Then, an audio AE is generated by applying \(\delta \) to the input audio.

Specifically, we want to generate \(\delta \) that satisfies:

$$\begin{aligned} \begin{aligned}&\mathop {P}_{x \in \mathcal {D}}(f(x')=t) \ge \eta \\&\text {such that } ||\delta ||_{\infty } \le \tau \\ \end{aligned} \end{aligned}$$
(1)

where t is a predefined target phrase, \(x'\) is the modified audio with elements clipped into \([-1, 1]\): \(x' = \max (\min (x+ \delta _{1:n}, 1), -1)\), \(\eta \) denotes the minimal success rate of attack, and \(\tau \) constrains the maximum norm of \(\delta \).

3.1 Evaluation

Given an input audio \(x\in \mathbb {R}^n\), we measure the distortion caused by \(\delta \) in decibels (dB):

$$\begin{aligned} \begin{aligned}&dB_x(\delta ) = 20 \cdot \log _{10}\frac{\max _i \delta _i}{\max _i x_i} \\&\text {for } i \in \{1, 2, \dots , n\} \end{aligned} \end{aligned}$$
(2)

This metric was initially defined by Carlini and Wagner [5] and is also used in other work [1, 14, 17, 26]. This metric is analogous to the maximum norm measurement in the image AE domain.

4 Proposed Method

4.1 Universal Adversarial Perturbations

To generate UAPs that satisfy the requirements defined in Eq. 1, we solve the following optimization problem:

$$\begin{aligned} \begin{aligned}&\min _{\delta } \frac{1}{|\mathcal {D}|}\sum _{x \in \mathcal {D}} \ell _{adv}(f(x'), t) + \lambda \cdot \ell _{reg}(\delta , \tau ) \\&\text {such that } \mathop {P}_{x \in \mathcal {D}}(f(x')=t) \ge \eta \end{aligned} \end{aligned}$$
(3)

where \(\mathcal {D}\) is a set of input audio, \(x'\) is the modified audio clipped into the range \([-1, 1]\): \(x' = \max (\min (x + \delta ^{\tau }_{1:n}, 1), -1)\). \(\delta ^{\tau }\) is the perturbations applied to x and equals to \(\delta \) clipped into a specific range: \(\delta ^{\tau } = \max (\min (\delta , \tau ), -\tau )\), with \(\tau \) constraining the maximum norm. \(\ell _{adv}(\cdot )\) calculates the loss of the ASR model and minimizing \(\ell _{adv}(\cdot )\) encourages the modified input \(x'\) be to transcribed as t. If a solution is found, \(\delta ^{\tau }\) is returned as a UAP. To make \(\delta ^{\tau }\) less suspicious, it is preferred that \(\tau \) be as small as possible. Thus, \(\tau \) should be initialized to a large value, then gradually decreased until a valid solution can no longer be found.

figure a

Instead of viewing x as the input audio and \(\delta ^{\tau }\) as noise, we consider \(\delta ^{\tau }\) as a signal which is transcribed as t. From this perspective, x is considered as “noise” applied to \(\delta ^{\tau }\), and \(\delta ^{\tau }\) is robust against modification by adding \(x \in D\). We will validate this point of view later in Sect. 5. A recent study by Zhang et al. [29] presented a similar idea in the image AE domain. They showed that UAPs were highly correlated with the output logits of image classifiers so that the classification was actually dominated by UAPs.

\(\ell _{reg}(\cdot )\) is the regularization term with \(\lambda \) for weighting. \(\ell _{reg}(\cdot )\) is defined as follows:

$$\begin{aligned} \begin{aligned} \ell _{reg}(\delta , \tau ) = \sum _{i=1}^{m} \max (|\delta _i| - \tau , 0) \end{aligned} \end{aligned}$$
(4)

Minimizing \(\ell _{reg}(\cdot )\) encourages the maximum norm of \(\delta \) to be within \(\tau \). This prevents \(\frac{\partial \ell _{adv}(f(x + \delta '_{1:n}), t)}{\partial \delta _i}\) from always being 0 when \(|\delta _i| > \tau \).

In practice, we split the generation process into two stages. During stage 1, we set \(\tau =1\) and gradually let \(\delta ^{\tau }\) be effective for more and more audio in \(\mathcal {D}\). Stage 1 finishes when \(\delta ^{\tau }\) can attack all audio in \(\mathcal {D}\), i.e. an audio AE is generated by applying \(\delta ^{\tau }\) to any audio in \(\mathcal {D}\). The purpose of this stage is to quickly find a valid \(\delta ^{\tau }\), even though \(\delta ^{\tau }\) may be too noisy. In stage 2, we focus on making \(\delta ^{\tau }\) less noisy by gradually decreasing \(\tau \) until no valid solution can be found. This two stage generation process is provided in Algorithm 1.

4.2 Robustness Against Room Impulse Response

figure b

In the audio AE domain, expectation over transformation (EOT) has been widely used to make audio AEs robust against RIRs [20, 22, 25]. The purpose of being robust against RIRs is to let audio AEs still be adversarial when played through speakers and received by microphones. EOT [3] was initially proposed to make image AEs robust against camera transformations.

In this research, we also deploy EOT to make our UAPs robust against RIR. It should be mentioned that computation will be prohibitively expensive if too many RIRs are considered [7]. To incorporate EOT, the optimization problem define in Eq. 3 is modified as follows:

$$\begin{aligned} \begin{aligned}&\min _{\delta } \mathop {\mathbb {E}}_{h \in \mathcal {H}} [\frac{1}{|\mathcal {D}|}\sum _{x \in \mathcal {D}} \ell _{adv}(f(x' * h), t)] + \lambda \cdot \ell _{reg}(\delta , \tau ) \\&\text {such that } \mathop {\mathbb {E}}_{h \in \mathcal {H}}[\mathop {P}_{x \in \mathcal {D}}(f(x'*h)=t)] \ge \eta \end{aligned} \end{aligned}$$
(5)

where \(\mathcal {H}\) is the distribution of RIRs considered, and \(*\) denotes convolution operation.

Algorithm 2 provides the process used to solve the optimization problem shown in Eq. 5. Specifically, \(\delta \) is initialized as the solution found in Stage 1 of Algorithm 1. For each audio, we randomly select an RIR to transform the audio. \(\tau \) constrains the maximum norm of \(\delta ^\tau \), and it gradually decreases until no valid solution can be found.

5 Results and Discussion

5.1 Setup

In this study, we used DeepSpeech2 as the target model, which is an end-to-end RNN based ASR model with CTC loss [2]. We used the open source implementation of DeepSpeech2 V2Footnote 1 with Librispeech [18] as the dataset since a pre-trained model on this dataset was released. Specifically, we randomly extracted 150 audio with durations from 2 to 4 seconds from the “dev-clean” dataset to generate UAPs. We also extracted all audio with duration 2 to 4 seconds from the “test-clean” dataset for evaluation. We used the following 5 target phrases to generate UAPs: “power off”, “open the door”, “turn off lights”, “use airplane mode”, “visit malicious dot com”. It should be noted that target phrases cannot be too long. This is because it is overly challenging to force a target model to output transcripts that are too long for short input audio.

Throughout the experiments, if not otherwise indicated, we used the following settings. The Adam method [11] was used for optimization with a learning rate of 0.001. \(\tau \), which controls the maximum norm of UAPs as shown in Eq. 3 and Eq. 5, was initially set to 1.0 then decreased by being multiplied with 0.8. The minimum success rate \(\eta \) was fixed at 0.8 for both Eq. 3 and Eq. 5. Without incorporating EOT, the maximum iterations to lower the maximum norm of UAPs was set to 30. If EOT was incorporated, the maximum iterations was set to 60, because it is more computationally expensive to converge in this case.

5.2 Generating Universal Adversarial Perturbations

Fig. 1.
figure 1

Iteration trend when generating UAPs.

We first used the Stage1 function in Algorithm 1 to generate UAPs for the 5 target phrases. As previously mentioned, the aim of this stage is to generate valid UAPs, even though they may be noisy. The time taken to generate UAPs for the target phrases: “power off”, “open the door”, “turn off lights”, “use airplane mode”, “visit malicious dot com”, it took 5.0, 2.8, 7.8, 4.2 and 7.9 hours respectively. Obviously, the generation time for different target phrases is different. This may be because target phrases that are seen less frequently during training of the target model will require more iterations. At the start of the generation process, the audio set only contained 1 audio. When the generated UAPs were able to attack all audio in the current set, we added a new audio to the set, i.e. the size of the set increased by 1. This strategy is beneficial for convergence since the UAPs for a specific set only needs to handle one new audio. The set at the end of the process contained 150 audio.

Figure 1 shows the iteration trend to generate UAPs capable of attacking all audio as we gradually increase the size of the audio set. To clearly show the iteration trend, we present a moving average based on 3 data points. The horizontal axis represents the number of audio used to train UAPs, while the vertical axis indicates the number of iterations needed for the UAPs to attack all audio in the set. Early on when the size of the set was small, the number of iterations increased as more audio were added to the set. This is reasonable since the UAPs had to attack a greater number of audio, so more computation was required to find a solution. However, interestingly the iterations started to decrease when the size of the audio set reached around 20. This can be explained from the point of view that the generated UAPs are considered as signals that are transcribed into the target phrase, while audio containing speech are considered as noise being applied to UAPs. From that perspective, it is intuitive that after a while, the UAPs become more robust despite additional audio being added to the set. In other words, when UAPs are robust against a large set of audio, fewer iterations are required to find a solution to attack the newly added audio.

Fig. 2.
figure 2

Increase in success rate as the UAPs attacked an increasing number of audio.

To test the performance of the generated UAPs, we applied the UAPs to all audio with a duration between 2 to 4 s from the “test-clean” set. As shown in Fig. 2, the success rate of UAPs increased as more audio was used for training. In the Figure, the horizontal axis represents the number of audio used to train UAPs, while the success rate was calculated by applying UAPs to all 736 audio with a duration between 2 to 4 seconds from “test-clean” set. The increase in success rate is complementary to the above discussion that UAPs become more robust against new audio as the size of training set increases.

Table 1. Minimized maximum norm of universal perturbations
Fig. 3.
figure 3

Comparing UAPs generated using Stage1 and Stage2 with the target phrase “power off”. (a) UAPs generated using Stage1 alone were very noisy; (b) Stage2 constrained the maximum norm of UAPs to a small value.

UAPs generated using Stage1 alone were too noisy to be used in practice as they easily cause suspicion. Stage2 was used to constrain the maximum norm of UAPs. To effectively decrease the maximum norm, UAPs were only required to attack \(80\%\) of audio in the audio set by setting \(\eta =0.8\). Intuitively, lowering \(\eta \) will lead to smaller maximum norm of UAPs.

Table 1 presents the results of the 5 UAPs. It took around 1 hour to finish Stage2 for each UAPs. We can see that the maximum norm of UAPs was greatly reduced after Stage2. UAPs generated using Stage1 and Stage2 with “power off” as the target phrase is compared in Fig. 3. Although the success rate on the test audio decreased because we set \(\eta =0.8\) instead of 1.0, the UAPs were still effective to attack over \(45\%\) of audio from the test set.

To give a sense of the distortion cause by our UAPs, Carlini and Wagner [5] reported that the \(95\%\) interval for distortion using their approach was between −15 dB to −45 dB. While our UAPs introduce more distortion compared with their approach, the key thing to note is that their perturbations are only effective for a specific audio input and must be recalculated for different audio, as opposed to UAPs which are universal and able to attack generic audio.

5.3 Preserving Temporal Dependency

Table 2. An example depicting preserved temporal dependency for UAPs

Temporal dependency (TD) was proposed as an important property to detect audio AEs by Yang et al. [27]. The key assumption is that benign audio preserves TD while audio AEs do not. Specifically, let \(S_k\) denote the transcript of the first kth portion of input audio. Let \(S_{\{whole, k\}}\) denote the first kth portion of the entire transcript, such that the length of \(S_{\{whole, k\}}\) is equal to the length of \(S_k\). If \(S_{\{whole, k\}}\) is not consistent with \(S_k\), this means the audio is potentially adversarial.

In our experiments, we found that UAPs generated by Stage2 can be transcribed as the target phrase and preserved TD. This finding is complementary to our point view that UAPs can be considered as signals that are transcribed as the target phrase. The results for the target phrases: “power off”, “use airplane mode” and “visit malicious dot com”, are shown in Table 2. The experimental results show that the transcripts of differently sliced UAPs were consistent with the corresponding portions of the target phrase. An interesting observation is that when \(k \ge 0.6\), all the partial UAPs were accurately transcribed as the target phrase. This is intuitive because the duration of the UAPs was 4 seconds, and were required to attack \(80\%\) of audio with duration between 2 to 4 seconds by design. Thus, the first portion of the UAPs were transcribed as the target phrase and robust against modification. The remaining parts of UAPs then aimed to suppress output from DeepSpeech2, i.e. forcing DeepSpeech2 to output nothing for those parts.

Table 3. AUC of temporal dependency detection*

As the UAPs preserved TD, this suggests that audio AEs generated by applying UAPs would also preserve TD. Therefore, we calculated the same metrics proposed by Yang et al. [27] to validate if our audio AEs generated using the UAPs were able to avoid TD detectionFootnote 2. These metrics were area under curve (AUC) score of word error rate (WER), AUC of character error rate (CER), and AUC of longest common prefix (LCP).

The audio AEs used in the experiment were those successfully generated by applying our Stage 2 UAPs to the test audio. Table 3 shows the experimental results for \(k=\frac{1}{2}, \frac{2}{3}, \frac{3}{4}\). We can see that TD detection only achieved good performance with WER and LCP on detecting audio AEs with the target phrase “power off” when \(k=\frac{1}{2}\). This implies that the first half of the UAPs for “power off” was not robust enough. To improve the robustness against TD detection for “power off” when \(k=\frac{1}{2}\), a potential solution is to increase the value of \(\eta \) in Stage2. If \(\eta = 1.0\), the first half of the UAPs for “power off” will be forced to be robust, although this will result in a larger maximum norm for UAPs. Other than the “power off” target phrase, we can see from Table 3 that most AUC scores were below 0.75. This indicates that audio AEs generated by our UAPs were overall robust against TD detection.

5.4 Robustness Against Gaussian Noise

Table 4. Success rates of audio AEs generated using UAPs against Gaussian noise

As discussed above, UAPs were trained to be robust against modification using audio containing speech. Table 4 further shows that audio AEs generated by applying UAPs to test audios were also robust against Gaussian noise until \(std=0.01\).

5.5 Robustness Against Room Impulse Response

Table 5. Robustness of UAPs their corresponding audio AEs against RIRs

We generated 100 RIRs from virtual rooms with dimension \((width, length, height)\) using pyroomacoustics 0.4.2Footnote 3. 80 RIRs were used for training while 20 RIRs used were for testing. height was set to 3.5 while \(width = length\) and we randomly sampled their values from \(\mathcal {U}(4,6)\). The time it takes for the RIR to decay by 60 dB was randomly sampled from \(\mathcal {U}(0.15,0.20)\). Locations of microphones and audio sources were randomly sampled inside the virtual rooms.

To test the robustness against RIR, each audio AE was transformed by a random RIR from the 20 RIRs. We also transformed the UAPs by all the 20 RIRs and to check whether UAPs themselves are robust against RIRs. When using Algorithm 2 to generate robust UAPs, we set the maximum iterations to 60.

Table 5 shows the results of comparing robust UAPs generated using Algorithm 2 with UAPs generated by Stage2. Table 5 also compares the robustness of audio AEs, which were generated by applying the corresponding UAPs to test audio. Although there was an exception for UAPs of “open the door”, UAPs generated by Stage2 and corresponding audio AEs were obviously not robust against RIRs. In contrast, UAPs generated using Algorithm 2 and their corresponding audio AEs were robust against RIRs. It should be noted that robustness against RIRs was obtained at the cost of significantly larger maximum norm.

5.6 Limitation

Our experiments showed that the quality of audio AEs generated by applying UAPs was poor. The distortion caused by UAPs will be worse if we make them robust against RIRs. While it will be difficult to lower the maximum norm of UAPs further while keeping them adversarial, we can potentially hide UAPs below the hearing threshold of unsuspicious sound. This may be a promising future direction. A potential scenario is where an adversary plays unsuspicious adversarial audio in the background, while the victim speaks to a voice interface, thereby causing the underlying ASR model to be fooled. A similar idea was proposed by Commandersong [28], in which they hid perturbations within a song. However, their method may not be robust for speech, which is common for voice interfaces.

In this section, we present a feasibility study on hiding UAPs below the hearing threshold in a piece of piano music. We incorporated the masking loss proposed by Qin et al. [20], which hid perturbations below the hearing threshold of speech. Specifically, we replaced the \(l_{reg}(\cdot )\) in Eq. 3 with the masking loss. Instead of generating UAPs from scratch, we used UAPs generated by Stage2 of Algorithm 1 as initial values. It should be mentioned that audio AEs were generated by applying UAPs together with the music.

Measuring the maximum norm of UAPs is meaningless in this case because large values in UAPs would be masked by the music. Therefore, we measured the Perceptual Evaluation of Speech Quality (PESQ), which was proposed to automatically measure degradation in the context of telephony [21]. The values range from 1.0 to 4.5 with larger values indicating better quality.

After running 30 iterations, we successfully generated UAPs by setting \(\eta =0.5\). The PESQ between the original music and music distorted by UAPs was 2.97, which means moderate quality. The success rate of generating audio AEs from test audios was \(30.71\%\). This shows UAPs hidden in music are still able to attack generic audio.

6 Conclusion and Future Work

In the audio AE domain, there is limited work focusing on generating UAPs against ASR models. In this research, we filled this research gap by proposing the first successful targeted UAPs against ASR models with CTC loss. We analyzed UAPs from the point of view that UAPs can be considered as signals that were transcribed as the target phrase. To decrease the distortion caused by UAPS, we tried to minimize the maximum norm of UAPs. In addition, we showed that UAPs themselves preserved temporal dependency, such that the audio AEs generated by applying UAPs also preserved temporal dependency. UAPs and the corresponding audio AEs were also robust against Gaussian noise. We demonstrated the possibiliy of hiding UAPs below the hearing threshold of unsuspicious sound, such as music. Future work will focus on generating UAPs with reduced distortion.