Detecting adversarial examples by positive and negative representations

https://doi.org/10.1016/j.asoc.2021.108383Get rights and content

Highlights

  • We propose a PNDetector to detect perturbations which is simple and efficient.

  • We use PNClassifiers as part of the PNDetector rather than traditional classifiers.

  • We use eight methods to generate examples to test the performance of the detector.

Abstract

Deep neural networks (DNNs) have been successfully applied in various fields. However, it has been demonstrated that a well-designed and quasi-imperceptible perturbation can confuse the targeted DNNs classifier with high confidence and lead to misclassification. Examples with such perturbations are called adversarial examples, and it is a challenging task to detect them. In this paper, we propose a positive–negative detector (PNDetector) to detect adversarial examples. The PNDetector is based on a positive–negative classifier (PNClassifier), which is trained by both the original examples (called positive representations) and their negative representations with the same structural and semantic features. The principle of the PNDetector is that the feature space of the positive and negative representations of adversarial examples under the PNClassifier has a high probability of belonging to different categories, while its performance on clean examples is not reduced by adding negative example representations into the train set. We test the PNDetector with adversarial examples generated by eight typical attack methods on four typical datasets. The experimental results demonstrate that the proposed detector is efficient in all datasets and under all attack types. Furthermore, its detection performance is comparable to that of state-of-the-art methods.

Introduction

In recent years, deep neural networks (DNNs) have been widely used in machine learning tasks, in particular speech recognition [1], face recognition [2], autonomous driving [3], and riverside monitoring systems [4]. However, it has been demonstrated that DNNs are susceptible to adversarial perturbations. That is, small perturbations that could cause legitimate images to be misclassified with high confidence. Such examples, which lead to misclassification, are called adversarial examples [5]. Attacking a DNN model in this manner may have serious consequences. For example, an attack on a facial feature recognition system may result in illegal access, a misjudgment by an automated driving system may cause a car accident, or a virus spoofing a malicious code detection system may seriously damage a computer.

The widespread application of DNNs has attracted considerable attention to the defense of adversarial examples. However, this is a challenging task. First, there are various adversarial examples generated by different attacks. It is difficult to establish a unified model to describe the intrinsic characteristics of these adversarial examples. Second, the perturbations are mostly solutions of non-convex nonlinear optimization problems, whose theoretical basis is not well understood [6]. Therefore, it is difficult to form a systematic theoretical explanation of the existence of adversarial examples in the DNN models, and the defense based on network flaws also becomes difficult.

Existing detection methods require a lot of time-consuming computations. In order to propose a fast and effective detection algorithm, we consider the use of negative information representation to study adversarial sample detectors. Information negative representation is a branch of immune computing used for information security and privacy protection [7], [8], [9]. Inspired by the negative representation of information [10], [11], we can easily flip the pixels of the original image to construct a negative image. These negative images have different pixel values from the original image, but have the same semantics, which is very meaningful. Firstly, negative images can be generated without any auxiliary information to enrich the dataset, so as to make the decision boundary of the model more accurate. Secondly, the existing attacks are basically attacks on positive samples. At the same time, there is a high probability that there is a difference between the positive adversarial samples and the negative adversarial samples (the negative representation of the positive adversarial samples), so we can perform difference analysis on the positive and negative images to detect Adversarial samples.

This paper focuses on adversarial examples concerning the feature space. A positive–negative detector (PNDetector) is proposed to detect adversarial examples, which is an attack-agnostic detection method. That is, when PNDetector detects the adversarial example, it does not know what kind of attack methods the example was generated by. The starting point of our study is that it is difficult to defend against adversarial perturbations, and thus we choose to study the intrinsic feature space of negative representation examples that negate all pixels. Negative representation image refers to the inversion of the original image, namely 1x, when the original image is x, and the pixel value is normalized to the interval [0,1]. Given that negative and positive images often have the same semantic structure [12], we reasonably propose PNClassifier, which adds negative representation samples to the training set. PNClassifier can enforce the similarity of normal positive and negative samples and map the adversarial positive and negative ones into the disjoint example space. Therefore, if the similarity of an unknown example is large (as calculated by the proposed detector), then the example is legitimate. However, if the difference is large, the example is judged to be adversarial, and its input is refused to avoid further damage.

In summary, the main contributions of this study are as follows.

  • Based on the negative representation of information, we propose a PNDetector to detect adversarial perturbations. The detector is composed of a classifier and a set of decision strategies. It judges whether an input example is an adversarial example by comparing the output similarity of its positive and negative representation version. PNDetector is simple, computationally efficient, and does not rely on datasets, attack methods, and network models.

  • We use PNClassifiers as part of the PNDetector rather than traditional classifiers. PNClassifiers are trained by original examples and their negative representations. According to the assignment strategy of negative example labels, three PNClassifiers are proposed. In the first, positive and negative examples use the same label, but in the other two, different labels are used.

  • We use eight advanced attack methods to generate examples to test the performance of the detector. The experimental results indicate that compared with state-of-the-art detection methods, the PNDetector has competitive performance. What is more, the good performance of PNDetector inspired us the ability to simultaneously attack various representation examples will be an important factor in evaluating attack performance in the future.

The rest of this paper is organized as follows. Section 2 introduces the background and related work. Section 3 introduces the concept of negative representations of images, proposes the PNDetector, and use the PNClassifier to analyze the rationality of the PNDetector. Section 4 presents the algorithm settings and the experimental results by the PNClassifier and the PNDetector, and it compares the performance of the proposed method with that of several state-of-the-art algorithms. The final section concludes the paper and proposes possible future research directions.

Section snippets

Related work

It has been demonstrated that in most classification models, examples with difficult-to-perceive perturbations can be easily misclassified. In this section, we introduce some typical adversarial attacks as well as relevant work on defense.

Motivation

DNNs, particularly neural networks (CNNs), have been successfully used for image recognition because they simulate visual object recognition, that is, local perception fields that perceive images from local to global. The key to recognition lies in the connection between image pixels, that is, image structural features. Comparatively, the value of a certain pixel is less important. Therefore, if some reasonable changes are applied to all pixels, DNNs should also be able to identify accurately

Experimental setup

The proposed PNDetector is implemented in Python and is based on Feature Squeezing [43] and the cleverhans library [50]. The attack methods used for generating adversarial examples are from the cleverhans library. The experiments are conducted on a PC with an i7-8786k 4.0 GHz CPU and a GeForce GTX 1080Ti GPU. The source code is available at https://github.com/Daftstone/PNDetector.

Conclusions and future work

We proposed the PNDetector for detecting adversarial perturbations based on the positive and negative representations of examples. In the PNDetector, we used three PNClassifiers by adding negative-representation examples to the training set. Furthermore, we used the image feature space to present our insights into the cause of adversarial examples and explain the plausibility of the PNDetector. The key of the detector is that the feature space of the negative adversarial example is inconsistent

CRediT authorship contribution statement

Wenjian Luo: Methodology, Experiments, Writing – original draft, Funding acquisition, Project administration. Chenwang Wu: Methodology, Software, Experiments, Writing – original draft. Li Ni: Software checking, Experiments checking, Writing – revising draft. Nan Zhou: Methodology, Software checking, Experiments checking, Writing – revising draft. Zhenya Zhang: Experiments checking, Writing – revising draft, Funding acquisition, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (54)

  • SalazarA. et al.

    Generative adversarial networks and Markov random fields for oversampling very small training sets

    Expert Syst. Appl.

    (2021)
  • HintonG. et al.

    Deep neural networks for acoustic modeling in speech recognition

    IEEE Signal Process. Mag.

    (2012)
  • Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification,...
  • C. Chen, A. Seff, A. Kornhauser, J. Xiao, Deepdriving: Learning affordance for direct perception in autonomous driving,...
  • PołapD. et al.

    Automatic ship classification for a riverside monitoring system using a cascade of artificial intelligence techniques including penalties and rewards

    ISA Trans.

    (2021)
  • SzegedyC. et al.

    Intriguing properties of neural networks

  • GoodfellowI. et al.

    Attacking machine learning with adversarial examples

    OpenAI

    (2017)
  • ForrestS. et al.

    Self-nonself discrimination in a computer

  • ZhaoD. et al.

    Negative iris recognition

    IEEE Trans. Dependable Secure Comput.

    (2018)
  • LuoW. et al.

    Authentication by encrypted negative password

    IEEE Trans. Inf. Forensics Secur.

    (2019)
  • EspondaF.

    Negative Representations of Information

    (2005)
  • LuoW. et al.

    Three branches of negative representation of information: A survey

    IEEE Trans. Emerg. Top. Comput. Intell.

    (2018)
  • HosseiniH. et al.

    On the limitation of convolutional neural networks in recognizing negative images

  • GoodfellowI.J. et al.

    Explaining and harnessing adversarial examples

    stat

    (2015)
  • S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, DeepFool: a simple and accurate method to fool deep neural networks, in:...
  • PapernotN. et al.

    The limitations of deep learning in adversarial settings

  • CarliniN. et al.

    Towards evaluating the robustness of neural networks

  • SuJ. et al.

    One pixel attack for fooling deep neural networks

    IEEE Trans. Evol. Comput.

    (2019)
  • Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, J. Li, Boosting adversarial attacks with momentum, in: Proceedings of...
  • AthalyeA. et al.

    Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

  • PengD. et al.

    Structure-preserving transformation: Generating diverse and transferable adversarial examples

  • A.S. Shamsabadi, R. Sanchez-Matilla, A. Cavallaro, ColorFool: Semantic adversarial colorization, in: Proceedings of the...
  • H. Hosseini, R. Poovendran, Semantic adversarial examples, in: Proceedings of the IEEE Conference on Computer Vision...
  • AkhtarN. et al.

    Threat of adversarial attacks on deep learning in computer vision: A survey

    IEEE Access

    (2018)
  • KurakinA. et al.

    Adversarial machine learning at scale

  • PapernotN. et al.

    Distillation as a defense to adversarial perturbations against deep neural networks

  • HintonG. et al.

    Distilling the knowledge in a neural network

    stat

    (2015)
  • Cited by (12)

    View all citing articles on Scopus

    This study is supported by the National Natural Science Foundation of China (No. 61573327) and the open project of the Anhui Province Key Laboratory of Intelligent Building and Building Energy, China Saving (No. IBBE2018KX01ZD).

    View full text