Detecting adversarial examples by positive and negative representations

doi:10.1016/j.asoc.2021.108383

Applied Soft Computing

Volume 117, March 2022, 108383

https://doi.org/10.1016/j.asoc.2021.108383 Get rights and content

Highlights

•
We propose a PNDetector to detect perturbations which is simple and efficient.
•
We use PNClassifiers as part of the PNDetector rather than traditional classifiers.
•
We use eight methods to generate examples to test the performance of the detector.

Abstract

Deep neural networks (DNNs) have been successfully applied in various fields. However, it has been demonstrated that a well-designed and quasi-imperceptible perturbation can confuse the targeted DNNs classifier with high confidence and lead to misclassification. Examples with such perturbations are called adversarial examples, and it is a challenging task to detect them. In this paper, we propose a positive–negative detector (PNDetector) to detect adversarial examples. The PNDetector is based on a positive–negative classifier (PNClassifier), which is trained by both the original examples (called positive representations) and their negative representations with the same structural and semantic features. The principle of the PNDetector is that the feature space of the positive and negative representations of adversarial examples under the PNClassifier has a high probability of belonging to different categories, while its performance on clean examples is not reduced by adding negative example representations into the train set. We test the PNDetector with adversarial examples generated by eight typical attack methods on four typical datasets. The experimental results demonstrate that the proposed detector is efficient in all datasets and under all attack types. Furthermore, its detection performance is comparable to that of state-of-the-art methods.

Introduction

In recent years, deep neural networks (DNNs) have been widely used in machine learning tasks, in particular speech recognition [1], face recognition [2], autonomous driving [3], and riverside monitoring systems [4]. However, it has been demonstrated that DNNs are susceptible to adversarial perturbations. That is, small perturbations that could cause legitimate images to be misclassified with high confidence. Such examples, which lead to misclassification, are called adversarial examples [5]. Attacking a DNN model in this manner may have serious consequences. For example, an attack on a facial feature recognition system may result in illegal access, a misjudgment by an automated driving system may cause a car accident, or a virus spoofing a malicious code detection system may seriously damage a computer.

The widespread application of DNNs has attracted considerable attention to the defense of adversarial examples. However, this is a challenging task. First, there are various adversarial examples generated by different attacks. It is difficult to establish a unified model to describe the intrinsic characteristics of these adversarial examples. Second, the perturbations are mostly solutions of non-convex nonlinear optimization problems, whose theoretical basis is not well understood [6]. Therefore, it is difficult to form a systematic theoretical explanation of the existence of adversarial examples in the DNN models, and the defense based on network flaws also becomes difficult.

Existing detection methods require a lot of time-consuming computations. In order to propose a fast and effective detection algorithm, we consider the use of negative information representation to study adversarial sample detectors. Information negative representation is a branch of immune computing used for information security and privacy protection [7], [8], [9]. Inspired by the negative representation of information [10], [11], we can easily flip the pixels of the original image to construct a negative image. These negative images have different pixel values from the original image, but have the same semantics, which is very meaningful. Firstly, negative images can be generated without any auxiliary information to enrich the dataset, so as to make the decision boundary of the model more accurate. Secondly, the existing attacks are basically attacks on positive samples. At the same time, there is a high probability that there is a difference between the positive adversarial samples and the negative adversarial samples (the negative representation of the positive adversarial samples), so we can perform difference analysis on the positive and negative images to detect Adversarial samples.

This paper focuses on adversarial examples concerning the feature space. A positive–negative detector (PNDetector) is proposed to detect adversarial examples, which is an attack-agnostic detection method. That is, when PNDetector detects the adversarial example, it does not know what kind of attack methods the example was generated by. The starting point of our study is that it is difficult to defend against adversarial perturbations, and thus we choose to study the intrinsic feature space of negative representation examples that negate all pixels. Negative representation image refers to the inversion of the original image, namely $1 - x$ , when the original image is $x$ , and the pixel value is normalized to the interval [0,1]. Given that negative and positive images often have the same semantic structure [12], we reasonably propose PNClassifier, which adds negative representation samples to the training set. PNClassifier can enforce the similarity of normal positive and negative samples and map the adversarial positive and negative ones into the disjoint example space. Therefore, if the similarity of an unknown example is large (as calculated by the proposed detector), then the example is legitimate. However, if the difference is large, the example is judged to be adversarial, and its input is refused to avoid further damage.

In summary, the main contributions of this study are as follows.

•
Based on the negative representation of information, we propose a PNDetector to detect adversarial perturbations. The detector is composed of a classifier and a set of decision strategies. It judges whether an input example is an adversarial example by comparing the output similarity of its positive and negative representation version. PNDetector is simple, computationally efficient, and does not rely on datasets, attack methods, and network models.
•
We use PNClassifiers as part of the PNDetector rather than traditional classifiers. PNClassifiers are trained by original examples and their negative representations. According to the assignment strategy of negative example labels, three PNClassifiers are proposed. In the first, positive and negative examples use the same label, but in the other two, different labels are used.
•
We use eight advanced attack methods to generate examples to test the performance of the detector. The experimental results indicate that compared with state-of-the-art detection methods, the PNDetector has competitive performance. What is more, the good performance of PNDetector inspired us the ability to simultaneously attack various representation examples will be an important factor in evaluating attack performance in the future.

The rest of this paper is organized as follows. Section 2 introduces the background and related work. Section 3 introduces the concept of negative representations of images, proposes the PNDetector, and use the PNClassifier to analyze the rationality of the PNDetector. Section 4 presents the algorithm settings and the experimental results by the PNClassifier and the PNDetector, and it compares the performance of the proposed method with that of several state-of-the-art algorithms. The final section concludes the paper and proposes possible future research directions.

Section snippets

Related work

It has been demonstrated that in most classification models, examples with difficult-to-perceive perturbations can be easily misclassified. In this section, we introduce some typical adversarial attacks as well as relevant work on defense.

Motivation

DNNs, particularly neural networks (CNNs), have been successfully used for image recognition because they simulate visual object recognition, that is, local perception fields that perceive images from local to global. The key to recognition lies in the connection between image pixels, that is, image structural features. Comparatively, the value of a certain pixel is less important. Therefore, if some reasonable changes are applied to all pixels, DNNs should also be able to identify accurately

Experimental setup

The proposed PNDetector is implemented in Python and is based on Feature Squeezing [43] and the cleverhans library [50]. The attack methods used for generating adversarial examples are from the cleverhans library. The experiments are conducted on a PC with an i7-8786k 4.0 GHz CPU and a GeForce GTX 1080Ti GPU. The source code is available at https://github.com/Daftstone/PNDetector.

Conclusions and future work

We proposed the PNDetector for detecting adversarial perturbations based on the positive and negative representations of examples. In the PNDetector, we used three PNClassifiers by adding negative-representation examples to the training set. Furthermore, we used the image feature space to present our insights into the cause of adversarial examples and explain the plausibility of the PNDetector. The key of the detector is that the feature space of the negative adversarial example is inconsistent

CRediT authorship contribution statement

Wenjian Luo: Methodology, Experiments, Writing – original draft, Funding acquisition, Project administration. Chenwang Wu: Methodology, Software, Experiments, Writing – original draft. Li Ni: Software checking, Experiments checking, Writing – revising draft. Nan Zhou: Methodology, Software checking, Experiments checking, Writing – revising draft. Zhenya Zhang: Experiments checking, Writing – revising draft, Funding acquisition, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (54)

SalazarA. et al.
Generative adversarial networks and Markov random fields for oversampling very small training sets
Expert Syst. Appl.
(2021)
HintonG. et al.
Deep neural networks for acoustic modeling in speech recognition
IEEE Signal Process. Mag.
(2012)
Y. Taigman, M. Yang, M. Ranzato, L. Wolf, Deepface: Closing the gap to human-level performance in face verification,...
C. Chen, A. Seff, A. Kornhauser, J. Xiao, Deepdriving: Learning affordance for direct perception in autonomous driving,...
PołapD. et al.
Automatic ship classification for a riverside monitoring system using a cascade of artificial intelligence techniques including penalties and rewards
ISA Trans.
(2021)
SzegedyC. et al.
Intriguing properties of neural networks
GoodfellowI. et al.
Attacking machine learning with adversarial examples
OpenAI
(2017)
ForrestS. et al.
Self-nonself discrimination in a computer
ZhaoD. et al.
Negative iris recognition
IEEE Trans. Dependable Secure Comput.
(2018)
LuoW. et al.
Authentication by encrypted negative password
IEEE Trans. Inf. Forensics Secur.
(2019)

EspondaF.

Negative Representations of Information

(2005)

LuoW. et al.

Three branches of negative representation of information: A survey

IEEE Trans. Emerg. Top. Comput. Intell.

(2018)

HosseiniH. et al.

On the limitation of convolutional neural networks in recognizing negative images

GoodfellowI.J. et al.

Explaining and harnessing adversarial examples

stat

(2015)

S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, DeepFool: a simple and accurate method to fool deep neural networks, in:...

PapernotN. et al.

The limitations of deep learning in adversarial settings

CarliniN. et al.

Towards evaluating the robustness of neural networks

SuJ. et al.

One pixel attack for fooling deep neural networks

IEEE Trans. Evol. Comput.

(2019)

Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, J. Li, Boosting adversarial attacks with momentum, in: Proceedings of...

AthalyeA. et al.

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

PengD. et al.

Structure-preserving transformation: Generating diverse and transferable adversarial examples

A.S. Shamsabadi, R. Sanchez-Matilla, A. Cavallaro, ColorFool: Semantic adversarial colorization, in: Proceedings of the...

H. Hosseini, R. Poovendran, Semantic adversarial examples, in: Proceedings of the IEEE Conference on Computer Vision...

AkhtarN. et al.

Threat of adversarial attacks on deep learning in computer vision: A survey

IEEE Access

(2018)

KurakinA. et al.

Adversarial machine learning at scale

PapernotN. et al.

Distillation as a defense to adversarial perturbations against deep neural networks

HintonG. et al.

Distilling the knowledge in a neural network

stat

(2015)

Cited by (12)

Unlocking the black box of CNNs: Visualising the decision-making process with PRISM
2023, Information Sciences
Technology has grown rapidly in recent years, and new solutions that rely on Machine Learning (ML) and Artificial Intelligence (AI) are introduced every day. With such fast-paced advancement, inspecting and fully comprehending how given models make decisions is becoming problematic. The complex decision-making process of these models has become a black box, making it challenging to unravel how they work; therefore, eXplainable Artificial Intelligence (XAI) methods are crucial for further development. This paper discusses how state-of-the-art techniques determine classifications and why they need to be revised to understand the prediction-generating process fully. It compares those existing solutions with the new method called Principal Image Sections Mapping - PRISM, which relies on Principal Component Analysis and allows visualising the most significant features recognised by a given Convolutional Neural Network. PRISM is implemented in a piece of software called TorchPRISM that can generate and present the clustering based on the method's output. The result can indicate ambiguous classes discrimination; thus, the possibility of automating the output analysis process is also discussed. The paper's main objective is to examine how PRISM enhances the current understanding of the decision-making process and introduce a tool that can facilitate analysing the output. PRISM implementation (TorchPRISM) can be found in the public GitHub repository: https://github.com/szandala/TorchPRISM
Adversarial anchor-guided feature refinement for adversarial defense
2023, Image and Vision Computing
Adversarial training (AT), which is known as a robust training method for defending against adversarial examples, usually loses the performance of models for clean examples due to the feature distribution discrepancy between clean and adversarial. In this paper, we propose a novel Adversarial Anchor-guided Feature Refinement (AAFR) defense method aimed at reducing the discrepancy and delivering reliable performances for both clean and adversarial examples. We devise adversarial anchor that detects whether the feature comes from clean or adversarial example. Then, we use adversarial anchor to refine the feature to reduce the discrepancy. As a result, the proposed method substantially achieves adversarial robustness while preserving the performance for clean examples. The effectiveness of the proposed method is verified with comprehensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet datasets.
Adversarial example detection using semantic graph matching
2023, Applied Soft Computing
Deep neural networks have recently been found to be vulnerable to adversarial examples, which can deceive attacked models with high confidence. This has given rise to significant security threats and raised doubts about the reliability of deploying deep learning models in security-critical domains. Therefore, effectively dealing with various adversarial examples has become an essential but challenging requirement. Adversarial example detection can predict the existence of adversarial examples in advance. However, existing detection methods are usually restricted to specific attacks, lacking generalization and decision-making bases. In this paper, we discover that the common characteristic of adversarial examples is to alter the semantic information recognized by models, which can effectively distinguish adversarial examples and provide interpretability. Based on this perspective, we propose a semantic graph matching (SeMatch) method to execute attack-agnostic adversarial example detection. SeMatch detects adversarial examples by comparing the constructed semantic graphs with the semantic graph prototypes of their predicted classes and further corrects their classification results. Experimental results demonstrate that SeMatch can effectively detect and classify adversarial examples in several attack settings, achieving an average detection and classification accuracy of 96.01% and 89.71%, respectively. When applied to unknown attack scenarios, SeMatch is more effective and interpretable than state-of-the-art methods.
Toward Understanding and Harnessing the Effect of Image Transformation in Adversarial Detection
2024, SSRN
ZDDR: A Zero-Shot Defender for Adversarial Samples Detection and Restoration
2024, IEEE Access
An Ant Colony Optimization Defense Against Adversarial Attacks for Pedestrian Detection
2023, SSRN

View all citing articles on Scopus

^☆: This study is supported by the National Natural Science Foundation of China (No. 61573327) and the open project of the Anhui Province Key Laboratory of Intelligent Building and Building Energy, China Saving (No. IBBE2018KX01ZD).

View full text

Detecting adversarial examples by positive and negative representations☆

Highlights

Abstract

Introduction

Section snippets

Related work

Motivation

Experimental setup

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Expert Syst. Appl.

Deep neural networks for acoustic modeling in speech recognition

IEEE Signal Process. Mag.

Automatic ship classification for a riverside monitoring system using a cascade of artificial intelligence techniques including penalties and rewards

ISA Trans.

Intriguing properties of neural networks

Attacking machine learning with adversarial examples

OpenAI

Self-nonself discrimination in a computer

Negative iris recognition

IEEE Trans. Dependable Secure Comput.

Authentication by encrypted negative password

IEEE Trans. Inf. Forensics Secur.

Negative Representations of Information

Three branches of negative representation of information: A survey

IEEE Trans. Emerg. Top. Comput. Intell.

On the limitation of convolutional neural networks in recognizing negative images

Explaining and harnessing adversarial examples

stat

The limitations of deep learning in adversarial settings

Towards evaluating the robustness of neural networks

One pixel attack for fooling deep neural networks

IEEE Trans. Evol. Comput.

Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples

Structure-preserving transformation: Generating diverse and transferable adversarial examples

Threat of adversarial attacks on deep learning in computer vision: A survey

IEEE Access

Adversarial machine learning at scale

Distillation as a defense to adversarial perturbations against deep neural networks

Distilling the knowledge in a neural network

stat