research-article

Public Access

Memory-based Distribution Shift Detection for Learning Enabled Cyber-Physical Systems with Statistical Guarantees

Authors:

Yahan Yang,

Ramneet Kaur,

Souradeep Dutta,

Insup LeeAuthors Info & Claims

ACM Transactions on Cyber-Physical Systems, Volume 8, Issue 2

Article No.: 21, Pages 1 - 28

https://doi.org/10.1145/3643892

Published: 15 May 2024 Publication History

PDF eReader

Abstract

Incorporating learning based components in the current state-of-the-art cyber-physical systems (CPS) has been a challenge due to the brittleness of the underlying deep neural networks. On the bright side, if executed correctly with safety guarantees, this has the ability to revolutionize domains like autonomous systems, medicine, and other safety-critical domains. This is because it would allow system designers to use high-dimensional outputs from sensors like camera and LiDAR. The trepidation in deploying systems with vision and LiDAR components comes from incidents of catastrophic failures in the real world. Recent reports of self-driving cars running into difficult to handle scenarios is ingrained in the software components which handle such sensor inputs.

The ability to handle such high-dimensional signals is due to the explosion of algorithms which use deep neural networks. Sadly, the reason behind the safety issues is also due to deep neural networks themselves. The pitfalls occur due to possible over-fitting and lack of awareness about the blind spots induced by the training distribution. Ideally, system designers would wish to cover as many scenarios during training as possible. However, achieving a meaningful coverage is impossible. This naturally leads to the following question: is it feasible to flag out-of-distribution (OOD) samples without causing too many false alarms? Such an OOD detector should be executable in a fashion that is computationally efficient. This is because OOD detectors often are executed as frequently as the sensors are sampled.

Our aim in this article is to build an effective anomaly detector. To this end, we propose the idea of a memory bank to cache data samples which are representative enough to cover most of the in-distribution data. The similarity with respect to such samples can be a measure of familiarity of the test input. This is made possible by an appropriate choice of distance function tailored to the type of sensor we are interested in. Additionally, we adapt conformal anomaly detection framework to capture the distribution shifts with a guarantee of false alarm rate. We report the performance of our technique on two challenging scenarios: a self-driving car setting implemented inside the simulator CARLA with image inputs and autonomous racing car navigation setting with LiDAR inputs. From the experiments, it is clear that a deviation from the in-distribution setting can potentially lead to unsafe behavior. It should be noted that not all OOD inputs lead to precarious situations in practice, but staying in-distribution is akin to staying within a safety bubble and predictable behavior. An added benefit of our memory-based approach is that the OOD detector produces interpretable feedback for a human designer. This is of utmost importance since it recommends a potential fix for the situation as well. In other competing approaches, such feedback is difficult to obtain due to reliance on techniques which use variational autoencoders.

1 Introduction

In the past few years, tremendous progress has been made toward improving the building blocks of fully autonomous systems. Components like LiDAR and camera, which were out of reach for most day-to-day applications, have become an integral part of the sensor array for driverless cars. Yet, there is a missing component that limits the adoption of autonomous systems from a well-monitored lab setting to the open world. The missing piece is safety. The consensus right now is to make cautious incremental progress before full autonomy is reached [45]. The first step is to incorporate Learning-Enabled Components (LECs) in modern Cyber-Physical Systems (CPS) in a limited capacity. Such LECs often rely on having access to a large corpus of data designed by humans. This often relies on well-trained human experts to build such a collection. Although often a big challenge in practice, we assume in this article that such a dataset is available during the training phase.

We direct the attention of the reader to one such LEC in a modern car [12]. A very humble setting compared to a driverless car is a simple lane keeping controller. The control algorithm receives a sensor feed in the form of a video and makes slight adjustments to keep the vehicle on the road. This happens under the supervision of a human driver who is still responsible for the safety of the vehicle. Contrast this scenario with that of a fully self-driving car equipped with an Automated Driving System (ADS). The software in this case has full control of the car at multiple levels of abstraction, starting from navigation and choice of route to maneuvering the vehicle in a heavy traffic scenario. If realized correctly, this can have deep ramifications. It can lead to reduction of accidents and improve general vehicle safety. In addition, it can lead to higher mobility for seniors and those who cannot drive. For this work, we use such a lane keeping controller based on video input.

The workhorse of machine learning systems are Deep Neural Networks (DNNs). These high-capacity function approximators can take inputs from a camera and LiDAR and produce a desired output. To function properly, DNNs go through a process of training to correct their mistakes by learning from a large corpus of curated data. Training large-scale DNNs are possible only due to the development of well-engineered gradient descent tools such as PyTorch [44] and TensorFlow [4], as well as improvements in specialized hardware like GPUs. It is not surprising anymore for DNNs to be able to perform superhuman performance in Atari games [42] and Alpha-Go [51]. But this does not come without its potential pitfalls. The mode of functioning of a DNN is quite different from the way humans operate. There is a clear lack of similarity when it comes to understanding an image. Concepts like persistency of objects, segmentation, and depth of vision do not happen naturally in a typical end-to-end training scenario. The ability to classify objects occurs by taking the right steps to fit a high-capacity function to the training data by controlling its labeling error rate. DNNs in the most basic form have millions of nodes and parameters, which are arranged in a fashion so as to form a directed acyclic computational graph. This complex nature of the graph along with its size makes it particularly difficult to be analyzable by human experts. This means that computer vision systems are more likely to make mistakes in ways that humans are not. A well-known example of such a phenomenon is the presence of adversarial perturbations to an otherwise clean image. This is referred to as an adversarial attack in the literature. Here, one makes imperceptible changes to an image which are completely ignored by human vision but can mislead the DNN. This can affect the safety of autonomous systems adversely. For instance, it is fairly straightforward to alter a speed limit sign to be interpreted as a different speed or change a stop sign to a speed limit sign.

If it were true that neural networks can automatically learn concepts like humans, then it would generalize well outside the training region, implying lesser concern for the CPS designer. But it is well known that, when pushed outside of their training region, neural networks can perform in an unpredictable fashion. Statistical machine learning tools can compute upper bounds on the generalization errors of a learned model. But these are often overly conservative, due to the degree of over-parameterization in a DNN.

Out-of-Distribution (OOD) detection [24, 32, 41] has been the focus of attention in a large section of the literature. This is mainly because of the promise it offers for infusing robustness in an otherwise delicate workflow of the DNNs. The main target of the research community working on these approaches is to analyze the robustness of the system when exposed to unknown settings. But it is apparent from the literature that such approaches do not address the CPS settings which function in the closed loop producing closely related samples but is more geared toward standard classification datasets like MNIST and CIFAR10. The main contribution of this article is that it offers an avenue to detect anomalous inputs in real time and in an interpretable fashion. The overall workflow involves building a representation system for familiar inputs (i.e., in-Distribution (iD) data as it is commonly called) while minimizing the number of such representative samples one uses in the process. Such prototype samples are referred to as memories in our work. To discover these prototypes, we use well-established computer vision techniques to compute the similarity of two images.

To rigorously control false detection by the proposed approach, we leverage a conformal anomaly detection framework [37]. This framework aims to test if an input conforms to the training distribution by assigning a non-conformity score to the input. The higher the score, the more non-conforming the input is to the training distribution. The probability of false detection by conformal anomaly detection is upper bounded by the detection threshold. The detection performance, however, depends on the choice of the Non-Conformity Measure (NCM) used as described in the original framework [6]. We propose using the Statistical Similarity Index Metric (SSIM) in the conformal anomaly detection framework for OOD detection in CPS. With inputs coming in a time-series sequence to the CPS, we propose to use the Harmonic Mean p-value (HMP) method [63] to combine conformal predictions from individual inputs (in the sequence) for detection on the sequence. The HMP method preserves the false detection guarantees from the conformal anomaly detection framework without assuming that the conformal predictions are computed independently [63].

Another advantage of the proposed framework is its ability to produce interpretable outcomes. This is accomplished using samples from the training dataset. In explainable machine learning systems, comparative methods which relate a test sample to a witness from the dataset [13, 52] are fairly well accepted. It is important to note that such feedback is often useful to the system designer especially. Such feedback can often be turned into a potential fix for the improper behavior, as will be evident later on in the article. In this work, we consider two broad sets of case studies, one of which involves a camera feed in the form of a video, and the other being in the form of LiDAR inputs. The video inputs feed into an Advanced Emergency Braking System (AEBS) and an end-to-end self driving system presented in the work of Cai and Koutsoukos [11]. The system is subjected to varying types of distribution shifts, such as a shift from the training weather (low precipitation), lighting conditions (day), leading obstacles (car), and clean (or non-adversarial) images. The other setting is that of LiDAR inputs. The anomalous inputs appear as random reflections encountered by the emitted light beam. These reflections push the network outside of its trusted zone, causing a deviation from safe behavior. Our experimental results show that our algorithm is able to achieve state-of-the-art results of distribution shift detection in an effective and interpretable way. We summarize our contributions next.

Contributions. First, we propose an algorithm to build a memory system composed of representative data points to capture the distribution and assign non-conformity scores. Second, we demonstrate how these scores can be combined to detect distribution shifts using the HMP and the Inductive Conformal Anomaly Detection (ICAD) framework, thereby exploiting the statistical guarantees that come with it. Third, we propose a way to compute a feedback image, which can help the system designer with an interpretable report for the OOD input.

Improvements Compared to Our Previous Work [66]. The main improvement that we propose in this article is a statistical guarantee on the False Detection Rate (FDR) for window-based OOD detection. This utilizes the backbone of the memory system to compute the distance between a test sample and stored memories, employing it as a non-conformity score within the ICAD framework. Image and LiDAR frames obtained from a continuous stream of sensor inputs are often correlated in practice, which violates the independence assumption typically used. To deal with this issue, we use the HMP method to combine p-values in a window and to upper bound the FDR on windows. This moves the results from an image-level detection to guarantees on the windows. Our experimental results show that we can effectively control the False-Positive (FP) detection rate with the statistical guarantee. Additionally, we include a systematic analysis of the effect of different window sizes and different expected FDR for trace-wise detection of OOD samples. We compare these results with existing state-of-the-art Variational Autoencoder (VAE)-based OOD detectors [11, 54]. The details are discussed in Section 6 and Section 7.

2 Related Work

Autonomous systems broadly categorize a class of systems where certain tasks can be performed with minimal human intervention. Such behaviors are often hard to encode on a case-by-case basis. This necessitates learning-based components like DNNs being placed in the workflow. However, the use of DNNs has produced safety concerns. This has piqued an interest from several domains, which have contributed in producing varying levels of assurance cases [29]. This has ranged from verification of DNNs against some specifications to runtime techniques for detection and recovery. Synthesizing deep learning systems which are robust by construction [17, 46] is often challenging in the more general setting. Yet, if it is the case that the property of correctness can be captured more precisely as a set propagation problem, then more hard guarantees can be given. In the literature, this is referred to as the domain of neural network verification [21, 23, 27, 28, 64, 65]. A more detailed survey of verification approaches can be found in the work of Sankaranarayanan et al. [49]. Despite serious efforts, it was quickly apparent that it is often difficult to verify applications which use images as inputs, not just because of the computational aspects but because it is hard to specify the problem itself. Defining what verification would mean in the context of images is a challenge. Recently, there has been interest in addressing this issue through an approach of evidence-based trust, such as predicting the performance of these networks in novel scenarios for deploying it in real-world systems [33]. In this article, we also follow the evidence-based trust approach. If the system is able to justify its decision by presenting evidence, we deem that as more likely to be the correct behavior. Next, we provide a summary of techniques specialized to images.

OOD detection has been extensively studied in classification problem settings for stand-alone LECs [24, 30, 31, 34, 41, 56, 67]. These approaches either use differences in the geometrical or statistical properties of the iD and OOD data for detecting a shift in the model behavior. OOD detection through safety envelopes, in CPS with low-dimensional input space sensors such as GPS, has been studied in the past [57]. Recently, there has been growing interest for detection of OOD and adversarial inputs in closed-loop CPS using high-dimensional sensors like a camera [11, 18, 35, 36, 47, 53, 54].

Sundar et al. [54] propose using KL divergence in the latent space of \(\beta\)-VAE for detection of individual images as OOD. Feng et al. [18] propose using KL divergence in the horizontal and vertical latent sub-space of the 3D convolutional VAE from the specified prior for detection of OOD traces. The input to 3D convolutional VAE is a sequence of frames (or the trace to be detected). To our knowledge, such techniques do not provide FDR guarantees on OOD detection.

ICAD has recently been utilized for controlling the false OOD detection on iD data [8, 11, 30, 36, 47]. iDECODe [30] proposes using error in the equivariant behavior of a model as the non-conformity score in CAD for OOD detection. The equivariance is learned on the data drawn from the training distribution and with respect to a set of transformations such as rotation on images. Bates et al. [8] propose combining conformal predictions from different channels and layers of convolutional neural networks for OOD detection. Whereas Bates et al. Bates et al. [8] and Kaur et al. [30] focus on detection of individual datapoints as OOD, we consider the problem of OOD detection in time-series data to CPS.

Cai and Koutsoukos [11], Ramakrishna et al. [47], and Kaur et al. [36] leverage ICAD for OOD detection on a sequence of time-series inputs. Cai and Koutsoukos [11] propose using reconstruction error by VAE on the input image (or frame) as a non-conformity score in the ICAD framework [38] for detection of OOD frames. They further apply the Martingale test [60] along with the cumulative sum procedure (CUSUM) [7] with a window of the past and present predictions for robust detection of OOD traces.¹ Ramakrishna et al. [47] use KL divergence between the disentangled feature space of \(\beta\)-VAE and normal distribution as the non-conformity score in ICAD for OOD detection of a single frame. They use the Martingale test along with CUSUM for detecting OOD traces. CODiT [36] uses error in the temporal equivariance learned by a VAE model on the training distribution of time-series windows as the non-conformity score in ICAD for OOD detection in time-series data. We propose to use memory-based distance as the non-conformity score in ICAD for OOD detection in time-series data with interpretable explanations on the detection.

To the best of our knowledge, all existing approaches for OOD detection in CPS with LEC are tied to VAE. Either reconstruction error from VAE on the input image or KL divergence in the latent space of the VAE is used for OOD detection in these approaches. Training VAEs often requires careful manual tuning [5], and the quality of the training decides the efficacy of the downstream processes. Here we set ourselves apart by not having to depend on a well-functioning VAE. In addition, unlike our approach, none of the existing approaches except for that of Ramakrishna et al. [47] provides interpretability on the source of OOD-ness of the input. We show that our approach can be extended to the case of LiDAR inputs as well without any conceptual modification.

3 Motivation and Problem Statement

The level of autonomy a CPS has to offer is often decided by how well a designer leverages the LECs. Detection of OOD is one of the ways we can safeguard systems from unwarranted behavior. In Figure 1(a), we show an example of a setting where the car is running an AEBS controller. The controller uses the system states and the video feeds from the camera to sense the positions of the closest leading object on the road. The controller’s job is to automatically brake the car if it crosses a certain distance threshold from the leading vehicle. What we observe is that because during training the DNN experienced just cars, it never learned to react to bikes on the road. What happens next is that the DNN completely misjudges a bike in the video and the controller ends up with an accident. What we would like to propose here is a method to detect such a shift in distribution.

Fig. 1.

Problem Statement. We would like to solve the problem of being able to alarm the system about distribution shifts in real time with statistical guarantees. It is extremely challenging to sample high-dimensional inputs space in an exhaustive fashion. This would mean careful analysis of the training time iD data to come up with an effective detector that can act in real time. Additionally, it is desirable that such an alarm system produces interpretable behavior. It is often the case that DNNs, due to their black-box nature, do not offer an explanation to their decisions. Here, we would like to take up the challenge of being able to point to an explanation when samples are iD or OOD. We demonstrate this in Figure 1(b), in which the system not only flags the image with the biker ahead as OOD, but selects a set of pixels demarcating the biker to communicate why it decided to label it as an OOD.

4 Background

In this section, we walk through some of the basic concepts and proof required for our approach.

4.1 Clustering with Medoids

One of the promising steps to build an understanding of the data distribution is through unsupervised clustering. In a fashion similar to k-means clustering, we wish to form partitions of the data into distinct groups or clusters. Clustering with k-means is a well-known tool but has its challenges when used in the context of images. An issue with k-means is that it can potentially produce virtual cluster centers which are absent in the original dataset. This is essentially because a simple mean of two (or more) images might not correspond to a real image. This is important for us since we wish to use these centers to form interpretable predictions which can answer why something was flagged as OOD. The other issue with vanilla k-means is that it is often susceptible to outliers in the data. Hence, we restrict ourselves to partitioning around points which are present in the data.

The algorithm that achieves this is PAM [2], which is short for Partitioning Around Medoids. Intuitively, the algorithm tries to search for centrally located data samples called medoids and are used to define the cluster boundaries in a nearest medoid sense. Let us assume that the set \(\mathcal {S}\) is equipped with a distance metric \(\mathcal {D}: (s_1,s_2) \rightarrow \mathbb {R}\), for \(s_i \in \mathcal {S}\) and \(n=| \mathcal {S}|\). Given a dataset \(\mathcal {S}\), PAM tries to select a set of \(r\) medoids - \(M_r: \lbrace m_1, m_2, \dots , m_r \rbrace\) such that the following cost is minimized:

\begin{equation} Cost(M_r) = \sum _{i=1}^{n} \underset{m_j \in M_r}{min}\mathcal {D}(m_j, s_i). \end{equation}

(1)

We assume that the inner minimization is always possible, and we are able to break ties arbitrarily among distinct members of the set \(\mathcal {S}\).

Algorithms. The challenge with PAM is that the naive implementation has a runtime complexity of \(O(n^2r^2)\) [50]. Even though there exist faster variants, it is still largely inaccessible for applications at the scale of image datasets generated from autonomous driving scenarios. To circumvent this challenge, we introduce a variant of the Clustering Large Applications based upon Randomized Search (CLARANS) [43] algorithm in Section 5.2. It combines randomized global search with a local clustering method. The medoids identified by minimizing the objective in Equation (1) are referred to as memories from here on.

4.2 Structural Similarity Index Metric

A fundamental challenge in dealing with images is to capture human perceptual similarity with a mathematically meaningful distance function. To the best of our knowledge, the right candidate for this purpose is SSIM. This was first introduced in the work of Wang et al. [62] and has gained widespread popularity. It computes the degree to which two images are similar to a human eye and was used to compute the degradation quality of an image. SSIM is designed to capture statistical similarity between images. Figure 2 gives an example of how SSIM successfully captures the perceptually difference of the images. To human eyes, the image on the left is more similar to the reference image (middle one) compared to the right one, which is reflected by the SSIM distances below. This makes our system more robust to random noise in comparison to vanilla DNNs. It has been used to capture image similarity for adversarial sticker attacks as well [39]. We exploit this feature in the context of videos, where subsequent frames are not worlds apart but are quite correlated in their information content.

Fig. 2.

We state the original SSIM distance function next. Assume we have two images \(\mathsf {A_1} \in \mathbb {R}^N\) and \(\mathsf {A_2} \in \mathbb {R}^N\). This allows us to compute three terms: a luminance distortion term, a contrast distortion term, and a correlation term as follows:

\begin{equation} l(\mathsf {A_1},\mathsf {A_2}) = \frac{2\bar{\mathsf {A_1}}\bar{\mathsf {A_2}} + c_1}{{\bar{\mathsf {A_1}}}^2 + {\bar{\mathsf {A_2}}}^2 + c_1}, \end{equation}

(2)

\begin{equation} c(\mathsf {A_1},\mathsf {A_2}) = \frac{2 s_\mathsf {A_1} s_\mathsf {A_2} + c_2}{s_\mathsf {A_1}^2 + s_\mathsf {A_2}^2 + c_2}, \end{equation}

(3)

\begin{equation} s(\mathsf {A_1},\mathsf {A_2}) = \frac{s_{\mathsf {A_1},\mathsf {A_2}} + c_3}{s_\mathsf {A_1} s_\mathsf {A_2} + c_3}, \end{equation}

(4)

where \(\bar{\mathsf {A_1}}, \bar{\mathsf {A_2}}, s^2_{\mathsf {A_1}}, s^2_{\mathsf {A_2}}\), and \(s_{\mathsf {A_1},\mathsf {A_2}}\) are the local mean, local variance, and local covariance between \(\mathsf {A_1}\) and \(\mathsf {A_2}\). The scalar terms \(c_1, c_2, c_3\) aim to capture the saturation effects of the visual system and provide numerical stability. The terms computed above capture the local difference in some chosen window in the image. The combination across all such local windows gives the SSIM. With \(c_3 = c_2/2\), SSIM can be written in the following form:

\begin{equation} \begin{aligned}SSIM(\mathsf {A_1}, \mathsf {A_2}) &= S_1(\mathsf {A_1}, \mathsf {A_2}) S_2(\mathsf {A_1}, \mathsf {A_2}) \\ S_1(\mathsf {A_1}, \mathsf {A_2}) &= l(\mathsf {A_1}, \mathsf {A_2}) \\ S_2(\mathsf {A_1}, \mathsf {A_2}) &= c(\mathsf {A_1}, \mathsf {A_2}) s(\mathsf {A_1}, \mathsf {A_2}). \end{aligned} \end{equation}

(5)

The computational structure of SSIM allows us to efficiently implement it in tools like PyTorch and accelerated using a GPU. This permits a scalable and efficient implementation inside our OOD detection framework. A large gamut of algorithms in Euclidean spaces evolved with the assumption of a true distance metric being present. To leverage these methods, it is important that we work with a distance function that is a true metric. The downside of SSIM is that it does not have the mathematical properties to be a true distance metric. But with some modifications, it can be turned into one. The details of this modification and the associated proof can be found in the work of Brunet et al. [10]. We use the modified SSIM to define a distance metric \(\mathcal {D}(A_1, A_2)\) in this article. The use of a proper distance metric for images allows us to capture distribution shifts in a more meaningful way.

4.3 Inductive Conformal Prediction and ICAD

Conformal prediction [6] is a general framework for testing conformance of an input with respect to the training distribution. Conformance with the training distribution is quantitatively measured by an NCM, which is a real-valued function that assigns a non-conformity score \(\alpha\) to the input with respect to data drawn from the training distribution. The higher the score is, the more non-conforming the input is with respect to the training data \(X=\lbrace x_1,x_2,\ldots ,x_l\rbrace\). NCMs based on nearest neighbors [59], support vector machines [59], random forests [14], and VAEs [11] have been proposed in the past.

Conformal anomaly detection [37] uses the non-conformity score from the conformal prediction framework to detect anomalous inputs. A p-value of the input x is computed by comparing its non-conformity score \(\alpha _x\) with the scores of the training data:

\begin{equation*} p_{l+1} = \frac{|\lbrace i=1, \ldots ,l: \alpha _{x} \le \alpha _i\rbrace |+1}{l+1}. \end{equation*}

Here, \(\lbrace i=1,\ldots ,l: \alpha _i\rbrace\) is the set of non-conformity scores computed for the training data from an NCM defined on the new set composed of the training data and the input \(x_{l+1}\). If \(x_{l+1}\) is drawn from the training distribution, then its score is expected to lie within the range of the scores of the training data and therefore higher p-values for the iD datapoints. Conformal anomaly detection detects an input as anomalous of its p-value lies below a specified detection threshold \(\epsilon _{icad} \in (0,1)\).

Recomputing scores for the training data for every new input might be computationally expensive (and even infeasible in real time) if computing the NCM is inefficient. ICAD [38] was proposed to resolve this issue. ICAD is based on the inductive version of the conformal prediction framework, where training data are divided into a proper training set \(X_{tr}=\lbrace x_1,\ldots ,x_m\rbrace\) and a calibration set \(X_{cal}=\lbrace x_{m+1}, \ldots ,x_l\rbrace\). NCM is defined on the proper training set and the p-value of the input x is computed by comparing its score with these scores of the calibration datapoints:

\begin{equation} p\text{-}value(x) = \frac{|\lbrace i=m+1, \ldots ,l : \ \alpha _{x} \le \alpha _{i}\rbrace |+1}{l-m+1}. \end{equation}

(6)

The non-conformity scores of the calibration set are computed in the offline settings and used at the inference time to compute the p-value of an input. Again, the input detected as an anomaly if its p-value lies below a specified detection threshold \(\epsilon _{icad} \in (0,1)\).

Lemma 4.1 ([6]).

If an input x and the calibration datapoints are independent and identically distributed, then the p-value of x computed from (6) is uniformly distributed. The probability of misdetecting x as anomalous is therefore upper bounded by the detection threshold \(\epsilon _{icad}\).

4.4 Combining p-Values Using HMP

Previous work [36] uses Fisher’s method [20] to combine multiple p-values from the ICAD framework. To preserve FP (or detection) rate guarantees from the conformal prediction framework, Fisher’s method requires the individual p-values to be independent [58]. Datapoints in a time-series window have a temporal dependency. So, we propose to use the HMP method [63] to combine p-values of the temporally dependent datapoints in a time-series window. We consider the following problem:

Given a window \(\mathcal {W}\) of consecutive time-series datapoints \((x_1, x_2, \dots , x_n\)), label \(\mathcal {W}\) as iD or OOD. We pose this problem as a statistical hypothesis testing problem with the null hypothesis of \(\mathcal {W} \sim\) iD. With the single hypothesis testing of \(\mathcal {W} \sim\) iD, we propose a solution based on multiple testing of the single hypothesis. The p-value from each test in the multiple testing framework can be combined to test for the single or global null via averaging methods [61]. The HMP method [63] is one such method that can be used to combine the dependent p-values from multiple tests while testing for the global null. We compute n p-values (\(p_1, p_2,\ldots , p_n\)) by performing a test on each datapoint in \(\mathcal {W}\) and then combine these p-values from multiple tests by using the HMP method for testing the same global null of \(\mathcal {W} \sim\) iD. We denote the combined p-value by \(p_{agg}\):

\begin{equation} p_{agg}= HMP(p_1,p_2, \ldots p_n) = \frac{\sum _{i=1}^{n}w_i}{\sum _{i=1}^{n}w_i/p_i}\text{where } \sum _{i=1}^{n}w_i = 1, \end{equation}

(7)

where n is the total number of datapoints in the window. In this article, we assign equal weights to all p-values: \(w_1=w_2=\cdots =w_n=1/n\).

Let us denote hypotheses in the multiple hypothesis testing framework by \(H_{0,1},H_{0,2},\ldots\). In the proposed solution, \(H_{0,1}=H_{0,2}=\cdots =H_{0,n} = H_0\) of \(\mathcal {W} \sim\) iD. \(H_{0,1}=H_{0,2}=\cdots =H_{0,n}\) are tested by computing \(n \ p\)-values: \(p_1,p_2,\ldots ,p_n\) on the n datapoints in \(\mathcal {W}\). We want to control the FP rate on the global null of \(H_0= \mathcal {W} \sim\) iD.

Theorem 1.

When all tests for the same null \(H_0\) in multiple hypothesis testing are combined using the HMP method, the probability of incorrectly rejecting \(H_0\) (FP rate) is upper bounded by the significance level \(\epsilon\). If the \(p_{\text{agg}}\) from the HMP method (7) is less than the significance level \(\epsilon\), then the \(H_0= \mathcal {W} \sim\) iD is rejected. The probability of falsely rejecting \(H_0\) by \(p_{\text{agg}}\) is therefore upper bounded by \(\epsilon\): \(\mathbb {P}(p_{\text{agg}} \le \epsilon) \le \epsilon\).

Proof.

The HMP method controls the strong-sense Family-Wise Error Rate (FWER) in multiple hypothesis testing (Results in the work of Wilson [63]). Since strong-sense FWER control implies the weak-sense FWER control, the HMP method controls the weak-sense FWER in multiple hypothesis testing.

Controlling weak-sense FWER at the significance level \(\epsilon\) by combining p-values in multiple hypothesis testing for the same null \(H_0\) is equivalent to controlling the FP rate while testing for the single hypothesis \(H_0\) (Roquain [48, Remark 1.6]). □

The HMP method compares \(p_{agg}\) with a critical value \(\xi\) for detection (\(p_{agg}\lt \xi\)) [63]. For the desired FP rate \(\epsilon\), \(\xi\) can be calculated by inverting the following equation:

\begin{equation} \epsilon = \int _{1/\xi }^{\infty } f_{Laudau} \left(x|log(n) + 0.874, \frac{\pi }{2}\right) \,dx, \end{equation}

(8)

where

\begin{equation} f_{Laudau}(x|\mu ,\sigma) = \frac{1}{\pi \sigma }\int _{0}^{\infty } e^{-t\frac{x-\mu }{\sigma }-\frac{2}{\pi }tlog(t)}\,dt. \end{equation}

(9)

More details on the critical value computation can be found in other works [1, 63].

Lemma 4.2 ([63]).

The probability of an FP detection by comparing \(p_{agg}\) with the critical value \(\xi\) in the HMP method is upper bounded by the detection threshold \(\epsilon\) when all individual \(p\)-values are valid. A \(p\)-value is valid if it satisfies Lemma 4.1 from the ICAD framework.

5 Methodology

An overview of the proposed OOD detection approach is as follows. As shown in Figure 3, at the training stage, the clustering method is used to filter the training set for prototypical datapoints. We call these prototypes memories. At runtime, we calculate the distance of an input with its closest memory and use it as the non-conformant score in the ICAD framework for computing the p-value of the input. The intuition for using this score is that an anomalous input is highly likely to lie far from the training set. For robust detection, we use the sliding window approach on a sequence of inputs, where verdicts (or p-values) on individual datapoints in the sequence are combined by using the HMP method for OOD detection with a bounded false alarm rate.

Fig. 3.

5.1 Initializing the Memory Set

The intuition here is that high-dimensional data like images and LiDAR scans, which are generated from a real-world setting, cluster well in practice. The first step is to identify these broad categories in a quick and efficient fashion. One of the questions, however, is that the number of partitions to be made is often not known a priori. But drawing on the intuitions from an image distance metric, only small enough distances have perceptual meaning. Thus, the idea here is to populate the input space densely enough with memories such that every training point is within a threshold distance d of some memory. Algorithm 1 summarizes our approach. We pick a data point at random and then compute the distance score across all samples in the currently RejectedSet in a single linear pass. The data points which are similar enough are admitted as being close to a memory, and they are not considered as candidates for new memories in the next iteration. We continue this process until all data points are admitted into the set of memories M. This allows the subsequent algorithms to have a warm start. Algorithm 1 always terminates. This is because the RejectedSet decreases by at least 1 at each step. In the worst case, we have as many memories as the number of data points. But in most practical datasets, this is not the case.

5.2 Learning Memories

To restate, we are given a dataset \(\mathcal {S}\), with q elements, and we wish to compute an r size memory set \(\mathcal {M} = \lbrace m_1, m_2, \dots , m_r\rbrace\) with certain desirable properties. The search for memories can be simplified by viewing this as a search through a graph \(\mathcal {G}\) [43] with subsets \(\mathcal {S}_r \subset \mathcal {S}\) as its nodes. Each subset of size r defines a choice for the memory set \(\mathcal {M}\).

Definition 5.1 (Memory Search Graph \(\mathcal {G}\))

The undirected graph \(\mathcal {G}\) is represented by an ordered pair \((V, E)\). The set of nodes \(V\) is the collection of subsets of original dataset \(\mathcal {S}_r \subset \mathcal {S}\). An edge \(e \in E\) exists between two nodes \(\mathcal {S}^1_r\) and \(\mathcal {S}^2_r\) iff \(|\mathcal {S}^1_r \cap \mathcal {S}^2_r| = r-1\). In other words, they differ by at most one memory.

Each node of the graph has an associated cost given by Equation (1). Hence, starting from some node, it is possible to visit neighboring nodes with decreasing costs in the search process. What we present next is a combination of \(Global\) resets and Local minimization to approximate the optimal choice.

Algorithm 2 picks the eventual memories used in OOD detection. Similar to the standard CLARANS algorithm, each node in \(\mathcal {G}\) has \(r(q-r)\) neighbors, where r is the number of memories. The number of neighbors can be quite large given the scale of modern machine learning datasets with large q. What we do here is start with a reasonable choice for initial node in \(\mathcal {G}\) and then greedily look for local improvements for a fixed number of iterations. The global search starts by using Algorithm 1, to generate the initial set of memories as node v in \(\mathcal {G}\). Notice that we do not choose the number of memories a priori but instead gets picked as a consequence of distance score {d}. The partitioning cost for the choice of memories is computed by the function ComputeCost which evaluates Equation (1). Note that this can be expensive since it needs a total of \(r \times q\) distance computation operations. The local search (lines 7–12) implements a greedy strategy to pick the neighborhood node that produces a descent. The outer loop of the algorithm keeps track of the node with the minimum cost for each such reset produced in line 3. Algorithm 2 trivially terminates, as each search proceeds for a fixed number of steps.

Definition 5.2 (Memory System \(\mathcal {M_S}\))

A memory system is a collection of pairs \(\mathcal {M_S} := \lbrace (m_1,q_1), (m_2,q_2) \dots , (m_r,q_r)\rbrace\), where

\begin{equation} \begin{aligned}q_i &= |Q_{m_i}| \\ Q_{m_i} &= \underset{s \in \mathcal {S}}{\bigcup } \mathbb {I}(m_i = \underset{i}{argmin }~\mathcal {D}(s, m_i)). \end{aligned} \end{equation}

(10)

The memory-based OOD detector does not need q for OOD detection, so we simplify the memory system as \(\mathcal {M_S} := \lbrace m_1, m_2 \dots , m_r\rbrace .\)

5.3 Scaling Memory Search

Even though the number of memories produced in Algorithm 2 might be small enough compared to the full dataset \(\mathcal {S}\), a search through the list of memories might still be challenging. To remedy this potential drawback, we deploy a simple hashing technique first introduced in the work of Fukunaga and Narendra [22]. The distance metric \(\mathcal {D}\) discussed in Section 4.2 was a proper distance metric, which implies that the distance function respects triangle inequality. In what follows, we describe a possible avenue to speed up the search for the k nearest memories, the intuition being that for sufficiently different memories, computing a single distance pair can be used to reject other memories from further consideration.

We are interested in computing the nearest neighbor—that is, \(k=1\) in the set \(\mathcal {M_S}\) for a test point \(x_t\). Assume that we wish to compute the distance between a test point \(x_t\), and some memory \(m_j\) and the distance \(\mathcal {D}(x_t, m_i)\) is known. Then in the triangle formed by the triplet \((m_i, x_t, m_j)\), the following two equations are true:

\begin{equation*} \mathcal {D}(m_i, x_t) - \mathcal {D}(m_i, m_j) \le \mathcal {D}(m_j, x_t) \end{equation*}

and

\begin{equation*} \mathcal {D}(m_i, m_j) - \mathcal {D}(m_i, x_t) \le \mathcal {D}(m_j, x_t), \end{equation*}

meaning that \(\mathcal {D}(m_j, x_t)\) is lower bounded by \(|\mathcal {D}(m_i, m_j) - \mathcal {D}(m_i, x_t)|\). If we are interested in memories which are within a certain threshold (say h) of m_j, we do not actually need to compute the distance \(\mathcal {D}(m_j, x_t)\) if the following equation holds \(True\):

\begin{equation} |\mathcal {D}(m_i, m_j) - \mathcal {D}(m_i, x_t)| \gt h. \end{equation}

(11)

For each memory \(m_i\), we can pre-compute a look-up table for the inter-memory distance \(\mathcal {Q}\): \(\lbrace (m_j, \mathcal {D}(m_i, m_j)) | 1 \le j \le |\mathcal {M_S}|, j \ne i \rbrace\). This can lead to reduction in the search space in practice by pruning out memories from further consideration each time the distance of a memory from \(x_t\) gets measured. For \(k \gt 1\), similar reasoning holds. The only difference being that in this case, the search algorithm tracks the distance of the \(k^{th}\)-memory furthest from the test point.

5.4 Detecting Distribution Shifts

To summarize, we know how to go from the set of training data \(\mathcal {S}\) to the set of memories \(\mathcal {M_S}\). It is generated by a smart initialization of the set of memories (Algorithm 1), followed by a refinement using the medoid-based partitioning technique discussed in Algorithm 2. Additionally, to handle any potential slowdowns, we briefly discussed how one can use the inter-memory distance to prune out large parts of the search space. This allows the system to scale to larger memory systems. Now, we discuss the proposed algorithm for detecting distribution shifts in real time.

In practical scenarios, detecting a distribution shift requires a robust mechanism. We achieve this by using a sliding window based approach to track the number of OOD datapoints. Algorithms 3 and 4 summarize the offline and real-time stages of the proposed OOD detection algorithm, respectively. The real-time detection Algorithm 4 is based on the HMP method for combining p-values (computed from the ICAD framework) of individual datapoints in a time-series window. Given the desired FP rate \(\epsilon\) and the size of sliding window n, the critical value \(\xi\) is computed in line 1 of the algorithm. The SSIM distance between the input and the closest memory from the memory system M is used as the non-conformity score for calculating the p-value of input x_i in line 4.² The aggregated p-value \(p_{agg}\) on the sliding window of the input datapoints \((x_{i-n},x_{i-n+1}, \ldots x_i)\) is computed by using the HMP method in line 6 of Algorithm 4. If \(p_{agg}\) is less than the \(\xi\), then the window is labeled as OOD.

Theorem 2.

The probability of misdetecting an iD window W as OOD by Algorithm 4 is upper bounded by the desired FP rate \(\epsilon\).

Proof.

If an individual datapoint \(x\) in W is drawn from the training distribution D, then x and datapoints in the calibration set are independent and identically distributed with respect to D. The p-value of x computed in line 4 of the algorithm is uniformly distributed and therefore valid according to Lemma 4.1. For all iD datapoints \((x_{i-n},x_{i-n+1}, \dots x_i)\) in W, \(p_{agg}\) is computed from their valid p-values in line 6. \(\xi\) is computed from the desired FP rate \(\epsilon\) in line 1. The probability of misdetecting W as OOD by comparing with it \(\xi\) is therefore bounded by \(\epsilon\) according to Lemma 4.2. □

5.5 Heatmap Generation Algorithm

The SSIM metric and the memory system can be used to generate explanations as promised. To elucidate this aspect, first we point the reader to Equation (5), for SSIM. It is essentially an aggregate of the local features around each individual pixel. This means that when dissimilarity arises, it is possible that a few pixels account for the major differences. Highlighting such pixels can produce a reasonable feedback to the system designer. For a test image that is sufficiently different, it is still true that there is some memory it finds itself being closest to. We use this closest memory to generate an explanation. The details are presented in Algorithm 5. It takes in as input a sample test image and its closest memory, and produces a feedback image. The first step is to create an SSIM map \(D_x\) on line 2 using ComputeFullSSIM. Next, it iterates through the pixel locations of the SSIM map and alters the color of pixels with the high contributions to the SSIM matrix. This is achieved through simple thresholding in line 5. The array \(x^{\prime }_t\) is the visualization of the most dissimilar parts between the test frame and its closest memory. Note that the best explanation that explains a certain behavior is often context specific. Here we find highlighting the important pixels to be the more useful one.

6 Case Study 1: Simulated Scenario for Autonomous Driving Systems

6.1 System Description

In this section, we consider a driving scenario with an AEBS as described in the work of Cai and Koutsoukos [11]. The AEBS is designed for preventing the potential collisions by detecting the front obstacles or monitoring the behavior of a leading car. This closed loop of AEBS (Figure 4) consists of a perception-based LEC and a controller trained with the Reinforcement Learning (RL) algorithms. Given an input image from the front camera, LEC is used for estimating distance between the ego vehicle and its front obstacles. The estimated distance is used as an input to the RL-based controller for determining the braking control action by the ego vehicle. Driving simulators such as CARLA [15] are a convenient source for effectively training LECs and RL-based controllers. Thus, in this case study, we evaluate our OOD detection approaches on simulated scenarios.

Fig. 4.

\(iD Data\). A simulated driving dataset in the work of Cai and Koutsoukos [11] is generated by using an open source autonomous driving simulator, CARLA [15]. The iD traces are simulated in daytime driving scenarios with light rain (or low precipitation level) where the front obstacle is always a car. The dataset contains 33 training traces that include 4,488 images with a precipitation level from 0 to 10 (inclusive). The average length of the training traces is about 120, and the sampling rate is 20 Hz. More details about the system setup and training can be found in the work of Cai and Koutsoukos [11].

Types of OODs. There are six different types of distribution shifts: distribution shifts by increasing the precipitation level (leading to heavy rain), distribution shifts by fog, snow and brightness change, distribution shifts by unseen front obstacles, and distribution shifts by adversarial perturbations.

Evaluation Metrics. In our experiments, we define iD traces as negative and OOD traces as positive. Thus, the FP rate is the number of iD traces that were detected as OOD traces, and the False-Negative (FN) rate is defined as the number of OOD traces falsely detected as iD traces. Additionally, we report the average time delay in OOD detection. In other words, if a trace becomes OOD at a certain time t, the delay in detection is computed as the difference between the time of detection and t. The average delay is the average detection delay on all successfully detected OOD traces.

Comparison with the Baselines . We compare our results with the VAE-based state-of-the-art detectors by Cai and Koutsoukos [11] and Sundar et al. [54]. We will be using the shorthand ‘VAE’ for the detector of Cai and Koutsoukos [11] and ‘\(\beta\)-VAE’ for the detector of Sundar et al. [54]. These baselines report their results on traces by using a point-based detection approach where an OOD detection alarm on the trace is raised on detection of the first OOD datapoint. For a fair comparison with these baselines, we also report our results on traces in a similar manner. We raise an OOD detection alarm on the trace on detection of the first OOD window in the trace. All detectors are trained on the same training traces and calibration traces. Both our detector and baselines are trained on an NVIDIA Quadro RTX 6000.

6.2 OOD Detection for Distribution Shift Due to Change in Weather and Lighting

The weather and lighting changes in driving scenarios have proved to be challenging for autonomous vehicles with image inputs [55]. In this section, we evaluate our approach on distribution shifts introduced by heavy rain, fog, snow, and darkness. Note that for all OOD traces, the level of weather/lighting factors (e.g., precipitation level, darkness) gradually increase with time. In our experiments, we do a 40/60 split on the proper training traces and calibration traces, respectively, and the size of the calibration set is 1,000 datapoints.

6.2.1 Distribution Shift Due to Heavy Rain.

Following the experiments in the work of Cai and Koutsoukos [11], we aim for OOD detection due to an increase in the iD precipitation level. The traces with heavy rain, specifically with precipitation levels greater than 20, are called OOD traces. There are 100 test traces in total, of which 26 of them are iD traces and the rest are OOD traces. Note that our algorithm contains a few hyperparameters including distance for learning a memory system d, window length n, and desired FP rate on window \(\epsilon\). These hyperparameters can be set by the users to achieve a certain level of safety according to the application of interest.

As shown in Table 1, we are able to detect all heavy rain traces with a small detection delay. None of the iD traces are detected as OOD by our approach. The performance of our detection algorithm is better than the performance of both VAE-based detectors. We also observe that both of these baselines are required to select their detection threshold from the validation data for controlling the FP rate, whereas the memory-based OOD detector uses the user-specified \(\epsilon\) as the desired upper bound on the FP rate. The training time of VAE and \(\beta\)-VAE is approximately 3 hours, and the time of learning a memory set for the same training dataset is about 10 minutes. Another noticeable result is that Algorithm 2 efficiently compresses approximately 2,000 proper training images into 141 memories. We also report the average execution time by our algorithm on test frames and show that our algorithm runs in real time since it is well below sampling time of the system in Table 1. Our detector also has better running speed compared to the baselines.

Table 1.

	(n, \(\epsilon\), d)	Mem	AUROC	FP	FN	Avg Delay	Exec Time (ms)
Ours	(6,0.01,0.2)	141	97.22	0/26	0/74	0.0	19.18
VAE [11]	N/A	N/A	89.08	2/26	0/74	0.0	55.95
\(\beta\)-VAE [54]	N/A	N/A	48.00	0/26	74/74	N/A	106.8

Table 1. OOD Detection Results on Heavy Rain Traces

We also perform an ablation study of our detector on the window length n used for detection. Figure 5 compares the FP detection rate on different window lengths. We observe that the FP rate decreases with the increase in the window length, which indicates that larger windows better control the FP rate. This justifies the use of sliding windows for robust detection instead of using only a single datapoint (or window length equal to 1) for detection. Additionally, for a given window length, the FP rate decreases as the \(\epsilon\) decreases, which is as expected.

Fig. 5.

6.2.2 Distribution Shift Due to Fog, Snow, and Low Lighting Conditions.

Here, we want to detect the distribution shifts due to fog or change in lighting conditions from the iD driving scenarios. We generate 27 foggy/snowy traces, where the level of fog/snow gradually increases with time. Similarly, we generate 27 night traces to mimic the scenario where the day starts getting darker. We show an example of a foggy trace in Figure 6(b) and an example of a night trace in Figure 6(c). To perform OOD detection for these three scenarios, we use the same hyperparameters that we use for detection due to heavy rain in Section 6.2.1. As shown in Table 2, our detector is able to detect all OOD traces due to both foggy and night-time scenarios with a small detection delay. Our algorithm is also able to detect all snowy traces but with a longer delay compared to the VAE method because that our detector does not raise an alarm before viewing a complete sliding window to avoid over-sensitivity to one frame.

Table 2.

	Night Traces			Foggy Traces			Snowy Traces
	AUROC	FN	Avg Delay	AUROC	FN	Avg Delay	AUROC	FN	Avg Delay
Ours	97.50	0/27	0.0	99.70	0/27	0.07	100.0	0/27	6.0
VAE	96.42	0/27	0.11	95.54	0/27	6.41	99.02	0/27	1.29
\(\beta\)-VAE	97.50	0/27	0.26	29.09	27/27	N/A	99.17	27/27	N/A

Table 2. OOD Detection Results on Night, Foggy, and Snowy Traces

Note that the hyperparameters n, \(\epsilon\), d are the same as in heavy rain OOD detection.

Fig. 6.

6.3 OOD Detection for Distribution Shift Due to Change in the Front Obstacle

For AEBS, one challenge is that the LEC works only for distance estimation from the front obstacles seen during its training. Considering that the LEC is only trained with a car as the front obstacle, it can always provide an accurate distance estimation and avoid collisions at a safe distance from the leading car. However, if the front obstacle changed to a bike, the LEC will fail to make a correct distance estimation, leading to a crash with the bike. In this experiment, the unseen object in the front is considered as the distribution shift, and we show an example of such a distribution shift in Figure 7 where the ego vehicle crashes into the bike. We generate 27 OOD traces with a bike as the front obstacle in the same environmental conditions as the training. Again, we use the same set of hyperparameters as in heavy rain OOD detection and show results in Table 3.

Table 3.

	FP	FN	Avg Delay
Ours	0/26	0/27	0.0
VAE	2/26	0/27	0.85
\(\beta\)-VAE	0/26	27/27	N/A

Table 3. Detection Results on OOD Traces with Bikes as the Front Obstacles

Note that the hyperparameters n, \(\epsilon\), d are the same as the OOD detection due to heavy rain.

Fig. 7.

6.4 Bounded FDR

We report box plots of the FDR with respect to the detection threshold \(\epsilon\) on the iD test windows of clear weather. These plots are shown in Figure 8. We randomly split the calibration and test traces and get the FDR with different values of \(\epsilon\) from 0.01 to 0.2 with the step size of 0.01. This is repeated 10 times, and we show box plots with the mean and inter-quartile range. The plot shows that the false detection by the proposed detector on the (sliding) iD test windows is bounded by the FDR \(\epsilon\) on average.

Fig. 8.

6.5 OOD Detection for Distribution Change Due to Adversarial Perturbations

Previous work [9] proposed a physical adversarial attack by painting lines on the road to confuse the autonomous driving system (highlighted in the red box of Figure 9(b)). As shown in Figure 9(c), this attack misleads the car to follow the painted lines, leading the car to crash into the fence [11]. We use the self-driving dataset with painted lines generated by Cai and Koutsoukos [11], where evaluation is focused on the right corner driving cases. A clean (or non-attacked) trace is shown in Figure 9(a).

Fig. 9.

All OOD traces contain OOD frames due to painted lines, but in some of these OOD traces, the car successfully takes the right turn without crashing into the fence. One such example of this OOD trace without the crash is shown in Figure 9(b). For OOD traces with a crash, instead of taking the right turn, the car crashes into the fence. One such example of this OOD trace with a crash is shown in Figure 9(c). Unlike previous experiments, there is no label of which frame in a trace is an OOD frame, so we report the successful crash prediction rate. A successful crash prediction on a trace is defined as predicting an OOD window before the crash actually happens. In addition to the OOD detection rate and crash detection rate, we also want to evaluate the forecast time of crashes by using our OOD detection technique. Assuming that the crash happens at time t and the detection time is \(t_{p}\), we define forecast time to be \(t-t_{p}\). We report the average forecast time for all successfully predicted crash traces. There are 105 OOD traces in total, and 64 of them result in a crash. Table 4 and Figure 10 show the performance of our detector on the distribution shift due to adversarial attack of the painted lines.

Table 4.

	(n, \(\epsilon\), d)	Mem	OOD Detection Rate	Crash Detection Rate	Avg Forecast (frame)	Exec Time (ms)
Ours	(10,0.07,0.5)	69	100.0	100.0	60.19	24.74
Ours	(10,0.05,0.6)	26	100.0	100.0	60.19	19.21
VAE	N/A	N/A	100.0	100.0	21.41	44.95
\(\beta\)-VAE	N/A	N/A	100.0	100.0	69.19	103.44

Table 4. OOD Detection Results on Adversarial Sticker Detection

Fig. 10.

Table 4 shows that we are able to detect all OOD traces and predict crashes before they actually occur around 60 timesteps prior. We observe that although both VAE and \(\beta\)-VAE successfully detect all OOD traces and avoid the crash beforehand, our method has a better average forecast time compared to VAE and a shorter execution time compared to both baselines. Figure 10 shows the performance of OOD/crash detection rate with respect to the expected FP rate (\(\epsilon\)) for different window lengths (n). We observe that as \(\epsilon\) decreases, the OOD/crash detection rate decreases and the average forecast time also reduces (the higher the better). This is expected because the memory-based detector becomes less conservative, where the detectors will try to detect most of the traces as iD. The results on OOD detection deteriorate with decrease in \(\epsilon\). Additionally, we observe that with decrease in \(\epsilon\), although the OOD detection rate drops slightly, our predictor is still able to predict all crashes beforehand. This is reasonable because OOD traces with a crash deviate more from the iD traces as compared to the OOD traces without a crash as illustrated in Figure 9(b) and (c). This shows the robustness of our sliding -window based approach compared to single-point-based methods, which avoids being sensitive to minor deviations from iD data. Figure 11 shows the performance of OOD/crash detection rate with respect to the detection threshold for VAE and \(\beta\)-VAE. Here we observe that the crash detection rate generally drops quicker with the decrease in the detection threshold, especially for \(\beta\)-VAE, making it harder to pick an appropriate detection threshold for detecting a crash.

Fig. 11.

6.6 OOD Reasoning Using the Closest Memory

Although the previously proposed VAE-based OOD detectors [11, 19] achieve comparable performance for the simulated autonomous driving scenarios, such detectors do not provide any explanation for the distribution shift with its detection. As described in Section 5.5, in addition to the detection, we provide an interpretation of our OOD detection result. When the detector reports a test frame as an OOD frame, we know that there is no matching memory from the training data. In addition to this quick intuition, our detector can further provide reasoning at the pixel level. We achieve this by selecting the memory with the smallest distance and generate a heatmap to highlight the OOD part for the test input. If the test input is far from all of the memories, the resulting highlighted pixels can be used to alarm drivers. The interpretability of our approach could assist drivers in deciding whether to switch or continue driving on the autonomous mode instead of solely relying on the detection results.

We show the highlighted OOD pixels after detection in Figure 1(b) and Figure 12(b). As shown in Figure 1(b), the unrecognized biker is highlighted in this OOD frame. In Figure 12 for the detected adversarial attacked frame, we highlight the pixels that contain the adversarial stickers on the road.

Fig. 12.

7 Case Study 2: Driving with LiDAR

7.1 System Description

LiDAR is an important component for building safe autonomous systems due to its ability to construct a comprehensive three-dimensional model of the surroundings. It has the ability to reliably complement a camera module when the system needs faster and more detailed feedback on the environment. LiDAR sensors on an autonomous car compute the relative position of the closest obstacle at a certain angle. This is achieved by measuring the time a reflected light beam takes to return to the LiDAR sensor after being emitted from the source. Ivanov et al. [25] discuss a challenging setup for steering a car with LiDAR inputs and a neural network controller. Although the inputs are not as complicated as image inputs from camera, the LECs trained with LiDAR inputs still suffer a failure in an OOD setting.

We show the setup of the F1/10 autonomous car from Ivanov et al. [25] in Figure 13. This closed-loop system consists of an F1/10 autonomous driving car from F1Tenth [3]. The LEC in this case is a neural network controller that decides the control action for the following step. The LiDAR measurements are the inputs to the F1/10 autonomous car, and the system states include velocities and positions of the vehicle. The neural network controller is trained with standard deep RL techniques like DDPG (deep-deterministic policy gradient) and TD3 (Twin Delay DDPG) [40]. The hidden layers are of size \(64 \times 64\) and \(128 \times 128\), respectively. The controller is designed for navigating in a structured environment like an L-shaped track shown in Figure 14. The LiDAR sensors have 1,081 rays at maximum and range of 5 m. The LiDAR scans on the system sweep from \(-135\) degrees to 135 degrees where 0 degree indicates the front heading of the vehicle.

Fig. 13.

Fig. 14.

7.2 Simulation vs Reality

Simulation serves as an important resource for training LECs due to the cost and effort of obtaining a large real-world dataset. However, when deploying the system to the real world, the simulation-to-real gap is always a challenge and sometimes causes an unpredicted behavior for safety-critical applications. Figure 14 illustrates such a failure where the car intends to turn right at the left corner of an L-shaped track from Ivanov et al. [25]. The reflective surface of the environment causes a delayed response of rays, and missing measurements make the system believe that there is no obstacle at a certain angle. The neural network controllers are shown to be sensitive to those inputs and fail to make the right steering commands. Although verification techniques in other works [16, 26] could provide safety properties of the systems, the assumption is that the car operates in a certain expected environment. If the car could receive an alarm of OOD LiDAR measurements before the crash happens, then it could take actions or change mode to avoid being misguided by a neural network controller.

7.3 Avoiding a Crash with OOD Detection

In this section, we report our experimental results on predicting crash when driving with LiDAR measurements. Ivanov et al. [25] observed that the delayed response of LiDAR rays due to reflection are correlated with crashes. We hypothesize that the distribution shifts of the LiDAR input could be due to a deviation from simulation data (ideal and no reflections) to real-world data. Hence, we define the distribution shift in this setup as the LiDAR scan with reflected rays. However, there is no label on the LiDAR measurements which contain rays obtained from reflective surfaces. Therefore, we evaluate our approach by predicting a crash beforehand. This achieved by flagging inputs which deviate from iD data. This is also an interesting case study to evaluate whether the OOD detection could be used for predicting a crash. The LiDAR dataset \(\mathcal {S}\) is defined as \(\lbrace T_1, T_2,T_3 \dots \rbrace\) and \(T_i = \lbrace (x_1,p_2),(x_2,p_2),\dots \rbrace\), where \(x_i \in \mathbb {R}^q\) is the LiDAR scan and \(p_i\) indicates whether there is a crash at timestep i. As described in the work of Ivanov et al. [25], the training data is obtained from a simulator for the 12 different controllers, and no crash occurs in simulation. The SSIM distance metric is still applicable in this case study. We created a two-dimensional input for LiDAR measurements by repeating the one-dimensional measurements. There are 8 training traces, 10 calibration traces, and 236 test traces in our experiments (55 of them show a crash). The average length of the test traces is about 410. The size of our calibration set is 4,000 scans. A crash prediction is successful if an OOD alarm is raised before the real crash happens. Our predictor reports a trace as an OOD trace if at least one sliding window in a given trace is flagged as an OOD according to Algorithm 4. The evaluation metrics are defined as follows:

\begin{equation} \begin{aligned}\textbf {True Prediction Rate (TPR) } &= \frac{ \text{# crash predicted successfully } }{ \text{# real crash happens } } \\ \textbf {False Prediction Rate (FPR) } &= \frac{ \text{# no crash happens but predicted with crash} }{ \text{# crash predicted} } \\ \textbf {Missed Prediction Rate (MPR) } &= \frac{ \text{# crash happens without alarm } }{ \text{# real crash happens} } \\ \end{aligned} \end{equation}

(12)

According to the results reported in Table 5 and Figure 15, our predictor is able to detect 75% of the crashes with \(\approx 14\) timesteps ahead in the best case. The FP rate is around 18%, and the MPR is about 14% for the best choice of hyperparameters. Compared to the two baselines, we have a comparable TPR that has a much lower false crash prediction rate. Our detector also has better execution time. As shown in Figure 15, with the same window size, increasing \(\epsilon\) results in both the correct crash prediction rate and FPR getting higher. We also observe that the correct crash prediction rate and FPR decrease as the size of sliding window increases.

Table 5.

	(n, \(\epsilon\), d)	Mem	TPR	FPR	MPR	Avg Forecast	Exec Time (ms)
Ours	(10,0.001,0.2)	41	85.71	35.53	12.5	22.73	3.7
Ours	(10,0.001,0.3)	24	91.07	40.7	8.93	24.5	1.99
Ours	(20,5e-4, 0.3)	24	75.0	17.54	14.07	13.74	1.99
VAE	N/A	N/A	85.71	77.23	8.93	120	11.27
\(\beta\)-VAE	N/A	N/A	98.21	76.39	1.79	130	48.01

Table 5. OOD Detection for LiDAR Data

Fig. 15.

8 Conclusion

OOD detection can be of utmost importance in ensuring the safety of CPS equipped with LECs. In this article, we proposed a memory-based OOD detector to detect distribution shifts for a real-time system. Our algorithm is able to achieve state-of-the-art results in OOD detection for self-driving car applications with interpretability and statistical guarantees, without compromising execution times. In the future, we would like to extend this technique on applications beyond self-driving cars where anomalous inputs are challenging to handle.

Acknowledgments

We would like to thank Professor Radoslav Ivanov at Rensselaer Polytechnic Institute for discussions regarding the LiDAR experiments and for sharing the data. We are grateful to Professor Edgar Dobriban at the University of Pennsylvania and Shuo Li at the University of Pennsylvania for their insightful discussions regarding the application of HMP techniques and multiple hypothesis testing. We thank the reviewers for their valuable and constructive feedback that helped to improve the article.

Footnotes

Cai and Koutsoukos [11] also consider distance of the input from the center of the hypersphere learned by a support vector training data description as another score in ICAD for OOD detection in CPS with LEC.

Non-conformity scores of the calibration datapoints are also computed in the offline stage (Algorithm 3) by using the same SSIM distance of the calibration datapoints and the closest memory in M.

References

[1]

D. Ter Haar (Ed.). 1965. On the energy loss of fast particles by ionisation. In Collected Papers of L. D. Landau. Pergamon, 417–424. DOI:

Abstract

1 Introduction

2 Related Work

3 Motivation and Problem Statement

4 Background

4.1 Clustering with Medoids

4.2 Structural Similarity Index Metric

4.3 Inductive Conformal Prediction and ICAD

4.4 Combining p-Values Using HMP

5 Methodology

5.1 Initializing the Memory Set

5.2 Learning Memories

5.3 Scaling Memory Search

5.4 Detecting Distribution Shifts

5.5 Heatmap Generation Algorithm

6 Case Study 1: Simulated Scenario for Autonomous Driving Systems

6.1 System Description

6.2 OOD Detection for Distribution Shift Due to Change in Weather and Lighting

6.2.1 Distribution Shift Due to Heavy Rain.

6.2.2 Distribution Shift Due to Fog, Snow, and Low Lighting Conditions.

6.3 OOD Detection for Distribution Shift Due to Change in the Front Obstacle

6.4 Bounded FDR

6.5 OOD Detection for Distribution Change Due to Adversarial Perturbations

6.6 OOD Reasoning Using the Closest Memory

7 Case Study 2: Driving with LiDAR

7.1 System Description

7.2 Simulation vs Reality

7.3 Avoiding a Crash with OOD Detection

8 Conclusion

Acknowledgments

Footnotes

References

Cited By

Index Terms

Recommendations

Out-of-distribution Detection in Dependent Data for Cyber-physical Systems with Conformal Guarantees

Can you trust your Agent? The Effect of Out-of-Distribution Detection on the Safety of Reinforcement Learning Systems

Out-of-Distribution Detection in Learning-Enabled Cyber-Physical Systems

Comments

Information

Published In

Publisher

Journal Family

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations