1 Introduction

The main objective in remote sensing using hyperspectral imaging is the detection and identification of objects of interest in a relatively large area. The resulting data cube is so large that it is computationally intensive just to compute the sample covariance of the data cube. As the spectral resolution of the hyperspectral imaging system continues to increase, the size of the resulting data cube will continue to increase. Consequently, more efficient algorithms for target detection are needed to detect and identify objects of interest quickly and accurately in autonomous mode.

Target detection problem is divided into signature-based target detection and anomalous target detection. The signature-based target detection compares a reference spectral signature of the target from reference target libraries with the unknown spectral signatures of the materials in the data cube to determine if there is a match. The reference spectral signature of the target may not be available. The reference spectral signature of the target is in reflectance, but the spectral signatures of the materials are in radiance and have to be converted into reflectance using atmospheric calibration for the comparison. The anomalous target detection does not require prior knowledge of the target signature. The anomalous target detection simply compares the spectral signature of the background materials in the data cube with the spectral signatures of the unknown materials in the data cube to determine if they are anomalies which are potential target materials. Anomalous target detection is a more practical problem than signature-based target detection.

The objective of this paper is to develop an anomaly detector in analytical expression for detecting anomalous objects in a relatively large area using hyperspectral imaging. The desirable attributes of anomaly detectors for such an application in autonomous mode include a low input–output processing time, low computational time, and minimal number of user-specified parameters. Reviews of some common anomaly detectors for hyperspectral imagery are discussed in [1, 2]. Several types of anomaly detectors that have been considered are based on the linear mixing models [38], kernel functions [3, 9, 10], robust distance [1113], angle [14, 15], and statistical distance [1, 1618].

One type of anomaly detectors based on the linear mixing models requires prior knowledge of the spectra of the end members. One of the difficulties with this type of anomaly detectors is that the spectra of the end members for the complex background are contaminated with background spectra and do not really resemble the actual background spectra. The other type of anomaly detectors based on the linear mixing models finds the end members automatically from the data cube. The ORASIS detector [7] which has been implemented in real time finds the end members automatically by constructing a minimal volume simplex. However, it is computationally intensive to construct the minimal volume simplex. A recent anomaly detector [8] based on the linear mixing model uses clustering to find the background classes and then unmixes the pixels in which the pixels with higher residual are potential anomalies. Anomaly detectors that use clustering algorithm require a sorting of the distances and this is computationally intensive for a typical data cube.

Anomaly detectors based on the kernel functions like the kernelized RX detector [3] needs to compute the Gramm matrix and covariance matrix for each scanning window. Moreover, it requires every pixel in each scanning window to compute each detector output so it needs more input–output processing time in comparison with a typical anomaly detector that requires only the pixel itself to compute the detector output for that pixel. Thus, the kernelized RX detector requires high input–output processing time and high computational time. Anomaly detectors based on the robust distance obtain initial estimates of the background mean and covariance from a random sample of the background pixels. The Mahalanobis distances of all the pixels in the data cube are computed and the pixels are sorted according to their distances. A specified number of pixels with the smallest distances are selected next to estimate the mean and covariance and this iterative process continues until a stopping criterion is satisfied. Anomaly detectors based on the robust distance can be implemented in a local mode using a scanning window or in a global mode. Anomaly detectors based on the robust distance require repeated sortings of the distances and repeated computations of the covariance, so these detectors in either local mode or global mode are computationally very expensive for a large data cube. More studies are needed to analyze anomaly detectors based on angle. The anomaly detectors based on statistical distance that have been proposed are RX [16], SSRX (subspace RX) [1, 17], CSD (complementary subspace detection) [1], CTM (confined target model) [1], BCSD (Bayesian CSD) [17], and RXD [18].

The conventional anomaly detectors for detecting anomalies in a large area are the RX detector and SSRX detector and these anomaly detectors form the basis for comparing the performance of a new anomaly detector. The RX detector is given by

$$ d_{\rm{RX}}(x)=x^{\rm{T}}C^{-1}x $$
(1)

where x is a demeaned test pixel, C is a population covariance, and superscript T denotes transpose. An anomalous pixel corresponds to a large detector value in (1). The RX detector is the Mahalanobis distance of the pixel and it performs reasonably well as a general-purpose anomaly detector.

The target pixels of the objects of interest in remote sensing of a relatively large area are typically hard to detect. However, it has been noticed that the target pixels can be detected in the noise subspace. This fact leads to the development of the SSRX detector which finds the clutter subspace and then deletes it. The Mahalanobis distance of the pixel in the noise subspace is defined as the SSRX detector, i.e.

$$ d_{\rm {SSRX}}(x)=y_t^{\rm{T}}D^{-1}_ty_t $$
(2)

where x is a demeaned test pixel, y t is a projection of x onto the noise subspace, and D t is the covariance of y t . An anomalous pixel corresponds to a large detector value in (2). The SSRX detector has never failed to be reported superior to the RX detector for at least several choices of clutter subspace using real-time data.

The SSRX detector has only one user-specified parameter which is the dimension of the deleted clutter subspace. Currently, there is no reliable method to select an optimal value for the parameter automatically. The SSRX detector is sensitive to this parameter and the maximum number of possible values for this parameter is one fewer than the spectral dimension which is typically in the two hundreds. This paper proposes an anomaly detector with a parameter that assumes fewer values and the anomaly detector is developed in Sect. 2. Instead of deleting the clutter subspace, the anomaly detector partials out the effect of the clutter subspace in a pixel by predicting each spectral component of the pixel using a linear combination of the clutter subspace in which the dimension of the clutter subspace can vary from one spectral component to another one. The performance of the anomaly detector is compared with the RX detector and SSRX detector using hyperspectral data cubes in the visible and near-infrared range and the results are presented in Sect. 3.

Other anomaly detectors that are based on partialling out are discussed in [1923]. Instead of deleting the clutter subspace as in the SSRX detector, the recent anomaly detector maximized subspace model (MSM) in [23] partials out the effect of the clutter subspace in a pixel by predicting each spectral component of the pixel using a linear combination of the clutter subspace in which the dimension of the clutter subspace is the same for each spectral component of the pixel. The experimental results show that the MSM detector has a better performance than the SSRX detector, but both have the same parameter and are sensitive to their parameter which is the dimension of the clutter subspace. Instead of using the same dimension of the clutter subspace for each spectral component of the pixel as in the MSM detector, the anomaly detector that is proposed in this paper allows the dimension of the clutter subspace to vary from one spectral component to another one and has a different user-specified parameter that determines the dimension of the clutter subspace.

2 Variable subspace model

The analytical expression of the anomaly detector based on the variable subspace model for detecting anomalies in hyperspectral imagery is developed in this section. The statistical modeling for the variable subspace model is presented in Sect. 2.1 and is based on the partialling out the effect of the clutter subspace in a pixel from the data cube by predicting each spectral component of the pixel using a linear combination of high-variance principal components in which the number of high-variance principal components can vary from one spectral component to another one. The coefficients of the linear combination are derived in Sect. 2.2 and are chosen to maximize a criterion based on the squared correlation. The Mahalanobis distance of the resulting residual is defined as the anomaly detector and is derived in analytical expression in Sect. 2.3 The parameter for selecting the number of high-variance principal components for each spectral component of the pixel is described in Sect. 2.4 and is based on the maximized squared correlations.

2.1 Statistical modeling

The statistical modeling of the anomaly detector assumes that all pixels in a data cube consist of background pixels and target pixels in which there is only one dominant type of background pixels that represent the sample statistics of the data cube. The target pixels and the other types of background pixels are assumed to be non-influential on the sample statistics of the data cube so the sample statistics of the data cube are not significantly affected by the inclusion of these non-influential pixels in computing the sample statistics. A large data cube of a sufficiently large scene would typically satisfy these assumptions. In summary, all pixels in a data cube are assumed to be unlabeled and are used to compute the sample statistics which are the sample mean and sample covariance. The test pixel from the data cube is also used as a background pixel because sample statistics are computed using all pixels in the data cube and the test statistics is computed for each pixel in the data cube.

It has been noticed that target pixels for remotely sensed data cube of a relatively large area using hyperspectral imaging system are significantly correlated with the clutter subspace and the background pixels are highly correlated with the clutter subspace. To be able to detect the targets, the effect from the clutter subspace should be partialled out from the target pixels without having to delete the clutter subspace from the target pixels. Having partialled out the effect of the clutter subspace from the target pixels, the targets in the resulting residual would be more detectable. The anomaly detector is proposed as the Mahalanobis distance of the residual \(\epsilon\) that results from partialling out the effect of the clutter subspace from a pixel y, i.e.

$$ d(y)=(\epsilon-E(\epsilon))^{\rm{T}}({\rm{Var}}(\epsilon) )^{-1}(\epsilon-E(\epsilon)) $$
(3)

where the superscript T is the transpose, \(E(\epsilon)\) is the expected value of \(\epsilon, \) and \({\rm{Var}}(\epsilon)\) is the variance of y which is the same as the covariance of \(\epsilon\) and \(\epsilon, \) i.e.

$$ {\rm{Var}}(\epsilon)= {\rm{Cov}}\left(\epsilon,\epsilon^{\rm{T}}\right) $$
(4)
$$ {\rm{Var}}(\epsilon)=E\left((\epsilon-E(\epsilon))(\epsilon-E(\epsilon))^{\rm{T}}\right). $$
(5)

The high-variance principal components and low-variance principal components of y are used to estimate the clutter subspace and noise subspace, respectively. The p × 1 residual \(\epsilon\) is the error that results from predicting each spectral component y i with a linear combination of the q i components from the q i  × 1 high-variance principal component \(\bar{x}_i\) of the p × 1 pixel y, i.e.

$$ \epsilon=\left[\begin{array}{c}y_1-\beta_1^{\rm{T}}\bar{x}_1\\ y_2-\beta_2^{\rm{T}}\bar{x}_2\\ \vdots\\ y_p-\beta_p^{\rm{T}}\bar{x}_p\end{array}\right] $$
(6)

where

$$ \epsilon=[\begin{array}{ccc}\epsilon_1\; \epsilon_2\; \ldots\; \epsilon_p\end{array}]^{\rm{T}} $$
(7)
$$ y=[\begin{array}{ccc}y_1\;y_2\;\ldots\;y_p\end{array}]^{\rm{T}} $$
(8)
$$ \bar{x}_i=[\begin{array}{ccc}x_1\;x_2\;\ldots\;x_{q_i}\end{array}]^{\rm{T}} $$
(9)
$$ \beta_i=[\begin{array}{ccc}\beta_{i,1}\;\beta_{i,2}\;\ldots\;\beta_{i,q_i}\end{array}]^{\rm{T}} $$
(10)

for 1 ≤ i ≤ p and 1 ≤ q i  ≤ p − 1. The index q i which is a function of i is the number of principal components used to estimate the ith spectral component y i of y. The value of q i can change from one spectral component to another spectral component. The q i  × 1 coefficient β i consists of the q i unknown coefficients \(\beta_{i,1},\beta_{i,2},\ldots,\beta_{i,q_i}\) of the linear combination for the spectral component y i .

The number of principal components used in the linear combination can vary from one spectral component to another. The squared correlation between each spectral component and its linear combination of the high-variance principal components is typically different for different spectral components although each spectral component uses the same number of principal components in the linear combination. The anomaly detection model in (3) allows the number of principal components to vary and includes the special case in which the number of principal components for each spectral component is constrained to be the same. The performance of current hyperspectral anomaly detectors that use the same number of predictors, which are usually principal components, for each spectral component are sensitive as a function of the number of predictors. The anomaly detection model in (3) which is based on the variable number of principal components would have the potential of mitigating this sensitivity by selecting the number of high-variance principal components to account for the high variance in the background scene more accurately. The number of principal components for each spectral component is determined by specifying a threshold on the squared correlation between the spectral component and linear combination of the high-variance principal components.

2.2 Estimation of coefficient

The unknown coefficient β i of the linear combination used to predict each spectral component y i is selected to maximize the following function

$$ f(\beta_i)=\left(\frac{{\rm{Cov}}\left(y_i,\bar{x}_i^{\rm{T}}\beta_i\right)}{\sqrt{{\rm{Var}}(y_i)}\sqrt {{\rm{Var}}\left(\bar{x}_i^{\rm{T}}\beta_i\right)}}\right)^2 $$
(11)

where

$$ {\rm{Cov}}\left(y_i,\bar{x}_i^{\rm{T}}\beta_i\right)=E\left((y_i-E(y_i))\left(\beta_i^{\rm{T}}\bar{x}_i-E \left(\beta_i^{\rm{T}}\bar{x}_i\right)\right)^{\rm{T}}\right). $$
(12)

The function f i ) is the squared correlation between y i and \(\beta_i^{\rm{T}}\bar{x}_i.\) The value of the coefficient \((\beta_i)\) that maximizes the function f i ) is denoted by \(\bar{\beta}_i.\) The derivation of the maximum solution for the function f i ) is presented below.

Each prinicipal component in \(\bar{x}_i\) can be written as a linear combination of the pixel y in which the coefficients of the linear combination are the elements of an eigenvector. The principal component \(\bar{x}_i\) for predicting y i can be written as

$$ \bar{x}_i=\left[\begin{array}{cccc}\alpha_{1}^{\rm{T}}y\;\alpha_{2}^{\rm{T}}y\;\ldots\;\alpha_{q_i}^{\rm{T}}y \end{array}\right]^{\rm{T}}$$
(13)

where (λ i , α i ) for \(i=1,2,\ldots,p\) are the eigenvalue–eigenvector pairs of Var(y) and the eigenvalues satisfy the condition \(\lambda_1\geq\lambda_2\geq\dots\geq\lambda_p.\) By substituting Eq. 13 into \({\rm{Cov}}(y_i,\bar{x}_i^{\rm{T}})\) and \({\rm{Cov}}(\bar{x}_i,\bar{x}_i^{\rm{T}}), {\rm{Cov}}(y_i,\bar{x}_i^{\rm{T}})\) and \({\rm{Cov}}(\bar{x}_i,\bar{x}_i^{\rm{T}})\) become

$$ {\rm{Cov}}\left(y_i,\bar{x}_i^{\rm{T}}\right)={\rm{Cov}}\left(y_i,y^{\rm{T}}\right)\bar{\alpha}_i $$
(14)
$$ {\rm{Cov}}\left(\bar{x}_i,\bar{x}_i^{\rm{T}}\right)=\bar{\alpha_i}^{\rm{T}}{\rm{Var}}(y)\bar{\alpha}_i=\bar{D}_i $$
(15)

where \(\bar{D}_i\) is a q i  × q i diagonal matrix with diagonal elements \(\lambda_{1},\lambda_{2},\dots,\lambda_{q_i}\) and \(\bar{\alpha}_i\) is a p × q i matrix of eigenvectors, i.e.

$$ \bar{D}_i={\rm{diag}}(\lambda_{1},\lambda_{2},\ldots,\lambda_{q_i}) $$
(16)
$$ \bar{\alpha}_i=\left[\begin{array}{cccc}\alpha_{1}\;\alpha_{2}\;\ldots\; \alpha_{q_i}\end{array}\right]. $$
(17)

By substituting \({\rm{Cov}}\left(y_i,\bar{x}_i^{\rm{T}}\right)\) in (14) and \({\rm{Cov}}\left(\bar{x}_i,\bar{x}_i^{\rm{T}}\right)\) in (15) into the function f i ) in (11), f i ) can be written as

$$ f(\beta_i)=\frac{\beta_i^{\rm{T}}\bar{\alpha}_i^{\rm{T}}{\rm{Cov}}\left(y_i,y^{\rm{T}}\right)^{\rm{T}}{\rm{Cov}}\left(y_i,y^{\rm{T}}\right)\bar {\alpha}_i\beta_i}{{\rm{Cov}}\left(y_i,y_i^{\rm{T}}\right)\beta_i^{\rm{T}}\bar{D}_i\beta_i}. $$
(18)

By differentiating the function f i ) in (18) with respect to β i , the gradient of f i ) is

$$ \frac{\partial{f(\beta_i)}}{\partial{\beta_i}}=\frac{2\bar{\alpha}_i^{\rm{T}}{\rm{Cov}}\left(y_i,y^ {\rm{T}}\right)^ {\rm{T}} {\rm{Cov}}\left(y_i,y^ {\rm{T}}\right)\bar{\alpha}_i\beta_i}{ {\rm{Cov}}\left(y_i,y_i^ {\rm{T}}\right) \beta_i^ {\rm{T}}\bar{D}_i\beta_i} -\frac{2f(\beta_i)\bar{D}_i\beta_i}{\beta_i^ {\rm{T}}\bar{D}_i\beta_i}. $$
(19)

By setting the gradient of f i ) in (19) to a zero vector and then pre-multiplying it by \(\bar{D}_i^{-1}, \) the defining equation for determining β i is

$$ \left(\psi_i-f(\beta_i)I_{q_i}\right)\beta_i=0_{q_i} $$
(20)

where \(0_{{q_{i}}}\) is a q i  × 1 zero vector, \(I_{{q_{i}}}\) is a q i  × q i identity matrix, and

$$ \psi_i=\frac{\bar{D}_i^{-1}\bar{\alpha}_i^{\rm{T}}{\rm{Cov}}\left(y_i,y^{\rm{T}}\right)^{\rm{T}}{\rm{Cov}}\left(y_i,y^{\rm{T}}\right) \bar{\alpha}_i}{{\rm{Cov}}\left(y_i,y_i^{\rm{T}}\right)}.$$
(21)

The defining equation in (20) shows that f i ) is an eigenvalue of ψ i with corresponding eigenvector β i .

By cyclic permutation on ψ i in (21), the maximum value for f i ) is

$$ f(\beta_i)=\frac{{\rm{Cov}}\left(y_i,y^{\rm{T}}\right)\bar{\alpha}_i\bar{D}_i^{-1}\bar{\alpha}_i^{\rm{T}}{\rm{Cov}} \left(y_i,y^{\rm{T}}\right)^{\rm{T}}}{{\rm{Cov}}\left(y_i,y_i^{\rm{T}}\right)} $$
(22)

which is the largest eigenvalue of ψ i and the corresponding eigenvector is

$$ \beta_i=\bar{D}_i^{-1}\bar{\alpha}_i^{\rm{T}}{\rm{Cov}}\left(y_i,y^{\rm{T}}\right)^{\rm{T}}$$
(23)

which is obtained by substituting f i ) in (22) into the defining equation in (20). The value of the coefficient β i that maximizes the function f i ) is given in (23).

2.3 Mahalanobis distance of residual

By substituting the coefficient β i in (23) into the residual \(\epsilon\) in (6), the residual \(\epsilon\) can be written as

$$ \epsilon=\left[\begin{array}{cccc}y_1-{\rm{Cov}}\left(y_1,y^{\rm{T}}\right)\bar{\alpha}_1\bar{D}^{-1}_1\bar{x}_1\\ y_2-{\rm{Cov}}\left(y_2,y^{\rm{T}}\right)\bar{\alpha}_2\bar{D}^{-1}_2\bar{x}_2\\ \vdots\\ y_p-{\rm{Cov}}\left(y_p,y^{\rm{T}}\right)\bar{\alpha}_p\bar{D}^{-1}_p\bar{x}_p\end{array}\right] $$
(24)

By substituting the principal component \(\bar{x}_i\) in (13) into the residual \(\epsilon\) in (24), the expected value of the residual \(\epsilon\) is

$$ E(\epsilon)=\left(I_{p}-\phi\right)E(y) $$
(25)

where I p is a p × p identity matrix and

$$ \phi= \left[\begin{array}{cccc}{\rm{Cov}}(y_1,y^{\rm{T}})\bar{\alpha}_1\bar{D}^{-1}_1\bar{\alpha}_1^{\rm{T}}\\ {\rm{Cov}}\left(y_2,y^{\rm{T}}\right)\bar{\alpha}_2\bar{D}^{-1}_2\bar{\alpha}_2^{\rm{T}}\\ \vdots\\ {\rm{Cov}}\left(y_p,y^{\rm{T}}\right)\bar{\alpha}_p\bar{D}^{-1}_p\bar{\alpha}_p^{\rm{T}}\end{array}\right]. $$
(26)

The variance of the residual \(\epsilon\) is

$$ {\rm{Var}}(\epsilon)={\rm{Var}}(y)+{\rm{Var}}(\phi y)-2{\rm{Cov}}\left(y,y^{\rm{T}}\phi^{\rm{T}}\right) $$
(27)
$$ {\rm{Var}}(\epsilon) ={\rm{Cov}}\left(y,y^{\rm{T}}\right)+(\phi-2I_p){\rm{Cov}}\left(y,y^{\rm{T}}\right)\phi^{\rm{T}}.$$
(28)

Thus, the expected value \(E(\epsilon)\) and variance \({\rm{Var}}(\epsilon)\) of the residual \(\epsilon\) are given in (25) and (28), respectively.

The Mahalanonis distance of the residual in (3) is a function of the inverse of the variance of the residual. However, the rank of the variance of the residual is less than full rank so the inverse variance of the residual does not exist. A method of approximating the inverse variance of the residual is needed. The high-variance principal components are linear combinations of the spectral components that represent the dominant background. Having partialled out the high-variance principal components, the resulting residual consists of mostly random noise and a small amount of target signal. The off-diagonal elements in the variance of the residual are approximately zero because the residual consists mainly of random noise. Thus, the variance of the residual can be approximated with a diagonal matrix with diagonal elements from the variance of the residual, so the inverse variance of the residual can be approximated.

By substituting the expected value of the residual in (25) and the diagonal matrix of the variance of the residual in (28) into the Mahalanonis distance in (3), the Mahalanobis distance of the residual becomes

$$ d_{{\rm{VSM}}}(y)=\left[\begin{array}{cccc} y_1-\mu_1-{\rm{Cov}}\left(y_1,y^{\rm{T}}\right)\bar{\alpha}_1\bar{D}_1^{-1}\left(\bar{x}_1-\bar{\alpha}_1^{\rm{T}}\mu\right) \\ y_2-\mu_2-{\rm{Cov}}\left(y_2,y^{\rm{T}}\right)\bar{\alpha}_2\bar{D}_2^{-1}\left(\bar{x}_2-\bar{\alpha}_2^{\rm{T}}\mu\right) \\ \dots \\ y_p-\mu_p-{\rm{Cov}}\left(y_p,y^{\rm{T}}\right)\bar{\alpha}_p\bar{D}_p^{-1}\left(\bar{x}_p-\bar{\alpha}_p^{\rm{T}}\mu\right) \end{array}\right]^{\rm{T}}\times$$
(29)
$$ \left({\rm{diag}}\left({\rm{Cov}}\left(y,y^{\rm{T}}\right)+(\phi-2I_p){\rm{Cov}}\left(y,y^{\rm{T}}\right)\phi^{\rm{T}}\right)\right)^{-1}\times $$
(30)
$$ \left[\begin{array}{cccc} y_1-\mu_1-{\rm{Cov}}\left(y_1,y^{\rm{T}}\right)\bar{\alpha}_1\bar{D}_1^{-1}\left(\bar{x}_1-\bar{\alpha}_1^{\rm{T}}\mu\right) \\ y_2-\mu_2-{\rm{Cov}}\left(y_2,y^{\rm{T}}\right)\bar{\alpha}_2\bar{D}_2^{-1}\left(\bar{x}_2-\bar{\alpha}_2^{\rm{T}}\mu\right) \\ \dots \\ y_p-\mu_p-{\rm{Cov}}\left(y_p,y^{\rm{T}}\right)\bar{\alpha}_p\bar{D}_p^{-1}\left(\bar{x}_p-\bar{\alpha}_p^{\rm{T}}\mu\right) \end{array}\right]. $$
(31)

The analytical expression of the anomaly detector-based variable subspace model (VSM) is given in (29), (30), and (31). A large value in the detector output d VMS(y) would indicate that the pixel y is a potential anomalous pixel. The expected value in (25) and variance of the residual in (28) have been derived assuming only the existence of the covariance of y without making any assumption on a probability distribution for y, but defining the anomaly detector as a Mahalanobis distance implies that y has a multivariate normal distribution.

2.4 Parameter

The VSM detector is based on the Mahalanobis distance of the residual. The residual is the error that results from predicting each spectral component of the pixel with a linear combination of the high-variance principal components. The number of high-variance principal components can vary from one spectral component to another one and a criterion for determining the number of high-variance principal components for each spectral component is needed.

The proposed criterion for selecting q i , the number of high-variance principal components for the ith spectral component y i of the pixel y, is based on the maximized squared correlation \(f(\bar{\beta}_i)\) which is given by

$$ f(\bar{\beta}_i)=\frac{{\rm{Cov}}(y_i,y^{\rm{T}})\bar{\alpha}_i\bar{D}^{-1} _i\bar{\alpha}^{\rm{T}}_i {\rm{Cov}}(y_i,y^{\rm{T}})^{\rm{T}}}{{\rm{Cov}}(y_i,y_i^{\rm{T}})}.$$
(32)

As the number of principal components increases, the maximized squared correlation in (32) will continue to increase until the increase will become not so significant at some value of q i . The value of q i is selected such that the absolute difference between the maximized squared correlation for the next value of q i and the maximized squared correlation for the current value of q i is within a prescribed tolerance. Let \(f^k(\bar{\beta}_i)\) be the maximized squared correlation for q i  = k where \(k=1,2,\dots,p-1. \) Then the value of q i is determined by the value of k such that

$$ \left|f^{k+1}(\bar{\beta}_i)-f^k(\bar{\beta}_i)\right|\le {\rm{tol}},$$
(33)

where tol is a specified tolerance.

The criterion for selecting values for tol is based on observations obtained from experimental results. The maximized squared correlation between each spectral component of the pixel and the first few high-variance principal components of the pixel quickly reaches 0.9, so the maximum absolute difference between successive maximized squared correlations in (33) should be approximately 0.1 and a maximum value for tol would be tol = 10−1. As the value of tol decreases to smaller values like \({\rm{tol}}=10^{-2}, {\rm{tol}}=10^{-3},\ldots, \) it will reach a value where at least one of the diagonal elements of the variance of the residual \(\epsilon\) will approach 0 and the previous value of the tol would be the minimum value for tol.

Thus, the VSM detector which has been developed in analytical expression has only one parameter tol and the values for the parameter can be selected automatically with a sequence of decreasing values for the parameter like \({\rm{tol}}=10^{-1}, {\rm{tol}}=10^{-2}, {\rm{tol}}=10^{-3}, \dots. \) The iteration is typically terminated after a few iterations when one of the diagonal elements of the variance of the residual becomes less than the machine epsilon in double precision. The VSM detector would have fewer possible values for the parameter than the SSRX detector.

3 Experimental results

A relative comparison of the performance between the VSM detector and the SSRX detector with respect to the RX detector is presented in this section. Two hyperspectral data cubes are used in the experiment in which the targets in the first data cube (ARCHER) are more difficult to detect than those in the second data cube (RIT). The experimental results for the first data cube and the second data cube are analyzed in Sects. 3.1 and 3.2, respectively.

A quantitative assessment of the performance of the VSM detector is accomplished in this paper by generating the ROC curves in unsupervised mode and by counting pixels instead of objects. The RX, SSRX, and VSM detectors are implemented as global anomaly detectors in which the background statistics are computed using all pixels in the data cube and the pixels are assumed to be unlabeled and normally distributed for both data cubes. A visual assessment of the performance of the VSM detector in identifying the anomalies is accomplished by generating images of the detector output.

The main computational cost for the RX, SSRX, and VSM detectors is the computation of the covariance matrix. The RX, SSRX, and VSM detectors compute the covariance matrix only once. The three anomaly detectors are in analytical expression and have about the same computational cost for generating each image of the detector output. As global detectors, the RX detector does not have a parameter, but the SSRX detector and the VSM detector have only one parameter. The RX, SSRX, and VSM detectors can also be implemented in a local mode with a scanning window and this will introduce at least one more parameter.

3.1 First data cube

The ARCHER data cube is in the visible and near-infrared wavelengths with spatial dimensions of 1,000 by 1,024 and spectral dimension of 42. The percentage of anomalous pixels which consist of man-made objects in the data cube is about 0.0766%. The data cube covers a large area, and a representative 280 × 400 RGB subimage of the data cube with some man-made objects labeled in white using spectral band number 40, 35, and 20 for the red band, green band, and blue band, respectively, is shown in Fig. 1.

Fig. 1
figure 1

A 280 × 400 RGB image of the first data cube with man-made objects labeled in white using spectral band number 40, 35, and 20 for the red band, green band, and blue band, respectively

The maximized squared correlation between each spectral component of the pixel and its linear combination of the high-variance principal components for the first data cube is shown in Fig. 2 for spectral band number from 1 to 42 and number of principal components from 2 to 41. The maximized squared correlation is relatively low when only the first high-variance principal component is used in the linear combination, so it is not included to better see the differences in the other correlations. The corresponding absolute difference between two adjacent maximized squared correlations is shown in Fig. 3 and this absolute difference is used to determine the number of high-variance principal components in the linear combination for each spectral component of the pixel. The number of high-variance principal components for each spectral component of the pixel is shown in Figs. 4 and 5 for tol = 10−3, tol = 10−4, tol = 10−5, tol = 10−6, and tol = 10−7. As the tolerance for the absolute difference between two adjacent maximized squared correlations decreases from tol = 10−3 to tol = 10−7, the number of high-variance principal components for each spectral component increases. The iteration terminates at tol = 10−6 when the diagonal element of the variance of the residual for spectral component number 11 becomes 0 at tol = 10−7. The VSM detector achieves its best performance at tol = 10−6. The minimum value of the diagonal elements of the variance of the residual for the VSM detector is shown in Fig. 18 for each value of the tolerance. The minimum value decreases as the tolerance decreases from tol = 10−3 to tol = 10−7 and it becomes 0 at tol = 10−7. The iteration is terminated when the minimum value becomes less than the machine epsilon in double precision. The VMS detector generates five images of detector output at tol = 10−2, tol = 10−3, tol = 10−4, tol = 10−5, and tol = 10−6 to be analyzed, but the SSRX detector requires 41 images. Thus, the VMS detector is computational more efficient than the SSRX detector.

Fig. 2
figure 2

Maximized squared correlation as a function of spectral band number and number of principal components for the VSM detector for the data cube in Fig. 1

Fig. 3
figure 3

Absolute difference between two adjacent maximized squared correlations as a function of spectral band number and number of principal components for the VSM detector for the data cube in Fig. 1

Fig. 4
figure 4

Number of principal components selected for each spectral band number for the VSM detector for the data cube in Fig. 1 for tol = 10−3, tol = 10−5, tol = 10−7

Fig. 5
figure 5

Number of principal components selected for each spectral band number for the VSM detector for the data cube in Fig. 1 for tol = 10−4 and tol = 10−6

The parameter q in SSRX detector is the number of high-variance principal components used for the deleted clutter subspace. There is currently no reliable way to automatically select the optimal value for q. Using the ROC curve from the RX detector as a reference, the difference in probability of detection between the SSRX detector and the RX detector is shown in Figs. 6, 7, 8, and 9 for 1 ≤ q ≤ 10, 11 ≤ q ≤ 20, 21 ≤ q ≤ 30, and 31 ≤ q ≤ 41, respectively. The difference is obtained by substracting the probability of detection of the RX detector from that of the SSRX detector. A positive difference indicates that the SSRX detector performs better than the RX detector. The x-coordinates of the points in the ROC curves of the RX detector do not coincide with those of the SSRX detector in general so a linear interpolation is used to interpolate the probabilities of detection of the SSRX detector to the same x-coordinates of the RX detector. In a similar way, the difference in probability of detection between the VSM detector and RX detector using the RX detector as a reference is shown in Fig. 10 for tol = 10−3, tol = 10−4, tol = 10−5, and tol = 10−6.

Fig. 6
figure 6

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 1 for q = 1 to q = 10

Fig. 7
figure 7

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 1 for q = 11 to q = 20

Fig. 8
figure 8

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 1 for q = 21 to q = 30

Fig. 9
figure 9

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 1 for q = 31 to q = 41

Fig. 10
figure 10

Difference in probability of detection between the VSM detector and RX detector (VSM-RX) for the data cube in Fig. 1

The ROC curves in Figs. 6, 7, 8, and 9 show the SSRX detector performs like the RX detector for q = 1, performs increasingly better than the RX detector for 2 ≤ q ≤ 24 with a peak of about 0.20 located at a probability of false alarm of about 10−3 for q = 23, and performs increasingly worse than the RX detector for 25 ≤ q ≤ 41. The ROC curves in Fig. 10 show the VSM detector keeps improving its performance as the tolerance decreases, performs worse than the RX detector for tol = 10−3 and tol = 10−4, and performs better than the RX detector for tol = 10−5, tol = 10−6, and tol = 10−7 with a peak of about 0.23 located at a probability of false alarm of about 10−3 for tol = 10−6. Thus, the overall performance of the VMS detector is better than that of the SSRX detector and the VSM detector can automatically stop the iteration of the values of its parameter and can detect the target within the range of the resulting values of its parameter.

3.2 Second data cube

The RIT (Rochester Institute of Technology) data cube [24] is in the visible and near-infrared wavelengths with spatial dimensions of 280 by 400 and spectral dimension of 126. The anomalous pixels are selected to be man-made objects that are easy to detect like large buildings. The 280 × 400 RGB image of the data cube using spectral band number 17, 7, and 2 for the red band, green band, and blue band, respectively, is shown in Fig. 11 and the estimated locations of the targets are shown in white in Fig. 12.

Fig. 11
figure 11

An RGB image of the 280 × 400 data cube (second data cube) using spectral band number 17, 7, and 2 for the red band, green band, and blue band, respectively

Fig. 12
figure 12

An image of the locations of targets for the data cube in Fig. 11

The maximized squared correlation between each spectral component of the pixel and its linear combination of the high-variance principal components for the second data cube is shown in Fig. 13 for spectral band number from 1 to 42 and number of principal components from 1 to 41. The corresponding absolute difference between two adjacent maximized squared correlations is shown in Fig. 14 and this absolute difference is used to determine the number of high-variance principal components in the linear combination for each spectral component of the pixel. The number of high-variance principal components for each spectral component of the pixel is shown in Figs. 15, 16, and 17 for tol = 10−1, tol = 10−2, tol = 10−3, tol = 10−4, tol = 10−5, tol = 10−6, tol = 10−7, tol = 10−8, and tol = 10−9. As the tolerance for the absolute difference between two adjacent maximized squared correlations decreases from tol = 10−1 to tol = 10−9, the number of high-variance principal components for each spectral component increases. The iteration terminates at tol = 10−9 when the diagonal elements of the variance of the residual for spectral component number 7 and 11 become 0 at tol = 10−10. The minimum value of the diagonal elements of the variance of the residual for the VSM detector is shown in Fig. 18 for each value of the tolerance. The minimum value decreases as the tolerance decreases from tol = 10−1 to tol = 10−10 and it becomes 0 at tol = 10−10. The VMS detector generates 9 images of detector output to be analyzed, but the SSRX detector requires 125 images. Thus, the VMS detector is computational more efficient than the SSRX detector.

Fig. 13
figure 13

Maximized squared correlation as a function of spectral band number and number of principal components for the VSM detector for the data cube in Fig. 11

Fig. 14
figure 14

Absolute difference between two adjacent maximized squared correlations as a function of spectral band number and number of principal components for the VSM detector for the data cube in Fig. 11

Fig. 15
figure 15

Number of principal components selected for each spectral band number for the VSM detector for the data cube in Fig. 11 for tol = 10−1, tol = 10−4, and tol = 10−7

Fig. 16
figure 16

Number of principal components selected for each spectral band number for the VSM detector for the data cube in Fig. 11 for tol = 10−2, tol = 10−5, and tol = 10−8

Fig. 17
figure 17

Number of principal components selected for spectral band number for the VSM detector for the data cube in Fig. 11 for tol = 10−3, tol = 10−6, and tol = 10−9

The ROC curves in Figs. 19 and 20 for \(1\le q\; \le12\) and \(1\;\le q \;\le125, \) respectively, show that the SSRX detector performs like the RX detector for q = 1, but it performs increasingly worse than the RX detector for 2 ≤ q ≤ 125. The ROC curves in Fig. 21 for tolerances from tol = 10−1 to tol = 10−9 show the VSM detector performs worse than the RX detector for tol = 10−8 and tol = 10−9, and performs better than the RX detector for tolerances from tol = 10−1 to tol = 10−7 with a peak located at a probability of false alarm of about 10−4 for tol = 10−2. The VSM detector for tolerances from tol = 10−1 to tol = 10−7 performs better than the SSRX detector for 1 ≤ q ≤ 125. Thus, the VSM detector can automatically stop the iteration of the values of its parameter and can detect the target within the range of the resulting values of its parameter.

Fig. 18
figure 18

The minimum value of the diagonal elements of the variance of the residual for the VSM detector for each value of the tolerance for the data cube in Figs. 1 and 11

Fig. 19
figure 19

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 11 for q = 1 to q = 12

Fig. 20
figure 20

Difference in probability of detection between the SSRX detector and RX detector (SSRX-RX) for the data cube in Fig. 11 for q = 1 to q = 125

Fig. 21
figure 21

Difference in probability of detection between the VSM detector and RX detector (VSM-RX) for the data cube in Fig. 11

The false-color images of the RX detector, the SSRX detector for q = 1, and the VSM detector for tol = 10−2 are shown in Figs. 22, 23, and 24, respectively, for visual comparison of their performance. The value of the detector output is a double-precision floating-point number and is transformed to an unsigned 8-bit integer using a linear transformation for display purpose. The minimum value of the detector output is mapped to 0 and the 99th percentile of the detector output is mapped to 255 in which any detector output with a value greater than the 99th is mapped to 255. The images of the detector output in Figs. 22, 23, and 24 show that the images of the RX detector and the SSRX detector are significantly noisier than that of the VSM detector and the images of the SSRX detector and the RX detector are comparable. Thus, the RIT data cube shows that the VSM detector performs better than the RX detector and the SSRX detector in detecting background pixels. The ARCHER data cube has also shown that the VSM detector performs better than the RX detector and the SSRX detector in detecting background pixels.

Fig. 22
figure 22

An image of the detector output for the RX detector for the data cube in Fig. 11

Fig. 23
figure 23

An image of the detector output for the SSRX detector with q = 1 for the data cube in Fig. 11

Fig. 24
figure 24

An image of the detector output for the VSM detector with tol = 10−2 for the data cube in Fig. 11

4 Conclusion

The analytical expression of the VSM detector for detecting anomalies using hyperspectral imaging system has been developed. The VSM detector is defined as the Mahalanobis distance of the residual resulting from partialling out the effect of the clutter subspace from a pixel by predicting each component of the pixel with a linear combination of the high-variance principal components. The coefficients of the linear combination are chosen to maximize the squared correlation between each component of the pixel and its linear predictor assuming only the existence of the covariance of the pixel. However, defining the anomaly detector as a Mahalanobis distance implies that the pixel has a multivariate normal distribution so the RX, SSRX, and VSM detectors assume that the pixel has a multivariate normal distribution. The number of high-variance principal components for each spectral component of the pixel can vary and is determined by the parameter tolerance based on the difference between two maximized squared correlations. The experimental results are obtained by implementing the anomaly detector as a global anomaly detector in unsupervised mode using two hyperspectral data cubes with wavelengths in the visible and near-infrared range in which the targets in the first data cube are more difficult to detect than those in the second one. The results show that the parameter in the VSM detector has a significantly reduced number of possible values than that in the SSRX detector and the values for the parameter in the VSM detector can be determined automatically in an iterative mode. The VSM detector requires a small tolerance to achieve its optimal performance for the first data cube and a large tolerance to achieve its optimal performance for the second data cube. For the first data cube, the results show that the VSM detector performs better than the SSRX detector and both perform better than the RX detector. For the second data cube, the VSM detector outperforms the SSRX detector which performs as well as the RX detector for one value of its parameter and worse than the RX detector for the rest of the values of its parameter.