Introduction

Multivariate geochemical data resulting from rock chips, soil, or stream sediment sampling can serve as potential datasets for unsupervised outlier detection methods (e.g., Shahrestani and Sanislav 2025a). Unsupervised machine learning models have proven highly effective for anomaly detection, primarily because they do not require labeled samples (e.g., Shahrestani and Carranza 2024). Outlier detection methods rank samples based on their deviations from normal behavior, reflecting signals originating from anomalous sources. These methods are classified into different groups based on the techniques employed to isolate outlier samples, including statistical approaches, proximity-based methods, density-based techniques, clustering methods, ensemble approaches, and machine learning algorithms (e.g., Shahrestani and Carranza 2024; Zimek et al. 2012). In the context of geochemical exploration, there is an increasing trend toward using unsupervised outlier detection approaches, such as restricted Boltzmann machines (RBMs) (Chen 2015), one-class support vector machines (OCSVMs) (Chen and Wu 2017; Shahrestani and Carranza 2024), dictionary learning (Chen and Sui 2022), deep autoencoders (Xiong and Zuo 2016), variational autoencoders (VAEs) (Zhao et al. 2023; Esmaeiloghli et al. 2023), geo-HGAN (Ding et al. 2024), model averaging (Wang and Zuo 2022), isolation forest (IF) (Chen and Wu 2019), bat-optimized OCSVM models (Chen et al. 2019), k-nearest neighbor methods (Chen et al. 2021a), local outlier factor (LOF) (Puchhammer et al. 2024; Shahrestani and Carranza 2024), distance anomaly factors (Chen et al. 2021b), density-based spatial clustering of applications with noise (DBSCAN) (Hajihosseinlou et al. 2024a), ordering points to identify the clustering structure (OPTICS) (Hajihosseinlou et al. 2024b), concentration–volume (C–V) fractal modeling (Afzal et al. 2011, 2016), the U value-number fractal (U-N) and the U value-area fractal (U-A) methods (Ghannadpour and Hezarkhani 2022), generative adversarial networks (GANs) (Zhang and Zuo 2021) and self-organizing maps (Bigdeli et al. 2022), to name but a few.

From another perspective, outlier detection techniques can be categorized into global and local approaches (e.g., Liu et al. 2025; Rayana et al. 2016). Global detection algorithms analyze the entire dataset, making them effective when outliers exhibit significant deviations from the overall data distribution (e.g., Puchhammer et al. 2024). In contrast, local detection algorithms focus on specific subsets of objects (e.g., Zhao et al. 2019a; Schubert et al. 2014). The choice between global and local outlier detection methods can greatly impact performance, as the data structure plays a pivotal role in determining the most suitable approach. For instance, global methods are highly effective when anomalies differ markedly from the majority of the data, whereas local methods excel at identifying anomalies within dense clusters of normal behavior.

The integration of deep learning techniques in geochemical anomaly detection has garnered significant attention in recent years, offering advanced methods for identifying mineralization-associated anomalies. Zhang et al. (2021) introduced a method combining deep convolutional neural networks (DCNNs) with pixel pair features, enhancing anomaly detection. Huang et al. (2022) applied Bayesian convolutional neural networks to quantify uncertainty in anomaly predictions, improving geological interpretations. Cao et al. (2023) optimized a random forest model using a competitive mechanism and beetle antennae search to advance anomaly detection. To address class imbalance in training datasets, Shayilan and Chen (2024) proposed a SMOTified extreme learning machine (ELM) model, merging the synthetic minority oversampling technique (SMOTE) with ELMs. Guo and Chen (2024) improved SMOTE by integrating it with generative adversarial networks (GANs), creating a SMOTified-GAN hybrid to enhance anomaly detection and the performance of ELMs in mineralization detection.

Unsupervised deep learning methods for anomaly detection are categorized into reconstructive and generative approaches (Wang and Chen 2025). Reconstruction-based methods, such as autoencoders, are effective in identifying anomalies through reconstruction errors (Xiong and Zuo 2016). Various methods, including continuous restricted Boltzmann machines (CRBMs) (Chen and Murray 2003), deep autoencoders (DAE) (Hinton and Salakhutdinov 2006), and variational autoencoders (VAEs) (Doersch 2016), have been shown to be effective for detecting multivariate geochemical anomalies. Chen et al. (2014) applied CRBMs to model probability distributions in complex geological environments, successfully identifying geochemical anomalies related to polymetallic mineralization. Similarly, Xiong and Zuo (2016) used DAEs to identify complex multivariate distributions linked to mineralization, while Chen and Shayilan (2022) distinguished gold mineralization-related anomalies using dictionary learning techniques. Furthermore, Luo et al. (2023) enhanced the interpretability of VAE models by integrating geological loss terms. Despite their success, deep autoencoder networks (DANs) have faced challenges when recognizing multi-element geochemical anomalies, as they disregard spatial structures (e.g., Zuo and Xu 2024). Recent advancements, such as the framework developed by Soltani et al. (2024), integrate deep learning with spatial machine learning processors to incorporate local neighborhood information and spatial non-stationarities in anomaly detection.

Spatial-based deep learning methods, such as convolutional neural networks (CNNs) (LeCun et al. 1998), convolutional autoencoders (CAEs) (Masci et al. 2011), and graph neural networks (Zuo and Xu 2023), have emerged to capture spatial relationships among geochemical variables. These methods overcome the fixed-size image limitations in CNNs and CAEs by allowing more flexible, graph-based modeling of geochemical data. Physics-constrained models further enhance deep learning interpretations by embedding prior geological knowledge and physical laws into frameworks, improving alignment with geological processes (e.g., Zuo et al. 2023). For example, Xiong et al. (2022) incorporated geological constraints into variational autoencoders (VAEs) to reveal nonlinear influences on mineralization. While reconstruction-based methods use reconstruction errors for anomaly detection, they may fail to capture anomaly complexities in geochemical data (Wang and Chen 2025). Recent advancements in unsupervised deep learning, such as deep autoencoders (DAE), GANs, and graph attention networks, are addressing these challenges. Li et al. (2024) developed a 1D-CNN model that achieved 99.18% accuracy in deposit type classification, while DAGMM (Wang and Chen 2025) merges dimensionality reduction with probabilistic modeling, enhancing geochemical anomaly detection. Wang et al. (2023) constructed an AlexNet-based CNN for automatic lithological identification from stream sediment data, further demonstrating the potential of deep learning techniques in geochemical exploration.

In geochemical exploration, several key factors should be considered when selecting an outlier detection method. First, it is important to choose methods that are robust to high-dimensional feature spaces or at least evaluate whether they suffer from the "curse of dimensionality." As dimensionality increases, the distinction between the nearest and farthest distances becomes less pronounced, weakening the effectiveness of distance-based metrics (e.g., Park 2023; Beyer et al. 1999). Second, the time complexity of the chosen method is crucial. For example, the angle-based outlier detection method (ABOD: Kriegel et al. 2008) is robust to dimensionality but becomes computationally expensive when applied to large datasets (Shahrestani and Sanislav 2025b). Third, consideration must be given to the number of parameters required by an outlier detection method and its sensitivity to those parameters. For instance, the accuracy of local outlier factor (LOF; Breunig et al. 2000) scores is highly sensitive to the choice of the number of neighbours (e.g., Shahrestani and Sanislav 2025b). Fourth, high feature density in most multivariate geochemical datasets can render certain criteria, such as within-class distances from the center, ineffective in some clustering-based outlier detection methods. Fifth, in many multivariate geochemical datasets, not all samples are analyzed for every element, leading to missing data. The outlier detection method should address this issue during the scoring process. Sixth, interpretability of the resulting outlier scores varies significantly among different outlier detection methods (Li et al. 2022). Finally, the effectiveness of an outlier detection method in delineating geochemical anomalies across different scales, from regional surveys to local investigations, should be assessed. In regional multivariate datasets, for example, samples are often collected from diverse geological settings, and the outlier detection algorithm should effectively capture these intrinsic sources of variation (e.g., Shahrestani and Carranza 2024).

These considerations highlight that the application of outlier detection methods in mineral exploration remains in its early stages, indicating a significant need for implementing various detection strategies based on different criteria for identifying geochemical anomalies. In response to this need, this paper investigates the effectiveness of an outlier detection method known as rotation-based outlier detection (ROD) (Almardeny et al. 2020). In ROD, the dataset is first divided into successive 3D subspaces, and each subspace is then rotated around the geometric median to determine outlier scores. This approach is easy to implement and is parameter-free, requiring no assumptions about the underlying distribution. To evaluate the efficiency of the ROD method in detecting geochemical anomalies, the resulting scores are compared with those from two well-established local outlier detection methods, LOF and KNN, which are applied to a multivariate regional geochemical dataset from a Cu-bearing mineralization province in southeastern Iran. The performance of ROD is compared with LOF and KNN, as these methods are widely used in anomaly detection tasks and represent two key approaches. LOF, a density-based method, identifies outliers by comparing the local density of data points, while KNN, a distance-based method, detects outliers based on the distance to the nearest neighbors. Both methods are commonly applied to multivariate datasets, making them suitable benchmarks for evaluating the effectiveness of new outlier detection techniques. Moreover, the well-understood behaviors and limitations of LOF and KNN, such as sensitivity of LOF to parameter choices and challenges of KNN in high-dimensional spaces, are considered. By comparing ROD—an innovative, parameter-free method that rotates data vectors about the geometric median—the aim is to demonstrate how unique approach of ROD addresses these challenges.

Materials and methods

Case study

The Iranian Plate is situated at the boundary between the Eurasian Plate to the north and the Gondwana Plate to the south (Aminzadeh and Rahimpour 2021). The geotectonic history of the Iranian Plate has been shaped by the opening and closing of the Paleo-Tethys and Neo-Tethys Oceans (Bagheri and Stampfli 2008; Masoodi et al. 2013). It is believed that the Iranian Plate separated from the Eurasian Plate during the Late Ordovician to Early Silurian, forming part of the passive margin of the Paleo-Tethys Ocean (Nataĭin and Şengör, 2005). The subduction of the oceanic crust of the Paleo-Tethys beneath the Eurasian Plate led to the rifting of Gondwana and the spreading of the Neo-Tethys Ocean in the Early Triassic (Stampfli and Borel 2002). The Neo-Tethys Ocean, located between the Iranian and Arabian Plates, played a pivotal role in shaping the tectonic framework of the region.

The subduction of the Neo-Tethys crust beneath the southern margin of the Iranian Plate resulted in the formation of three main tectonic belts: the Urumieh-Dokhtar Magmatic Belt (UDMB), the Sanandaj-Sirjan Zone, and the Zagros Folding and Thrusting Belt, which developed between the Early Jurassic and Eocene (Ghasemi and Talbot 2006; Bagheri and Stampfli 2008). The Urumieh-Dokhtar Magmatic Arc, a key feature of the region, extends over 1700 km along the NW–SE direction and is known for hosting numerous porphyry copper deposits associated with Oligo-Miocene intrusive bodies (e.g., Richards et al. 2012).

The Baft district, located within the Kerman Cenozoic Magmatic Arc (KCMA) of the Urumieh-Dokhtar Magmatic Belt (UDMB), covers an area of approximately 1250 km2 and presents a complex geological setting (Safari et al. 2015). The basement of the district consists of Paleozoic metamorphic rocks, overlain by a Cretaceous mélange, with Eocene volcanic and sedimentary formations (Fig. 1). Miocene intrusive rocks, including the Lalezar granitoid and Rabor-Lalehzar Magmatic Complex (RLMC), intrude the Eocene volcanic rocks (Shafiei 2010; Srdic et al. 1972). Oligocene–Miocene volcanic and sedimentary sequences are also present, with Quaternary deposits covering the region (Ghasemzadeh et al. 2022).

Fig. 1
figure 1

Simplified geological map of Baft area with the scale of 100,000 (after Srdic et al. 1972)

The northern parts of the Baft district are dominated by Eocene volcanic rocks, including andesites and dacites, which are intruded by Upper Miocene quartz monzodiorite and granodiorite, part of the Lalezar granitoids or RLMC. These granitoids exhibit high-K calc-alkaline characteristics and range from gabbrodiorites to granites (Moghadam et al. 2018). Mineral compositions of the most felsic rocks in the region are characterized by the presence of Na-plagioclase, quartz, alkali feldspar, biotite, and hornblende, while gabbro-diorite rocks contain plagioclase (Ca-rich), hornblende, biotite, and clinopyroxene (Srdic et al. 1972; Dimitrijevic 1973).

The region is economically significant due to its rich porphyry copper deposits, with notable mineralization at locations such as Bid Khan, Ghaleh Asghar, and Harij. These deposits are primarily associated with Miocene intrusive bodies (e.g., Dimitrijevic 1973). Additionally, vein-type mineralization is observed at sites such as Goghar, Ab Bahri, and Paynegin (Aghazadeh et al. 2015; Mohebi et al. 2015; Ghasemzadeh et al. 2019). The mineralization in these areas is linked to the hydrothermal alteration associated with the Miocene granitoid intrusions (Ghasemzadeh et al. 2022).

Geochemical data and preprocessing

A geochemical survey was conducted in the Baft district, during which 911 stream sediment samples were systematically collected, with an average sampling density of one sample per 1 km2. These samples were analyzed by geological survey of Iran using the inductively coupled plasma—optical emission spectrometry method (ICP-OES), while Au contents were determined using fire assay followed by atomic absorption spectroscopy (AAS) (Ghasemzadeh et al. 2022). Analytical reliability was ensured through the application of duplicate analyses, following the methodology of Thompson and Howarth (1976), which demonstrated a precision of better than 10% for most analyzed elements (Ghasemzadeh et al. 2019).

In addition to using mineral occurrences as validation points for regional geochemical anomalies, several rock samples were collected from upstream areas of stream sediment sites that exhibited relatively high concentrations of individual chemical elements, as indicated by the stream sediment geochemical survey (Table 1). For the multivariate geochemical analysis, 12 elements were selected, including the primary economic metal (Cu) and pathfinder elements (Ag, As, Au, Bi, Cu, Mo, Pb, Sb, Zn) due to their association with mineralization processes (e.g., Jamali 2017; Ghasemzadeh et al. 2019), along with three geologically influenced elements (Cr, Co, Ni) (Table 2). To account for the compositional constraints inherent in geochemical data, the dataset was subjected to centered log-ratio (clr) transformation, as proposed by Aitchison (1986).

Table 1 Result of sample analysis for Au, Cu, Pb, and Zn in mineralized samples in anomaly checking stage
Table 2 Summary statistics of the raw values for the 12 elements included in the data processing procedure from the geochemical dataset of the Baft area

Outlier detection methods

Rotation-based outlier detection approach (ROD)

Stream sediment geochemical surveys play a crucial role in identifying geochemical anomalies associated with anomalous sources such as mineral deposits and understanding geochemical dispersion patterns. Elements in stream sediments can vary widely in concentration due to diverse upstream sources and roles of weathering, erosion, transportation, and deposition (e.g., Shahrestani et al. 2024; Najafian et al. 2023). These processes produce complex multivariate datasets characterized by inherent heterogeneities and anomalies that conventional statistical methods struggle to capture effectively. In this context, the rotation-based outlier detection (ROD) method offers significant advantages by considering the relationships among multivariate features, enabling the detection of subtle anomalies embedded within high-dimensional datasets. The concept of the ROD method can initially be illustrated using a three-dimensional feature space and subsequently generalized for higher dimensions, which is common in most outlier detection tasks. Consider three collinear vectors, \(\overrightarrow{x}\), \(\overrightarrow{y}\), and \(\overrightarrow{z}\) in R3, which form a parallelepiped with \(\overrightarrow{x}\) and \(\overrightarrow{y}\) serving as the base (Fig. 2). The volume of this parallelepiped can be expressed as follows (Almardeny et al. 2020):

$$\text{volume}=\left|\left(\overrightarrow{x}\times \overrightarrow{y}\right).\overrightarrow{z}\right|=\left(\Vert \overrightarrow{x}\Vert \Vert \overrightarrow{y}\Vert sin\gamma \right).\Vert \overrightarrow{z}\Vert \left|cos\theta \right|$$
(1)

where, \(\gamma =\angle (\overrightarrow{x},\overrightarrow{y})\) represents the angle between vectors \(\overrightarrow{x}\) and \(\overrightarrow{y}\), while \(\theta\) denotes the angle between \(\overrightarrow{z}\) and the height (h) that is perpendicular from \(\overrightarrow{z}\) to the base (Fig. 2a). According to the figure, vectors \(\overrightarrow{y}\) and \(\overrightarrow{z}\) can be generated by counterclockwise rotating \(\overrightarrow{x}\) around a diagonal line extending from the origin to the upper-right corner using two different angles, \({\theta }_{1}^{\prime}\) and \({\theta }_{2}^{\prime}\). Consequently, the volume of the parallelepiped is proportional to the magnitude of \(\overrightarrow{x}\) and the angle between \(\overrightarrow{x}\) and the rotation axis (Almardeny et al. 2020).

Fig. 2
figure 2

a Parameters related to the parallelepiped created by three non-collinear vectors of \(\overrightarrow{x}\), \(\overrightarrow{y}\), and \(\overrightarrow{z}\) and rotation around the diagonal line (red line), \(\gamma\) is angle between \(\overrightarrow{x}\), \(\overrightarrow{y}\) and \(\theta\) is the angle between \(\overrightarrow{z}\) and the height (h) (b) \({F}^{*}\left(\gamma \right)\) function

Another essential concept in the ROD method is the geometric median, which generalizes the median using the appropriate L1 estimator. This parameter remains unaffected by outliers, as it is not skewed by extreme values on either side of the distribution (e.g., Cheng et al. 2024). For dimensions greater than two, the computation process becomes more complex; therefore, Vardi and Zhang (2000) proposed a modified version of the Weiszfeld algorithm to determine the geometric median (\({g}_{m}\)) of a feature space S = {X1, X2, …, Xm) in Rd:

$${g}_{m}\to \text{T} ({g}_{m})={\left(1-\frac{\mu \left({g}_{m}\right)}{r\left({g}_{m}\right)}\right)}^{+}\widetilde{T}({g}_{m})+\text{min}\left(1, \frac{\mu \left({g}_{m}\right)}{r\left({g}_{m}\right)}\right){(g}_{m})$$
(2)

where

$$\widetilde{T}\left({g}_{m}\right)={\left\{\sum \nolimits_{{X}_{i}\ne {g}_{m}}\frac{{\mu }_{i}}{\Vert {g}_{m}-{X}_{i}\Vert }\right\}}^{-1}\sum \nolimits_{{X}_{i}\ne {g}_{m}}\frac{{\mu }_{i}{X}_{i}}{\Vert {g}_{m}-{X}_{i}\Vert }$$
(3)
$$r\left({g}_{m}\right)=\Vert \widetilde{R}({g}_{m})\Vert , \widetilde{R}\left({g}_{m}\right)=\sum \nolimits_{{X}_{i}\ne {g}_{m}}\frac{{\mu }_{i}({X}_{i}-{g}_{m})}{\Vert {{X}_{i}-g}_{m}\Vert }$$
(4)
$$\mu\left(g_m\right)=\left\{\begin{array}{c}\mu\;\left(p\right),if\;g_m=X_p,p=1,\dots,m\\0,otherwise\end{array}\right.$$
(5)

In the equations above, \({g}_{m}\) in Rd represents the geometric median only if it is a fixed point and satisfies the condition \(\left({g}_{m}\right)\le\) \(\mu \left({g}_{m}\right)\). Here, \(\mu \left({g}_{m}\right)\) indicates a weight variable at \({g}_{m}\), which can be either equal to \(\mu \left(p\right)\) —the number of data vectors in S that have zero Euclidean distance to \({g}_{m}\) during the iteration process—or equal to zero if \(\mu \left(p\right)\)=0. In the latter case, T(\({g}_{m}\)) is considered as a weighted average of S, resulting in T(\({g}_{m}\)) = \(\widetilde{T}\left({g}_{m}\right)\), as outlined in the Weiszfeld algorithm. It is also noteworthy that \({g}_{m}\) is unique when the points are not collinear, and it remains invariant under Euclidean similarity transformations, such as translation, rotation, and reflection (Almardeny et al. 2020).

In a three-dimensional scenario, let W = {\({\overrightarrow{v}}_{1}{,\overrightarrow{v}}_{2}\), …, \({,\overrightarrow{v}}_{n}\}\) in R3 represent a series of vectors corresponding to data samples. If \(\overrightarrow{m}\) in R3 is the unit geometric median vector of W that defines the rotation axis, it can be demonstrated that for any vector \(\overrightarrow{v}\) in W that is independent of\(\overrightarrow{m}\), the signed volume of the parallelepiped formed by rotating \(\overrightarrow{v}\) twice around \(\overrightarrow{m}\)—considering the right-hand rule and with \({\theta }_{1 }<{\theta }_{2}\) in the range of 0 to 2π—can be approximately represented as a cost function using the Rodrigues rotation formula (Almardeny et al. 2020):

$$F\left(\overrightarrow v,\gamma\right)={\Arrowvert\overrightarrow v\Arrowvert}^3(\cos\;\gamma sin\gamma^2)$$
(6)

In the above equation, the differences between the vectors in the dataset relate to their magnitudes and the angle \(\gamma\) between \(\overrightarrow{v}\) and \(\overrightarrow{m}\), indicating the degree of deviation from the median vector. For simplicity, \((\cos\;\gamma sin\gamma^2)\) is denoted as \({F}^{*}\left(\gamma \right)\) (Fig. 2b). Since \({F}^{*}\left(\gamma \right)\) changes periodically between 0 and 2π, it is advisable to restrict the analysis to the range 0 to π, as the trigonometric function is evaluated only for the smallest angle between \(\overrightarrow{v}\) and \(\overrightarrow{m}\). Additionally, for \(\gamma\) in {0, \(\pi /2\), \(\pi \}\), \({F}^{*}\left(\gamma \right)\) is equal to zero, leading to three cases: \(\gamma ={F}^{*}\left(\gamma \right)=0\) only if \(\overrightarrow{v}\)=\(\overrightarrow{m}\); \(\gamma =\frac{\pi }{2}\), where \({F}^{*}\left(\gamma \right)=0\) only if \(\overrightarrow{v}\) \(\perp\) \(\overrightarrow{m}\); and \(\gamma =\pi\), where \({F}^{*}\left(\gamma \right)=0\) only if \(\overrightarrow{v}\) \(||\) \(\overrightarrow{m}\). The maximum and minimum values of \({F}^{*}\left(\gamma \right)\) are approximately 0.385 and − 0.385 at the turning points (Fig. 1b), referred to as the threshold angles \({\omega }_{1}\) and \({\omega }_{2}\), respectively.

Another characteristic of \({F}^{*}\left(\gamma \right)\) is that \({F}^{*}\left(\gamma \in [0,\frac{\pi }{2}]\right)\)=\(-{F}^{*}\left(\gamma \in [\frac{\pi }{2},\pi ]\right)\), which ensures the uniqueness of the values by emphasizing points not lying in the same hyperplane as \(\overrightarrow{m}\), thereby indicating the orientation of the parallelepiped. To preserve the proportionality of deviation characteristics among the data vectors, \({F}^{*}\left(\gamma \right)\) is adjusted to ensure it does not approximate \(\gamma\) in (\({\omega }_{1}, \frac{\pi }{2}]\cup\)(\({\omega }_{2}, \pi ]\). The angles are thus scaled down to (\({0,\omega }_{1}]\cup\)(\({\frac{\pi }{2},\omega }_{2}]\). Subtracting the geometric median from each vector in the dataset helps reveal the rotation cost of low-magnitude vectors that are collinear with higher-magnitude vectors while maintaining the relative positional relationships among the data points (Fig. 3) (Almardeny et al. 2020).

Fig. 3
figure 3

Algorithm of the ROC approach (Almardeny et al. 2020)

To determine the rotation outlier score, it is essential to isolate the ROD costs that are inconsistent with other costs (Fig. 3). The median absolute deviation (MAD) is chosen due to its robustness and efficiency (Leys et al. 2013):

\(\text{MAD}=median(\left|{X}_{i}-\widetilde{X}\right|\))

$${M}_{i}=\frac{0.6745({X}_{i}-\widetilde{X})}{MAD}$$
(7)

where \(\widetilde{X}\) denotes the median of the dataset, and \({M}_{i}\) represents the outlierness score of the ith data sample. The constant value of 0.6745, introduced by Iglewicz and Hoaglin (1993), is included in \({M}_{i}\) because, for a normal-like distribution, \({M}_{i}\) approaches 1, as E(MAD) = 0.6745 for large datasets (Almardeny et al. 2020).

In high-dimensional scenarios (d > 3), outliers hidden within subspaces of complex manifolds often remain undetected (Müller et al. 2012). Data may originate from various mechanisms across individual dimensions or subsets, suggesting that exploring different perspectives can unveil outliers that remain concealed in a comprehensive attribute space (Kriegel et al. 2012). In high-dimensional feature spaces, authentic outliers may also become obscured due to the curse of dimensionality, which leads to data points becoming too sparse and increases noise levels (Parsons et al. 2004). Additionally, in such contexts, the concept of neighborhood loses its significance (Beyer et al. 1999), making methods reliant on the relative contrast of distances between data points increasingly unreliable (Aggarwal et al. 2001). To address a dataset W in Rn×d, ROD can be applied in the 3D subspaces {Uij, …, UIJ} that emerge from decomposing the entire feature space V into subsets representing different combinations of the feature space {si ϵ I | si = {Uj | j ϵ [J]}}. Consequently, the overall outlier score of V for each sample is determined by integrating the ROD scores from every 3D subspace (Fig. 3) (Almardeny et al. 2020).

To adapt the high-dimensional case to the standard three-dimensional approach, two considerations are made. In datasets containing features with high magnitudes, these features can dominate the algorithm, as the ROD cost correlates with vector magnitude for each sample. As a result, the MAD may vary significantly, masking some three-dimensional scores. To mitigate this issue, the data is scaled according to the quantile range, which is robust to outliers (Almardeny et al. 2020):

$${X}_{inew}=\frac{{X}_{i}-{Q}_{1}(X)}{{Q}_{3}\left(X\right)-{Q}_{1}(X)}$$
(8)

where \({X}_{i}\) represents the ith sample, and \({Q}_{1}(X)\) and \({Q}_{3}\left(X\right)\) denote the first and third quantiles of the dataset. In the second consideration, constraining the MAD within a predefined numerical range helps to correct false weights attributed by the average function to each 3D score. To address this, the results are scaled to the range [0,1] using the following formula:

$${X}_{inew}=\frac{1}{1+{e}^{-{X}_{ij}}}$$
(9)

In which \({X}_{ij}\) is the MAD of the ith sample and jth 3D-subspace.

The LOF and KNN methods

The LOF method is a density-driven method which recently used to identify subtle and localized anomalies in geochemical studies (e.g., Puchhammer et al. 2024; Shahrestani and Carranza 2024). This technique evaluates the density of a data point in relation to its neighbors by calculating the local reachable density (ρk), which measures the proximity of nearby points (Breunig et al. 2000). An LOF score is calculated by comparing the density of a point to the average density of its k-nearest neighbors. Scores greater than 1 indicate potential outliers due to lower relative densities (Breunig et al. 2000). Unlike global approaches, LOF focuses on local density variations, making it effective for detecting both isolated and clustered anomalies within irregular geochemical distributions. Studies emphasize its effectiveness in complex geological environments where traditional methods often face challenges with uneven mineralization distribution (Shahrestani and Carranza 2024).

The KNN method is a widely utilized approach for outlier detection, particularly suitable for geochemical studies where identifying deviations from expected patterns is important (e.g., Chen et al. 2021c). This technique determines the distance between a given data point and its k-nearest neighbors to assess its conformity within the dataset. Points with significantly larger distances compared to others are flagged as potential outliers, reflecting their distinctiveness within the data distribution (Angiulli and Pizzuti 2002). KNN is non-parametric, requiring no assumptions about the underlying data distribution, which enhances its versatility in handling diverse datasets. By examining local neighborhood relationships, KNN effectively detects anomalies in both clustered and dispersed datasets (Chen and Shayilan 2022).

Results and discussion

In this research, we utilized the Python implementation of the ROD algorithm (https://github.com/yzhao062/pyod/blob/master/pyod/models/rod.py; Almardeny et al. 2020; Zhao et al. 2019b) to determine the outlier scores for all stream sediment samples. Additionally, we calculated the scores from the LOF and KNN methods using the ‘scikit-learn’ Python library. Two criteria were employed to compare the efficiency of different outlier detection methods. In regional geochemical exploration, the spatial locations of known mineral occurrences, along with mineralized samples, can be leveraged to evaluate the efficiency of various outlier detection methods in delineating geochemical anomalies. The first criterion involved constructing the receiver operating characteristic (ROC) curve (Fawcett 2006) using the locations of known mineral occurrences and mineralized samples as true positive values, alongside randomly selected locations within the entire study area as false positive values (Wang and Zuo 2022; Nykänen et al. 2015). A greater area under the curve (AUC) indicates a higher efficiency of the outlier detection methods (e.g., Hastie et al. 2009). The second criterion focused on the number of detected known mineral deposits and mineralized samples in the anomaly classes of the interpolated multivariate geochemical map (e.g., Shahrestani and Sanislav, 2025a). Quantile classification was employed, dividing the interpolated outlier score maps into four quartiles: Q1, Q2, Q3, and Q4. Q1 encompasses the lowest 25% of the data, Q2 the next 25%, Q3 the following 25%, and Q4 the highest 25%. Since different outlier detection methods yield scores with varying distributions, this classification ensures that each quartile contains an equal proportion of data (e.g., Liu et al. 2024; Wang and Zuo 2022). When creating the KNN and LOF models for detecting geochemical anomalies in high-dimensional data, several parameters must be defined. The most critical of these is the value of k, which should be tailored to the characteristics of the dataset. Other parameters can typically be set to their default values. In regions with established data or “ground truth,” the optimal k can be found through trial and error. However, for unexplored or greenfield areas lacking known mineral occurrences, selecting an appropriate k becomes more challenging (Chen et al. 2021c). Given the sensitivity of LOF and KNN to the number of neighboring samples (k), we calculated the scores of the LOF and KNN methods with varying k values ranging from 10 to 90 (approximately 10% of data points) to identify the k value at which LOF and KNN generate the most effective geochemical maps for capturing known mineralization (Fig. 4). From Fig. 4, two characteristics of the LOF and KNN methods are apparent. First, LOF exhibits greater sensitivity to k values compared to the KNN method, with the AUC significantly increasing as the number of neighbor samples grows. The second observation is the peak AUC value in both outlier detection methods when 70 samples are considered as the size of the neighborhood. Therefore, this optimal k value is selected for comparison of LOF and KNN with the ROD method.

Fig. 4
figure 4

ROC curves based on LOF and KNN scores considering various number of nearest neighbors implying more sensitivity of LOF relative to KNN

After the optimal neighboring size was selected, the performances of LOF (k = 70), KNN (k = 70), and ROD were compared in delineating geochemical anomalies originating from anomalous sources. Figure 5 illustrates the anomaly maps based on outlier scores derived from LOF, KNN, and ROD. As shown in Fig. 6, which presents the ROC curves, a slight superiority of ROD over both LOF and KNN is demonstrated. However, comparable contrasts between hard data (i.e., mineralization) and randomly selected anomalous and non-anomalous points are exhibited in the anomaly maps produced by KNN and ROD. Additionally, mineralized samples were also considered validation points in this analysis (see Fig. 5). The same trend was observed when mineralized samples were used as validation points, revealing higher AUC values for ROD and an increase in the efficiency of LOF compared to ROD and KNN. This suggests that while a slight edge in overall performance is offered by ROD, improved efficacy of LOF is also shown when the validation dataset includes mineralized samples. These findings indicate that valuable insights into the delineation of geochemical anomalies are provided by all three methods, with ROD emerging as a particularly effective approach.

Fig. 5
figure 5

Classified geochemical anomaly maps based on interpolated outlier scores emerging from (a) LOF, b KNN, c ROD (all trace elements), d ROD (group_2)

Fig. 6
figure 6

ROC curves based on LOF, KNN, and ROD scores demonstrating the efficiency of ROD over LOF and KNN considering (up) mineral occurrences, and (down) mineralized samples

One important characteristic of outlier detection methods is their robustness to the number of dimensions. To examine this feature, two different datasets with varying feature space sizes were selected. In the first group, all trace elements, including Ag, As, Au, Bi, Co, Cr, Cu, Mo, Ni, Pb, Sb, and Zn, were considered. In the second group, Pearson correlation values were first calculated between all clr-transformed trace elements, and one element was selected from those with high correlation values. Consequently, six elements, namely Ag, As, Au, Cu, Mo, and Sb, were chosen for group 2.

Three outlier detection methods were then applied to both groups of elements. In the context of geochemical exploration based on regional multivariate data, the number of known mineral occurrences in the anomaly class served as a benchmark for comparing the geochemical anomaly maps produced by different anomaly detection methods (see Fig. 5). The classified geochemical maps reveal three primary zones of geochemical anomalies. The first zone is located in the northern and northeastern parts of the study area, encompassing the majority of mineral deposits and aligning with the presence of volcanic rocks in these regions (Fig. 1). The second zone is situated in the western portion of the area and is likely associated with the influence of pyroxene-andesite lava flows and altered mafic and ultramafic rocks. The third zone is in the southeastern part of the study area, corresponding to altered mafic and ultramafic rocks. Table 3 was designed to compare the clr-transformed Cu data, LOF, KNN, and ROC applied to the first and second groups of elements. The findings in Table 3 reveal that the single-element Cu anomaly map outperforms LOF in delineating known Cu mineral occurrences. Additionally, LOF was found to be more sensitive to the size of the input feature space compared to KNN. Moreover, the ROD outlier detection method demonstrated significant performance in detecting known mineral deposits compared to LOF and KNN across both dataset groups. The results suggest that adding a new dimension (i.e., a geochemical element) to the ROD procedure can enhance the effectiveness of the outlier detection practice. This is particularly relevant in geochemical exploration, where mineral occurrences typically emit unique multivariate geochemical signals that can be differentiated from non-anomalous sources. Consequently, an outlier detection method capable of managing high-dimensional data without compromising performance is of great interest.

Table 3 Number of mineral occurrences detected by single-variate clr-transformed Cu data, LOF, KNN, and ROD anomaly maps

Another characteristic of outlier detection methods is the spatial relationships between the highest outlier scores and the locations of validation points, which in this case are known mineral occurrences and mineralized samples. Figure 7 illustrates the locations of the top 10% of scores obtained from LOF, KNN, and ROD. Five zones are highlighted in Fig. 6 to facilitate the comparison of the spatial distribution of top outlier scores from the three methods. In zone 1, the top scores from KNN and ROD demonstrate spatial proximity to known Cu mineralization, with this proximity being more pronounced in ROD compared to KNN. In zone 2, only one top score stream sediment sample is observed relative to mineralization for both LOF and KNN, while no top score sample is identified in ROD. In zone 3, similar performance is noted across all three outlier detection methods. In zone 4, distinct patterns emerge for KNN and LOF in comparison to ROD, where, similar to zone 1, the top scores from ROD are closer to mineralization compared to those from KNN and LOF. In Zone 5, the absence of validation points makes it impractical to assume the superiority of any particular method.

Fig. 7
figure 7

Spatial distribution of top 10% outlier scores emerging from LOF (a), KNN (b), and ROD (c) relative to known Cu mineralization

The LOF and KNN methods are widely utilized in geochemical anomaly detection, particularly in mineral exploration. LOF, a density-based algorithm, is highly effective at identifying local anomalies within heterogeneous geological datasets by leveraging spatial neighborhood relationships (Puchhammer et al. 2024). Unlike global anomaly detection techniques such as isolation forest (iForest), LOF excels in analyzing stream sediment geochemical data with varying concentration levels, as it can detect deviations from localized geochemical patterns that might otherwise be overlooked (Puchhammer et al. 2024). Another advantage of LOF is its independence from covariance estimation, as it strictly relies on Euclidean distances, allowing for the application of isometric transformations in compositional data analysis (Puchhammer et al. 2024). Similarly, KNN, which forms the basis of LOF, is a non-parametric method that identifies anomalies by comparing each sample to its closest neighbors. Its strength lies in its ability to capture complex nonlinear patterns in geochemical datasets without requiring distributional assumptions, making it useful in both supervised and unsupervised anomaly detection scenarios (Chen et al. 2021c). Additionally, KNN is highly intuitive and can be adapted to different geological settings, particularly where geochemical anomalies exhibit localized clustering (Puchhammer et al. 2024).

Despite their advantages, both LOF and KNN have significant limitations. A major drawback of LOF is its reduced effectiveness in high-density sampling areas, where geochemical values transition gradually toward mineralized zones. In such cases, samples located directly over mineralized deposits may not be flagged as anomalies, limiting the method applicability in certain geological environments (Puchhammer et al. 2024). Furthermore, LOF performance is highly sensitive to the choice of the nearest-neighbor parameter (k), with optimal results typically achieved when the number of input variables is limited to five to ten. Beyond this range, the reliability of distance-based calculations decreases due to the curse of dimensionality, which affects both LOF and KNN (Shahrestani and Sanislav 2025b). The selection of k is particularly challenging in greenfield exploration, where no known mineral occurrences exist, making parameter optimization difficult (Chen et al. 2021c). Even in well-studied regions, determining the appropriate k value often requires a trial-and-error approach, relying on available ground truth data (Chen et al. 2021c). Additionally, the KNN framework underlying LOF is computationally intensive for large datasets, as it requires calculating pairwise distances between all samples, making it less efficient for regional-scale geochemical surveys (Aggarwal and Sathe 2017).

Based on the results, ROD is a promising outlier detection method for geochemical anomaly detection tasks due to apparent advantages over some well-stablished outlier detection methods including LOF and KNN. In geochemical exploration, where datasets often include multiple elements and complex relationships due to various sources of variability such as lithology or mineralization, ROD has demonstrated its effectiveness in improving the accuracy of anomaly detection. This research reveals a key strength of ROD to handle high-dimensional data. Geochemical datasets typically contain multiple elements that which of which may partially explain the geochemical patterns associated with mineralization. ROD performs well when additional dimensions are introduced. Unlike LOF and KNN, which are sensitive to high-dimensional data, ROD uses subspaces to minimize the curse of dimensionality. This makes ROD a proper candidate for geochemical exploration, where mineral occurrences release unique geochemical signals that may only be identifiable when the relationships between elements are considered. By applying rotation-based detection, ROD can better capture these complex signals and avoid the performance issues LOF and KNN encounter in higher-dimensional spaces.

Another advantage of ROD is its rotation cost function, which is particularly well-suited to detecting anomalies in geochemical data. The method uses a rotation-based approach to analyze the geometric relationships between samples, detecting anomalies based on their orientation in subspaces rather than relying solely on distance or density. This makes ROD highly effective in identifying geochemical outliers, where anomalies may not always be distinguished by their proximity to neighboring samples but rather by their overall deviation from expected geochemical patterns. ROD is less dependent on subjective parameters such as neighborhood size (k), which often require careful tuning in methods like LOF and KNN. The independence of ROD from this parameter provides a practical advantage over LOF and KNN, both of which require careful selection of k to achieve optimal results, particularly in the case of LOF. Furthermore, ROD, uses MAD in its evaluation of rotation costs, which enhances its ability to identify true anomalies while reducing the likelihood of classifying noisy points as outliers. The ROD method has potential limitations in identifying geochemical anomalies. Its high computational complexity arises from breaking the entire attribute space into multiple 3D subspaces, making it inefficient as dimensionality increases, leading to long processing times with high-dimensional datasets. Additionally, ROD may struggle to detect anomalies that span across multiple dimensions since it focuses on localized subspaces rather than the overall dataset structure. While it does not require assumptions about statistical distributions, its reliance on the geometric median for rotation makes it sensitive to changes in data distribution, which can affect performance in skewed or multimodal datasets. Furthermore, selecting a limited number of 3D subspaces to enhance efficiency could reduce precision, resulting in missed anomalies if the chosen subspaces are not appropriate.

The ROD method is effective in high-dimensional geochemical datasets but does not explicitly consider spatial relationships or compositional associations between elements. While ROD excels at identifying statistical outliers by focusing on geometric relationships in feature space, it fails to account for the spatial context of the data or the geological processes influencing geochemical distributions. Consequently, ROD may struggle to distinguish between anomalies linked to geological processes and those arising from noise or data inconsistencies. However, identifying geochemical anomalies requires consideration of both spatial patterns and elemental associations, as these factors are crucial in differentiating mineralization-related anomalies from background variations (Zuo and Xu 2024). To address this need, a novel methodology involving a dual-branch deep learning model is proposed, which integrates a graph convolutional autoencoder (GCN-AE) for spatial feature extraction and a long short-term memory autoencoder (LSTM-AE) for spectral feature analysis. This approach captures both spatial relationships among geochemical samples and compositional variations within individual samples (Xu et al. 2024). In contrast, while ROD is effective in high-dimensional datasets, its lack of explicit incorporation of spatial context may limit its sensitivity to mineralization patterns that are spatially controlled.

Conclusions

This study highlights the distinct advantages of the ROD method in geochemical anomaly detection, showcasing its effectiveness compared to traditional techniques like LOF and KNN. ROD consistently demonstrates superior performance in high-dimensional datasets common in geochemical exploration, where complex interactions among multiple elements, influenced by factors such as lithology and mineralization, present significant challenges for conventional methods. By addressing the curse of dimensionality through subspace analysis, ROD maintains strong performance even as additional variables are introduced, excelling in the identification of subtle geochemical signals associated with mineralization that other methods may overlook. A key advantage of ROD is its enhanced sensitivity, which allows for better detection of mineral occurrences by focusing on the intricate relationships between elements. This capability is evident in its superior AUC values and improved detection of known mineral occurrences in anomaly maps and ROC curves, effectively delineating anomalous areas while minimizing false positives. Such performance highlights the potential of ROD for advancing geochemical exploration. Despite its promising performance, there are opportunities for further improvement in the ROD method. One potential enhancement could be optimizing the selection of subspaces informed by geochemical domain knowledge. This approach may refine the identification of outliers associated with specific mineralization types, leading to improved accuracy. Additionally, future work aims to utilize ROD on a limited number of 3D subspaces of interest. This strategy is designed to speed up the running time of method while maintaining its precision, thereby enhancing its practicality for geochemical analysis.