A novel hierarchical clustering analysis method based on Kullback–Leibler divergence and application on dalaimiao geochemical exploration data
Introduction
Hierarchical clustering analysis (HCA) is a method that builds a hierarchy of clusters of variables (R-mode) or observations (Q-mode) according to the proximity between pairwise variables or observations. This method is commonly used in geochemical data processing such as environmental assessment and mineralization exploration (Grunsky, 2010; Hernandez et al., 2004; Li et al., 2014; Mokhtari et al., 2014; Nezhad et al., 2015; O'Shea and Jankowski, 2006; Templ et al., 2008). However, in the instance of geochemical data most cases of the application of Q-mode HCA focus on the classifications of individual specimens but not on datasets or groups of samples. That is because measures of proximity such as the Manhattan Distance (Mumm et al., 2012; O'Shea and Jankowski, 2006), D-value (Kremer et al., 2012), and Euclidian distance (Fatehi and Asadi, 2017) are based on pairwise comparisons of specimens (data points). In regional geochemical data, many sites are sampled over a common geo-object (e.g., lithology unit, alteration zone, structural belt and other objects that occupy geographic space). When we focus on the relationships between those geo-objects, the HCA, based on the proximity between pairwise geochemical data points, may not perform well, as it can result in a large and complex dendrogram with many leaves, which is complicated and difficult to explain.
Based on the extent of the geo-objects the entire dataset can be divided into several sub-datasets, each containing sites collected over the same geo-objects, which characterize the geo-objects more precisely than at a single data point. Pairwise differences between sub-datasets as the measure of proximity in HCA, make it possible to design a new HCA algorithm that reveals relationships among the geo-objects. In this paper, the Kullback–Leibler divergence (KL-divergence) as a measure of proximity, is used to develop the HCA method, which is then applied to a geochemical data in the Dalaimiao district.
Section snippets
Measure of proximity
HCA builds models based on proximity. For Q-type clustering, its proximities represent distances or dissimilarities between observations. When observations are datasets (groups or populations), it is necessary to measure the distance or dissimilarity between groups. For example, we can use a measure of dissimilarity (or metric) such as the Euclidian or Aitchison distances (Aitchison et al., 2000) between the centroids of the groups as a proximity measure for HCA, but it will lose the
Geology
The study area of 2952 km2 is in the Inner Mongolia Autonomous Region of China. The area overlies a Neopaleozoic accretion complex, the Uliastai active continent margin, the subduction zone of the Siberian plate and the North China platform. Most of the area is covered by a thin layer of wind-transported sand and soil. The underlying lithology is recognized by saprolite (rock debris) and a few outcrops scattered over the surface. Quaternary regolith sediments cover 41% of the area (Fig. 1). The
Results using dataset G18E8
The results of the new method using dataset G18E8 are shown in Fig. 4. Fig. 4a shows the results of HCA based on . It can be observed that 18 geo-objects are divided into two clusters which represents the stratigraphy and intrusions respectively. Also, there are some small clusters with the geo-objects that have a similar composition or a close spatial connection. For example, IN1-INE-IN2 are the intrusions located in the north; SOC2SOC3 are geo-objects belong to Bayanhushu Formation;
Conclusion
The application of the HCA method shows that measures of KL-divergence can describe the dissimilarity of pairwise geochemical datasets based on geo-objects, and the HCA method can give a comprehensive view of geo-object associations. Additionally, the decomposition components of , and , can further characterize dissimilarity in two aspects: the rock types that are reflected via , and the spatial relationships and component similarities of geo-objects that are revealed via
Acknowledgments
We thank three anonymous reviewers for their helpful comments. This research benefited from financial support from the National Key Research and Development Program of China (2016YFC0600501), the National Natural Science Foundation of China (No. 41430320, 41602337) and a Chinese Geological Survey project (Minerals and Geological Prospecting on Shallow Covered Areas of Jinning, Inner Mongolia, No. DD20160045).
References (42)
Divergence measures for statistical data processing—an annotated bibliography
Signal Process.
(2013)- et al.
The separation of geochemical anomalies from background by fractal methods
J. Geochem. Explor.
(1994) Mapping singularities with stream sediment geochemical data for prediction of undiscovered mineral deposits in Gejiu, Yunnan Province, China
Ore Geol. Rev.
(2007)Singularity theory and methods for mapping geochemical anomalies caused by buried sources and for predicting undiscovered mineral deposits in covered areas
J. Geochem. Explor.
(2012)- et al.
Singularity analysis of ore-mineral and toxic trace elements in stream sediments
Comput. Geosci.
(2009) - et al.
Application of semi-supervised fuzzy c-means method in clustering multivariate geochemical data, a case study from the Dalli Cu-Au porphyry deposit in central Iran
Ore Geol. Rev.
(2017) - et al.
Soil volatile mercury, boron and ammonium distribution at Canadas caldera, Tenerife, Canary Islands, Spain
Appl. Geochem.
(2004) - et al.
Mineral microbial structures in a bone of the Late Cretaceous dinosaur Saurolophus angustirostris from the Gobi Desert, Mongolia - a Raman spectroscopy study
Palaeogeogr. Palaeoclimatol. Palaeoecol.
(2012) - et al.
The relationships between magnetic susceptibility and elemental variations for mineralized rocks
J. Geochem. Explor.
(2014) - et al.
Geochemical prospecting for Cu mineralization in an arid terrain-central Iran
J. Afr. Earth Sci.
(2014)
Factor analysis applied to regional geochemical data: problems and possibilities
Appl. Geochem.
Cluster analysis applied to regional geochemical data: problems and possibilities
Appl. Geochem.
Statistical analyses of geochemical variables in soils of Ireland
Geoderma
Statistical characterization of a large geochemical database and effect of sample size
Appl. Geochem.
Logratio analysis and compositional distance
Math. Geol.
Population fluctuations, power laws and mixtures of lognormal distributions
Ecol. Lett.
Information geometry of divergence functions
Bull. Pol. Acad. Sci. Tech. Sci.
Families of alpha-beta-and gamma-divergences: flexible and robust measures of similarities
Entropy
An elementary proof of the triangle inequality for the Wasserstein metric
Proc. Am. Math. Soc.
Kullback–Leibler divergence measure for multivariate skew-normal distributions
Entropy
Cluster Analysis
Cited by (26)
Combining physical-based model and machine learning to forecast chlorophyll-a concentration in freshwater lakes
2024, Science of the Total EnvironmentRecent trends of machine learning applied to multi-source data of medicinal plants
2023, Journal of Pharmaceutical AnalysisHierarchical and K-means clustering to assess thermal dissatisfaction and productivity in university classrooms
2023, Building and EnvironmentHeat and air quality related cause-based elderly mortalities and emergency visits
2023, Environmental ResearchCitation Excerpt :The clustering analysis is an unsupervised technique of organizing data points based on similarity. To provide multi-dimensional clustering flexibility and a robust clustering scheme, the proposed method in this paper follows a novel approach by combining hierarchical (Murtagh and Contreras, 2012; Obas et al., 2021; Yang et al., 2019) and k-means clustering analysis to provide a value-based classification of clusters. The hierarchical clustering presents initial homogeneous clusters based on distance measures generated by proximities.
Almost aggregations in the gravitational clustering to perform anomaly detection
2022, Information Sciences