Elsevier

Computers & Geosciences

Volume 123, February 2019, Pages 10-19
Computers & Geosciences

A novel hierarchical clustering analysis method based on Kullback–Leibler divergence and application on dalaimiao geochemical exploration data

https://doi.org/10.1016/j.cageo.2018.11.003Get rights and content

Highlights

  • A new hierarchical clustering analysis method based on KL divergence is proposed.

  • The method can reveal relationships among geo-objects based on geochemistry.

  • This capability is verified through an application with geochemical data.

Abstract

In this paper, we propose a new hierarchical clustering analysis method (HCA) that uses Kullback–Leibler divergence (DKLS) of pairwise geochemical datasets of geo-objects (e.g., lithological units) as a measure of proximity. The method can reveal relationships among geo-objects based on geochemistry. This capability is verified through an application with geochemical exploration data from regolith that overlies the Dalaimiao region in China. DKLSM and DKLSC, two parts of DKLS, respectively describe the differences on the mean and the differences on covariance and are also used as measures of proximity. DKLSM characterizes rock type and DKLSC describes spatial relationships and component similarities between geo-objects. This contribution not only provides a tool that can reveal relationships between geo-objects based on geochemical data, but also reveals that DKLS and its two parts can characterize geochemical differences from different perspectives. These measures hold promise in the enhancement of methods for recognizing geochemical patterns.

Introduction

Hierarchical clustering analysis (HCA) is a method that builds a hierarchy of clusters of variables (R-mode) or observations (Q-mode) according to the proximity between pairwise variables or observations. This method is commonly used in geochemical data processing such as environmental assessment and mineralization exploration (Grunsky, 2010; Hernandez et al., 2004; Li et al., 2014; Mokhtari et al., 2014; Nezhad et al., 2015; O'Shea and Jankowski, 2006; Templ et al., 2008). However, in the instance of geochemical data most cases of the application of Q-mode HCA focus on the classifications of individual specimens but not on datasets or groups of samples. That is because measures of proximity such as the Manhattan Distance (Mumm et al., 2012; O'Shea and Jankowski, 2006), D-value (Kremer et al., 2012), and Euclidian distance (Fatehi and Asadi, 2017) are based on pairwise comparisons of specimens (data points). In regional geochemical data, many sites are sampled over a common geo-object (e.g., lithology unit, alteration zone, structural belt and other objects that occupy geographic space). When we focus on the relationships between those geo-objects, the HCA, based on the proximity between pairwise geochemical data points, may not perform well, as it can result in a large and complex dendrogram with many leaves, which is complicated and difficult to explain.

Based on the extent of the geo-objects the entire dataset can be divided into several sub-datasets, each containing sites collected over the same geo-objects, which characterize the geo-objects more precisely than at a single data point. Pairwise differences between sub-datasets as the measure of proximity in HCA, make it possible to design a new HCA algorithm that reveals relationships among the geo-objects. In this paper, the Kullback–Leibler divergence (KL-divergence) as a measure of proximity, is used to develop the HCA method, which is then applied to a geochemical data in the Dalaimiao district.

Section snippets

Measure of proximity

HCA builds models based on proximity. For Q-type clustering, its proximities represent distances or dissimilarities between observations. When observations are datasets (groups or populations), it is necessary to measure the distance or dissimilarity between groups. For example, we can use a measure of dissimilarity (or metric) such as the Euclidian or Aitchison distances (Aitchison et al., 2000) between the centroids of the groups as a proximity measure for HCA, but it will lose the

Geology

The study area of 2952 km2 is in the Inner Mongolia Autonomous Region of China. The area overlies a Neopaleozoic accretion complex, the Uliastai active continent margin, the subduction zone of the Siberian plate and the North China platform. Most of the area is covered by a thin layer of wind-transported sand and soil. The underlying lithology is recognized by saprolite (rock debris) and a few outcrops scattered over the surface. Quaternary regolith sediments cover 41% of the area (Fig. 1). The

Results using dataset G18E8

The results of the new method using dataset G18E8 are shown in Fig. 4. Fig. 4a shows the results of HCA based on DKLSM. It can be observed that 18 geo-objects are divided into two clusters which represents the stratigraphy and intrusions respectively. Also, there are some small clusters with the geo-objects that have a similar composition or a close spatial connection. For example, IN1-INE-IN2 are the intrusions located in the north; SOC2single bondSOC3 are geo-objects belong to Bayanhushu Formation;

Conclusion

The application of the HCA method shows that measures of KL-divergence can describe the dissimilarity of pairwise geochemical datasets based on geo-objects, and the HCA method can give a comprehensive view of geo-object associations. Additionally, the decomposition components of DKLS, DKLSC and DKLSM, can further characterize dissimilarity in two aspects: the rock types that are reflected via DKLSM, and the spatial relationships and component similarities of geo-objects that are revealed via DKL

Acknowledgments

We thank three anonymous reviewers for their helpful comments. This research benefited from financial support from the National Key Research and Development Program of China (2016YFC0600501), the National Natural Science Foundation of China (No. 41430320, 41602337) and a Chinese Geological Survey project (Minerals and Geological Prospecting on Shallow Covered Areas of Jinning, Inner Mongolia, No. DD20160045).

References (42)

  • C. Reimann et al.

    Factor analysis applied to regional geochemical data: problems and possibilities

    Appl. Geochem.

    (2002)
  • M. Templ et al.

    Cluster analysis applied to regional geochemical data: problems and possibilities

    Appl. Geochem.

    (2008)
  • C.S. Zhang et al.

    Statistical analyses of geochemical variables in soils of Ireland

    Geoderma

    (2008)
  • C.S. Zhang et al.

    Statistical characterization of a large geochemical database and effect of sample size

    Appl. Geochem.

    (2005)
  • J. Aitchison et al.

    Logratio analysis and compositional distance

    Math. Geol.

    (2000)
  • A.P. Allen et al.

    Population fluctuations, power laws and mixtures of lognormal distributions

    Ecol. Lett.

    (2001)
  • S.-i. Amari et al.

    Information geometry of divergence functions

    Bull. Pol. Acad. Sci. Tech. Sci.

    (2010)
  • A. Cichocki et al.

    Families of alpha-beta-and gamma-divergences: flexible and robust measures of similarities

    Entropy

    (2010)
  • P. Clement et al.

    An elementary proof of the triangle inequality for the Wasserstein metric

    Proc. Am. Math. Soc.

    (2008)
  • J.E. Contreras-Reyes et al.

    Kullback–Leibler divergence measure for multivariate skew-normal distributions

    Entropy

    (2012)
  • B. Everitt et al.

    Cluster Analysis

    (2011)
  • Cited by (26)

    • Heat and air quality related cause-based elderly mortalities and emergency visits

      2023, Environmental Research
      Citation Excerpt :

      The clustering analysis is an unsupervised technique of organizing data points based on similarity. To provide multi-dimensional clustering flexibility and a robust clustering scheme, the proposed method in this paper follows a novel approach by combining hierarchical (Murtagh and Contreras, 2012; Obas et al., 2021; Yang et al., 2019) and k-means clustering analysis to provide a value-based classification of clusters. The hierarchical clustering presents initial homogeneous clusters based on distance measures generated by proximities.

    View all citing articles on Scopus
    View full text