Elsevier

Neurocomputing

Volume 129, 10 April 2014, Pages 185-198
Neurocomputing

Globality and locality incorporation in distance metric learning

https://doi.org/10.1016/j.neucom.2013.09.041Get rights and content

Abstract

Supervised distance metric learning plays a substantial role to the success of statistical classification and information retrieval. Although many related algorithms are proposed, it is still an open problem about incorporating both the geometric information (i.e., locality) and the label information (i.e., globality) in metric learning. In this paper, we propose a novel metric learning framework, called “Dependence Maximization based Metric Learning” (DMML), which can efficiently integrate these two sources of information into a unified structure as instances of convex programming without requiring balance weights. In DMML, the metric is trained by maximizing the dependence between data distributions in the reproducing kernel Hilbert spaces (RKHSs). Unlike learning in the existing information theoretic algorithms, however, DMML requires no estimation or assumption of data distributions. Under this proposed framework, we present two methods by employing different independence criteria respectively, i.e., Hilbert–Schmidt Independence Criterion and the generalized Distance Covariance. Comprehensive experimental results for classification, visualization and image retrieval demonstrate that DMML favorably outperforms state-of-the-art metric learning algorithms, meanwhile illustrate the respective advantages of these two proposed methods in the related applications.

Introduction

Distance functions are critical for many models and algorithms in machine learning and pattern recognition, such as k-nearest neighbor (kNN) classification and k-means clustering. The metric distances provide a measurement of dissimilarity between different points and significantly influence the performance of these algorithms. Due to limited prior knowledge, most algorithms use simple Euclidean distances. However, such distances cannot ensure satisfactory results in many types of applications where the intrinsic spaces of data are not Euclidean. Previous research results [1], [4], [7] have shown that empirically learnt distance metrics lead to substantial improvement to the Euclidean distances when the prior information is not available. In addition, metric learning has been successfully applied to a large portion of real-world problems, including visual object categorization [34], image retrieval [36] and cartoon synthesis [37].

Recently, many excellent algorithms have been developed for metric learning [2], [3], [5], [6], [21], [33], [35]. Among the related studies, most effort has been spent on learning a Mahalanobis distance from labeled training data. The Mahalanobis distances generalize standard Euclidean distances by scaling and rotating feature spaces. After gaining a deep insight into the popular Mahalanobis distance learning algorithms, we summarize a hierarchical diagram shown in Fig. 1 which classifies them into different categories. For better understanding of the significant differences between these approaches, we initially classify them in terms of the information they considered. One category attempts to learn distance metrics using class labels for classification, the other category considers both the label information (i.e., globality) and the geometric information (i.e., locality).

Research in the first category (i.e., globality metric learning) is driven by the need of keeping all the data points in the same class close together for compactness while ensuring those from different classes far apart for separability. To this end, a number of algorithms [4], [5], [6], [12], [22], [24] have been proposed which can be further divided into the following subcategories as shown in Fig. 1.

(1) Algorithms based on similarity/dissimilarity: A natural intention in metric learning is to keep the points of the same label similar (i.e., the distance between them should be relative small) and others dissimilar (i.e., the distance should be larger). For instance, the method proposed by Xing et al. [4] formulates a convex metric learning program by reducing the averaged distance between similar instances under the constraint of separating dissimilar instances. Metric Learning by Collapsing Classes (MCML) [5] aims to find a distance metric that collapses all the points in the same class while maintaining the separation between different classes.

(2) Algorithms based on information theory: To introduce information theory into metric learning, these algorithms define two Gaussian distributions. The first Gaussian is based on the Mahalanobis distance to be learnt and the second is determined heuristically. In this term, the distance metric can be learnt by minimizing the relative entropy between these two distributions. In particular, Information-Theoretic Metric Learning (ITML) [6] defines the second Gaussian from prior knowledge and searches the optimal metric by minimizing Kullback–Leibler (K–L) divergences between them, subject to a set of similarity and dissimilarity constraints. The information geometry algorithm [23] uses an ideal kernel given from the class labels [24] to construct the second Gaussian and also minimizes K–L divergences between two distributions.

The previously discussed globality metric learning algorithms have been successfully applied to various fields. In these algorithms, data points are generally assumed to have unimodal distributions. Based on this assumption, they attempt to minimize distances between all pairs in the same class. For this reason, however, the globality algorithms are not appropriate for multimodal data distributions, since their goals (i.e., compactness and separability) conflict and can be hardly achieved simultaneously [17]. To alleviate this problem, it is expected to incorporate the geometric information (i.e., locality) with the label information (i.e., globality) in metric learning. This issue is of particular importance and has got significant progress recently [15], [16], [17], [18], [19], [20]. As an empirical validation, we compare two globality methods (i.e., Xing et al. [4] and Globerson and Roweis [5]) with a globality + locality method (i.e., Weinberger et al. [16]) on a synthetic 2-D data set. The 2-D data set has two classes and each class has distinct modes. From their results shown in Fig. 2, it is clear that the globality methods [4], [5] actually increase the 3-NN classification error, since they focus on minimizing all the pairwise distances between similarly labeled data points. Even worse, [5] collapses the entire data set into a straight line. By contrast, [16] reduces the error rate, due to the adaption to local structures. Though this evaluation is conducted on a synthetic data set, it shows in general the problems posed by multimodal data distributions and illustrates the advantages of considering the global and the local information simultaneously.

In the globality+locality category, the fundamental challenge is how to combine locality and globality. As concluded in Fig. 1, typical algorithms [16], [19], [20] address this challenge by directly generating balancing weight(s). These weights are involved through two strategies: (1) combining the term of local manifold into the objective function of globality metric learning; (2) formulating the globality learning objective function subject to a set of triplet-based constraints which state that a point should be similar to another point but dissimilar to a third point in the learnt metric. Particularly, Large Margin Nearest Neighbor (LMNN) [16] learns a distance metric from the local neighborhood and is solved by a semidefinite programming (SDP) [38] incorporating a great number of triplet-based constraints. Hoi et al. [19] provide an SDP to consider the topological structure of data along with similarity and dissimilarity constraints. Zhong et al. [20] propose a parametric manifold regularizer to the metric learning model based on a large set of triplet-based constraints. Although the algorithms discussed above have been extensively investigated in the literature, they present a limitation that many different balancing weights need to be tuned or optimized, such as slack variables, the amount of constraints and the weight in the objective function. The values of these balancing weights have great influences on metric learning performance. Even worse, the computational complexity may be increased rapidly for large-scale and high-dimensional data, since many optimizations, e.g., the general-purpose solver of SDP in LMNN, involve iterative procedures and have to solve massive numbers of constraints in each iteration. Therefore, it is promising to design an efficient metric learning algorithm to address the challenges originated from balancing weight(s). [15], [17] are the researches most related to this paper, in which a globality+locality distance metric is learnt without optimizing balancing weight(s) and iteratively satisfying constraints. Despite similar goals, our approach differs significantly in the essential conception and optimization. Goldberger et al. [15] propose a I prefer "Neighbourhood" since this is the title of the referred paper. Components Analysis (NCA) to minimize the expected kNN error for distance metric learning. Local Distance Metric (LDM) [17] learns a distance metric by optimizing local compactness and separability in a probability framework. Unfortunately, [15], [17] are shown to be nonconvex problems which are prone to being trapped in local solutions and suffer from computationally expensive optimizations. We would also like to mention that the dimensionality reduction algorithm Local Fisher Discriminant Analysis (LFDA) [18] learns a transformation with locality preservation. Our work differs from LFDA in that LFDA cannot uniquely determine the distance metric in the embedding space which highly depends on the normalization scheme.

In this paper, we propose a general Mahalanobis distance learning framework referred to as “Dependence Maximization based Metric Learning” (DMML) in a statistical setting. It belongs to the Non-balancing Weight(s) category of Globality + Locality distance learning as shown in Fig. 1. The key idea of DMML is to construct two distributions, one is based on the distance metric A to be learnt and the other is based on an ideal nonlinear map ΦI defined from both class labels and local structures. The difference between the distance metric A and the ideal map ΦI is measured by the statistical dependence of these two distributions. By maximizing this dependence, the optimal A can be found as the target Mahalanobis distance in DMML. To the best of our knowledge, DMML is the first framework that introduces statistical dependence into Mahalanobis distance learning in the reproducing kernel Hilbert spaces (RKHSs). The main contributions of this paper can be summarized as follows:

  • DMML effectively incorporates two sources of information (i.e., globality and locality) in the target Mahalanobis distance A without optimizing balancing weight(s), because the constructed ideal map ΦI can preserve the local structure of data and maximally align with the label information.

  • Distinguished from classical dependence measuring criteria (e.g., Mutual Information and Pearson's χ2 test), DMML focuses on using the criteria computed in RKHSs to avoid estimation or assumption of the data distributions. Many existing kernel-based criteria [8], [41], [43] can be incorporated into DMML to tackle the independence measurement problem.

  • Under DMML framework, we propose two methods by employing Hilbert–Schmidt Independence Criterion (HSIC) [8] and generalized Distance Covariance [27], respectively. They are formulated as convex programs and can be efficiently optimized by the first order gradient procedure.

The rest of paper is organized as follows. In Section 2, we discuss the dependence of data distributions. Section 3 presents the metric learning framework DMML. Specifically, two methods are provided with different dependence measuring criteria. In Section 4, the application of DMML in dimensionality reduction is investigated. Section 5 reports the experimental results for classification, visualization and image retrieval. Finally, our concluding remarks are given in Section 6.

Section snippets

Dependence of data distributions

The proposed metric learning framework DMML builds on maximizing the dependence of two data distributions. Many parametric criteria (e.g., Mutual Information and Distance Covariance) have been proposed to measure the statistical dependence between data distributions. However, one has to assume or estimate specific relations between variables, since the detailed distribution formats are required in the classical measurement methods. Recently, to alleviate this limitation, there is a growing

Dependence maximization based metric learning

In this section, we present the general framework DMML for distance metric learning. The two essential ingredients of our framework are (1) the incorporation of both globality and locality without balancing weights, (2) no assumption or estimation of the data distributions. Under the proposed framework, two methods are described by employing HSIC and generalized Distance Covariance, respectively. More importantly, we formulate them as instances of convex programming.

Low dimensional projections

Dimensionality reduction is advantageous for high dimensional data: (1) it can substantially reduce the computation complexity and the storage requirement; and (2) it allows for visualization or embedding of the original data. Many techniques for dimensionality reduction have been investigated which are classified into unsupervised methods (e.g., Principal Component Analysis (PCA) [13], Locality Preserving Projections (LPP) [14]) and supervised methods (e.g., Linear Discriminant Analysis (LDA)

Experiments

In this section, we evaluate our methods DMML-H and DMML-D with different graph kernels for three metric learning related applications: (1) classification; (2) data visualization and (3) image retrieval.

Conclusion

In this paper, we have presented a statistics metric learning framework, Dependence Maximization based Metric Learning (DMML). DMML introduces dependence maximization into Mahalanobis distance learning in the reproducing kernel Hilbert spaces (RKHSs). In contrast to the existing approaches, DMML puts emphasis on the following points. First, DMML optimally integrates the global and the local information without balancing weight(s) optimization and constraints satisfaction. Second, DMML does not

Acknowledgment

This work is supported in part by NSFC no. 61273196.

Wei Wang received her B.E. degree in Automation from the University of Science and Technology of China, Hefei, China, in 2009. Currently she is a Ph.D. candidate at the Joint Laboratory for University of Science and Technology of China, Hefei, China and National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include machine learning, data mining and image processing.

References (44)

  • H. Zhang et al.

    Semi-supervised distance metric learning based on local linear regression for data clustering

    Neurocomputing

    (2012)
  • M. Wang et al.

    Metric learning with feature decomposition for image categorization

    Neurocomputing

    (2010)
  • J. Chen, Z. Zhao, J. Ye, H. Liu, Nonlinear adaptive distance metric learning for clustering, in: 14th ACM SIGKDD...
  • J.B. Tenenbaum et al.

    A global geometric framework for nonlinear dimensionality reduction

    Science

    (2000)
  • S.T. Roweis et al.

    Nonlinear dimensionality reduction by locally linear embedding

    Science

    (2000)
  • E.P. Xing, A.Y. Ng, M.I. Jordan, S.J. Russell, Distance metric learning, with application to clustering with...
  • A. Globerson, S. Roweis, Metric learning by collapsing classes, in: 20th Annual Conference on Advances in Neural...
  • J.V. Davis, B. Kulis, P. Jain, S. Sra, I.S. Dhillon, Information-theoretic metric learning, in: 24th Annual...
  • R. Jin, S. Wang, Y. Zhou, Regularized distance metric learning:theory and algorithm, in: 23th Annual Conference on...
  • A. Gretton, O. Bousquet, A.J. Smola, B. Scholkopf, Measuring statistical dependence with hilbert-schmidt norms, in:...
  • L. Song, A. Smola, A. Gretton, K.M. Borgwardt, A dependence maximization view of clustering, in: 24th Annual...
  • L. Song, A. Smola, A. Gretton, K.M. Borgwardt, Supervised feature selection via dependence estimation, in: 24th Annual...
  • M. Wang, F. Sha, M.I. Jordan, Unsupervised kernel dimension reduction, in: 24th Annual Conference on Advances in Neural...
  • R. Fisher

    The use of multiple measurements in taxonomic problems

    Ann. Hum. Genet.

    (1936)
  • I. Jolliffe

    Principal Component Analysis

    (1986)
  • X. He, P. Niyogi, Locality preserving projections, in: 17th Annual Conference on Advances in Neural Information...
  • J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighbourhood components analysis, in: 19th Annual Conference on...
  • K.Q. Weinberger, J. Blitzer, L.K. Saul, Distance metric learning for large margin nearest neighbor classification, in:...
  • L. Yang, R. Jin, R. Sukthankar, Y. Liu, An efficient algorithm for local distance metric learning, in: 21th National...
  • M. Sugiyama

    Dimensionality reduction of multimodal labele data by local Fisher discriminant analysis

    J. Mach. Learn. Res.

    (2007)
  • S.C.H. Hoi, W. Liu, S.-F. Chang, Semi-supervised distance metric learning for collaborative image retrieval, in: IEEE...
  • G.Q. Zhong, K.Z. Huang, C.L. Liu, Low rank metric learning with manifold regularization, in: 11th IEEE International...
  • Cited by (8)

    • Learning with Hilbert–Schmidt independence criterion: A review and new perspectives

      2021, Knowledge-Based Systems
      Citation Excerpt :

      However, how to extend the HSIC bottleneck method to other learning paradigms such as regression and generative models is an open problem and worthy of being explored. (3) The HSIC-based learning algorithms have been widely applied to real applications such as bioinformatics [154,155], image processing and retrieval [156–158] and natural language processing [159,160]. These algorithms take the advantages of HSIC and generally yield improved performance.

    • Multi-view metric learning based on KL-divergence for similarity measurement

      2017, Neurocomputing
      Citation Excerpt :

      Wu [26] proposed an online multi-modal distance metric learning which has been successfully applied in image retrieval. Meanwhile, some related works [27,28] have been proposed which attracted widely attentions. Because distance metric learning methods play important roles in various applications, many DML methods have been proposed to measure similarities between samples.

    • Large-scale distance metric learning for k-nearest neighbors regression

      2016, Neurocomputing
      Citation Excerpt :

      Recently, distance metric learning has been widely applied to many interesting machine learning problems (see [3,4] for recent surveys), such as information retrieval [5–7], classification [8–12], computer vision [13,14] and bioinformatics [15,16]. Due to the important role of distance metric learning in metric-related pattern recognition tasks, a number of distance metric learning methods have been proposed [2,17–21]. These methods generally fall into two categories: methods based on eigenvalue optimization and methods based on convex or nonconvex optimization.

    • Semi-supervised change detection method for multi-temporal hyperspectral images

      2015, Neurocomputing
      Citation Excerpt :

      Distance metric learning methods are often accomplished based on a set of labeled datasets of pairwise relationships and can be widely exploited in the field of computer vision and machine learning [25,26]. Therefore, there are an increasing number of distance metric learning methods in recent years, such as maximally collapsing metric learning [27], metric learning for large margin nearest neighbor classification (LMNN) [28], smooth approximation method of distance metric learning [29], and dependence maximization based metric learning (DMML)[30]. However, there are limited researches using the semi-supervised distance metric learning to detect the change information with abundant spectral information.

    View all citing articles on Scopus

    Wei Wang received her B.E. degree in Automation from the University of Science and Technology of China, Hefei, China, in 2009. Currently she is a Ph.D. candidate at the Joint Laboratory for University of Science and Technology of China, Hefei, China and National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. Her current research interests include machine learning, data mining and image processing.

    Bao-Gang Hu (M94-SM99) received the M.Sc. degree from the University of Science and Technology, Beijing, China, and the Ph.D. degree from McMaster University, Hamilton, ON, Canada, both in mechanical engineering, in 1983 and 1993, respectively. He was a Research Engineer and Senior Research Engineer at C-CORE, Memorial University of Newfoundland, St. Johns, NF, Canada, from 1994 to 1997. From 2000 to 2005, he was the Chinese Director of computer science, control, and applied mathematics with the Chinese-French Joint Laboratory, National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China. He is currently a Professor at NLPR. His current research interests include pattern recognition and plate growth modeling.

    Zeng-Fu Wang was born in Hefei, China. He received the B.S. degree in information and system from University of Science and Technology of China in 1982, the M.S. degree in communications and electronic engineering from Nanjing Electronic Engineering Research Center, and the Ph.D. degree in control engineering from Osaka University, Japan, respectively. He is now a professor and director of Institute of Intelligent Machines, Chinese Academy of Sciences, and a professor of University of Science and Technology of China. His research interests include stereo vision, biometrics, affective computing and intelligent robotics.

    View full text