Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data

doi:10.1016/j.patcog.2010.02.022

Pattern Recognition

Volume 43, Issue 8, August 2010, Pages 2982-2992

https://doi.org/10.1016/j.patcog.2010.02.022 Get rights and content

Abstract

The problem of clustering with side information has received much recent attention and metric learning has been considered as a powerful approach to this problem. Until now, various metric learning methods have been proposed for semi-supervised clustering. Although some of the existing methods can use both positive (must-link) and negative (cannot-link) constraints, they are usually limited to learning a linear transformation (i.e., finding a global Mahalanobis metric). In this paper, we propose a framework for learning linear and non-linear transformations efficiently. We use both positive and negative constraints and also the intrinsic topological structure of data. We formulate our metric learning method as an appropriate optimization problem and find the global optimum of this problem. The proposed non-linear method can be considered as an efficient kernel learning method that yields an explicit non-linear transformation and thus shows out-of-sample generalization ability. Experimental results on synthetic and real-world data sets show the effectiveness of our metric learning method for semi-supervised clustering tasks.

Introduction

Distance metrics are a key issue in many machine learning algorithms [1]. Over the past few years, there has been considerable research on distance metric learning [2]. Many of the earlier studies optimize the metric with class labels for classification tasks [3], [4], [5], [6], [7], [8]. More recently, researchers have given much attention to distance learning for semi-supervised clustering tasks. As class label information is not generally available for clustering tasks, constraints are used as more natural supervisory information for these tasks. Pairwise similarity (positive) and dissimilarity (negative) constraints are the most popular kind of side information that has been used for semi-supervised clustering. However, other kinds of side information like relative comparisons have also been considered in some studies.

Over the last few years, the problem of clustering with side information (semi-supervised clustering) has received increasing attention [9], [10] and distance learning has been considered as a powerful approach for this problem. The two most frequently used approaches that include side information in the clustering algorithms are constraint-based [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21] and distance function learning [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34] approaches [24]. In the former approach, the clustering algorithm itself is modified to use the available labels or constraints to bias the search for an appropriate data clustering. However, in the latter approach, the algorithm learns a distance function prior to clustering. The learned distance function tries to put similar points close together and dissimilar points far away from each other. This approach is more flexible in the choice of distance function [33]. Additionally, it has received considerable attention in recent studies [1], [25], [28], [29], [30], [31], [33], [34] and we also use this approach.

Distance learning based on constraints has been studied by many researchers [22], [23], [24], [25], [26], [27], [28], [29], [30], [31], [32], [33], [34]. Klein et al. [22] introduced a metric adaptation method for semi-supervised clustering. This method finds a distance measure according to the shortest path in a version of the similarity graph that has been altered by positive constraints. However, negative constraints have been employed after the metric adaptation phase during the complete-link clustering. Some latter studies [1], [23], [25], [28], [34] have considered a more popular approach that learns a global Mahalanobis metric from pairwise constraints. Xing et al. [23] proposed a convex optimization problem to learn a global Mahalanobis metric according to pairwise constraints. Bar-Hillel et al. [25] devised a more efficient, non-iterative algorithm called relevant component analysis (RCA) for learning a Mahalanobis metric. This method can only incorporate positive constraints. An extension of the RCA method that can consider both positive and negative constraints has also been introduced by Yeung and Chang [28].

More recently, some non-linear metric learning methods for semi-supervised clustering have been introduced. Chang and Yeung [29] proposed a locally linear metric learning method that considers only positive constraints. The objective function of this method has many local optima and the topology cannot be preserved well during this approach [30]. Chang and Yeung [31] proposed also a metric adaptation method. This method adjusts the location of data points iteratively, so that similar points tend to get closer and dissimilar points tend to move away from each other. As this method lacks an explicit transformation map, it cannot project new data points onto the transformed space straightforwardly [31]. Additionally, the movement of data points in this method may interfuse the structure of the data. In [30], two kernel-based metric learning methods have been presented that do have some limitations [30]. These kernel-based methods can use only positive constraints.

Among the existing metric learning methods, some of them [1], [23], [28], [34], [39], [40] can incorporate both positive and negative constraints. However, most of these methods [1], [23], [28], [34] learn only a linear transformation that corresponds to a Mahalanobis metric. Although some recent studies [39], [40] have been introduced for kernel learning from positive and negative constraints, they are based on learning non-parametric kernel matrices. These methods can only find distances of the seen data. Additionally, the optimization problems in these methods are usually difficult to solve [40] and the degree of freedom of the corresponding models is very high (i.e., n² where n denotes the number of data points). In this paper, we propose an efficient non-linear metric learning method that considers both positive and negative constraints and also the topological structure of the data. We formulate the proposed method as a constrained trace ratio optimization problem that can be solved efficiently using algorithms introduced for this purpose (e.g., Xiang et al.'s method [1]). The proposed non-linear method can be considered as an efficient kernel learning method that does not need to learn all items of an $n \times n$ matrix. Our method yields an explicit transformation that can project new data points onto the transformed space.

The rest of this paper is organized as follows: Section 2 presents a brief review of related works. In Section 3, first the general form of the proposed optimization problems that incorporate both positive and negative constraints and also the topological structure of the data is introduced. Then, we present special problems that can be solved efficiently for learning linear and non-linear transformations. Finally, we present a kernel-based method and show the relation between the proposed non-linear method and a special form of this kernel-based method. Section 4 presents some experimental results on synthetic and real-world data sets. Concluding remarks are given in the last section.

Section snippets

Related works

In this section, we review those methods that can consider both positive and negative constraints to learn a transformation. A positive constraint denotes a pair of data points that must be in the same cluster while a negative constraint denotes two data points that must be in two different clusters [1]. Most of the existing methods that can use both positive and negative constraints learn a Mahalanobis metric A (where A is a positive semi-definite matrix) or, equivalently, find a

Proposed approach

In this section, we first propose a general framework for learning an appropriate transformation from positive and negative constraints. Based on this framework, we propose problems (that can be solved efficiently) for learning linear and non-linear transformations. Finally, we introduced our kernel-based method and show the relation between a special form of this method and the proposed non-linear metric learning method.

Here, we introduce some notations used in this section. $X = {x_{1}, x_{2}, \dots, x_{n}}$

Experimental results

In this section, we explain experiments that we have conducted to compare our linear and non-linear metric learning methods with some existing methods. We measure the effectiveness of semi-supervised metric-learning algorithms by comparing clustering results obtained from using different metrics. We report results on both synthetic and real-world data sets.

Conclusions and future work

In this paper, we introduced a novel metric learning method for semi-supervised clustering. We proposed a general framework for learning linear and non-linear transformations using both positive and negative constraints. The proposed methods have been formulated as constrained trace ratio problems that can be solved efficiently. We considered the geometrical structure of the data along with the pairwise constraints in the proposed optimization problems. We showed that the proposed non-linear

About the Author—MAHDIEH SOLEYMANI BAGHSHAH received her B.S. and M.S. degrees from Department of Computer Engineering, Sharif University of Technology, Iran in 2003 and 2005. She is now a Ph.D. candidate at Sharif University of Technology. Her research interests include machine learning and pattern recognition with primary emphasis on semi-supervised learning and clustering.

References (40)

S. Xiang et al.
Learning a Mahalanobis distance metric for data clustering and classification
Pattern Recognition
(2008)
D.Y. Yeung et al.
Extending the relevant component analysis algorithm for metric learning using both positive and negative equivalence constraints
Pattern Recognition
(2006)
H. Chang et al.
Locally linear metric adaptation with application to semi-supervised clustering and image retrieval
Pattern Recognition
(2006)
H. Chang et al.
Relaxational metric adaptation and its application to semi-supervised clustering and content-based image retrieval
Pattern Recognition
(2006)
Y.F. Guo et al.
A generalized Foley–Sammon transform based on generalized fisher discriminant criterion and its application to face recognition
Pattern Recognition Letters
(2003)
L. Yang, R. Jin, Distance metric learning: a comprehensive survey, Technical Report, Michigan State University...
J. Goldberger, S. Roweis, G. Hinton, R. Salakhutdinov, Neighborhood components analysis, in: Advances in NIPS, MIT...
K. Weinberger, J. Blitzer, L. Saul, Distance metric learning for large margin nearest neighbor classification, in:...
T. Hastie et al.
Discriminant adaptive nearest neighbor classification
IEEE Transactions on Pattern Analysis and Machine Intelligence
(1996)
A. Globerson, S. Roweis, Metric learning by collapsing classes, in: Advances in NIPS, MIT Press, Cambridge, MA, USA,...

J.H. Friedman, Flexible metric nearest neighbor classification, Technical Report, Statistics Department, Stanford...

Z.H. Zhang, J.T. Kwok, D.Y. Yeung, Parametric distance metric learning with label information, in: IJCAI, Acapulco,...

M.H.C. Law, Clustering, dimensionality reduction, and side information, Ph.D. Dissertation, Michigan University,...

S. Basu, Semi-supervised clustering: probabilistic models, algorithms and experiments, Ph.D. Dissertation, University...

M.H.C. Law, A. Topchy, A.K. Jain, Model-based clustering with probabilistic constraints, in: SIAM Conference on Data...

S. Basu, A. Banerjee, R.J. Mooney, Semi-supervised clustering by seeding, in: 19th International Conference on Machine...

K. Wagstaff, C. Cardie, S. Rogers, S. Schroedl, Constrained K-means clustering with background knowledge, in: 18th...

Z. Lu, T. Leen, Semi-supervised learning with penalized probabilistic clustering, in: Advances in NIPS 17, MIT Press,...

N. Bansal et al.

Correlation clustering

Machine Learning

(2004)

T. Lange, M.H. Law, A.K. Jain, J. Buhmann, Learning with constrained and unlabelled data, in: IEEE Computer Society...

Cited by (37)

Decision tree pairwise metric learning against adversarial attacks
2021, Computers and Security
Citation Excerpt :
Despite its effectiveness on imbalanced and limited data samples, metric learning based models for adversarial sample detection such as (Chen et al., 2018; Pang et al., 2018; 2019; Wen et al., 2016) learn on the single Mahalanobis distance which is limited in its capacity to capture the shape of complex data (Baghshah and Shouraki, 2010; Duan et al., 2020; Ma and Zheng, 2016; Wang et al., 2018; Xiong et al., 2012). To resolve the aforementioned limitation, the authors (Baghshah and Shouraki, 2010; Chang, 2012; Frome et al., 2017; Ma and Zheng, 2016; Xiong et al., 2012) introduced a non-negative and symmetric metric that is able to implicitly adapt it distance function through the feature space thereby, making their distance metric learning based models adaptable to complex data. Kontschieder et al. (2016) and Hehn and Hamprecht (2018), introduced the concept of decision trees in deep learning to transform the weak Softmax Cross Entropy loss into strong classifiers.
Distance Metric Learning has been used or paired with SoftMax Cross-Entropy loss to increase the discriminative power of deep learning classifiers against adversarial attacks. Most distance metric learning-based methods for adversarial detection adopt the standard Mahalanobis distance which only encodes the relative position information and therefore cannot capture the entire shape of complex data. In this research, we propose an alternative metric learning approach for adversarial sample classification. This approach integrates relative information as well as absolute pairwise information into a differentiable decision tree representation to guarantee a more robust classifier. We term this metric learning approach as differentiable decision tree pairwise metric learning (DTML). We demonstrate that DTML is more robust even under strong adversarial untargeted attacks compared with the single Mahalanobis distance-based defending methods on MNIST, CIFAR-10, and KDDCUP99 datasets.
Sparse feature selection: Relevance, redundancy and locality structure preserving guided by pairwise constraints
2020, Applied Soft Computing Journal
Selection of features as a pre-processing stage is an essential issue in many machine learning tasks (such as classification) to reduce data dimensionality as there are many irrelevant and redundant features that can mislead the learning process. Graph-based sparse feature selection is developed to overcome this issue. In this paper, a novel graph-based sparse feature selection method is proposed that take into account both issues: relevancy and redundancy analysis. An empirical loss function joining with $ℓ_{1}$ -norm regularization term is proposed to overcome the relevancy issue and the redundancy issue is overcome by introducing a regularization term that prefers uncorrelated features. Furthermore, the proposed learning procedure is guided by two different sets of supervision information as pairs of must-linked (positive) and cannot-linked (negative) constraint sets to select a discriminative feature subset. These guiding information besides the whole data points are encoded in the graph Laplacian matrix that preserves the locality structure of the original data. The graph Laplacian matrix is constructed by two different approaches. Our first approach tries to preserve the structure of the original data guided just by the positive data points (unique samples in the must-linked constraints), and our second approach applies a normalized adapted affinity matrix to embed the pairwise must-linked and cannot-linked constraints as well as the neighborhood relationships information, all together. The experimental results on a number of several datasets from the University of California-Irvine machine learning repository, in addition to several high dimensional gene expression datasets show the efficacy of the proposed methods in the classification tasks compared to several powerful feature selection methods.
Sparse Bayesian approach for metric learning in latent space
2019, Knowledge-Based Systems
This paper presents a new and efficient approach for metric learning in latent space. Our method discovers an optimal mapping from the feature space to a latent space that shrinks the distance between similar data items and also increases the distance between dissimilar ones. The proposed approach is based on a Bayesian variational framework which iteratively finds the optimal posterior distribution of parameters and hyperparameters of the model. Advantages of the proposed method to similar work are 1) Learning the noise of the latent variables on the low-dimensional manifold to find a more effective transformation. 2) Automatically finding the dimension of latent space and sparsification of the solution which prevents the overfitting problem. 3) Unlike Mahalanobis metric learning, the proposed algorithm roughly scales linearly to the dimension of data. Also, the present work is extended for learning in the feature space induced by an RKHS kernel. The proposed method is evaluated on small and large datasets coming from real applications such as network intrusion detection, face recognition, handwritten digits, letter recognition, and hyperspectral image classification. The results show that our method outperforms related representative and state-of-the-art methods in many small and large datasets.
A data-driven metric learning-based scheme for unsupervised network anomaly detection
2019, Computers and Electrical Engineering
Most network anomaly detection systems (NADSs) rely on the distance between the connections’ feature vectors to identify attacks. Traditional distance metrics are inefficient for these systems as they deal with heterogeneous features of network connections. In this paper, we address a clustering-based NADS employing a data-driven distance metric. This metric is the outcome of a proposed metric learning method, which extracts its required side information from the training samples. The learned transformation matrix maps the connections’ features to a new feature space in which similar and dissimilar connections are more well-separated while the local neighborhood information of the connections’ features is preserved using the Laplacian Eigenmap technique. The proposed NADS is evaluated over the Kyoto 2006+ and NSL-KDD datasets. The experimental results show that it has superior performance in comparison with a recent SVM-clustering based NADS that employs the traditional Euclidean distance function.
Sparse Bayesian similarity learning based on posterior distribution of data
2018, Engineering Applications of Artificial Intelligence
A major challenge in similarity/distance learning is attaining a strong measure which is close to human notions of similarity. This paper shows why the consideration of data distribution can yield a more effective similarity measure. In addition, the current work both introduces a new scalable similarity measure based on the posterior distribution of data and develops a practical algorithm that learns the proposed measure from the data. To address scalability in this algorithm, the observed data are assumed to have originated from low dimensional latent variables that are close to several subspaces. Other advantages of the currently proposed method include: (1) Providing a principled way to combine metrics in computing the similarity between new instances, unlike local metric learning methods. (2) Automatically identifying the real dimension of latent subspaces, by defining appropriate priors over the parameters of the system via a Bayesian framework. (3) Finding a better projection to low dimensional subspaces, by learning the noise of the latent variables on these subspaces. The present method is evaluated on various real datasets obtained from applications, such as face verification, handwritten digit and spoken letter recognition, network intrusion detection, and image classification. The experimental results confirm that the proposed method significantly outperforms other state-of-the-art metric learning methods on both small and large-scale datasets.
Detection of evolving concepts in non-stationary data streams: A multiple kernel learning approach
2018, Expert Systems with Applications
Due to the unprecedented speed and volume of generated raw data in most of applications, data stream mining has attracted a lot of attention recently. Methods for solving these problems should address challenges in this area such as infinite length, concept-drift, recurring concepts, and concept-evolution. Moreover, due to the speedy intrinsic of data streams, the time and space complexity of the methods are extremely important. This paper proposes a novel method based on multiple-kernels for classifying non-stationary data streams, which addresses the mentioned challenges with special attention to the space complexity. By learning multiple kernels and specifying the boundaries of classes in the feature (mapped) space of combined kernels, the required amount of memory will be decreased. These kernels will be updated regularly throughout the stream when the true labels of instances are received. Newly arrived instances will be classified with respect to their distance to boundaries of the previously known classes in the feature spaces. Due to the efficient memory usage, the computation time does not increase significantly through the stream. We evaluate the performance of the proposed method using a set of experiments conducted on both real and synthetic benchmark data sets. The experimental results show the superiority of the proposed method over the state-of-the-art methods in this area.

View all citing articles on Scopus

About the Author—SAEED BAGHERI SHOURAKI received his B.Sc. in Electrical Engineering and M.Sc. in Digital Electronics from Sharif University of Technology, Tehran, Iran, in 1985 and 1987. He joined soon to Computer Engineering Department of Sharif University of Technology as a faculty member. He received his Ph.D. on fuzzy control systems from Tsushin Daigaku (University of Electro-Communications), Tokyo, Japan, in 2000. He continued his activities in Computer Engineering Department up to 2008. He is currently an associate professor in Electrical Engineering Department of Sharif University of Technology. His research interests include control, robotics, artificial life, and soft computing.

View full text

Non-linear metric learning using pairwise similarity and dissimilarity constraints and the geometrical structure of data

Abstract

Introduction

Section snippets

Related works

Proposed approach

Experimental results

Conclusions and future work

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition

Pattern Recognition Letters

Discriminant adaptive nearest neighbor classification

IEEE Transactions on Pattern Analysis and Machine Intelligence

Correlation clustering

Machine Learning