Elsevier

Neural Networks

Volume 26, February 2012, Pages 141-158
Neural Networks

Learning from pairwise constraints by Similarity Neural Networks

https://doi.org/10.1016/j.neunet.2011.10.009Get rights and content

Abstract

In this paper we present Similarity Neural Networks (SNNs), a neural network model able to learn a similarity measure for pairs of patterns, exploiting a binary supervision on their similarity/dissimilarity relationships. Pairwise relationships, also referred to as pairwise constraints, generally contain less information than class labels, but, in some contexts, are easier to obtain from human supervisors. The SNN architecture guarantees the basic properties of a similarity measure (symmetry and non negativity) and it can deal with non-transitivity of the similarity criterion. Unlike the majority of the metric learning algorithms proposed so far, it can model non-linear relationships among data still providing a natural out-of-sample extension to novel pairs of patterns. The theoretical properties of SNNs and their application to Semi-Supervised Clustering are investigated. In particular, we introduce a novel technique that allows the clustering algorithm to compute the optimal representatives of a data partition by means of backpropagation on the input layer, biased by a L2 norm regularizer. An extensive set of experimental results are provided to compare SNNs with the most popular similarity learning algorithms. Both on benchmarks and real world data, SNNs and SNN-based clustering show improved performances, assessing the advantage of the proposed neural network approach to similarity measure learning.

Introduction

Similarity learning algorithms induce a similarity measure, that is suitable for comparing data points, exploiting a set of supervised examples. Given two points represented in a feature space, x,yF, the assumption behind similarity learning is that F is non-Euclidean, so that the similarity degree between x and y may not be appropriately computed by the classical Euclidean distance. Supervision is generally provided in the form of similarity/dissimilarity labels on pairs of points, also referred to as pairwise constraints. The learned similarity function should approximate the supervisor’s perception of similarity in the given feature space and it can be used to partition data in semi–supervised clustering algorithms.

In the last few decades, the human perception of similarity received a growing attention from researchers in psychology and mathematics, who studied its properties and tried to define appropriate models of similarity functions (Richter, 1992, Tversky, 1977, Wallace, 1958). More recently, the problem of learning a similarity measure has attracted also the machine learning community. In particular, in a wide set of fields, ranging from bioinformatics to information retrieval, from robotics to computer vision, often a supervision on the relationships between two entities is available, and an appropriate criterion to compare new data is to be inferred.

The term similarity measure is frequently used in a generic sense, describing both similarity and dissimilarity functions. Following psychological evidence on how humans learn, similarity measures are not required to satisfy all the metric properties (Santini and Jain, 1999, Tversky, 1977). In particular, the similarity relationship is not necessarily transitive.

In this paper we focus on two major aspects of similarity learning: the inductive learning of a similarity measure from pairwise constraints in a fully supervised scenario in which the trained model provides a natural out-of-sample extension of the similarity criterion to novel pairs of points, and, successively, the application of the learned function to semi-supervised partitional clustering of unlabeled data. The former point is remarkable, and it casts the learning problem in a more challenging scenario with respect to the transductive setting in which many existing algorithms operate (i.e. they compute distances within the training pairs only). For instance, the learned measure could be used to group or compare data that is not available at training time, or that it is incrementally added to the existing collection of patterns as soon as it is acquired or provided to the system. In particular, the contributions of this paper are the following:

  • the definition of Similarity Neural Networks (SNNs), a neural network model designed to learn non-linear similarity measures from pairwise constraints and to generalize the learned criterion to compare previously unseen data pairs. The network architecture guarantees the symmetry and non negativity of the implemented similarity measures, independently of the available supervision.

  • the analysis of the theoretical properties and of the approximation capabilities of SNNs, that are proven to be universal approximators for symmetric functions.

  • the definition of a technique to compute the optimal cluster representatives in semi-supervised partitional clustering, exploiting SNNs to implement the similarity concept. As a matter of fact, due to the non-linearity of the learned function, the SNN model cannot be directly applied to K-Means clustering without considering approximations of data representatives. To overcome this issue, we describe how to compute optimal representatives with respect to the SNN function by means of a modified backpropagation scheme, that is extended to the input layer of the network and biased by a L2 norm regularizer. This approach is a more efficient version of the technique that we proposed in Melacci, Maggini, and Sarti (2009).

  • the critical review of the most popular seeding and constraining policies, and the definition of new ones, in order to show that both the learning of the similarity measure and the clustering process can be “guided” by the available pairwise constraints, improving the quality of the data partitioning in space regions where SNNs were not able to perfectly model the similarity function;

  • a detailed experimental analysis, that has been conducted to measure the performances of SNNs. We compare the new model against many popular inductive learning algorithms for similarity measure estimation, considering linear, non-linear, and kernel-based techniques. Experiments are performed on several benchmarks from the UCI repository and on real data from the US Postal System (Asuncion & Newman, 2007). SNNs compare favorably with the considered methods, showing the advantage of their flexible and non-linear model, and the benefits in their application to tasks that require to partition data accordingly to an adaptive similarity criterion.

The paper is organized as follows. In the next section the notation and the properties of the considered pairwise relationships are presented. Section 3 reviews the related work on similarity learning available in the literature. The SNN model and its theoretical properties are described in Section 4. Section 5 describes the application of SNNs to semi-supervised clustering, and in Section 6 a detailed experimental analysis is reported. Finally, Section 7 draws some conclusions and delineates the directions for future research.

Section snippets

Learning pairwise relationships

In this section we introduce the general formulation of learning from pairwise constraints and the special case when pairwise relationships are obtained from the class labels available for each single data point.

General formulation. Given a set V={x1,x2,,xn} of n data points xiFRm, we consider the case when the available supervision is represented by a set P of p symmetric pairwise relationships, or constraints. In details, P=SD, where S contains the must-link constraints, or similarity

Related work

In the machine learning literature, similarity-based learning collects a large number of significantly different approaches. In the following we briefly summarize the main contributions, focusing on the techniques that are particularly related to SNNs. The existing algorithms can be roughly divided into three main categories, based on the type of the provided supervision.

Unsupervised. Many unsupervised algorithms that make specific assumptions on the distribution of data are frequently referred

Similarity neural networks

An SNN consists of a feedforward Multi-Layer Perceptron (MLP) (Haykin, 1998) trained to learn a similarity measure for pairs of patterns (xi,xj), xi,xjRm, using binary supervisions. Given a set of objects V and a set of pairwise constraints P, the SNN learning set is defined as L={([xi(z),xj(z)],yz)(xi(z),xj(z))P,yz=1if(xi(z),xj(z))S,yz=0if(xi(z),xj(z))D}.

The set L collects p=|P| triples ([xi(z),xj(z)],yz), being yz{0,1} the similarity/dissimilarity label of the pair (xi(z),xj(z)),

Semi–supervised clustering by Similarity Neural Networks

Partitional clustering algorithms, such as K-Means and K-Medoids (Duda et al., 2000, Kaufman and Rousseeuw, 1987), divide a set V of objects into k clusters by searching for k representatives which minimize the average dissimilarity of all objects to the nearest representative. The representative cj of the j-th cluster Cj is computed as follows cj=argmincjxiCjd(xi,cj).

In K-Medoids, cj, j=1,,k, are referred to as medoids, and they are selected from points of V. Differently, the k centroids of

Experimental results

In order to evaluate the performances of SNNs and their application to partitional semi-supervised clustering, we selected 7 popular datasets from the UCI repository (Asuncion & Newman, 2007). The resulting optimal setup is tested on the handwritten digit data from the US Postal System.2Table 1 reports the main characteristics of each dataset.

For each benchmark a set of k classes is defined. Following the framework described

Conclusions and future work

In this paper we presented the Similarity Neural Network (SNN) model. SNNs are designed to learn similarity measures from pairwise constraints that describe similarity/dissimilarity relationships between patterns. Due to their particular architecture, they guarantee to compute a symmetric and non negative function, independently from the available supervision, and they naturally provide an out-of-sample extension to novel pairs of data points. The approximation capabilities of SNN have been

References (58)

  • A. Bar-Hillel et al.

    Learning distance functions using equivalence relations

  • A. Bar-Hillel et al.

    Learning a Mahalanobis metric from equivalence constraints

    The Journal of Machine Learning Research

    (2005)
  • Basu, S., Banerjee, A., & Mooney, R. (2002). Semi-supervised clustering by seeding. In Proceedings of the International...
  • S. Basu et al.

    A probabilistic framework for semi-supervised clustering

  • S. Basu et al.

    Constrained clustering: advances in algorithms, theory, and applications

    (2008)
  • M. Belkin et al.

    Laplacian eigenmaps for dimensionality reduction and data representation

    Neural computation

    (2003)
  • M. Bilenko et al.

    Integrating constraints and metric learning in semi-supervised clustering

  • H. Chang et al.

    Kernel-based metric adaptation with pairwise constraints

  • Y. Chen et al.

    Similarity-based classification: concepts and algorithms

    The Journal of Machine Learning Research

    (2009)
  • M. Cox

    Multidimensional scaling

    (2000)
  • De Bie, T., Momma, M., & Cristianini, N. (2003). Efficiently learning the metric using side-information. Proceedings of...
  • C. Domeniconi et al.

    Large margin nearest neighbor classifiers

    IEEE Transactions on Neural Networks

    (2005)
  • R. Duda et al.

    Pattern classification

    (2000)
  • E. Frank et al.

    Weka: a machine learning workbench for data mining

  • J. Goldberger et al.

    Neighbourhood components analysis

  • S. Haykin

    Neural networks: a comprehensive foundation

    (1998)
  • T. Hertz et al.

    Boosting margin based distance functions for clustering

  • T. Hertz et al.

    Learning distance functions for image retrieval

  • D. Hochbaum et al.

    A best possible heuristic for the k-center problem

    Mathematics of Operations Research

    (1985)
  • Cited by (20)

    • Discriminative information-based nonparallel support vector machine

      2019, Signal Processing
      Citation Excerpt :

      In detail, must-link means a pair of samples should be allotted to the same cluster, while cannot-link performs the opposite operation. PCs have been applied to clustering [30], semi-supervised classification [31] tasks, supervised classification [32], feature extraction [33], dimension reduction [34] and neural network [35]. In MPC, the weight Wij is relatively large when the two samples are spatially close to each other, while Wij becomes relatively small when the two samples are spatially away from each other, which is the most commonly used spatial distribution learning strategy.

    • Matrixized learning machine with modified pairwise constraints

      2015, Pattern Recognition
      Citation Excerpt :

      As for pairwise constrained classification methods, PC could be used in multi-tech methods, such as the multi-label ensemble learning [28], the multi-feature-oriented scene classification [55], and the ensemble of multi-classifiers [22]. Besides, PC is utilized for semi-supervised classification [16,?,58], the feature extraction [36,53,?], the dimension reduction [43], and the neural network [31]. To overcome this drawback, we wish that the introduced spatial information could make the effect of the traditional PC more significant, i.e., the newly-acquired knowledge should satisfy what Table 2 lists.

    View all citing articles on Scopus
    View full text