Sparse distance metric learning

Choy, Tze; Meinshausen, Nicolai

doi:10.1007/s00180-013-0437-2

Sparse distance metric learning

Original Paper
Published: 24 July 2013

Volume 29, pages 515–528, (2014)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Tze Choy¹ &
Nicolai Meinshausen¹

634 Accesses
Explore all metrics

Abstract

Nearest neighbour classification requires a good distance metric. Previous approaches try to learn a quadratic distance metric learning so that observations of different classes are well separated. For high-dimensional problems, where many uninformative variables are present, it is attractive to select a sparse distance metric, both to increase predictive accuracy but also to aid interpretation of the result. We investigate the $\ell 1$-regularized metric learning problem, making a connection with the Lasso algorithm in the linear least squared settings. We show that the fitted transformation matrix is close to the desired transformation matrix in $\ell 1$-norm by assuming a version of the compatibility condition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Local generalized quadratic distance metrics: application to the k-nearest neighbors classifier

Article 25 April 2017

Unsupervised representation learning with Minimax distance measures

Article Open access 28 July 2020

Semi-supervised Coefficient-Based Distance Metric Learning

References

Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Nashua, New Hampshire
MATH Google Scholar
Bian W, Tao D (2011) Learning a distance metric by empirical loss minimization. In: Proceedings of the twenty-second international joint conference on artificial intelligence—vol 2, IJCAI’11. Association for the Advancement of Artificial Intelligence Press, pp 1186–1191
Bian W, Tao D (2012) Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst 23(8):1194–1205
Article Google Scholar
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732
Article MATH MathSciNet Google Scholar
Breiman L (2001) Random forests. Mach Learn 45:5–32
Article MATH Google Scholar
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data. Springer, Berlin
Book MATH Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Friedman JH, Hastie T, Tibshirani R (2009) December). Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Google Scholar
Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2004). Neighborhood component analysis. In: Advances in neural information processing systems 17. MIT Press, Cambridge, pp 513–520
Hix S, Noury A, Roland G (2006) Dimensions of politics in the European Parliament. Am J Polit Sci 50:494–511
Article Google Scholar
Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 89:1069–1097
Article MathSciNet Google Scholar
Soifer A, Grünbaum B, Johnson P, Rousseau C (2008) The mathematical coloring book: mathematics of coloring and the colorful life of its creators. Springer, New York
Google Scholar
Van De Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
Article MATH MathSciNet Google Scholar
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
MATH Google Scholar
Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15. MIT Press, Cambridge, pp 505–512

Download references

Author information

Authors and Affiliations

Department of Statistics, University of Oxford, 1 South Parks Road, Oxford, OX1 3TG, UK
Tze Choy & Nicolai Meinshausen

Authors

Tze Choy
View author publications
You can also search for this author inPubMed Google Scholar
Nicolai Meinshausen
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nicolai Meinshausen.

Apeendix: Proofs

1.1 Additional lemmata

Lemma 6.1

Suppose Assumptions 3.1 and 3.2 hold and suppose the covariates are bounded, that is there exists some $c>0$ such that $ |X_{i,h}|\le c $ for all $ h = 1, \ldots , p, $ and $ i=1, \ldots ,n$. Let

$$\begin{aligned} \lambda _0 := 8 c^2 \delta ^2 \sqrt{\frac{t^2 + 4 \log (p)+2 \log (n)}{(n-1)/2}}. \end{aligned}$$

Then, for $N={n(n-1)/2}$,

$$\begin{aligned} \mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}|2N^{-1} \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\right) \ge 1 - 2 \exp \left( -\frac{t^2}{2}\right) \!. \end{aligned}$$

Proof

The main difficulty here is the fact that the pairwise distances are not independent. When the number of observations $n$ is odd, we can pick a set of $(n-1)/{2}$ pairwise distances so that no indices overlap and hence are independent. If $n$ is even, can we not pick a set of $n/2$ pairwise distances. For this set of pairwise distances, we can show that the contribution from the noise term can be bounded with high probability. We can then show that the contribution of noise to the set of all pairwise distances can also be bounded by applying an union bound.

We first consider the case when $n$ is odd. We would use a result from graph theory to prove our claim. A complete graph with $n$ vertices is a graph where there is an edge between any two vertices. A pairwise distance can be represented as an edge in the complete graph. An edge colouring of a graph is that we can assign an colour to each edge such that no adjacent edges share the same colour. This implies that pairwise distances corresponds to edges of the same colour are independent under Assumption 3.2. Edge chromatic number is the minimum number of colours required to colour the edges of a graph. A complete graph with $n$ vertices has an edge chromatic number of $n$ when $n$ is odd and $n-1$ when $n$ is even(a constructive proof is given in Soifer et al. (2008, P.133), from which we can deduce that there is a colouring of the edges such that each colour consists of exactly $(n-1)/2$ edges when $n$ is odd. Therefore, the set of all pairwise distances can be partitioned into $n$ partitions $G_1,\ldots ,G_n$, each partition consists of $(n-1)/{2}$ pairwise distances that no indices overlap. This partition is always possible because there exists a colouring of the edges with $n$ colours in a complete graph would give such a partition, with edges of each colours forms a partition.

Let $V_{h,\ell ,G_k} = \frac{2}{n-1} \sum _{ \{i,j\} \in G_k} 2\varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })$.

By the boundedness of $X$ and $\varepsilon $,

$$\begin{aligned} - 8c^2\delta ^2\le 2\varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell }) \le 8c^2\delta ^2. \end{aligned}$$

Therefore, by Hoeffding’s inequality,

$$\begin{aligned} P\left( |V_{h,\ell ,G_k}|\ge \lambda _0\right)&\le 2 \exp \left[ -\frac{2 \lambda _0^2 \left( \frac{n-1}{2}\right) ^2}{\frac{n-1}{2} (16c^2\delta ^2)^2}\right] \\&\le 2 \exp \left( - \frac{t^2 +2 \log n + 4\log p}{2}\right) \!. \end{aligned}$$

By a union bound over all choices of $h$ and $\ell $,

$$\begin{aligned} P\left( \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| \ge \lambda _0 \right)&\le 2p^2 \exp \left( -\frac{t^2+4\log p + 2 \log n}{2}\right) \\&= 2 \exp \left( -\frac{t^2+2\log n }{2}\right) \!. \end{aligned}$$

Finally, by a simple union bound over all partitions of $G_k$,

$$\begin{aligned} P\left( \max _k \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| \ge \lambda _0 \right)&\le 2n \exp \left( -\frac{t^2+2\log n }{2} \right) \\&= 2 \exp \left( -\frac{t^2}{2}\right) \!. \end{aligned}$$

And thus

$$\begin{aligned}&\mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}|2 N^{-1} \sum _{ij} \varepsilon _{i,j} \left( X_{ih}-X_{jh} \right) \left( X_{i\ell } - X_{j\ell }\right) | < \lambda _0 \right) \\&\quad = \mathbf P \left( \max _{1 \le h\le p, 1\le \ell \le p}n^{-1} |\sum _k V_{h,\ell ,G_k}| < \lambda _0 \right) \\&\quad \ge \mathbf P \left( \max _k \max _{1 \le h\le p, 1\le \ell \le p}|V_{h,\ell ,G_k}| < \lambda _0 \right) \\&\quad \ge 1-2\exp \left( {-t}^{2/2}\right) \!. \end{aligned}$$

When $n$ is even, we can decompose the complete graph into $n-1$ sets of disjoint pairs, each set with $n$ pairwise distances. We can derive a slightly stronger bound in this case than the case when n is odd. $\square $

The proofs of Lemmata 6.2 and 6.3 follow closely with the ones given in Chapter 6 of Bühlmann and van de Geer (2011), with modifications to handle the matrix notations.

Lemma 6.2

Assume that $\max _{1 \le h\le p, 1\le \ell \le p}|2 N^{-1} \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0$. Then

$$\begin{aligned} N^{-1} \sum _{ij} (\hat{d}(i,j)-d^{*}(i,j))^2 + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \lambda _0 \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + \lambda \Vert \mathbf{M }^{*}\Vert _1. \end{aligned}$$

Proof

Since $\hat{\mathbf{M }}$ minimizes $L(\mathbf{M })$, we have that $L(\hat{\mathbf{M }}) \le L(\mathbf M^{*} )$ and thus

$$\begin{aligned}&\sum _{ij} \frac{ (d(i,j) - \hat{d}(i,j))^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} \frac{ (d(i,j) - d^{*}(i,j))^2}{N} + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&- 2 \sum _{ij} \frac{\hat{d}(i,j) d(i,j)^2}{N} + \sum _{ij} \frac{\hat{d}(i,j)^2}{N}+\lambda \Vert \hat{\mathbf{M }}\Vert _1 \le - 2 \sum _{ij} \frac{d^{*}(i,j) d(i,j)}{N} + \sum _{ij} \frac{d^{*}(i,j)}{N} \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\sum _{ij} \frac{\left( \hat{d}(i j)-d^{*}(i, j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} 2 (d(i,j) - d^{*}(i,j)) \frac{(\hat{d}(i,j) - d^{*}(i, j))}{N}+ \lambda \Vert \mathbf{M }^{*}\Vert _1 \\&\sum _{ij} \frac{\left( \hat{d}(i j)-d^{*}(i, j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }}\Vert _1 \le \sum _{ij} 2 \varepsilon _{k} \frac{(\hat{d}(i ,j) - d^{*}(i, j))}{N} + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\quad \quad = \sum _{ij} 2 \varepsilon _{i,j} tr( (\hat{\mathbf{M }}-\mathbf{M }^{*})(X_i - X_j)(X_i-X_j)^{T})/N+ \lambda \Vert \mathbf{M }^{*}\Vert _1 \\&\quad \quad = tr( (\hat{\mathbf{M }}-\mathbf{M }^{*})\sum _{ij} 2 \varepsilon _{i,j} (X_i - X_j)(X_i-X_j)^{T})/N + \lambda \Vert \mathbf{M }^{*}\Vert _1\\&\quad \quad \le \lambda _0 \Vert \hat{\mathbf{M }}-\mathbf{M }^{*}\Vert _1 + \lambda \Vert \mathbf{M }^{*}\Vert _1, \end{aligned}$$

which completes the proof. $\square $

Lemma 6.3

Assume that $N^{\!-\!1} \max _{1 \!\le \! h\le p, 1\le \!\ell \! \le p}|2 \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0$ holds. By picking $\lambda \ge 2 \lambda _0$,

$$\begin{aligned} 2 \sum _{ij} \frac{(\hat{d}(i,j) - d^{*}(i,j))^2}{N} + \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_{S} - \mathbf{M }_{S}^{*} \Vert _1. \end{aligned}$$

Proof

By triangle inequality,

$$\begin{aligned} \Vert \hat{\mathbf{M }}\Vert \ge \Vert \mathbf{M }^{*}_S\Vert - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert . \end{aligned}$$

Also note that we can expand $\Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert = \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S \Vert + \Vert \hat{\mathbf{M }}_{S^c}\Vert .$ We can hence further extend the result of Lemma 6.2,

$$\begin{aligned}&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j) \right) ^2}{N} + 2\lambda \Vert \hat{\mathbf{M }}\Vert _1 \le 2\lambda _0 \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + 2\lambda \left( \Vert \mathbf{M }^{*}_S\Vert _1 - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \right) \le \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + 2\lambda \left( \Vert \mathbf{M }^{*}_S\Vert _1 - \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \right) \\&\qquad \qquad \le \lambda \left( \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 + \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \right) + 2\lambda \Vert \mathbf{M }^{*}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N} + \lambda \Vert {\hat{\mathbf{M}}}_{S^c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 - \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1\\&2\sum _{ij} \frac{\left( \hat{d}_{i j}-d^{*}(i,j)\right) ^2}{N}+ \lambda \Vert \hat{\mathbf{M }}_{S_c}\Vert _1 \le 3 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1, \end{aligned}$$

which completes the proof. $\square $

1.2 Proof of Theorem 3.4

Finally, we can prove the performance of the regularized method. By Lemma 6.1, it holds with probability at least $1 - 2 \exp (-t^2/2)$ that

$$\begin{aligned} \max _{1 \le h\le p, 1\le \ell \le p}|2N^{-1} \sum _{ij} \varepsilon _{i,j} \left( X_{ih}-X_{jh}\right) \left( X_{i\ell } - X_{j\ell }\right) | < \frac{\lambda }{2}. \end{aligned}$$

Hence, using Lemma 6.3, it holds with probability $1 - 2 \exp (-t^2/2)$ that

$$\begin{aligned} 2 \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1 \;\le \; 4 \lambda \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1. \end{aligned}$$

Using the compatibility condition, there exists some $\psi >0$ such that

$$\begin{aligned} \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S\Vert _1 \le (\sqrt{N}\psi )^{-1} \sum _{ij} \left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2\!. \end{aligned}$$

Hence

$$\begin{aligned} 2 \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \lambda \Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert _1&\le 4 \lambda \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) }{\sqrt{N} \psi } \\&\le \sum _{ij} \frac{\left( \hat{d}(i,j) - d^{*}(i,j)\right) ^2}{N} + \frac{4\lambda ^2}{\psi ^2}, \end{aligned}$$

which completes the proof. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Choy, T., Meinshausen, N. Sparse distance metric learning. Comput Stat 29, 515–528 (2014). https://doi.org/10.1007/s00180-013-0437-2

Download citation

Received: 05 September 2012
Accepted: 12 July 2013
Published: 24 July 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s00180-013-0437-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sparse distance metric learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Local generalized quadratic distance metrics: application to the k-nearest neighbors classifier

Unsupervised representation learning with Minimax distance measures

Semi-supervised Coefficient-Based Distance Metric Learning

References

Author information

Authors and Affiliations

Corresponding author

Apeendix: Proofs

Apeendix: Proofs

1.1 Additional lemmata

Lemma 6.1

Proof

Lemma 6.2

Proof

Lemma 6.3

Proof

1.2 Proof of Theorem 3.4

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now