Abstract
Nearest neighbour classification requires a good distance metric. Previous approaches try to learn a quadratic distance metric learning so that observations of different classes are well separated. For high-dimensional problems, where many uninformative variables are present, it is attractive to select a sparse distance metric, both to increase predictive accuracy but also to aid interpretation of the result. We investigate the \(\ell 1\)-regularized metric learning problem, making a connection with the Lasso algorithm in the linear least squared settings. We show that the fitted transformation matrix is close to the desired transformation matrix in \(\ell 1\)-norm by assuming a version of the compatibility condition.

Similar content being viewed by others
References
Bertsekas DP (1999) Nonlinear programming, 2nd edn. Athena Scientific, Nashua, New Hampshire
Bian W, Tao D (2011) Learning a distance metric by empirical loss minimization. In: Proceedings of the twenty-second international joint conference on artificial intelligence—vol 2, IJCAI’11. Association for the Advancement of Artificial Intelligence Press, pp 1186–1191
Bian W, Tao D (2012) Constrained empirical risk minimization framework for distance metric learning. IEEE Trans Neural Netw Learn Syst 23(8):1194–1205
Bickel P, Ritov Y, Tsybakov A (2009) Simultaneous analysis of Lasso and Dantzig selector. Ann Stat 37:1705–1732
Breiman L (2001) Random forests. Mach Learn 45:5–32
Bühlmann P, van de Geer S (2011) Statistics for high-dimensional data. Springer, Berlin
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml
Friedman JH, Hastie T, Tibshirani R (2009) December). Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Goldberger J, Roweis S, Hinton G, Salakhutdinov R (2004). Neighborhood component analysis. In: Advances in neural information processing systems 17. MIT Press, Cambridge, pp 513–520
Hix S, Noury A, Roland G (2006) Dimensions of politics in the European Parliament. Am J Polit Sci 50:494–511
Negahban S, Wainwright MJ (2011) Estimation of (near) low-rank matrices with noise and high-dimensional scaling. Ann Stat 89:1069–1097
Soifer A, Grünbaum B, Johnson P, Rousseau C (2008) The mathematical coloring book: mathematics of coloring and the colorful life of its creators. Springer, New York
Van De Geer S, Bühlmann P (2009) On the conditions used to prove oracle results for the Lasso. Electron J Stat 3:1360–1392
Weinberger KQ, Saul LK (2009) Distance metric learning for large margin nearest neighbor classification. J Mach Learn Res 10:207–244
Xing EP, Ng AY, Jordan MI, Russell S (2002) Distance metric learning, with application to clustering with side-information. In: Advances in neural information processing systems 15. MIT Press, Cambridge, pp 505–512
Author information
Authors and Affiliations
Corresponding author
Apeendix: Proofs
Apeendix: Proofs
1.1 Additional lemmata
Lemma 6.1
Suppose Assumptions 3.1 and 3.2 hold and suppose the covariates are bounded, that is there exists some \(c>0\) such that \( |X_{i,h}|\le c \) for all \( h = 1, \ldots , p, \) and \( i=1, \ldots ,n\). Let
Then, for \(N={n(n-1)/2}\),
Proof
The main difficulty here is the fact that the pairwise distances are not independent. When the number of observations \(n\) is odd, we can pick a set of \((n-1)/{2}\) pairwise distances so that no indices overlap and hence are independent. If \(n\) is even, can we not pick a set of \(n/2\) pairwise distances. For this set of pairwise distances, we can show that the contribution from the noise term can be bounded with high probability. We can then show that the contribution of noise to the set of all pairwise distances can also be bounded by applying an union bound.
We first consider the case when \(n\) is odd. We would use a result from graph theory to prove our claim. A complete graph with \(n\) vertices is a graph where there is an edge between any two vertices. A pairwise distance can be represented as an edge in the complete graph. An edge colouring of a graph is that we can assign an colour to each edge such that no adjacent edges share the same colour. This implies that pairwise distances corresponds to edges of the same colour are independent under Assumption 3.2. Edge chromatic number is the minimum number of colours required to colour the edges of a graph. A complete graph with \(n\) vertices has an edge chromatic number of \(n\) when \(n\) is odd and \(n-1\) when \(n\) is even(a constructive proof is given in Soifer et al. (2008, P.133), from which we can deduce that there is a colouring of the edges such that each colour consists of exactly \((n-1)/2\) edges when \(n\) is odd. Therefore, the set of all pairwise distances can be partitioned into \(n\) partitions \(G_1,\ldots ,G_n\), each partition consists of \((n-1)/{2}\) pairwise distances that no indices overlap. This partition is always possible because there exists a colouring of the edges with \(n\) colours in a complete graph would give such a partition, with edges of each colours forms a partition.
Let \(V_{h,\ell ,G_k} = \frac{2}{n-1} \sum _{ \{i,j\} \in G_k} 2\varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })\).
By the boundedness of \(X\) and \(\varepsilon \),
Therefore, by Hoeffding’s inequality,
By a union bound over all choices of \(h\) and \(\ell \),
Finally, by a simple union bound over all partitions of \(G_k\),
And thus
When \(n\) is even, we can decompose the complete graph into \(n-1\) sets of disjoint pairs, each set with \(n\) pairwise distances. We can derive a slightly stronger bound in this case than the case when n is odd. \(\square \)
The proofs of Lemmata 6.2 and 6.3 follow closely with the ones given in Chapter 6 of Bühlmann and van de Geer (2011), with modifications to handle the matrix notations.
Lemma 6.2
Assume that \(\max _{1 \le h\le p, 1\le \ell \le p}|2 N^{-1} \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\). Then
Proof
Since \(\hat{\mathbf{M }}\) minimizes \(L(\mathbf{M })\), we have that \(L(\hat{\mathbf{M }}) \le L(\mathbf M^{*} )\) and thus
which completes the proof. \(\square \)
Lemma 6.3
Assume that \(N^{\!-\!1} \max _{1 \!\le \! h\le p, 1\le \!\ell \! \le p}|2 \sum _{ij} \varepsilon _{i,j} (X_{ih}-X_{jh})(X_{i\ell } - X_{j\ell })| < \lambda _0\) holds. By picking \(\lambda \ge 2 \lambda _0\),
Proof
By triangle inequality,
Also note that we can expand \(\Vert \hat{\mathbf{M }} - \mathbf{M }^{*}\Vert = \Vert \hat{\mathbf{M }}_S - \mathbf{M }^{*}_S \Vert + \Vert \hat{\mathbf{M }}_{S^c}\Vert .\) We can hence further extend the result of Lemma 6.2,
which completes the proof. \(\square \)
1.2 Proof of Theorem 3.4
Finally, we can prove the performance of the regularized method. By Lemma 6.1, it holds with probability at least \(1 - 2 \exp (-t^2/2)\) that
Hence, using Lemma 6.3, it holds with probability \(1 - 2 \exp (-t^2/2)\) that
Using the compatibility condition, there exists some \(\psi >0\) such that
Hence
which completes the proof. \(\square \)
Rights and permissions
About this article
Cite this article
Choy, T., Meinshausen, N. Sparse distance metric learning. Comput Stat 29, 515–528 (2014). https://doi.org/10.1007/s00180-013-0437-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-013-0437-2