Abstract
Metric learning aims to learn a distance metric to properly measure the similarities between pairwise examples. Most existing learning algorithms are designed to reduce intra-class distances and meanwhile enlarge inter-class distances by critically introducing a margin between intra-class and inter-class distances. However, such learning objectives may yield boundless (distance) metric space, because their enlargements on inter-class distances are usually unconstrained. In this case, excessively enlarged inter-class distances would relatively reduce the ratio of margin to the whole distance range (i.e., the margin-range-ratio), and thus being against the initial large-margin purpose for discriminating the similarities of data pairs. To address this issue, we propose a new boundary-restricted metric (BRM), which confines the metric space by a restriction function. Such a restriction function is monotonous and gradually converges to an upper bound, which suppresses excessively large distances of data pairs and concurrently maintains the reliable discriminability. After that, the learned metric can be successfully restricted in a finite region, and thereby avoiding the reduction of margin-range-ratio. Theoretically, we prove that BRM tightens the generalization error bound of the traditional learning model without sacrificing the fitting capability or destroying the topological property of the learned metric, which implies that BRM makes a good bias-variance tradeoff for the metric learning task. Extensive experiments on toy data and real-world datasets validate the superiority of our approach over the state-of-the-art metric learning methods.
Similar content being viewed by others
Data availability
(The data used in this work is all public.)
Code availability
The codes of the proposed method will be released after publishing.
Notes
Here h adaptively scales the oversized measurement of very high-dimensional projected features.
To include the manifold-based metric, we let \(\varvec{\varphi }({\varvec{x}}) = \varvec{\varphi }(\text {vec}({\varvec{X}})) = d\cdot \text {vec}({\varvec{M}}^{\top } {\varvec{g}}({\varvec{X}}){\varvec{M}}) \in {\mathbb {R}}^{d^{2}}\).
A differentiable function \(f({\varvec{a}})\) defined on the domain \({\mathcal {C}}\) is Lipschitz-continuous if and only if there exists \(L>0\) such that \(\vert f({\varvec{a}})- f({\varvec{b}})\vert \le L\Vert {\varvec{a}}-{\varvec{b}}\Vert _{2}\) for any \({\varvec{a}},{\varvec{b}}\in {\mathcal {C}}\).
The distance function \(D(\cdot , \cdot )\) is a metric if and only if it satisfies the four conditions \(\forall \varvec{\alpha }_{1},\varvec{\alpha }_{2},\varvec{\alpha }_{3} \in {\mathbb {R}}^{d}\): (I). Non-negativity: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) \ge 0\); (II). Symmetry: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = D(\varvec{\alpha }_{2},\varvec{\alpha }_{1})\); (III). Triangle: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) + D(\varvec{\alpha }_{2},\varvec{\alpha }_{3}) \ge D(\varvec{\alpha }_{1},\varvec{\alpha }_{3})\); (IV). Coincidence: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = 0 \Leftrightarrow \varvec{\alpha }_{1} = \varvec{\alpha }_{2}\).
For simplicity, here \({\overline{\sigma }}^{2}=\frac{1}{h}\sum _{i=1}^{h}\sigma _{i}^{2}\) and \({\overline{\mu }}=\frac{1}{h}\sum _{i=1}^{h}\mu _{i}\).
References
Alpaydin, E. (2020). Introduction to machine learning. MIT Press.
Asuncion, A., & Newman, D. (2007). Uci machine learning repository.
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In ICML (pp. 11–180).
Berrendero, J. R., Bueno-Larraz, B., & Cuevas, A. (2020). On mahalanobis distance in functional settings. Journal of Machine Learning Research, 21(9), 1–33.
Bian, W., & Tao, D. (2012). Constrained empirical risk minimization framework for distance metric learning. IEEE Transactions on Neural Networks and Learning System, 23(8), 1194–1205.
Biswas, A., & Parikh, D. (2013). Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR (pp. 644–651).
Brown, M., Hua, G., & Winder, S. (2010). Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 43–57.
Carlile, B., Delamarter, G., Kinney, P., Marti, A., & Whitney, B. (2017). Improving deep learning by inverse square root linear units (isrlus). arXiv:1710.09967
Chen, S., Gong, C., Yang, J., Tai, Y., Hui, L., & Li, J. (2019a). Data-adaptive metric learning with scale alignment. In AAAI (pp. 3347–3354).
Chen, S., Luo, L., Yang, J., Gong, C., Li, J., & Huang, H. (2019b). Curvilinear distance metric learning. In NeurIPS (pp. 4223–4232).
Chu, X., Lin, Y., Wang, Y., Wang, X., Yu, H., Gao, X., & Tong, Q. (2020). Distance metric learning with joint representation diversification. In ICML (pp. 1962–1973).
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I.S. (2007). Information-theoretic metric learning. In ICML (pp. 209–216).
Dong, M., Wang, Y., Yang, X., & Xue, J. H. (2019). Learning local metrics and influential regions for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 1522.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Ermolov, A., Mirvakhabova, L., Khrulkov, V., Sebe, N., & Oseledets, I. (2022). Hyperbolic vision transformers: Combining improvements in metric learning. In CVPR (pp. 7409–7419).
Fazlyab, M., Robey, A., Hassani, H., Morari, M., & Pappas, G. (2019). Efficient and accurate estimation of Lipschitz constants for deep neural networks. In NeurIPS.
Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85.
Geng, C., & Chen, S. (2018). Metric learning-guided least squares classifier learning. IEEE Transactions on Neural Networks and Learning System, 29(12), 6409–6414.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS (pp. 249–256).
Goldberger, J., Hinton, G. E., Roweis, S. T., & Salakhutdinov, R. R. (2005). Neighbourhood components analysis. In NeurIPS (pp. 513–520).
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. In NeurIPS (pp. 2672–2680).
Harandi, M., Salzmann, M., & Hartley, R. (2017). Joint dimensionality reduction and metric learning: A geometric take. In ICML (pp. 1404–1413).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Horev, I., Yger, F., & Sugiyama, M. (2017). Geometry-aware principal component analysis for symmetric positive definite matrices. Machine Learning, 66, 493–522.
Huang, Z., Wang, R., Shan, S., Van Gool, L., & Chen, X. (2018). Cross Euclidean-to-Riemannian metric learning with application to face recognition from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2827–2840.
Huo, Z., Nie, F., & Huang, H. (2016). Robust and effective metric learning using capped trace norm: Metric learning via capped trace norm. In SIGKDD (pp. 1605–1614).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML (pp. 448–456).
Kar, P., Narasimhan, H., & Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In NeurIPS.
Kelley, J. L. (2017). General topology. Courier Dover Publications.
Kim, S., Kim, D., Cho, M., & Kwak, S. (2020). Proxy anchor loss for deep metric learning. In CVPR (pp. 3238–3247).
Kim, Y., & Park, W. (2021). Multi-level distance regularization for deep metric learning. In AAAI (pp. 1827–1835).
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 3dRR.
Kwon, Y., Kim, W., Sugiyama, M., & Paik, M. C. (2020). Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric. Machine Learning, 66, 513–532.
Law, M., Liao, R., Snell, J., & Zemel, R. (2019). Lorentzian distance learning for hyperbolic representations. In ICML (pp. 3672–3681).
Lebanon, G. (2006). Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 497–508.
Li, P., Li, Y., Xie, H., & Zhang, L. (2022). Neighborhood-adaptive structure augmented metric learning. In AAAI.
Li, Q., Haque, S., Anil, C., Lucas, J., Grosse, R. B., & Jacobsen, J. H. (2019). Preventing gradient attenuation in Lipschitz constrained convolutional networks. In NeurIPS.
Lim, D., Lanckriet, G., & McFee, B. (2013). Robust structural metric learning. In ICML (pp. 615–623).
Lu, J., Xu, C., Zhang, W., Duan, L. Y., & Mei, T. (2019). Sampling wisely: Deep image embedding by top-k precision optimization. In ICCV (pp. 7961–7970).
Luo, L., Xu, J., Deng, C., & Huang, H. (2019). Robust metric learning on grassmann manifolds with generalization guarantees. In AAAI (pp. 4480–4487).
Meyer, C. D. (2000). Matrix analysis and applied linear algebra (vol. 71). SIAM.
Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers. Wiley.
Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR (pp. 4004–4012).
Paassen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree edit distance learning via adaptive symbol embeddings. In ICML.
Perrot, M., & Habrard, A. (2015). Regressive virtual metric learning. In NeurIPS (pp. 1810–1818).
Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., & Jin, R. (2019). Softtriple loss: Deep metric learning without triplet sampling. In CVPR, (pp. 6450–6458).
Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic pac-bayes bounds for non-iid data: Applications to ranking and stationary \(\beta\)-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In ICML (pp. 314–323).
Rudin, W. (1964). Principles of mathematical analysis. McGraw-Hill.
Seidenschwarz, J. D., Elezi, I., & Leal-Taixe, L. (2021). Learning intra-batch connections for deep metric learning. In ICML.
Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS (pp. 1857–1865).
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., & Beyer, L. (2021). How to train your vit? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270
Suarez, J. L., Garcia, S., & Herrera, F. (2018). A tutorial on distance metric learning: Mathematical foundations, algorithms and software. arXiv:1812.05944
Suárez, J. L., Garcia, S., & Herrera, F. (2020). pydml: A python library for distance metric learning. Journal of Machine Learning Research, 21(96), 1–7.
Suarez, J. L., Garcia, S., & Herrera, F. (2021). Ordinal regression with explainable distance metric learning based on ordered sequences. Machine Learning, 66, 2729–2762.
Ting, K. M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., & Zhou, Z. H. (2019). Lowest probability mass neighbour algorithms: Relaxing the metric constraint in distance-based neighbourhood algorithms. Machine Learning, 108(2), 331–376.
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science (vol. 47). Cambridge University Press.
Wang, H., Nie, F., & Huang, H. (2014). Robust distance metric learning via simultaneous l1-norm minimization and maximization. In ICML (pp. 1836–1844).
Wang, X., Han, X., Huang, W., Dong, D., & Scott, M. R. (2019). Multi-similarity loss with general pair weighting for deep metric learning. In CVPR (pp. 173–182).
Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In NeurIPS (pp. 1473–1480).
Weisstein, E. W. (2002). Inverse trigonometric functions. https://mathworldwolfram.com/
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology.
Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52.
Xie, P., Wu, W., Zhu, Y., & Xing, E. (2018). Orthogonality-promoting distance metric learning: Convex relaxation and theoretical analysise. In ICML (pp. 2404–2413).
Xing, E. P., Jordan, M. I., Russell, S. J., & Ng, A. (2003). Distance metric learning with application to clustering with side-information. In NeurIPS (pp. 521–528).
Xu, J., Luo, L., Deng, C., & Huang, H. (2018). Bilevel distance metric learning for robust image recognition. In NeurIPS (pp. 4198–4207).
Xu, X., Yang, Y., Deng, C., & Zheng, F. (2019). Deep asymmetric metric learning via rich relationship mining. In CVPR (pp. 4076–4085).
Yan, J., Yang, E., Deng, C., & Huang, H. (2022). Metricformer: A unified perspective of correlation exploring in similarity learning. In NeurIPS.
Yang, J., Luo, L., Qian, J., Tai, Y., Zhang, F., & Xu, Y. (2016). Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 156–171.
Yang, P., Huang, K., & Liu, C. L. (2013). Geometry preserving multi-task metric learning. Machine Learning, 66, 133–175.
Yang, X., Zhou, P., & Wang, M. (2018). Person reidentification via structural deep metric learning. IEEE Transactions on Neural Networks and Learning System, 30(10), 2987–2998.
Ye, H. J., Zhan, D. C., & Jiang, Y. (2019). Fast generalization rates for distance metric learning. Machine Learning, 66, 267–295.
Ye, H. J., Zhan, D. C., Jiang, Y., Si, X. M., & Zhou, Z. H. (2019). What makes objects similar: A unified multi-metric learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5), 1257–1270.
Yoshida, T., Takeuchi, I., & Karasuyama, M. (2021). Distance metric learning for graph structured data. Machine Learning, 66, 1765–1811.
Yu, B., & Tao, D. (2019). Deep metric learning with tuplet margin loss. In ICCV (pp. 6490–6499).
Zadeh, P., Hosseini, R., & Sra, S. (2016). Geometric mean metric learning. In ICML (pp. 2464–2471).
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. In ICCV (pp. 4353–4361).
Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1), 2287–2318.
Zhang, B., Zheng, W., Zhou, J., & Lu, J. (2022). Attributable visual similarity learning. In CVPR.
Zhang, S., Tay, Y., Yao, L., Sun, A., & An, J. (2019a). Next item recommendation with self-attentive metric learning. In AAAI.
Zhang, Y., Zhong, Q., Ma, L., Xie, D., & Pu, S (2019b). Learning incremental triplet margin for person re-identification. In AAAI (pp. 9243–9250).
Zhu, P., Cheng, H., Hu, Q., Wang, Q., & Zhang, C. (2018). Towards generalized and efficient metric learning on riemannian manifold. In IJCAI (pp. 192–199).
Funding
(S.C., G.N., and M.S. were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. M.S. was also supported by the Institute for AI and Beyond, UTokyo. C.G., J.L., and J.Y. were supported by NSF of China (Nos: U1713208, 61973162, 62072242), NSF of Jiangsu Province (No: BZ2021013), NSF for Distinguished Young Scholar of Jiangsu Province (No: BK20220080), and the Fundamental Research Funds for the Central Universities (Nos: 30920032202, 30921013114).)
Author information
Authors and Affiliations
Contributions
(All authors contributed to the algorithm design and analysis. The first draft of the manuscript was written by Shuo Chen, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.)
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Editor: Zhi-Hua Zhou.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix A
Appendix A
This section provids the detailed proofs for all theorems in Sect. 3.1 and Sect. 4.
1.1 A.1 Proof for Theorem 1
We first introduce the following Lindeberg central limit theorem (CLT) as a Lemma to prove our Theorem 1.
Lemma 1
(Lindeberg CLT (Vershynin, 2018)) Suppose \(\{X_{1},\ldots ,X_{h}\}\) is a sequence of independent random variables, each with finite expected value \(\mu _{i}\) and variance \(\sigma _{i}^{2}\). If for any given \(\epsilon >0\)
then the distribution of the standardized sum \(1/S_{h}\sum _{i=1}^{h}(X_{i}-\mu _{i})\) converges to the standard normal distribution \({\mathcal {N}}(0,1)\), where \(S_{h}^{2}=\sum _{i=1}^{h}\sigma _{i}^{2}\) and \(1_{\{\cdot \}}\) is the indicator function.
Based on the above conclusion on i.n.i.d. random variables \(X_{1}, X_{2}, \ldots , X_{h}\), here we prove Theorem 1 by investigating the probability of that the distance value crosses a given upper-bound. We show that such a probability is mainly determined by the boundary of the metric space.
Proof
We let \(X_{i} = \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}\) for \(i = 1,2,\ldots ,h\) and we can obtain that \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert > b^{p}({\mathcal {H}}_{v}^{u}) - \mu _{i}\}}] = 0\) for sufficiently large h. Specifically, we denote \(V^{2}=1/h\sum _{i=1}^{h}\sigma _{i}^{2}>0\) and let \(h=\left\lceil (b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i})^{2}/(\epsilon V)^{2}\right\rceil\), and we have that
where \(b^{p}({\mathcal {H}}_{v}^{u})\) is the upper-bound of \(X_{i}\) so that \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\) is always non-negative (\(\mu _{i}\) is the mean of \(X_{i}\)). Then, if \(\vert X_{i}-\mu _{i}\vert <b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\), we have
If \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\le \vert X_{i}-\mu _{i} \vert \le \epsilon S_{h}\), we have
Finally, if \(\epsilon S_{h}< \vert X_{i}-\mu _{i}\vert\), we have
and thus we have \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]\) for sufficiently large h. Furthermore, as \(\vert X_{i}-\mu _{i}\vert\) is always small than its upper bound \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\), we have \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]=0\). Therefore, we have
which implies that the Lindeberg condition in Eq. (17) is satisfied. Therefore, the standardized sum \(\sum _{i=1}^{h}\vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}/{\widetilde{\sigma }}\) converges to the standard normal distribution .Footnote 5 Then, for any given \(\epsilon _{1}>0\), there exists sufficiently large h such that
where \(Z \in {\mathbb {R}}\) and \(\phi ( \cdot )\) is the cumulative distribution function of the standard norm distribution. Meanwhile, we have
so for any \(\epsilon _{1} > 0\), there exists sufficiently large h such that
As \(\phi (\cdot )\) is monotonically increasing and \(\epsilon _{1}\) is a given sufficiently small number, \(\text {sup}_{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}}\left\{ \text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \right\}\) is dominated by \(\sqrt{h}(u^{p} - {\overline{\mu }})/{\overline{\sigma }}\). According to the law of large numbers (Vershynin, 2018), it follows that for any \(\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}\) there exists a sufficiently large N making
with at least probability \(1-\epsilon _{2}\), where the sample mean \(m_{N} = (1/N) \sum _{j=1}^{N} d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})\) and sample variance \(\varSigma _{N}^{2} = ((1/N)) \sum _{j=1}^{N} (d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})-m_{N})^{2}\). Then we have that there exists sufficiently small \(\epsilon _{1}\) and \(\epsilon _{2}\) such that for any given \(\delta \in \min (1,v^{p} - u^{p})\)
Now we only have to consider the minimal value of the positive term \(\left( m_{N} - u^{p}\right) /\varSigma _{N}\) under the constraint \(d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j}) > v > u\) for \(j = 1,2,\ldots ,N\). To be specific, it holds that
where \(t = m_{N} - v^{p}\in (0, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})-v^{p})]\), and \(m_{N}\) is necessarily included in \((v^{p}, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})+v^{p}))\). By combining the results in Eqs. (27) and (28), we thus get
where \(\psi \left[ b({\mathcal {H}}_{v}^{u})\right] = [\sqrt{h}(1 + 2(v^{p} - u^{p})/(v^{p} - b^{p}({\mathcal {H}}_{v}^{u})))]\). Here \(\psi \left[ b({\mathcal {H}}_{v}^{u})\right]\) is a monotonically increasing function w.r.t. the boundary \(b({\mathcal {H}}_{v}^{u})\). The proof is completed. \(\square\)
1.2 A.2 Proof for Theorem 2
Proof
The Non-negativity and Symmetry can be trivially achieved by the definition of BRM. Here we prove that \({\mathcal {D}}_{\varvec{\varphi }}(\cdot ,\cdot )\) has the triangle property when \({\mathcal {R}}'\) is monotonically decreasing. Specifically, for any given \(\varvec{\alpha },\varvec{\beta },{\varvec{\gamma }} \in {\mathbb {R}}^{d}\), we invoke the mean value theorem (Rudin, 1964) and have that
where the real numbers \(Q_{i} = \vert \varphi _{i}(\varvec{\alpha }) - \varphi _{i}(\varvec{\beta })\vert\), \(T_{i} = \vert \varphi _{i}(\varvec{\beta }) - \varphi _{i}(\varvec{\gamma })\vert\), \(\xi _{1} \in [0,\min (Q_{i}, T_{i})]\), \(\xi _{2} \in [0,\max (Q_{i}, T_{i})]\), and \(\varTheta (\xi _{2}) = (1/2)\min (Q_{i}^{2},T_{i}^{2}){\mathcal {R}}''(\xi _{2}) \le 0\). Then we have that
Finally, for any given \({\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\beta }) = 0\), and any given \(k \in {\mathbb {N}}_{h}\), we have \({\mathcal {R}}(\vert \varphi _{k}(\varvec{\alpha })-\varphi _{k}(\varvec{\beta })\vert )=0,\) and thus it holds that \([\varphi _{1}(\varvec{\alpha }),\ldots ,\varphi _{h}(\varvec{\alpha })] = [\varphi _{1}(\varvec{\beta }),\ldots ,\varphi _{h}(\varvec{\beta })]\). By further invoking the invertibility of the mapping \(\varvec{\varphi }\), we have that \(\varvec{\alpha } = \varvec{\beta }\) which completes the proof. \(\square\)
1.3 A.3 Proof for Theorem 3
Proof
We let \(\widehat{\varvec{\varphi }}(\cdot ) = c\varvec{\varphi }(\cdot )\) (\(c > 0\)) and employ the Taylor expansion (Rudin, 1964) on each restriction function, and we get
According to the homogeneity of vector norm (Meyer, 2000), it follows that there exists \(a_{1} > a_{0} > 0\) such that \(\forall {\varvec{x}},\widehat{{\varvec{x}}}\in {\mathbb {R}}^{d}\)
so we have that
Then for \(u=a_{0}/{\mathcal {R}}'(0)\) and \(v=a_{1}/{\mathcal {R}}'(0)\), we have
which completes the proof by letting c be sufficiently small. \(\square\)
1.4 A.4 Proof for Theorem 4
We first introduce the following McDiarmids inequality as a Lemma to prove our Theorem 4.
Lemma 2
(McDiarmid Inequality (Meyer, 2000)) For independent random variables \(t_{1},t_{2},\ldots ,t_{n} \in {\mathcal {T}}\) and a given function \(\omega :{\mathcal {T}}^{n} \rightarrow {\mathbb {R}}\), if \(\forall v_{i}' \in {\mathcal {T}}\) (\(i = 1,2,\ldots ,n\)), the function satisfies
then for any given \(\mu > 0\), it holds that \(\text {pr}\{\vert \omega (t_{1},\ldots ,t_{n}) - {\mathbb {E}}[\omega (t_{1},\ldots ,t_{n})]\vert > \mu \} \le 2\text {e}^{-2\mu ^{2}/\sum _{i=1}^{n}\rho _{i}^{2}}\).
We prove Theorem 4 by analyzing the perturbation [i.e., \(\rho _{i}\) in the above Eq. (36)] of the loss function \({\mathcal {L}}\).
Proof
Firstly, we denote that
and
where \(({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k})\) is an arbitrary data pair from the sample space with similarity label \(b_{k}\). Then we have that
Meanwhile, we have
By Lemma 2, we let that
for all \(i=1,2,\ldots ,N\), and we get
where the real-valued monotonically increasing function \(\theta (B)=\max (\ell ((1-c_{1})B;1),\ell (-c_{0}B;0))\). The proof is completed. \(\square\)
1.5 A.5 Proof of Lipschitz-Continuity
Here we demonstrate that our learning objective \({\mathcal {F}}(\varvec{\varphi })\) is always Lipschitz continuous based on the Lipschitz-continuity of \(\varvec{\varphi }(\cdot )\), \(\ell (\cdot )\), \(\varOmega (\cdot )\), and \({\mathcal {R}}(\cdot )\). To be more specific, for any two given \(\varvec{\varphi }\) and \(\widetilde{\varvec{\varphi }}\), we have
which implies that \((2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}+\lambda L_{2})\) is a valid Lipschitz constant of our learning objective \({\mathcal {F}}\).
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, S., Gong, C., Li, X. et al. Boundary-restricted metric learning. Mach Learn 112, 4723–4762 (2023). https://doi.org/10.1007/s10994-023-06380-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10994-023-06380-3