Boundary-restricted metric learning

Chen, Shuo; Gong, Chen; Li, Xiang; Yang, Jian; Niu, Gang; Sugiyama, Masashi

doi:10.1007/s10994-023-06380-3

Boundary-restricted metric learning

Published: 20 September 2023

Volume 112, pages 4723–4762, (2023)
Cite this article

Machine Learning Aims and scope Submit manuscript

Shuo Chen ORCID: orcid.org/0000-0001-8140-0409¹,
Chen Gong²,
Xiang Li⁴,
Jian Yang²,
Gang Niu¹ &
…
Masashi Sugiyama^1,3

456 Accesses
1 Altmetric
Explore all metrics

Abstract

Metric learning aims to learn a distance metric to properly measure the similarities between pairwise examples. Most existing learning algorithms are designed to reduce intra-class distances and meanwhile enlarge inter-class distances by critically introducing a margin between intra-class and inter-class distances. However, such learning objectives may yield boundless (distance) metric space, because their enlargements on inter-class distances are usually unconstrained. In this case, excessively enlarged inter-class distances would relatively reduce the ratio of margin to the whole distance range (i.e., the margin-range-ratio), and thus being against the initial large-margin purpose for discriminating the similarities of data pairs. To address this issue, we propose a new boundary-restricted metric (BRM), which confines the metric space by a restriction function. Such a restriction function is monotonous and gradually converges to an upper bound, which suppresses excessively large distances of data pairs and concurrently maintains the reliable discriminability. After that, the learned metric can be successfully restricted in a finite region, and thereby avoiding the reduction of margin-range-ratio. Theoretically, we prove that BRM tightens the generalization error bound of the traditional learning model without sacrificing the fitting capability or destroying the topological property of the learned metric, which implies that BRM makes a good bias-variance tradeoff for the metric learning task. Extensive experiments on toy data and real-world datasets validate the superiority of our approach over the state-of-the-art metric learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Metric learning with geometric mean for similarities measurement

Article 21 December 2015

Fast generalization rates for distance metric learning

Article 27 June 2018

Robust metric learning based on the rescaled hinge loss

Article 18 May 2020

Data availability

(The data used in this work is all public.)

Code availability

The codes of the proposed method will be released after publishing.

Notes

Here h adaptively scales the oversized measurement of very high-dimensional projected features.
To include the manifold-based metric, we let $\varvec{\varphi }({\varvec{x}}) = \varvec{\varphi }(\text {vec}({\varvec{X}})) = d\cdot \text {vec}({\varvec{M}}^{\top } {\varvec{g}}({\varvec{X}}){\varvec{M}}) \in {\mathbb {R}}^{d^{2}}$.
A differentiable function $f({\varvec{a}})$ defined on the domain ${\mathcal {C}}$ is Lipschitz-continuous if and only if there exists $L>0$ such that $\vert f({\varvec{a}})- f({\varvec{b}})\vert \le L\Vert {\varvec{a}}-{\varvec{b}}\Vert _{2}$ for any ${\varvec{a}},{\varvec{b}}\in {\mathcal {C}}$.
The distance function $D(\cdot , \cdot )$ is a metric if and only if it satisfies the four conditions $\forall \varvec{\alpha }_{1},\varvec{\alpha }_{2},\varvec{\alpha }_{3} \in {\mathbb {R}}^{d}$: (I). Non-negativity: $D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) \ge 0$; (II). Symmetry: $D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = D(\varvec{\alpha }_{2},\varvec{\alpha }_{1})$; (III). Triangle: $D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) + D(\varvec{\alpha }_{2},\varvec{\alpha }_{3}) \ge D(\varvec{\alpha }_{1},\varvec{\alpha }_{3})$; (IV). Coincidence: $D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = 0 \Leftrightarrow \varvec{\alpha }_{1} = \varvec{\alpha }_{2}$.
For simplicity, here ${\overline{\sigma }}^{2}=\frac{1}{h}\sum _{i=1}^{h}\sigma _{i}^{2}$ and ${\overline{\mu }}=\frac{1}{h}\sum _{i=1}^{h}\mu _{i}$.

References

Alpaydin, E. (2020). Introduction to machine learning. MIT Press.
Asuncion, A., & Newman, D. (2007). Uci machine learning repository.
Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In ICML (pp. 11–180).
Berrendero, J. R., Bueno-Larraz, B., & Cuevas, A. (2020). On mahalanobis distance in functional settings. Journal of Machine Learning Research, 21(9), 1–33.
MathSciNet MATH Google Scholar
Bian, W., & Tao, D. (2012). Constrained empirical risk minimization framework for distance metric learning. IEEE Transactions on Neural Networks and Learning System, 23(8), 1194–1205.
Article Google Scholar
Biswas, A., & Parikh, D. (2013). Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR (pp. 644–651).
Brown, M., Hua, G., & Winder, S. (2010). Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 43–57.
Article Google Scholar
Carlile, B., Delamarter, G., Kinney, P., Marti, A., & Whitney, B. (2017). Improving deep learning by inverse square root linear units (isrlus). arXiv:1710.09967
Chen, S., Gong, C., Yang, J., Tai, Y., Hui, L., & Li, J. (2019a). Data-adaptive metric learning with scale alignment. In AAAI (pp. 3347–3354).
Chen, S., Luo, L., Yang, J., Gong, C., Li, J., & Huang, H. (2019b). Curvilinear distance metric learning. In NeurIPS (pp. 4223–4232).
Chu, X., Lin, Y., Wang, Y., Wang, X., Yu, H., Gao, X., & Tong, Q. (2020). Distance metric learning with joint representation diversification. In ICML (pp. 1962–1973).
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Article MathSciNet MATH Google Scholar
Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I.S. (2007). Information-theoretic metric learning. In ICML (pp. 209–216).
Dong, M., Wang, Y., Yang, X., & Xue, J. H. (2019). Learning local metrics and influential regions for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 1522.
Article Google Scholar
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Ermolov, A., Mirvakhabova, L., Khrulkov, V., Sebe, N., & Oseledets, I. (2022). Hyperbolic vision transformers: Combining improvements in metric learning. In CVPR (pp. 7409–7419).
Fazlyab, M., Robey, A., Hassani, H., Morari, M., & Pappas, G. (2019). Efficient and accurate estimation of Lipschitz constants for deep neural networks. In NeurIPS.
Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85.
Article Google Scholar
Geng, C., & Chen, S. (2018). Metric learning-guided least squares classifier learning. IEEE Transactions on Neural Networks and Learning System, 29(12), 6409–6414.
Article MathSciNet Google Scholar
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS (pp. 249–256).
Goldberger, J., Hinton, G. E., Roweis, S. T., & Salakhutdinov, R. R. (2005). Neighbourhood components analysis. In NeurIPS (pp. 513–520).
Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. In NeurIPS (pp. 2672–2680).
Harandi, M., Salzmann, M., & Hartley, R. (2017). Joint dimensionality reduction and metric learning: A geometric take. In ICML (pp. 1404–1413).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).
Horev, I., Yger, F., & Sugiyama, M. (2017). Geometry-aware principal component analysis for symmetric positive definite matrices. Machine Learning, 66, 493–522.
Article MathSciNet MATH Google Scholar
Huang, Z., Wang, R., Shan, S., Van Gool, L., & Chen, X. (2018). Cross Euclidean-to-Riemannian metric learning with application to face recognition from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2827–2840.
Article Google Scholar
Huo, Z., Nie, F., & Huang, H. (2016). Robust and effective metric learning using capped trace norm: Metric learning via capped trace norm. In SIGKDD (pp. 1605–1614).
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML (pp. 448–456).
Kar, P., Narasimhan, H., & Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In NeurIPS.
Kelley, J. L. (2017). General topology. Courier Dover Publications.
Kim, S., Kim, D., Cho, M., & Kwak, S. (2020). Proxy anchor loss for deep metric learning. In CVPR (pp. 3238–3247).
Kim, Y., & Park, W. (2021). Multi-level distance regularization for deep metric learning. In AAAI (pp. 1827–1835).
Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 3dRR.
Kwon, Y., Kim, W., Sugiyama, M., & Paik, M. C. (2020). Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric. Machine Learning, 66, 513–532.
Article MathSciNet MATH Google Scholar
Law, M., Liao, R., Snell, J., & Zemel, R. (2019). Lorentzian distance learning for hyperbolic representations. In ICML (pp. 3672–3681).
Lebanon, G. (2006). Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 497–508.
Article Google Scholar
Li, P., Li, Y., Xie, H., & Zhang, L. (2022). Neighborhood-adaptive structure augmented metric learning. In AAAI.
Li, Q., Haque, S., Anil, C., Lucas, J., Grosse, R. B., & Jacobsen, J. H. (2019). Preventing gradient attenuation in Lipschitz constrained convolutional networks. In NeurIPS.
Lim, D., Lanckriet, G., & McFee, B. (2013). Robust structural metric learning. In ICML (pp. 615–623).
Lu, J., Xu, C., Zhang, W., Duan, L. Y., & Mei, T. (2019). Sampling wisely: Deep image embedding by top-k precision optimization. In ICCV (pp. 7961–7970).
Luo, L., Xu, J., Deng, C., & Huang, H. (2019). Robust metric learning on grassmann manifolds with generalization guarantees. In AAAI (pp. 4480–4487).
Meyer, C. D. (2000). Matrix analysis and applied linear algebra (vol. 71). SIAM.
Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers. Wiley.
Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR (pp. 4004–4012).
Paassen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree edit distance learning via adaptive symbol embeddings. In ICML.
Perrot, M., & Habrard, A. (2015). Regressive virtual metric learning. In NeurIPS (pp. 1810–1818).
Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., & Jin, R. (2019). Softtriple loss: Deep metric learning without triplet sampling. In CVPR, (pp. 6450–6458).
Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic pac-bayes bounds for non-iid data: Applications to ranking and stationary $\beta$-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.
MathSciNet MATH Google Scholar
Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In ICML (pp. 314–323).
Rudin, W. (1964). Principles of mathematical analysis. McGraw-Hill.
Seidenschwarz, J. D., Elezi, I., & Leal-Taixe, L. (2021). Learning intra-batch connections for deep metric learning. In ICML.
Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS (pp. 1857–1865).
Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., & Beyer, L. (2021). How to train your vit? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270
Suarez, J. L., Garcia, S., & Herrera, F. (2018). A tutorial on distance metric learning: Mathematical foundations, algorithms and software. arXiv:1812.05944
Suárez, J. L., Garcia, S., & Herrera, F. (2020). pydml: A python library for distance metric learning. Journal of Machine Learning Research, 21(96), 1–7.
Google Scholar
Suarez, J. L., Garcia, S., & Herrera, F. (2021). Ordinal regression with explainable distance metric learning based on ordered sequences. Machine Learning, 66, 2729–2762.
Article MathSciNet MATH Google Scholar
Ting, K. M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., & Zhou, Z. H. (2019). Lowest probability mass neighbour algorithms: Relaxing the metric constraint in distance-based neighbourhood algorithms. Machine Learning, 108(2), 331–376.
Article MathSciNet MATH Google Scholar
Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science (vol. 47). Cambridge University Press.
Wang, H., Nie, F., & Huang, H. (2014). Robust distance metric learning via simultaneous l1-norm minimization and maximization. In ICML (pp. 1836–1844).
Wang, X., Han, X., Huang, W., Dong, D., & Scott, M. R. (2019). Multi-similarity loss with general pair weighting for deep metric learning. In CVPR (pp. 173–182).
Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In NeurIPS (pp. 1473–1480).
Weisstein, E. W. (2002). Inverse trigonometric functions. https://mathworldwolfram.com/
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology.
Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52.
Article MathSciNet MATH Google Scholar
Xie, P., Wu, W., Zhu, Y., & Xing, E. (2018). Orthogonality-promoting distance metric learning: Convex relaxation and theoretical analysise. In ICML (pp. 2404–2413).
Xing, E. P., Jordan, M. I., Russell, S. J., & Ng, A. (2003). Distance metric learning with application to clustering with side-information. In NeurIPS (pp. 521–528).
Xu, J., Luo, L., Deng, C., & Huang, H. (2018). Bilevel distance metric learning for robust image recognition. In NeurIPS (pp. 4198–4207).
Xu, X., Yang, Y., Deng, C., & Zheng, F. (2019). Deep asymmetric metric learning via rich relationship mining. In CVPR (pp. 4076–4085).
Yan, J., Yang, E., Deng, C., & Huang, H. (2022). Metricformer: A unified perspective of correlation exploring in similarity learning. In NeurIPS.
Yang, J., Luo, L., Qian, J., Tai, Y., Zhang, F., & Xu, Y. (2016). Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 156–171.
Article Google Scholar
Yang, P., Huang, K., & Liu, C. L. (2013). Geometry preserving multi-task metric learning. Machine Learning, 66, 133–175.
Article MathSciNet MATH Google Scholar
Yang, X., Zhou, P., & Wang, M. (2018). Person reidentification via structural deep metric learning. IEEE Transactions on Neural Networks and Learning System, 30(10), 2987–2998.
Article Google Scholar
Ye, H. J., Zhan, D. C., & Jiang, Y. (2019). Fast generalization rates for distance metric learning. Machine Learning, 66, 267–295.
Article MATH Google Scholar
Ye, H. J., Zhan, D. C., Jiang, Y., Si, X. M., & Zhou, Z. H. (2019). What makes objects similar: A unified multi-metric learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5), 1257–1270.
Article Google Scholar
Yoshida, T., Takeuchi, I., & Karasuyama, M. (2021). Distance metric learning for graph structured data. Machine Learning, 66, 1765–1811.
Article MathSciNet MATH Google Scholar
Yu, B., & Tao, D. (2019). Deep metric learning with tuplet margin loss. In ICCV (pp. 6490–6499).
Zadeh, P., Hosseini, R., & Sra, S. (2016). Geometric mean metric learning. In ICML (pp. 2464–2471).
Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. In ICCV (pp. 4353–4361).
Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1), 2287–2318.
MATH Google Scholar
Zhang, B., Zheng, W., Zhou, J., & Lu, J. (2022). Attributable visual similarity learning. In CVPR.
Zhang, S., Tay, Y., Yao, L., Sun, A., & An, J. (2019a). Next item recommendation with self-attentive metric learning. In AAAI.
Zhang, Y., Zhong, Q., Ma, L., Xie, D., & Pu, S (2019b). Learning incremental triplet margin for person re-identification. In AAAI (pp. 9243–9250).
Zhu, P., Cheng, H., Hu, Q., Wang, Q., & Zhang, C. (2018). Towards generalized and efficient metric learning on riemannian manifold. In IJCAI (pp. 192–199).

Download references

Funding

(S.C., G.N., and M.S. were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. M.S. was also supported by the Institute for AI and Beyond, UTokyo. C.G., J.L., and J.Y. were supported by NSF of China (Nos: U1713208, 61973162, 62072242), NSF of Jiangsu Province (No: BZ2021013), NSF for Distinguished Young Scholar of Jiangsu Province (No: BK20220080), and the Fundamental Research Funds for the Central Universities (Nos: 30920032202, 30921013114).)

Author information

Authors and Affiliations

Center for Advanced Intelligence Project, RIKEN, Tokyo, Japan
Shuo Chen, Gang Niu & Masashi Sugiyama
School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China
Chen Gong & Jian Yang
Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
Masashi Sugiyama
College of Computer Science, Nankai University, Tianjin, China
Xiang Li

Authors

Shuo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Chen Gong
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Li
View author publications
You can also search for this author in PubMed Google Scholar
Jian Yang
View author publications
You can also search for this author in PubMed Google Scholar
Gang Niu
View author publications
You can also search for this author in PubMed Google Scholar
Masashi Sugiyama
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

(All authors contributed to the algorithm design and analysis. The first draft of the manuscript was written by Shuo Chen, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.)

Corresponding author

Correspondence to Shuo Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Zhi-Hua Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

This section provids the detailed proofs for all theorems in Sect. 3.1 and Sect. 4.

1.1 A.1 Proof for Theorem 1

We first introduce the following Lindeberg central limit theorem (CLT) as a Lemma to prove our Theorem 1.

Lemma 1

(Lindeberg CLT (Vershynin, 2018)) Suppose $\{X_{1},\ldots ,X_{h}\}$ is a sequence of independent random variables, each with finite expected value $\mu _{i}$ and variance $\sigma _{i}^{2}$. If for any given $\epsilon >0$

$$\begin{aligned} \lim _{h\rightarrow \infty }\frac{1}{S_{h}^{2}}\sum _{i=1}^{h}\nolimits {\mathbb {E}}[(X_{i}-\mu _{i})^{2}\cdot 1_{\{\vert X_{i}-\mu _{i}\vert >\epsilon S_{n}\}}]=0, \end{aligned}$$

(17)

then the distribution of the standardized sum $1/S_{h}\sum _{i=1}^{h}(X_{i}-\mu _{i})$ converges to the standard normal distribution ${\mathcal {N}}(0,1)$, where $S_{h}^{2}=\sum _{i=1}^{h}\sigma _{i}^{2}$ and $1_{\{\cdot \}}$ is the indicator function.

Based on the above conclusion on i.n.i.d. random variables $X_{1}, X_{2}, \ldots , X_{h}$, here we prove Theorem 1 by investigating the probability of that the distance value crosses a given upper-bound. We show that such a probability is mainly determined by the boundary of the metric space.

Proof

We let $X_{i} = \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}$ for $i = 1,2,\ldots ,h$ and we can obtain that ${\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert > b^{p}({\mathcal {H}}_{v}^{u}) - \mu _{i}\}}] = 0$ for sufficiently large h. Specifically, we denote $V^{2}=1/h\sum _{i=1}^{h}\sigma _{i}^{2}>0$ and let $h=\left\lceil (b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i})^{2}/(\epsilon V)^{2}\right\rceil$, and we have that

$$\begin{aligned} \epsilon S_{h}=\epsilon \sqrt{h}V\ge \epsilon V\sqrt{(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i})^{2}/(\epsilon V)^{2}}=b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}, \end{aligned}$$

(18)

where $b^{p}({\mathcal {H}}_{v}^{u})$ is the upper-bound of $X_{i}$ so that $b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}$ is always non-negative ($\mu _{i}$ is the mean of $X_{i}$). Then, if $\vert X_{i}-\mu _{i}\vert <b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}$, we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=0=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}. \end{aligned}$$

(19)

If $b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\le \vert X_{i}-\mu _{i} \vert \le \epsilon S_{h}$, we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=0\le (X_{i} - \mu _{i})^{2}=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}.} \end{aligned}$$

(20)

Finally, if $\epsilon S_{h}< \vert X_{i}-\mu _{i}\vert$, we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=(X_{i} - \mu _{i})^{2}=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}, \end{aligned}$$

(21)

and thus we have ${\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]$ for sufficiently large h. Furthermore, as $\vert X_{i}-\mu _{i}\vert$ is always small than its upper bound $b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}$, we have ${\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]=0$. Therefore, we have

$$\begin{aligned} \lim _{h\rightarrow \infty } \frac{\sum _{i=1}^{h} {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >\epsilon S_{h}\}}]}{S_{h}^{2}} \le \lim _{h\rightarrow \infty } \frac{h\cdot 0}{h\sigma _{L}^{2}} = 0, \end{aligned}$$

(22)

which implies that the Lindeberg condition in Eq. (17) is satisfied. Therefore, the standardized sum $\sum _{i=1}^{h}\vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}/{\widetilde{\sigma }}$ converges to the standard normal distribution .^{Footnote 5} Then, for any given $\epsilon _{1}>0$, there exists sufficiently large h such that

$$\begin{aligned} \text {pr} \left[ \frac{1}{{\overline{\sigma }}\sqrt{h}} \sum _{i=1}^{h}\nolimits \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p} - \mu _{i} \le Z\right] \in \varDelta (\phi (Z),\epsilon _{1}), \end{aligned}$$

(23)

where $Z \in {\mathbb {R}}$ and $\phi ( \cdot )$ is the cumulative distribution function of the standard norm distribution. Meanwhile, we have

$$\begin{aligned} \text {pr}[d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u] = \text {pr}\left[ \sum _{i=1}^{h}\nolimits \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p} < hu^{p}\right] , \end{aligned}$$

(24)

so for any $\epsilon _{1} > 0$, there exists sufficiently large h such that

$$\begin{aligned} \text {pr}[d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u] \in \varDelta (\phi (\sqrt{h}(u^{p} - {\overline{\mu }})/{\overline{\sigma }}), \epsilon _{1}). \end{aligned}$$

(25)

As $\phi (\cdot )$ is monotonically increasing and $\epsilon _{1}$ is a given sufficiently small number, $\text {sup}_{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}}\left\{ \text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \right\}$ is dominated by $\sqrt{h}(u^{p} - {\overline{\mu }})/{\overline{\sigma }}$. According to the law of large numbers (Vershynin, 2018), it follows that for any $\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}$ there exists a sufficiently large N making

$$\begin{aligned} |{\overline{\mu }}-m_{N}|<\epsilon _{2} \text {and} |{\overline{\sigma }}\sqrt{h}-\varSigma _{N}|<\epsilon _{2}, \end{aligned}$$

(26)

with at least probability $1-\epsilon _{2}$, where the sample mean $m_{N} = (1/N) \sum _{j=1}^{N} d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})$ and sample variance $\varSigma _{N}^{2} = ((1/N)) \sum _{j=1}^{N} (d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})-m_{N})^{2}$. Then we have that there exists sufficiently small $\epsilon _{1}$ and $\epsilon _{2}$ such that for any given $\delta \in \min (1,v^{p} - u^{p})$

$$\begin{aligned} \sup _{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}}\{\text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \}\in \varDelta \left( \phi \left[ \sqrt{h}\left( u^{p} - m_{N}\right) /\varSigma _{N}\right] ,\delta \right) . \end{aligned}$$

(27)

Now we only have to consider the minimal value of the positive term $\left( m_{N} - u^{p}\right) /\varSigma _{N}$ under the constraint $d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j}) > v > u$ for $j = 1,2,\ldots ,N$. To be specific, it holds that

$$\begin{aligned}&\left( m_{N}-u^{p}\right) /\varSigma _{N}\nonumber \\&\quad =\left( m_{N}-v^{p}+v^{p}-u^{p}\right) /\varSigma _{N}\nonumber \\&\quad \ge \left( m_{N} - v^{p} + v^{p} - u^{p}\right) /\min (m_{N} - v^{p}, b^{p}({\mathcal {H}}_{v}^{u}) - m_{N})\nonumber \\&\quad \ge \left( m_{N} - v^{p} + v^{p} - u^{p}\right) /(m_{N} - v^{p})\nonumber \\&\quad =(t + v^{p} - u^{p})/t\nonumber \\&\quad =(1 + (v^{p} - u^{p})/t)\nonumber \\&\quad \ge (1 + 2(v^{p} - u^{p})/(b^{p}({\mathcal {H}}_{v}^{u}) - v^{p})), \end{aligned}$$

(28)

where $t = m_{N} - v^{p}\in (0, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})-v^{p})]$, and $m_{N}$ is necessarily included in $(v^{p}, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})+v^{p}))$. By combining the results in Eqs. (27) and (28), we thus get

$$\begin{aligned} \sup _{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}} \left\{ \text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \right\} \in \varDelta \left( \phi \left\{ \psi \left[ b({\mathcal {H}}_{v}^{u})\right] \right\} ,\delta \right) , \end{aligned}$$

(29)

where $\psi \left[ b({\mathcal {H}}_{v}^{u})\right] = [\sqrt{h}(1 + 2(v^{p} - u^{p})/(v^{p} - b^{p}({\mathcal {H}}_{v}^{u})))]$. Here $\psi \left[ b({\mathcal {H}}_{v}^{u})\right]$ is a monotonically increasing function w.r.t. the boundary $b({\mathcal {H}}_{v}^{u})$. The proof is completed. $\square$

1.2 A.2 Proof for Theorem 2

Proof

The Non-negativity and Symmetry can be trivially achieved by the definition of BRM. Here we prove that ${\mathcal {D}}_{\varvec{\varphi }}(\cdot ,\cdot )$ has the triangle property when ${\mathcal {R}}'$ is monotonically decreasing. Specifically, for any given $\varvec{\alpha },\varvec{\beta },{\varvec{\gamma }} \in {\mathbb {R}}^{d}$, we invoke the mean value theorem (Rudin, 1964) and have that

$$\begin{aligned}&{\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\beta })\vert )+{\mathcal {R}}(\vert \varphi _{i}(\varvec{\beta })-\varphi _{i}(\varvec{\gamma })\vert )\nonumber \\&\quad ={\mathcal {R}}(Q_{i})+{\mathcal {R}}(T_{i})-{\mathcal {R}}(0)\nonumber \\&\quad ={\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\xi _{1})\nonumber \\&\quad \ge {\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\max (Q_{i},T_{i}))\nonumber \\&\quad \ge {\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\max (Q_{i},T_{i}))+\varTheta (\xi _{2})\nonumber \\&\quad ={\mathcal {R}}(\max (Q_{i},T_{i})+\min (Q_{i},T_{i}))\nonumber \\&\quad ={\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\beta })\vert +\vert \varphi _{i}(\varvec{\beta })-\varphi _{i}(\varvec{\gamma })\vert )\nonumber \\&\quad \ge {\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\gamma })\vert ), \end{aligned}$$

(30)

where the real numbers $Q_{i} = \vert \varphi _{i}(\varvec{\alpha }) - \varphi _{i}(\varvec{\beta })\vert$, $T_{i} = \vert \varphi _{i}(\varvec{\beta }) - \varphi _{i}(\varvec{\gamma })\vert$, $\xi _{1} \in [0,\min (Q_{i}, T_{i})]$, $\xi _{2} \in [0,\max (Q_{i}, T_{i})]$, and $\varTheta (\xi _{2}) = (1/2)\min (Q_{i}^{2},T_{i}^{2}){\mathcal {R}}''(\xi _{2}) \le 0$. Then we have that

$$\begin{aligned}&{\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\beta })+{\mathcal {D}}_{\varvec{\varphi }}(\varvec{\beta }, \varvec{\gamma })\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(Q_{i})\right] ^{p}\right) ^{ 1/p} +\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(T_{i})\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \ge \left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(Q_{i})+{\mathcal {R}}(T_{i})\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \ge \left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\gamma })\vert )\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad ={\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\gamma }). \end{aligned}$$

(31)

Finally, for any given ${\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\beta }) = 0$, and any given $k \in {\mathbb {N}}_{h}$, we have ${\mathcal {R}}(\vert \varphi _{k}(\varvec{\alpha })-\varphi _{k}(\varvec{\beta })\vert )=0,$ and thus it holds that $[\varphi _{1}(\varvec{\alpha }),\ldots ,\varphi _{h}(\varvec{\alpha })] = [\varphi _{1}(\varvec{\beta }),\ldots ,\varphi _{h}(\varvec{\beta })]$. By further invoking the invertibility of the mapping $\varvec{\varphi }$, we have that $\varvec{\alpha } = \varvec{\beta }$ which completes the proof. $\square$

1.3 A.3 Proof for Theorem 3

Proof

We let $\widehat{\varvec{\varphi }}(\cdot ) = c\varvec{\varphi }(\cdot )$ ($c > 0$) and employ the Taylor expansion (Rudin, 1964) on each restriction function, and we get

$$\begin{aligned}&{\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}},\widehat{{\varvec{x}}})\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(c\vert \varphi _{i}({\varvec{x}})-\varphi _{i}(\widehat{{\varvec{x}}})\vert )\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(0)+c\vert \varphi _{i}({\varvec{x}})-\varphi _{i}(\widehat{{\varvec{x}}})\vert {\mathcal {R}}'(0)+o(c)\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad =\frac{c}{h}{\mathcal {R}}'(0)\Vert \varvec{\varphi }({\varvec{x}})-\varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}+o(c). \end{aligned}$$

(32)

According to the homogeneity of vector norm (Meyer, 2000), it follows that there exists $a_{1} > a_{0} > 0$ such that $\forall {\varvec{x}},\widehat{{\varvec{x}}}\in {\mathbb {R}}^{d}$

$$\begin{aligned} \frac{a_{0}}{h}\Vert \varvec{\varphi }({\varvec{x}}) - \varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}\le d_{\varvec{\varphi }}({\varvec{x}},\widehat{{\varvec{x}}})\le \frac{a_{1}}{h}\Vert \varvec{\varphi }({\varvec{x}}) - \varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}, \end{aligned}$$

(33)

so we have that

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})\ge \frac{c}{a_{1}}{\mathcal {R}}'(0)d_{\varvec{\varphi }}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})+o(c),\\ {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+})\le \frac{c}{a_{0}}{\mathcal {R}}'(0)d_{\varvec{\varphi }}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+})+o(c). \end{array}\right. } \end{aligned}$$

(34)

Then for $u=a_{0}/{\mathcal {R}}'(0)$ and $v=a_{1}/{\mathcal {R}}'(0)$, we have

$$\begin{aligned} {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})\ge cv + o(c)>cu + o(c)\ge {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+}), \end{aligned}$$

(35)

which completes the proof by letting c be sufficiently small. $\square$

1.4 A.4 Proof for Theorem 4

We first introduce the following McDiarmids inequality as a Lemma to prove our Theorem 4.

Lemma 2

(McDiarmid Inequality (Meyer, 2000)) For independent random variables $t_{1},t_{2},\ldots ,t_{n} \in {\mathcal {T}}$ and a given function $\omega :{\mathcal {T}}^{n} \rightarrow {\mathbb {R}}$, if $\forall v_{i}' \in {\mathcal {T}}$ ($i = 1,2,\ldots ,n$), the function satisfies

$$\begin{aligned} \vert \omega (t_{1},\ldots ,t_{i},\ldots ,t_{n}) - \omega (t_{1},\ldots ,t_{i}',\ldots ,t_{n})\vert \le \rho _{i}, \end{aligned}$$

(36)

then for any given $\mu > 0$, it holds that $\text {pr}\{\vert \omega (t_{1},\ldots ,t_{n}) - {\mathbb {E}}[\omega (t_{1},\ldots ,t_{n})]\vert > \mu \} \le 2\text {e}^{-2\mu ^{2}/\sum _{i=1}^{n}\rho _{i}^{2}}$.

We prove Theorem 4 by analyzing the perturbation [i.e., $\rho _{i}$ in the above Eq. (36)] of the loss function ${\mathcal {L}}$.

Proof

Firstly, we denote that

$$\begin{aligned} \omega = \frac{1}{N}\sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}), \end{aligned}$$

(37)

and

$$\begin{aligned} \omega _{(k)} = \frac{1}{N} \left[ \sum _{i=1,i\ne k}^{N} \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) + \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k}) \right] , \end{aligned}$$

(38)

where $({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k})$ is an arbitrary data pair from the sample space with similarity label $b_{k}$. Then we have that

$$\begin{aligned}&|\omega -\omega _{(k)}|\nonumber \\&\quad =\frac{1}{N}|\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{k},\widehat{{\varvec{x}}}_{k});y_{k})-\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k})|\nonumber \\&\quad \le \frac{1}{N}\max (\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{k},\widehat{{\varvec{x}}}_{k});y_{k}),\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k}))\nonumber \\&\quad \le \frac{1}{N}\max (\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0)). \end{aligned}$$

(39)

Meanwhile, we have

$$\begin{aligned}&\frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) - {\mathbb {E}}_{{\mathcal {X}}}\left( \frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})\right) \nonumber \\&\quad = \frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) - {\mathbb {E}}_{({\varvec{x}},\widehat{{\varvec{x}}})}\left[ \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}},\widehat{{\varvec{x}}});y_{({\varvec{x}},\widehat{{\varvec{x}}})})\right] \nonumber \\&\quad ={\mathcal {L}}(\varvec{\varphi })-\widetilde{{\mathcal {L}}}(\varvec{\varphi }). \end{aligned}$$

(40)

By Lemma 2, we let that

$$\begin{aligned} \rho _{i}=\frac{1}{N}\max (\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0)), \end{aligned}$$

(41)

for all $i=1,2,\ldots ,N$, and we get

$$\begin{aligned}&\text {pr}\left\{ |{\mathcal {L}}(\varvec{\varphi })-\widetilde{{\mathcal {L}}}(\varvec{\varphi })|<\theta (B)\sqrt{[\text {ln}(2/\delta )]/(2N)}\right\} \nonumber \\&\quad =1 - 2\text {e}^{-2\mu ^{2}/\sum _{i=1}^{n}\rho _{i}^{2}}\nonumber \\&\quad \ge 1 - 2\text {e}^{\frac{-2N(\theta (B)\sqrt{[\text {ln}(2/\delta )]/(2N)})^{2}}{\max ^{2}(\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0))}}\nonumber \\&\quad =1 - 2\text {e}^{-2N\left( \sqrt{[\text {ln}(2/\delta )]/(2N)}\right) ^{2}}\nonumber \\&\quad =1 - 2\text {e}^{-\text {ln}(2/\delta )}\nonumber \\&\quad =1-\delta , \end{aligned}$$

(42)

where the real-valued monotonically increasing function $\theta (B)=\max (\ell ((1-c_{1})B;1),\ell (-c_{0}B;0))$. The proof is completed. $\square$

1.5 A.5 Proof of Lipschitz-Continuity

Here we demonstrate that our learning objective ${\mathcal {F}}(\varvec{\varphi })$ is always Lipschitz continuous based on the Lipschitz-continuity of $\varvec{\varphi }(\cdot )$, $\ell (\cdot )$, $\varOmega (\cdot )$, and ${\mathcal {R}}(\cdot )$. To be more specific, for any two given $\varvec{\varphi }$ and $\widetilde{\varvec{\varphi }}$, we have

$$\begin{aligned}&\vert {\mathcal {F}}(\varvec{\varphi })-{\mathcal {F}}(\widetilde{\varvec{\varphi }})\vert \nonumber \\&\quad =\vert 1/N\sum _{i=1}^{N}\nolimits \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})+\lambda \varOmega (\varvec{\varphi })\nonumber \\&\quad \quad -1/N\sum _{i=1}^{N}\nolimits \ell ({\mathcal {D}}_{\widetilde{\varvec{\varphi }}}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})-\lambda \varOmega (\widetilde{\varvec{\varphi }})\vert \nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert {\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i})-{\mathcal {D}}_{\widetilde{\varvec{\varphi }}}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i})\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert \left( 1/h\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}\left( \vert \varphi _{i}({\varvec{x}}_{i})-\varphi _{i}(\widehat{{\varvec{x}}}_{i})\vert \right) \right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \quad -\left( 1\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}\left( \vert {\widetilde{\varphi }}_{i}({\varvec{x}}_{i})-{\widetilde{\varphi }}_{i}(\widehat{{\varvec{x}}}_{i})\vert \right) \right] ^{p}\right) ^{ 1/p}\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert \omega (\vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})\vert )-\omega (\vert \widetilde{\varvec{\varphi }}({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\vert )\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})\vert -\vert \widetilde{\varvec{\varphi }}({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\vert \Vert _{2}\nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})+\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})-\widetilde{\varvec{\varphi }}({\varvec{x}}_{i})\Vert _{2}\nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\left( \Vert \varvec{\varphi }({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}({\varvec{x}}_{i})\Vert _{2}+\Vert \varvec{\varphi }(\widehat{{\varvec{x}}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\Vert _{2}\right) \nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le 2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}+\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =(2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}+\lambda L_{2})\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}, \end{aligned}$$

(43)

which implies that $(2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}+\lambda L_{2})$ is a valid Lipschitz constant of our learning objective ${\mathcal {F}}$.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, S., Gong, C., Li, X. et al. Boundary-restricted metric learning. Mach Learn 112, 4723–4762 (2023). https://doi.org/10.1007/s10994-023-06380-3

Download citation

Received: 12 December 2022
Revised: 28 April 2023
Accepted: 17 July 2023
Published: 20 September 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10994-023-06380-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Boundary-restricted metric learning

Abstract

Access this article

Similar content being viewed by others

Metric learning with geometric mean for similarities measurement

Fast generalization rates for distance metric learning

Robust metric learning based on the rescaled hinge loss

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A

1.1 A.1 Proof for Theorem 1

Lemma 1

Proof

1.2 A.2 Proof for Theorem 2

Proof

1.3 A.3 Proof for Theorem 3

Proof

1.4 A.4 Proof for Theorem 4

Lemma 2

Proof

1.5 A.5 Proof of Lipschitz-Continuity

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Boundary-restricted metric learning

Abstract

Access this article

Similar content being viewed by others

Metric learning with geometric mean for similarities measurement

Fast generalization rates for distance metric learning

Robust metric learning based on the rescaled hinge loss

Data availability

Code availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix A

Appendix A

1.1 A.1 Proof for Theorem 1

Lemma 1

Proof

1.2 A.2 Proof for Theorem 2

Proof

1.3 A.3 Proof for Theorem 3

Proof

1.4 A.4 Proof for Theorem 4

Lemma 2

Proof

1.5 A.5 Proof of Lipschitz-Continuity

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation