Skip to main content
Log in

Boundary-restricted metric learning

  • Published:
Machine Learning Aims and scope Submit manuscript

Abstract

Metric learning aims to learn a distance metric to properly measure the similarities between pairwise examples. Most existing learning algorithms are designed to reduce intra-class distances and meanwhile enlarge inter-class distances by critically introducing a margin between intra-class and inter-class distances. However, such learning objectives may yield boundless (distance) metric space, because their enlargements on inter-class distances are usually unconstrained. In this case, excessively enlarged inter-class distances would relatively reduce the ratio of margin to the whole distance range (i.e., the margin-range-ratio), and thus being against the initial large-margin purpose for discriminating the similarities of data pairs. To address this issue, we propose a new boundary-restricted metric (BRM), which confines the metric space by a restriction function. Such a restriction function is monotonous and gradually converges to an upper bound, which suppresses excessively large distances of data pairs and concurrently maintains the reliable discriminability. After that, the learned metric can be successfully restricted in a finite region, and thereby avoiding the reduction of margin-range-ratio. Theoretically, we prove that BRM tightens the generalization error bound of the traditional learning model without sacrificing the fitting capability or destroying the topological property of the learned metric, which implies that BRM makes a good bias-variance tradeoff for the metric learning task. Extensive experiments on toy data and real-world datasets validate the superiority of our approach over the state-of-the-art metric learning methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

(The data used in this work is all public.)

Code availability

The codes of the proposed method will be released after publishing.

Notes

  1. Here h adaptively scales the oversized measurement of very high-dimensional projected features.

  2. To include the manifold-based metric, we let \(\varvec{\varphi }({\varvec{x}}) = \varvec{\varphi }(\text {vec}({\varvec{X}})) = d\cdot \text {vec}({\varvec{M}}^{\top } {\varvec{g}}({\varvec{X}}){\varvec{M}}) \in {\mathbb {R}}^{d^{2}}\).

  3. A differentiable function \(f({\varvec{a}})\) defined on the domain \({\mathcal {C}}\) is Lipschitz-continuous if and only if there exists \(L>0\) such that \(\vert f({\varvec{a}})- f({\varvec{b}})\vert \le L\Vert {\varvec{a}}-{\varvec{b}}\Vert _{2}\) for any \({\varvec{a}},{\varvec{b}}\in {\mathcal {C}}\).

  4. The distance function \(D(\cdot , \cdot )\) is a metric if and only if it satisfies the four conditions \(\forall \varvec{\alpha }_{1},\varvec{\alpha }_{2},\varvec{\alpha }_{3} \in {\mathbb {R}}^{d}\): (I). Non-negativity: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) \ge 0\); (II). Symmetry: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = D(\varvec{\alpha }_{2},\varvec{\alpha }_{1})\); (III). Triangle: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) + D(\varvec{\alpha }_{2},\varvec{\alpha }_{3}) \ge D(\varvec{\alpha }_{1},\varvec{\alpha }_{3})\); (IV). Coincidence: \(D(\varvec{\alpha }_{1},\varvec{\alpha }_{2}) = 0 \Leftrightarrow \varvec{\alpha }_{1} = \varvec{\alpha }_{2}\).

  5. For simplicity, here \({\overline{\sigma }}^{2}=\frac{1}{h}\sum _{i=1}^{h}\sigma _{i}^{2}\) and \({\overline{\mu }}=\frac{1}{h}\sum _{i=1}^{h}\mu _{i}\).

References

  • Alpaydin, E. (2020). Introduction to machine learning. MIT Press.

  • Asuncion, A., & Newman, D. (2007). Uci machine learning repository.

  • Bar-Hillel, A., Hertz, T., Shental, N., & Weinshall, D. (2003). Learning distance functions using equivalence relations. In ICML (pp. 11–180).

  • Berrendero, J. R., Bueno-Larraz, B., & Cuevas, A. (2020). On mahalanobis distance in functional settings. Journal of Machine Learning Research, 21(9), 1–33.

    MathSciNet  MATH  Google Scholar 

  • Bian, W., & Tao, D. (2012). Constrained empirical risk minimization framework for distance metric learning. IEEE Transactions on Neural Networks and Learning System, 23(8), 1194–1205.

    Article  Google Scholar 

  • Biswas, A., & Parikh, D. (2013). Simultaneous active learning of classifiers & attributes via relative feedback. In CVPR (pp. 644–651).

  • Brown, M., Hua, G., & Winder, S. (2010). Discriminative learning of local image descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1), 43–57.

    Article  Google Scholar 

  • Carlile, B., Delamarter, G., Kinney, P., Marti, A., & Whitney, B. (2017). Improving deep learning by inverse square root linear units (isrlus). arXiv:1710.09967

  • Chen, S., Gong, C., Yang, J., Tai, Y., Hui, L., & Li, J. (2019a). Data-adaptive metric learning with scale alignment. In AAAI (pp. 3347–3354).

  • Chen, S., Luo, L., Yang, J., Gong, C., Li, J., & Huang, H. (2019b). Curvilinear distance metric learning. In NeurIPS (pp. 4223–4232).

  • Chu, X., Lin, Y., Wang, Y., Wang, X., Yu, H., Gao, X., & Tong, Q. (2020). Distance metric learning with joint representation diversification. In ICML (pp. 1962–1973).

  • Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.

    Article  MathSciNet  MATH  Google Scholar 

  • Davis, J. V., Kulis, B., Jain, P., Sra, S., & Dhillon, I.S. (2007). Information-theoretic metric learning. In ICML (pp. 209–216).

  • Dong, M., Wang, Y., Yang, X., & Xue, J. H. (2019). Learning local metrics and influential regions for classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42, 1522.

    Article  Google Scholar 

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., & Uszkoreit, J. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  • Ermolov, A., Mirvakhabova, L., Khrulkov, V., Sebe, N., & Oseledets, I. (2022). Hyperbolic vision transformers: Combining improvements in metric learning. In CVPR (pp. 7409–7419).

  • Fazlyab, M., Robey, A., Hassani, H., Morari, M., & Pappas, G. (2019). Efficient and accurate estimation of Lipschitz constants for deep neural networks. In NeurIPS.

  • Franklin, J. (2005). The elements of statistical learning: Data mining, inference and prediction. The Mathematical Intelligencer, 27(2), 83–85.

    Article  Google Scholar 

  • Geng, C., & Chen, S. (2018). Metric learning-guided least squares classifier learning. IEEE Transactions on Neural Networks and Learning System, 29(12), 6409–6414.

    Article  MathSciNet  Google Scholar 

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In AISTATS (pp. 249–256).

  • Goldberger, J., Hinton, G. E., Roweis, S. T., & Salakhutdinov, R. R. (2005). Neighbourhood components analysis. In NeurIPS (pp. 513–520).

  • Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial networks. In NeurIPS (pp. 2672–2680).

  • Harandi, M., Salzmann, M., & Hartley, R. (2017). Joint dimensionality reduction and metric learning: A geometric take. In ICML (pp. 1404–1413).

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR (pp. 770–778).

  • Horev, I., Yger, F., & Sugiyama, M. (2017). Geometry-aware principal component analysis for symmetric positive definite matrices. Machine Learning, 66, 493–522.

    Article  MathSciNet  MATH  Google Scholar 

  • Huang, Z., Wang, R., Shan, S., Van Gool, L., & Chen, X. (2018). Cross Euclidean-to-Riemannian metric learning with application to face recognition from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12), 2827–2840.

    Article  Google Scholar 

  • Huo, Z., Nie, F., & Huang, H. (2016). Robust and effective metric learning using capped trace norm: Metric learning via capped trace norm. In SIGKDD (pp. 1605–1614).

  • Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML (pp. 448–456).

  • Kar, P., Narasimhan, H., & Jain, P. (2014). Online and stochastic gradient methods for non-decomposable loss functions. In NeurIPS.

  • Kelley, J. L. (2017). General topology. Courier Dover Publications.

  • Kim, S., Kim, D., Cho, M., & Kwak, S. (2020). Proxy anchor loss for deep metric learning. In CVPR (pp. 3238–3247).

  • Kim, Y., & Park, W. (2021). Multi-level distance regularization for deep metric learning. In AAAI (pp. 1827–1835).

  • Krause, J., Stark, M., Deng, J., & Fei-Fei, L. (2013). 3d object representations for fine-grained categorization. In 3dRR.

  • Kwon, Y., Kim, W., Sugiyama, M., & Paik, M. C. (2020). Principled analytic classifier for positive-unlabeled learning via weighted integral probability metric. Machine Learning, 66, 513–532.

    Article  MathSciNet  MATH  Google Scholar 

  • Law, M., Liao, R., Snell, J., & Zemel, R. (2019). Lorentzian distance learning for hyperbolic representations. In ICML (pp. 3672–3681).

  • Lebanon, G. (2006). Metric learning for text documents. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4), 497–508.

    Article  Google Scholar 

  • Li, P., Li, Y., Xie, H., & Zhang, L. (2022). Neighborhood-adaptive structure augmented metric learning. In AAAI.

  • Li, Q., Haque, S., Anil, C., Lucas, J., Grosse, R. B., & Jacobsen, J. H. (2019). Preventing gradient attenuation in Lipschitz constrained convolutional networks. In NeurIPS.

  • Lim, D., Lanckriet, G., & McFee, B. (2013). Robust structural metric learning. In ICML (pp. 615–623).

  • Lu, J., Xu, C., Zhang, W., Duan, L. Y., & Mei, T. (2019). Sampling wisely: Deep image embedding by top-k precision optimization. In ICCV (pp. 7961–7970).

  • Luo, L., Xu, J., Deng, C., & Huang, H. (2019). Robust metric learning on grassmann manifolds with generalization guarantees. In AAAI (pp. 4480–4487).

  • Meyer, C. D. (2000). Matrix analysis and applied linear algebra (vol. 71). SIAM.

  • Montgomery, D. C., & Runger, G. C. (2010). Applied statistics and probability for engineers. Wiley.

  • Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In CVPR (pp. 4004–4012).

  • Paassen, B., Gallicchio, C., Micheli, A., & Hammer, B. (2018). Tree edit distance learning via adaptive symbol embeddings. In ICML.

  • Perrot, M., & Habrard, A. (2015). Regressive virtual metric learning. In NeurIPS (pp. 1810–1818).

  • Qian, Q., Shang, L., Sun, B., Hu, J., Li, H., & Jin, R. (2019). Softtriple loss: Deep metric learning without triplet sampling. In CVPR, (pp. 6450–6458).

  • Ralaivola, L., Szafranski, M., & Stempfel, G. (2010). Chromatic pac-bayes bounds for non-iid data: Applications to ranking and stationary \(\beta\)-mixing processes. Journal of Machine Learning Research, 11, 1927–1956.

    MathSciNet  MATH  Google Scholar 

  • Reddi, S. J., Hefny, A., Sra, S., Poczos, B., & Smola, A. (2016). Stochastic variance reduction for nonconvex optimization. In ICML (pp. 314–323).

  • Rudin, W. (1964). Principles of mathematical analysis. McGraw-Hill.

  • Seidenschwarz, J. D., Elezi, I., & Leal-Taixe, L. (2021). Learning intra-batch connections for deep metric learning. In ICML.

  • Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In NeurIPS (pp. 1857–1865).

  • Steiner, A., Kolesnikov, A., Zhai, X., Wightman, R., Uszkoreit, J., & Beyer, L. (2021). How to train your vit? Data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270

  • Suarez, J. L., Garcia, S., & Herrera, F. (2018). A tutorial on distance metric learning: Mathematical foundations, algorithms and software. arXiv:1812.05944

  • Suárez, J. L., Garcia, S., & Herrera, F. (2020). pydml: A python library for distance metric learning. Journal of Machine Learning Research, 21(96), 1–7.

    Google Scholar 

  • Suarez, J. L., Garcia, S., & Herrera, F. (2021). Ordinal regression with explainable distance metric learning based on ordered sequences. Machine Learning, 66, 2729–2762.

    Article  MathSciNet  MATH  Google Scholar 

  • Ting, K. M., Zhu, Y., Carman, M., Zhu, Y., Washio, T., & Zhou, Z. H. (2019). Lowest probability mass neighbour algorithms: Relaxing the metric constraint in distance-based neighbourhood algorithms. Machine Learning, 108(2), 331–376.

    Article  MathSciNet  MATH  Google Scholar 

  • Vershynin, R. (2018). High-dimensional probability: An introduction with applications in data science (vol. 47). Cambridge University Press.

  • Wang, H., Nie, F., & Huang, H. (2014). Robust distance metric learning via simultaneous l1-norm minimization and maximization. In ICML (pp. 1836–1844).

  • Wang, X., Han, X., Huang, W., Dong, D., & Scott, M. R. (2019). Multi-similarity loss with general pair weighting for deep metric learning. In CVPR (pp. 173–182).

  • Weinberger, K. Q., Blitzer, J., & Saul, L. K. (2006). Distance metric learning for large margin nearest neighbor classification. In NeurIPS (pp. 1473–1480).

  • Weisstein, E. W. (2002). Inverse trigonometric functions. https://mathworldwolfram.com/

  • Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., & Perona, P. (2010). Caltech-UCSD Birds 200. Tech. Rep. CNS-TR-2010-001, California Institute of Technology.

  • Xia, P., Zhang, L., & Li, F. (2015). Learning similarity with cosine similarity ensemble. Information Sciences, 307, 39–52.

    Article  MathSciNet  MATH  Google Scholar 

  • Xie, P., Wu, W., Zhu, Y., & Xing, E. (2018). Orthogonality-promoting distance metric learning: Convex relaxation and theoretical analysise. In ICML (pp. 2404–2413).

  • Xing, E. P., Jordan, M. I., Russell, S. J., & Ng, A. (2003). Distance metric learning with application to clustering with side-information. In NeurIPS (pp. 521–528).

  • Xu, J., Luo, L., Deng, C., & Huang, H. (2018). Bilevel distance metric learning for robust image recognition. In NeurIPS (pp. 4198–4207).

  • Xu, X., Yang, Y., Deng, C., & Zheng, F. (2019). Deep asymmetric metric learning via rich relationship mining. In CVPR (pp. 4076–4085).

  • Yan, J., Yang, E., Deng, C., & Huang, H. (2022). Metricformer: A unified perspective of correlation exploring in similarity learning. In NeurIPS.

  • Yang, J., Luo, L., Qian, J., Tai, Y., Zhang, F., & Xu, Y. (2016). Nuclear norm based matrix regression with applications to face recognition with occlusion and illumination changes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1), 156–171.

    Article  Google Scholar 

  • Yang, P., Huang, K., & Liu, C. L. (2013). Geometry preserving multi-task metric learning. Machine Learning, 66, 133–175.

    Article  MathSciNet  MATH  Google Scholar 

  • Yang, X., Zhou, P., & Wang, M. (2018). Person reidentification via structural deep metric learning. IEEE Transactions on Neural Networks and Learning System, 30(10), 2987–2998.

    Article  Google Scholar 

  • Ye, H. J., Zhan, D. C., & Jiang, Y. (2019). Fast generalization rates for distance metric learning. Machine Learning, 66, 267–295.

    Article  MATH  Google Scholar 

  • Ye, H. J., Zhan, D. C., Jiang, Y., Si, X. M., & Zhou, Z. H. (2019). What makes objects similar: A unified multi-metric learning approach. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5), 1257–1270.

    Article  Google Scholar 

  • Yoshida, T., Takeuchi, I., & Karasuyama, M. (2021). Distance metric learning for graph structured data. Machine Learning, 66, 1765–1811.

    Article  MathSciNet  MATH  Google Scholar 

  • Yu, B., & Tao, D. (2019). Deep metric learning with tuplet margin loss. In ICCV (pp. 6490–6499).

  • Zadeh, P., Hosseini, R., & Sra, S. (2016). Geometric mean metric learning. In ICML (pp. 2464–2471).

  • Zagoruyko, S., & Komodakis, N. (2015). Learning to compare image patches via convolutional neural networks. In ICCV (pp. 4353–4361).

  • Zbontar, J., & LeCun, Y. (2016). Stereo matching by training a convolutional neural network to compare image patches. Journal of Machine Learning Research, 17(1), 2287–2318.

    MATH  Google Scholar 

  • Zhang, B., Zheng, W., Zhou, J., & Lu, J. (2022). Attributable visual similarity learning. In CVPR.

  • Zhang, S., Tay, Y., Yao, L., Sun, A., & An, J. (2019a). Next item recommendation with self-attentive metric learning. In AAAI.

  • Zhang, Y., Zhong, Q., Ma, L., Xie, D., & Pu, S (2019b). Learning incremental triplet margin for person re-identification. In AAAI (pp. 9243–9250).

  • Zhu, P., Cheng, H., Hu, Q., Wang, Q., & Zhang, C. (2018). Towards generalized and efficient metric learning on riemannian manifold. In IJCAI (pp. 192–199).

Download references

Funding

(S.C., G.N., and M.S. were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. M.S. was also supported by the Institute for AI and Beyond, UTokyo. C.G., J.L., and J.Y. were supported by NSF of China (Nos: U1713208, 61973162, 62072242), NSF of Jiangsu Province (No: BZ2021013), NSF for Distinguished Young Scholar of Jiangsu Province (No: BK20220080), and the Fundamental Research Funds for the Central Universities (Nos: 30920032202, 30921013114).)

Author information

Authors and Affiliations

Authors

Contributions

(All authors contributed to the algorithm design and analysis. The first draft of the manuscript was written by Shuo Chen, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.)

Corresponding author

Correspondence to Shuo Chen.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Editor: Zhi-Hua Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A

Appendix A

This section provids the detailed proofs for all theorems in Sect. 3.1 and Sect. 4.

1.1 A.1 Proof for Theorem 1

We first introduce the following Lindeberg central limit theorem (CLT) as a Lemma to prove our Theorem 1.

Lemma 1

(Lindeberg CLT (Vershynin, 2018)) Suppose \(\{X_{1},\ldots ,X_{h}\}\) is a sequence of independent random variables, each with finite expected value \(\mu _{i}\) and variance \(\sigma _{i}^{2}\). If for any given \(\epsilon >0\)

$$\begin{aligned} \lim _{h\rightarrow \infty }\frac{1}{S_{h}^{2}}\sum _{i=1}^{h}\nolimits {\mathbb {E}}[(X_{i}-\mu _{i})^{2}\cdot 1_{\{\vert X_{i}-\mu _{i}\vert >\epsilon S_{n}\}}]=0, \end{aligned}$$
(17)

then the distribution of the standardized sum \(1/S_{h}\sum _{i=1}^{h}(X_{i}-\mu _{i})\) converges to the standard normal distribution \({\mathcal {N}}(0,1)\), where \(S_{h}^{2}=\sum _{i=1}^{h}\sigma _{i}^{2}\) and \(1_{\{\cdot \}}\) is the indicator function.

Based on the above conclusion on i.n.i.d. random variables \(X_{1}, X_{2}, \ldots , X_{h}\), here we prove Theorem 1 by investigating the probability of that the distance value crosses a given upper-bound. We show that such a probability is mainly determined by the boundary of the metric space.

Proof

We let \(X_{i} = \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}\) for \(i = 1,2,\ldots ,h\) and we can obtain that \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert > b^{p}({\mathcal {H}}_{v}^{u}) - \mu _{i}\}}] = 0\) for sufficiently large h. Specifically, we denote \(V^{2}=1/h\sum _{i=1}^{h}\sigma _{i}^{2}>0\) and let \(h=\left\lceil (b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i})^{2}/(\epsilon V)^{2}\right\rceil\), and we have that

$$\begin{aligned} \epsilon S_{h}=\epsilon \sqrt{h}V\ge \epsilon V\sqrt{(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i})^{2}/(\epsilon V)^{2}}=b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}, \end{aligned}$$
(18)

where \(b^{p}({\mathcal {H}}_{v}^{u})\) is the upper-bound of \(X_{i}\) so that \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\) is always non-negative (\(\mu _{i}\) is the mean of \(X_{i}\)). Then, if \(\vert X_{i}-\mu _{i}\vert <b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\), we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=0=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}. \end{aligned}$$
(19)

If \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\le \vert X_{i}-\mu _{i} \vert \le \epsilon S_{h}\), we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=0\le (X_{i} - \mu _{i})^{2}=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}.} \end{aligned}$$
(20)

Finally, if \(\epsilon S_{h}< \vert X_{i}-\mu _{i}\vert\), we have

$$\begin{aligned} (X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}=(X_{i} - \mu _{i})^{2}=(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}, \end{aligned}$$
(21)

and thus we have \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i}-\mu _{i}\vert>\epsilon S_{h}\}}] \le {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]\) for sufficiently large h. Furthermore, as \(\vert X_{i}-\mu _{i}\vert\) is always small than its upper bound \(b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\), we have \({\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >b^{p}({\mathcal {H}}_{v}^{u})-\mu _{i}\}}]=0\). Therefore, we have

$$\begin{aligned} \lim _{h\rightarrow \infty } \frac{\sum _{i=1}^{h} {\mathbb {E}}[(X_{i} - \mu _{i})^{2} \cdot 1_{\{\vert X_{i} - \mu _{i}\vert >\epsilon S_{h}\}}]}{S_{h}^{2}} \le \lim _{h\rightarrow \infty } \frac{h\cdot 0}{h\sigma _{L}^{2}} = 0, \end{aligned}$$
(22)

which implies that the Lindeberg condition in Eq. (17) is satisfied. Therefore, the standardized sum \(\sum _{i=1}^{h}\vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p}/{\widetilde{\sigma }}\) converges to the standard normal distribution .Footnote 5 Then, for any given \(\epsilon _{1}>0\), there exists sufficiently large h such that

$$\begin{aligned} \text {pr} \left[ \frac{1}{{\overline{\sigma }}\sqrt{h}} \sum _{i=1}^{h}\nolimits \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p} - \mu _{i} \le Z\right] \in \varDelta (\phi (Z),\epsilon _{1}), \end{aligned}$$
(23)

where \(Z \in {\mathbb {R}}\) and \(\phi ( \cdot )\) is the cumulative distribution function of the standard norm distribution. Meanwhile, we have

$$\begin{aligned} \text {pr}[d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u] = \text {pr}\left[ \sum _{i=1}^{h}\nolimits \vert \varphi _{i}({\varvec{x}}) - \varphi _{i}(\widehat{{\varvec{x}}})\vert ^{p} < hu^{p}\right] , \end{aligned}$$
(24)

so for any \(\epsilon _{1} > 0\), there exists sufficiently large h such that

$$\begin{aligned} \text {pr}[d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u] \in \varDelta (\phi (\sqrt{h}(u^{p} - {\overline{\mu }})/{\overline{\sigma }}), \epsilon _{1}). \end{aligned}$$
(25)

As \(\phi (\cdot )\) is monotonically increasing and \(\epsilon _{1}\) is a given sufficiently small number, \(\text {sup}_{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}}\left\{ \text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \right\}\) is dominated by \(\sqrt{h}(u^{p} - {\overline{\mu }})/{\overline{\sigma }}\). According to the law of large numbers (Vershynin, 2018), it follows that for any \(\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}\) there exists a sufficiently large N making

$$\begin{aligned} |{\overline{\mu }}-m_{N}|<\epsilon _{2} \text {and} |{\overline{\sigma }}\sqrt{h}-\varSigma _{N}|<\epsilon _{2}, \end{aligned}$$
(26)

with at least probability \(1-\epsilon _{2}\), where the sample mean \(m_{N} = (1/N) \sum _{j=1}^{N} d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})\) and sample variance \(\varSigma _{N}^{2} = ((1/N)) \sum _{j=1}^{N} (d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j})-m_{N})^{2}\). Then we have that there exists sufficiently small \(\epsilon _{1}\) and \(\epsilon _{2}\) such that for any given \(\delta \in \min (1,v^{p} - u^{p})\)

$$\begin{aligned} \sup _{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}}\{\text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \}\in \varDelta \left( \phi \left[ \sqrt{h}\left( u^{p} - m_{N}\right) /\varSigma _{N}\right] ,\delta \right) . \end{aligned}$$
(27)

Now we only have to consider the minimal value of the positive term \(\left( m_{N} - u^{p}\right) /\varSigma _{N}\) under the constraint \(d_{\varvec{\varphi }}({\varvec{z}}_{j}, \widehat{{\varvec{z}}}_{j}) > v > u\) for \(j = 1,2,\ldots ,N\). To be specific, it holds that

$$\begin{aligned}&\left( m_{N}-u^{p}\right) /\varSigma _{N}\nonumber \\&\quad =\left( m_{N}-v^{p}+v^{p}-u^{p}\right) /\varSigma _{N}\nonumber \\&\quad \ge \left( m_{N} - v^{p} + v^{p} - u^{p}\right) /\min (m_{N} - v^{p}, b^{p}({\mathcal {H}}_{v}^{u}) - m_{N})\nonumber \\&\quad \ge \left( m_{N} - v^{p} + v^{p} - u^{p}\right) /(m_{N} - v^{p})\nonumber \\&\quad =(t + v^{p} - u^{p})/t\nonumber \\&\quad =(1 + (v^{p} - u^{p})/t)\nonumber \\&\quad \ge (1 + 2(v^{p} - u^{p})/(b^{p}({\mathcal {H}}_{v}^{u}) - v^{p})), \end{aligned}$$
(28)

where \(t = m_{N} - v^{p}\in (0, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})-v^{p})]\), and \(m_{N}\) is necessarily included in \((v^{p}, \frac{1}{2}(b^{p}({\mathcal {H}}_{v}^{u})+v^{p}))\). By combining the results in Eqs. (27) and (28), we thus get

$$\begin{aligned} \sup _{\varvec{\varphi }\in {\mathcal {H}}_{v}^{u}} \left\{ \text {pr}\left[ d_{\varvec{\varphi }}({\varvec{z}},\widehat{{\varvec{z}}}) < u\right] \right\} \in \varDelta \left( \phi \left\{ \psi \left[ b({\mathcal {H}}_{v}^{u})\right] \right\} ,\delta \right) , \end{aligned}$$
(29)

where \(\psi \left[ b({\mathcal {H}}_{v}^{u})\right] = [\sqrt{h}(1 + 2(v^{p} - u^{p})/(v^{p} - b^{p}({\mathcal {H}}_{v}^{u})))]\). Here \(\psi \left[ b({\mathcal {H}}_{v}^{u})\right]\) is a monotonically increasing function w.r.t. the boundary \(b({\mathcal {H}}_{v}^{u})\). The proof is completed. \(\square\)

1.2 A.2 Proof for Theorem 2

Proof

The Non-negativity and Symmetry can be trivially achieved by the definition of BRM. Here we prove that \({\mathcal {D}}_{\varvec{\varphi }}(\cdot ,\cdot )\) has the triangle property when \({\mathcal {R}}'\) is monotonically decreasing. Specifically, for any given \(\varvec{\alpha },\varvec{\beta },{\varvec{\gamma }} \in {\mathbb {R}}^{d}\), we invoke the mean value theorem (Rudin, 1964) and have that

$$\begin{aligned}&{\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\beta })\vert )+{\mathcal {R}}(\vert \varphi _{i}(\varvec{\beta })-\varphi _{i}(\varvec{\gamma })\vert )\nonumber \\&\quad ={\mathcal {R}}(Q_{i})+{\mathcal {R}}(T_{i})-{\mathcal {R}}(0)\nonumber \\&\quad ={\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\xi _{1})\nonumber \\&\quad \ge {\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\max (Q_{i},T_{i}))\nonumber \\&\quad \ge {\mathcal {R}}(\max (Q_{i},T_{i}))+\min (Q_{i},T_{i}){\mathcal {R}}'(\max (Q_{i},T_{i}))+\varTheta (\xi _{2})\nonumber \\&\quad ={\mathcal {R}}(\max (Q_{i},T_{i})+\min (Q_{i},T_{i}))\nonumber \\&\quad ={\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\beta })\vert +\vert \varphi _{i}(\varvec{\beta })-\varphi _{i}(\varvec{\gamma })\vert )\nonumber \\&\quad \ge {\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\gamma })\vert ), \end{aligned}$$
(30)

where the real numbers \(Q_{i} = \vert \varphi _{i}(\varvec{\alpha }) - \varphi _{i}(\varvec{\beta })\vert\), \(T_{i} = \vert \varphi _{i}(\varvec{\beta }) - \varphi _{i}(\varvec{\gamma })\vert\), \(\xi _{1} \in [0,\min (Q_{i}, T_{i})]\), \(\xi _{2} \in [0,\max (Q_{i}, T_{i})]\), and \(\varTheta (\xi _{2}) = (1/2)\min (Q_{i}^{2},T_{i}^{2}){\mathcal {R}}''(\xi _{2}) \le 0\). Then we have that

$$\begin{aligned}&{\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\beta })+{\mathcal {D}}_{\varvec{\varphi }}(\varvec{\beta }, \varvec{\gamma })\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(Q_{i})\right] ^{p}\right) ^{ 1/p} +\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(T_{i})\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \ge \left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(Q_{i})+{\mathcal {R}}(T_{i})\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \ge \left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(\vert \varphi _{i}(\varvec{\alpha })-\varphi _{i}(\varvec{\gamma })\vert )\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad ={\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\gamma }). \end{aligned}$$
(31)

Finally, for any given \({\mathcal {D}}_{\varvec{\varphi }}(\varvec{\alpha }, \varvec{\beta }) = 0\), and any given \(k \in {\mathbb {N}}_{h}\), we have \({\mathcal {R}}(\vert \varphi _{k}(\varvec{\alpha })-\varphi _{k}(\varvec{\beta })\vert )=0,\) and thus it holds that \([\varphi _{1}(\varvec{\alpha }),\ldots ,\varphi _{h}(\varvec{\alpha })] = [\varphi _{1}(\varvec{\beta }),\ldots ,\varphi _{h}(\varvec{\beta })]\). By further invoking the invertibility of the mapping \(\varvec{\varphi }\), we have that \(\varvec{\alpha } = \varvec{\beta }\) which completes the proof. \(\square\)

1.3 A.3 Proof for Theorem 3

Proof

We let \(\widehat{\varvec{\varphi }}(\cdot ) = c\varvec{\varphi }(\cdot )\) (\(c > 0\)) and employ the Taylor expansion (Rudin, 1964) on each restriction function, and we get

$$\begin{aligned}&{\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}},\widehat{{\varvec{x}}})\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(c\vert \varphi _{i}({\varvec{x}})-\varphi _{i}(\widehat{{\varvec{x}}})\vert )\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad =\left( \frac{1}{h}\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}(0)+c\vert \varphi _{i}({\varvec{x}})-\varphi _{i}(\widehat{{\varvec{x}}})\vert {\mathcal {R}}'(0)+o(c)\right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad =\frac{c}{h}{\mathcal {R}}'(0)\Vert \varvec{\varphi }({\varvec{x}})-\varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}+o(c). \end{aligned}$$
(32)

According to the homogeneity of vector norm (Meyer, 2000), it follows that there exists \(a_{1} > a_{0} > 0\) such that \(\forall {\varvec{x}},\widehat{{\varvec{x}}}\in {\mathbb {R}}^{d}\)

$$\begin{aligned} \frac{a_{0}}{h}\Vert \varvec{\varphi }({\varvec{x}}) - \varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}\le d_{\varvec{\varphi }}({\varvec{x}},\widehat{{\varvec{x}}})\le \frac{a_{1}}{h}\Vert \varvec{\varphi }({\varvec{x}}) - \varvec{\varphi }(\widehat{{\varvec{x}}})\Vert _{1}, \end{aligned}$$
(33)

so we have that

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})\ge \frac{c}{a_{1}}{\mathcal {R}}'(0)d_{\varvec{\varphi }}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})+o(c),\\ {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+})\le \frac{c}{a_{0}}{\mathcal {R}}'(0)d_{\varvec{\varphi }}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+})+o(c). \end{array}\right. } \end{aligned}$$
(34)

Then for \(u=a_{0}/{\mathcal {R}}'(0)\) and \(v=a_{1}/{\mathcal {R}}'(0)\), we have

$$\begin{aligned} {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{-},\widehat{{\varvec{x}}}^{-})\ge cv + o(c)>cu + o(c)\ge {\mathcal {D}}_{\widehat{\varvec{\varphi }}}({\varvec{x}}^{+},\widehat{{\varvec{x}}}^{+}), \end{aligned}$$
(35)

which completes the proof by letting c be sufficiently small. \(\square\)

1.4 A.4 Proof for Theorem 4

We first introduce the following McDiarmids inequality as a Lemma to prove our Theorem 4.

Lemma 2

(McDiarmid Inequality (Meyer, 2000)) For independent random variables \(t_{1},t_{2},\ldots ,t_{n} \in {\mathcal {T}}\) and a given function \(\omega :{\mathcal {T}}^{n} \rightarrow {\mathbb {R}}\), if \(\forall v_{i}' \in {\mathcal {T}}\) (\(i = 1,2,\ldots ,n\)), the function satisfies

$$\begin{aligned} \vert \omega (t_{1},\ldots ,t_{i},\ldots ,t_{n}) - \omega (t_{1},\ldots ,t_{i}',\ldots ,t_{n})\vert \le \rho _{i}, \end{aligned}$$
(36)

then for any given \(\mu > 0\), it holds that \(\text {pr}\{\vert \omega (t_{1},\ldots ,t_{n}) - {\mathbb {E}}[\omega (t_{1},\ldots ,t_{n})]\vert > \mu \} \le 2\text {e}^{-2\mu ^{2}/\sum _{i=1}^{n}\rho _{i}^{2}}\).

We prove Theorem 4 by analyzing the perturbation [i.e., \(\rho _{i}\) in the above Eq. (36)] of the loss function \({\mathcal {L}}\).

Proof

Firstly, we denote that

$$\begin{aligned} \omega = \frac{1}{N}\sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}), \end{aligned}$$
(37)

and

$$\begin{aligned} \omega _{(k)} = \frac{1}{N} \left[ \sum _{i=1,i\ne k}^{N} \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) + \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k}) \right] , \end{aligned}$$
(38)

where \(({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k})\) is an arbitrary data pair from the sample space with similarity label \(b_{k}\). Then we have that

$$\begin{aligned}&|\omega -\omega _{(k)}|\nonumber \\&\quad =\frac{1}{N}|\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{k},\widehat{{\varvec{x}}}_{k});y_{k})-\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k})|\nonumber \\&\quad \le \frac{1}{N}\max (\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{k},\widehat{{\varvec{x}}}_{k});y_{k}),\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{a}}_{k},\widehat{{\varvec{a}}}_{k});b_{k}))\nonumber \\&\quad \le \frac{1}{N}\max (\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0)). \end{aligned}$$
(39)

Meanwhile, we have

$$\begin{aligned}&\frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) - {\mathbb {E}}_{{\mathcal {X}}}\left( \frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})\right) \nonumber \\&\quad = \frac{1}{N} \sum _{i=1}^{N}\ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i}) - {\mathbb {E}}_{({\varvec{x}},\widehat{{\varvec{x}}})}\left[ \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}},\widehat{{\varvec{x}}});y_{({\varvec{x}},\widehat{{\varvec{x}}})})\right] \nonumber \\&\quad ={\mathcal {L}}(\varvec{\varphi })-\widetilde{{\mathcal {L}}}(\varvec{\varphi }). \end{aligned}$$
(40)

By Lemma 2, we let that

$$\begin{aligned} \rho _{i}=\frac{1}{N}\max (\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0)), \end{aligned}$$
(41)

for all \(i=1,2,\ldots ,N\), and we get

$$\begin{aligned}&\text {pr}\left\{ |{\mathcal {L}}(\varvec{\varphi })-\widetilde{{\mathcal {L}}}(\varvec{\varphi })|<\theta (B)\sqrt{[\text {ln}(2/\delta )]/(2N)}\right\} \nonumber \\&\quad =1 - 2\text {e}^{-2\mu ^{2}/\sum _{i=1}^{n}\rho _{i}^{2}}\nonumber \\&\quad \ge 1 - 2\text {e}^{\frac{-2N(\theta (B)\sqrt{[\text {ln}(2/\delta )]/(2N)})^{2}}{\max ^{2}(\ell ((1-c_{1})B;1),\ell ((0-c_{0})B;0))}}\nonumber \\&\quad =1 - 2\text {e}^{-2N\left( \sqrt{[\text {ln}(2/\delta )]/(2N)}\right) ^{2}}\nonumber \\&\quad =1 - 2\text {e}^{-\text {ln}(2/\delta )}\nonumber \\&\quad =1-\delta , \end{aligned}$$
(42)

where the real-valued monotonically increasing function \(\theta (B)=\max (\ell ((1-c_{1})B;1),\ell (-c_{0}B;0))\). The proof is completed. \(\square\)

1.5 A.5 Proof of Lipschitz-Continuity

Here we demonstrate that our learning objective \({\mathcal {F}}(\varvec{\varphi })\) is always Lipschitz continuous based on the Lipschitz-continuity of \(\varvec{\varphi }(\cdot )\), \(\ell (\cdot )\), \(\varOmega (\cdot )\), and \({\mathcal {R}}(\cdot )\). To be more specific, for any two given \(\varvec{\varphi }\) and \(\widetilde{\varvec{\varphi }}\), we have

$$\begin{aligned}&\vert {\mathcal {F}}(\varvec{\varphi })-{\mathcal {F}}(\widetilde{\varvec{\varphi }})\vert \nonumber \\&\quad =\vert 1/N\sum _{i=1}^{N}\nolimits \ell ({\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})+\lambda \varOmega (\varvec{\varphi })\nonumber \\&\quad \quad -1/N\sum _{i=1}^{N}\nolimits \ell ({\mathcal {D}}_{\widetilde{\varvec{\varphi }}}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i});y_{i})-\lambda \varOmega (\widetilde{\varvec{\varphi }})\vert \nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert {\mathcal {D}}_{\varvec{\varphi }}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i})-{\mathcal {D}}_{\widetilde{\varvec{\varphi }}}({\varvec{x}}_{i},\widehat{{\varvec{x}}}_{i})\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert \left( 1/h\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}\left( \vert \varphi _{i}({\varvec{x}}_{i})-\varphi _{i}(\widehat{{\varvec{x}}}_{i})\vert \right) \right] ^{p}\right) ^{ 1/p}\nonumber \\&\quad \quad -\left( 1\sum _{i=1}^{h}\nolimits \left[ {\mathcal {R}}\left( \vert {\widetilde{\varphi }}_{i}({\varvec{x}}_{i})-{\widetilde{\varphi }}_{i}(\widehat{{\varvec{x}}}_{i})\vert \right) \right] ^{p}\right) ^{ 1/p}\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits \vert \omega (\vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})\vert )-\omega (\vert \widetilde{\varvec{\varphi }}({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\vert )\vert +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})\vert -\vert \widetilde{\varvec{\varphi }}({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\vert \Vert _{2}\nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \varvec{\varphi }({\varvec{x}}_{i})-\varvec{\varphi }(\widehat{{\varvec{x}}}_{i})+\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})-\widetilde{\varvec{\varphi }}({\varvec{x}}_{i})\Vert _{2}\nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le L_{1}\frac{1}{N}\sum _{i=1}^{N}\nolimits B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\left( \Vert \varvec{\varphi }({\varvec{x}}_{i})-\widetilde{\varvec{\varphi }}({\varvec{x}}_{i})\Vert _{2}+\Vert \varvec{\varphi }(\widehat{{\varvec{x}}}_{i})-\widetilde{\varvec{\varphi }}(\widehat{{\varvec{x}}}_{i})\Vert _{2}\right) \nonumber \\&\quad \quad +\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad \le 2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}+\lambda L_{2}\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}\nonumber \\&\quad =(2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}+\lambda L_{2})\Vert \varvec{\varphi }-\widetilde{\varvec{\varphi }}\Vert _{2}, \end{aligned}$$
(43)

which implies that \((2L_{0}L_{1}B^{-(1-p)^{2}}L_{{{{\mathcal {R}}}}}+\lambda L_{2})\) is a valid Lipschitz constant of our learning objective \({\mathcal {F}}\).

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, S., Gong, C., Li, X. et al. Boundary-restricted metric learning. Mach Learn 112, 4723–4762 (2023). https://doi.org/10.1007/s10994-023-06380-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10994-023-06380-3

Keywords

Navigation