Abstract
Estimating distance metrics for given data samples is essential in machine learning algorithms with various applications. Accurately determining the metric becomes impossible if there are observation noises or missing values. In this work, we proposed an approach to calibrating distance metrics. Compared with standard practices that primarily reside on data imputation, our proposal makes fewer assumptions about the data. It provides a solid theoretical guarantee in improving the quality of the estimate. We developed a simple, efficient, yet effective computing procedure that scales up to realize the calibration process. The experimental results from a series of empirical evaluations justified the benefits of the proposed approach and demonstrated its high potential in practical applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We set \(\mu = \max \left\{ d^{0}_{ij}\right\} \) and \(\epsilon =0.02\) in the study.
- 2.
Implementation downloaded from http://optml.mit.edu/software.html.
- 3.
Implementation downloaded from https://candes.su.domains/software/.
References
Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)
Brickell, J., Dhillon, I., Sra, S., Tropp, J.: The metric nearness problem. SIAM J. Matrix Anal. Appl. 30(1), 375–396 (2008)
Cai, J.F., Candès, E., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Cline, A., Dhillon, I.: Computation of the singular value decomposition. In: Handbook of Linear Algebra, pp. 45–1. Chapman and Hall/CRC (2006)
Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38 (1977)
Deutsch, F.: Best Approximation in Inner Product Spaces. Springer, New York (2001)
Duarte, M., Hu, Y.: Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 64(7), 826–838 (2004)
Duda, R., Hart, P.: Pattern Classification. Wiley, Hoboken (2000)
Dykstra, R.: An algorithm for restricted least squares regression. J. Am. Stat. Assoc. 78(384), 837–842 (1983)
Enders, C.: Applied Missing Data Analysis. Guilford Press (2010)
Escalante, R., Raydan, M.: Alternating Projection Methods. SIAM, Philadelphia (2011)
Ghahramani, Z., Jordan, M.: Supervised learning from incomplete data via an EM approach. Adv. Neural. Inf. Process. Syst. 6, 120–127 (1994)
Gilbert, G.: Positive definite matrices and Sylvester’s criterion. Am. Math. Mon. 98(1), 44–46 (1991)
Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore (1996)
Higham, N.: Computing the nearest correlation matrix - a problem from finance. IMA J. Numer. Anal. 22, 329–343 (2002)
Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, Cambridge (2012)
Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)
Jannach, D., Resnick, P., Tuzhilin, A., Zanker, M.: Recommender systems—beyond matrix completion. Commun. ACM 59(11), 94–102 (2016)
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5(Apr), 361–397 (2004)
Li, W.: Estimating Jaccard index with missing observations: a matrix calibration approach. Adv. Neural. Inf. Process. Syst. 28, 2620–2628 (2015)
Li, W.: Scalable calibration of affinity matrices from incomplete observations. In: Asian Conference on Machine Learning, pp. 753–768 (2020)
Little, R., Rubin, D.: Statistical Analysis with Missing Data, vol. 793. Wiley, Hoboken (2019)
Murphy, K.: Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge (2012)
Muzellec, B., Josse, J., Boyer, C., Cuturi, M.: Missing data imputation using optimal transport. In: International Conference on Machine Learning, pp. 7130–7140. PMLR (2020)
Qi, H., Sun, D.: An augmented Lagrangian dual approach for the H-weighted nearest correlation matrix problem. IMA J. Numer. Anal. 31(2), 491–511 (2011)
Schoenberg, I.: Metric spaces and positive definite functions. Trans. Am. Math. Soc. 44(3), 522–536 (1938)
Schölkopf, B., Smola, A., Bach, F., et al.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)
Sonthalia, R., Gilbert, A.C.: Project and forget: solving large-scale metric constrained problems. arXiv preprint arXiv:2005.03853 (2020)
Stockham, C., Wang, L.S., Warnow, T.: Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics 18(suppl_1), S285–S293 (2002)
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, R., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)
Wells, J., Williams, L.: Embeddings and Extensions in Analysis, vol. 84. Springer, Heidelberg (1975). https://doi.org/10.1007/978-3-642-66037-5
Xing, E., Jordan, M., Russell, S., Ng, A.: Distance metric learning with application to clustering with side-information. Adv. Neural. Inf. Process. Syst. 15, 521–528 (2002)
Acknowledgments
We thank the reviewers for the helpful comments. The work is supported by Guangdong Basic and Applied Basic Research Foundation (2021A1515011825), Shenzhen Science and Technology Program (CUHKSZWDZC0004), and Shenzhen Research Institute of Big Data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, W., Yu, F. (2023). Calibrating Distance Metrics Under Uncertainty. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13715. Springer, Cham. https://doi.org/10.1007/978-3-031-26409-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-26409-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26408-5
Online ISBN: 978-3-031-26409-2
eBook Packages: Computer ScienceComputer Science (R0)