Skip to main content

Calibrating Distance Metrics Under Uncertainty

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases (ECML PKDD 2022)

Abstract

Estimating distance metrics for given data samples is essential in machine learning algorithms with various applications. Accurately determining the metric becomes impossible if there are observation noises or missing values. In this work, we proposed an approach to calibrating distance metrics. Compared with standard practices that primarily reside on data imputation, our proposal makes fewer assumptions about the data. It provides a solid theoretical guarantee in improving the quality of the estimate. We developed a simple, efficient, yet effective computing procedure that scales up to realize the calibration process. The experimental results from a series of empirical evaluations justified the benefits of the proposed approach and demonstrated its high potential in practical applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 109.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We set \(\mu = \max \left\{ d^{0}_{ij}\right\} \) and \(\epsilon =0.02\) in the study.

  2. 2.

    Implementation downloaded from http://optml.mit.edu/software.html.

  3. 3.

    Implementation downloaded from https://candes.su.domains/software/.

References

  1. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press, New York (2004)

    Book  MATH  Google Scholar 

  2. Brickell, J., Dhillon, I., Sra, S., Tropp, J.: The metric nearness problem. SIAM J. Matrix Anal. Appl. 30(1), 375–396 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  3. Cai, J.F., Candès, E., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  4. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)

    Article  Google Scholar 

  5. Cline, A., Dhillon, I.: Computation of the singular value decomposition. In: Handbook of Linear Algebra, pp. 45–1. Chapman and Hall/CRC (2006)

    Google Scholar 

  6. Dempster, A., Laird, N., Rubin, D.: Maximum likelihood from incomplete data via the EM algorithm. J. Roy. Stat. Soc. B 39, 1–38 (1977)

    MathSciNet  MATH  Google Scholar 

  7. Deutsch, F.: Best Approximation in Inner Product Spaces. Springer, New York (2001)

    Book  MATH  Google Scholar 

  8. Duarte, M., Hu, Y.: Vehicle classification in distributed sensor networks. J. Parallel Distrib. Comput. 64(7), 826–838 (2004)

    Article  Google Scholar 

  9. Duda, R., Hart, P.: Pattern Classification. Wiley, Hoboken (2000)

    MATH  Google Scholar 

  10. Dykstra, R.: An algorithm for restricted least squares regression. J. Am. Stat. Assoc. 78(384), 837–842 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  11. Enders, C.: Applied Missing Data Analysis. Guilford Press (2010)

    Google Scholar 

  12. Escalante, R., Raydan, M.: Alternating Projection Methods. SIAM, Philadelphia (2011)

    Book  MATH  Google Scholar 

  13. Ghahramani, Z., Jordan, M.: Supervised learning from incomplete data via an EM approach. Adv. Neural. Inf. Process. Syst. 6, 120–127 (1994)

    Google Scholar 

  14. Gilbert, G.: Positive definite matrices and Sylvester’s criterion. Am. Math. Mon. 98(1), 44–46 (1991)

    Article  MathSciNet  MATH  Google Scholar 

  15. Golub, G., Van Loan, C.: Matrix Computations. Johns Hopkins University Press, Baltimore (1996)

    MATH  Google Scholar 

  16. Higham, N.: Computing the nearest correlation matrix - a problem from finance. IMA J. Numer. Anal. 22, 329–343 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  17. Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, Cambridge (2012)

    Book  Google Scholar 

  18. Jain, A., Murty, M., Flynn, P.: Data clustering: a review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  19. Jannach, D., Resnick, P., Tuzhilin, A., Zanker, M.: Recommender systems—beyond matrix completion. Commun. ACM 59(11), 94–102 (2016)

    Article  Google Scholar 

  20. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

    Google Scholar 

  21. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  22. Lewis, D., Yang, Y., Rose, T., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5(Apr), 361–397 (2004)

    Google Scholar 

  23. Li, W.: Estimating Jaccard index with missing observations: a matrix calibration approach. Adv. Neural. Inf. Process. Syst. 28, 2620–2628 (2015)

    Google Scholar 

  24. Li, W.: Scalable calibration of affinity matrices from incomplete observations. In: Asian Conference on Machine Learning, pp. 753–768 (2020)

    Google Scholar 

  25. Little, R., Rubin, D.: Statistical Analysis with Missing Data, vol. 793. Wiley, Hoboken (2019)

    MATH  Google Scholar 

  26. Murphy, K.: Machine Learning: a Probabilistic Perspective. MIT Press, Cambridge (2012)

    MATH  Google Scholar 

  27. Muzellec, B., Josse, J., Boyer, C., Cuturi, M.: Missing data imputation using optimal transport. In: International Conference on Machine Learning, pp. 7130–7140. PMLR (2020)

    Google Scholar 

  28. Qi, H., Sun, D.: An augmented Lagrangian dual approach for the H-weighted nearest correlation matrix problem. IMA J. Numer. Anal. 31(2), 491–511 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  29. Schoenberg, I.: Metric spaces and positive definite functions. Trans. Am. Math. Soc. 44(3), 522–536 (1938)

    Article  MathSciNet  MATH  Google Scholar 

  30. Schölkopf, B., Smola, A., Bach, F., et al.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2002)

    Google Scholar 

  31. Sonthalia, R., Gilbert, A.C.: Project and forget: solving large-scale metric constrained problems. arXiv preprint arXiv:2005.03853 (2020)

  32. Stockham, C., Wang, L.S., Warnow, T.: Statistically based postprocessing of phylogenetic analysis by clustering. Bioinformatics 18(suppl_1), S285–S293 (2002)

    Google Scholar 

  33. Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, R., et al.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)

    Article  Google Scholar 

  34. Wells, J., Williams, L.: Embeddings and Extensions in Analysis, vol. 84. Springer, Heidelberg (1975). https://doi.org/10.1007/978-3-642-66037-5

    Book  MATH  Google Scholar 

  35. Xing, E., Jordan, M., Russell, S., Ng, A.: Distance metric learning with application to clustering with side-information. Adv. Neural. Inf. Process. Syst. 15, 521–528 (2002)

    Google Scholar 

Download references

Acknowledgments

We thank the reviewers for the helpful comments. The work is supported by Guangdong Basic and Applied Basic Research Foundation (2021A1515011825), Shenzhen Science and Technology Program (CUHKSZWDZC0004), and Shenzhen Research Institute of Big Data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wenye Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, W., Yu, F. (2023). Calibrating Distance Metrics Under Uncertainty. In: Amini, MR., Canu, S., Fischer, A., Guns, T., Kralj Novak, P., Tsoumakas, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2022. Lecture Notes in Computer Science(), vol 13715. Springer, Cham. https://doi.org/10.1007/978-3-031-26409-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26409-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26408-5

  • Online ISBN: 978-3-031-26409-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics