skip to main content
10.1145/3589334.3645456acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

A Fast Similarity Matrix Calibration Method with Incomplete Query

Published: 13 May 2024 Publication History

Abstract

The similarity matrix is at the core of similarity search problems. However, incomplete observations are ubiquitous in real scenarios leading to a less accurate similarity matrix. To alleviate this problem, in this paper, based on the key insight that the similarity matrix enjoys both the symmetric and positive semi-definiteness (PSD) properties, we propose a novel similarity matrix calibration method, which is scalable, effective, and sound. Specifically, we establish the PSD property as a constraint for the similarity matrix calibration problem and propose a novel similarity matrix calibration method to estimate the similarity matrix, which approximates the unknown complete ground-truth similarity matrix. To enable a fast optimization process, we further develop a general approximated algorithm that bypasses the computation of singular values. Theoretical analysis ensures stable calibration performance and convergence speed. Extensive experiments of similarity matrix calibration on real-world datasets demonstrate that our proposed method outperforms baseline methods in terms of both accuracy and speed.

Supplemental Material

MP4 File
Video presentation
MP4 File
Supplemental video

References

[1]
Melissa Ailem, Aghiles Salah, and Mohamed Nadif. 2017. Non-negative matrix factorization meets word embedding. In SIGIR.
[2]
David M Allen. 1971. Mean square error of prediction as a criterion for selecting variables. Technometrics, Vol. 13, 3 (1971), 469--475.
[3]
Mihael Ankerst, Bernhard Braunmüller, Hans-Peter Kriegel, and Thomas Seidl. 1998. Improving adaptable similarity query processing by using approximations. In VLDB.
[4]
Nachman Aronszajn. 1950. Theory of reproducing kernels. Trans. Amer. Math. Soc., Vol. 68, 3 (1950), 337--404.
[5]
Mihály Bakonyi and Charles R Johnson. 1995. The Euclidian distance matrix completion problem. SIAM J. Matrix Anal. Appl., Vol. 16, 2 (1995), 646--654.
[6]
Laura Balzano, Robert Nowak, and Benjamin Recht. 2010. Online identification and tracking of subspaces from highly incomplete information. In Allerton.
[7]
Matthew Blackwell, James Honaker, and Gary King. 2017. A unified approach to measurement error and missing data: overview and applications. Sociological Methods & Research, Vol. 46, 3 (2017), 303--341.
[8]
Sergey Blok, Douglas Medin, and Daniel Osherson. 2003. Probability from similarity. In AAAI Spring Symposium. 36--42.
[9]
Vincent D Blondel, Anah'i Gajardo, Maureen Heymans, Pierre Senellart, and Paul Van Dooren. 2004. A measure of similarity between graph vertices: Applications to synonym extraction and web searching. SIAM review, Vol. 46, 4 (2004), 647--666.
[10]
Djallel Bouneffouf, Mayank Agarwal, and Irina Rish. 2023. Dialogue System with Missing Observation. In ICASSP.
[11]
Ioan Buciu and Ioannis Pitas. 2004. Application of non-negative and local non negative matrix factorization to facial expression recognition. In ICPR.
[12]
Jian-Feng Cai, Emmanuel J Candès, and Zuowei Shen. 2010. A singular value thresholding algorithm for matrix completion. SIAM Journal on optimization, Vol. 20, 4 (2010), 1956--1982.
[13]
Emmanuel Candes and Benjamin Recht. 2012. Exact matrix completion via convex optimization. Commun. ACM, Vol. 55, 6 (2012), 111--119.
[14]
Wei Cheng, Xiaoming Jin, Jian-Tao Sun, Xuemin Lin, Xiang Zhang, and Wei Wang. 2013. Searching dimension incomplete databases. IEEE Transactions on Knowledge and Data Engineering, Vol. 26, 3 (2013), 725--738.
[15]
Gobinda G Chowdhury. 2010. Introduction to modern information retrieval. Facet publishing.
[16]
Cody Coleman, Edward Chou, Julian Katz-Samuels, Sean Culatana, Peter Bailis, Alexander C Berg, Robert Nowak, Roshan Sumbaly, Matei Zaharia, and I Zeki Yalniz. 2022. Similarity search for efficient active learning and search of rare concepts. In AAAI.
[17]
Per-Erik Danielsson. 1980. Euclidean distance mapping. Computer Graphics and image processing, Vol. 14, 3 (1980), 227--248.
[18]
Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), Vol. 39, 1 (1977), 1--22.
[19]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In CVPR.
[20]
John E Dennis, Jr and Jorge J Moré. 1977. Quasi-Newton methods, motivation and theory. SIAM review, Vol. 19, 1 (1977), 46--89.
[21]
Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In CVPR.
[22]
Yiran Dong and Chao-Ying Joanne Peng. 2013. Principled missing data methods for researchers. SpringerPlus, Vol. 2 (2013), 1--17.
[23]
Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang Mphago, and Oteng Tabona. 2021. A survey on missing data in machine learning. Journal of Big Data, Vol. 8, 1 (2021), 1--37.
[24]
Jicong Fan and Madeleine Udell. 2019. Online high rank matrix completion. In CVPR.
[25]
Xiping Fu, Brendan McCane, Steven Mills, Michael Albert, and Lech Szymanski. 2016. Auto-Jacobin: auto-encoder Jacobian binary hashing. arXiv preprint arXiv:1602.08127 (2016).
[26]
Jean H Gallier. 2010. Notes on the Schur complement. (2010).
[27]
GH Gessert. 1991. Handling missing data by using stored truth values. ACM SIGMOD Record, Vol. 20, 3 (1991), 30--42.
[28]
George T Gilbert. 1991. Positive definite matrices and Sylvester's criterion. The American Mathematical Monthly, Vol. 98, 1 (1991), 44--46.
[29]
Geoff Gordon and Ryan Tibshirani. 2012. Karush-kuhn-tucker conditions. Optimization, Vol. 10, 725/36 (2012), 725.
[30]
John W Graham, Scott M Hofer, and David P MacKinnon. 1996. Maximizing the usefulness of data obtained with planned missing value patterns: An application of maximum likelihood procedures. Multivariate Behavioral Research, Vol. 31, 2 (1996), 197--218.
[31]
Lieve Hamers et al. 1989. Similarity measures in scientometric research: The Jaccard index versus Salton's cosine formula. Information Processing and Management, Vol. 25, 3 (1989), 315--18.
[32]
John D Head and Michael C Zerner. 1985. A Broyden-Fletcher-Goldfarb-Shanno optimization procedure for molecular geometries. Chemical physics letters, Vol. 122, 3 (1985), 264--270.
[33]
Henry Helson. 2006. The spectral theorem. The Spectral Theorem (2006), 23--41.
[34]
Harold V Henderson and Shayle R Searle. 1981. On deriving the inverse of a sum of matrices. Siam Review, Vol. 23, 1 (1981), 53--60.
[35]
Nicholas J Higham. 1988. Computing a nearest symmetric positive semidefinite matrix. Linear algebra and its applications, Vol. 103 (1988), 103--118.
[36]
Richard D Hill and Steven R Waters. 1987. On the cone of positive semidefinite matrices. LINEAR ALGEBRA APPLIC., Vol. 90 (1987), 81--88.
[37]
Roger A Horn and Charles R Johnson. 2012. Matrix Analysis. Cambridge University Press.
[38]
Prateek Jain, Praneeth Netrapalli, and Sujay Sanghavi. 2013. Low-rank matrix completion using alternating minimization. In STOC.
[39]
Ki-Yeol Kim, Byoung-Jin Kim, and Gwan-Su Yi. 2004. Reuse of imputed data in microarray analysis increases imputation efficiency. BMC Bioinformatics, Vol. 5, 1 (2004), 1--9.
[40]
Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features from tiny images. (2009).
[41]
Martin Kyselak, David Novak, and Pavel Zezula. 2011. Stabilizing the recall in similarity search. In Proceedings of the Fourth International Conference on Similarity Search and Applications. 43--49.
[42]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE, Vol. 86, 11 (1998), 2278--2324.
[43]
David D Lewis, Yiming Yang, Tony Russell-Rose, and Fan Li. 2004. RCV1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, Vol. 5, Apr (2004), 361--397.
[44]
Ming Li, Xin Chen, Xin Li, Bin Ma, and Paul MB Vitányi. 2004. The similarity metric. IEEE transactions on Information Theory, Vol. 50, 12 (2004), 3250--3264.
[45]
Wenye Li. 2015. Estimating Jaccard index with missing observations: a matrix calibration approach. In NeurIPS.
[46]
Wenye Li. 2020. Scalable Calibration of Affinity Matrices from Incomplete Observations. In Asian Conference on Machine Learning. PMLR, 753--768.
[47]
Wei-Chao Lin and Chih-Fong Tsai. 2020. Missing value imputation: a review and analysis of the literature (2006--2017). Artificial Intelligence Review, Vol. 53, 2 (2020), 1487--1509.
[48]
Roderick JA Little and Donald B Rubin. 2019. Statistical Analysis with Missing Data. Vol. 793. John Wiley & Sons.
[49]
Rudolf Manz, Mario Assenmacher, E Pflüger, Stefan Miltenyi, and Andreas Radbruch. 1995. Analysis and sorting of live cells according to secreted molecules, relocated to a cell-surface affinity matrix. Proceedings of the National Academy of Sciences, Vol. 92, 6 (1995), 1921--1925.
[50]
Douglas L Medin, Robert L Goldstone, and Dedre Gentner. 1993. Respects for similarity. Psychological review, Vol. 100, 2 (1993), 254.
[51]
Todd K Moon. 1996. The expectation-maximization algorithm. IEEE Signal processing magazine, Vol. 13, 6 (1996), 47--60.
[52]
Rafic Nader, Alain Bretto, Bassam Mourad, and Hassan Abbas. 2019. On the positive semi-definite property of similarity matrices. Theoretical Computer Science, Vol. 755 (2019), 13--28.
[53]
Kazuhide Nakata, Makoto Yamashita, Katsuki Fujisawa, and Masakazu Kojima. 2006. A parallel primal-dual interior-point method for semidefinite programs using positive definite matrix completion. Parallel Comput., Vol. 32, 1 (2006), 24--43.
[54]
Therese D Pigott. 2001. A review of methods for missing data. Educational research and evaluation, Vol. 7, 4 (2001), 353--383.
[55]
Boris Teodorovich Polyak. 1969. The conjugate gradient method in extremal problems. U. S. S. R. Comput. Math. and Math. Phys., Vol. 9, 4 (1969), 94--112.
[56]
Jiezhong Qiu, Yuxiao Dong, Hao Ma, Jian Li, Chi Wang, Kuansan Wang, and Jie Tang. 2019. Netsmf: Large-scale network embedding as sparse matrix factorization. In The Web Conference.
[57]
Raimundo Real and Juan M Vargas. 1996. The probabilistic basis of Jaccard's index of similarity. Systematic biology, Vol. 45, 3 (1996), 380--385.
[58]
Hanan Samet. 2006. Foundations of multidimensional and metric data structures. Morgan Kaufmann.
[59]
Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. 2002. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT press.
[60]
Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval. Vol. 39. Cambridge University Press Cambridge.
[61]
George AF Seber and Alan J Lee. 2012. Linear regression analysis. John Wiley & Sons.
[62]
Gilbert W Stewart. 1993. On the early history of the singular value decomposition. SIAM review, Vol. 35, 4 (1993), 551--566.
[63]
DS Tselnik. 1994. A simple bound for the remainder of the Neumann series in the case of a self-adjoint compact operator. Applied Mathematics Letters, Vol. 7, 6 (1994), 71--74.
[64]
Vladimir Vapnik. 1999. The nature of statistical learning theory. Springer science & business media.
[65]
Jung-Ying Wang. 2002. Application of support vector machines in bioinformatics. Taipei: Department of Computer Science and Information Engineering, National Taiwan University (2002).
[66]
Kun Xie, Jiazheng Tian, Xin Wang, Gaogang Xie, Jiannong Cao, Hongbo Jiang, and Jigang Wen. 2022. Fast Retrieval of Large Entries With Incomplete Measurement Data. IEEE/ACM Transactions on Networking, Vol. 30, 5 (2022), 1955--1969.
[67]
Fangchen Yu, Yicheng Zeng, Jianfeng Mao, and Wenye Li. 2023. Online estimation of similarity matrices with incomplete data. In UAI. PMLR.
[68]
Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, and Michal Batko. 2006. Similarity search: the metric space approach. Vol. 32. Springer Science & Business Media.
[69]
Fuzhen Zhang. 2006. The Schur complement and its applications. Vol. 4. Springer Science & Business Media. io

Index Terms

  1. A Fast Similarity Matrix Calibration Method with Incomplete Query

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      WWW '24: Proceedings of the ACM Web Conference 2024
      May 2024
      4826 pages
      ISBN:9798400701719
      DOI:10.1145/3589334
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 May 2024

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. optimization
      2. positive semi-definiteness
      3. similarity matrix calibration

      Qualifiers

      • Research-article

      Funding Sources

      • InnoHK Fund

      Conference

      WWW '24
      Sponsor:
      WWW '24: The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore, Singapore

      Acceptance Rates

      Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • 0
        Total Citations
      • 124
        Total Downloads
      • Downloads (Last 12 months)124
      • Downloads (Last 6 weeks)18
      Reflects downloads up to 05 Mar 2025

      Other Metrics

      Citations

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media