skip to main content
research-article

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Published:12 October 2015Publication History
Skip Abstract Section

Abstract

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates to search, locality-sensitive hashing (LSH) based indexing methods are very effective. However, most such methods only use LSH for the first phase of similarity search—that is, efficient indexing for candidate generation. In this article, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search—performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2 × --20 × for a wide variety of datasets.

We also extend the BayesLSH algorithm for kernel methods—in which the similarity between two data objects is defined by a kernel function. Since the embedding of data points in the transformed kernel space is unknown, algorithms such as AllPairs which rely on building inverted index structure for fast similarity search do not work with kernel functions. Exhaustive search across all possible pairs is also not an option since the dataset can be huge and computing the kernel values for each pair can be prohibitive. We propose K-BayesLSH an all-pairs similarity search problem for kernel functions. K-BayesLSH leverages a recently proposed idea—kernelized locality sensitive hashing (KLSH)—for hash bit computation and candidate generation, and uses the aforementioned BayesLSH idea for candidate pruning and similarity estimation. We ran a broad spectrum of experiments on a variety of datasets drawn from different domains and with distinct kernels and find a speedup of 2 × --7 × over vanilla KLSH.

References

  1. Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. 2011. Building rome in a day. Commun. ACM 54, 10 (Oct. 2011), 105--112. DOI:http://dx.doi.org/10.1145/2001269.2001293 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM, Banff, Alberta, Canada, 131--140. DOI:http://dx.doi.org/10.1145/1242572.1242591 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations (extended abstract). In STOC'98. ACM, Dallas, Texas, United States, 327--336. DOI:http://dx.doi.org/10.1145/276698.276781 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157--1166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Aniket Chakrabarti and Srinivasan Parthasarathy. 2015. Sequential hypothesis tests for adaptive locality sensitive hashing. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Florence, Italy, 162--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. John-Marc Chandonia, Gary Hon, Nigel S. Walker, Loredana Lo Conte, Patrice Koehl, Michael Levitt, and Steven E. Brenner. 2004. The ASTRAL compendium in 2004. Nucleic acids research 32, suppl 1 (2004), D189--D192.Google ScholarGoogle Scholar
  8. Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing. ACM, Montral, Qubec, Canada, 380--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proc. BMVC. Scotland, UK, 76.1--76.12.Google ScholarGoogle Scholar
  10. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG. ACM, Brooklyn, NY, USA, 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Armido R. Didonato and Alfred H. Morris, Jr. 1992. Algorithm 708: Significant digit computation of the incomplete beta function ratios. ACM Transactions on Mathematical Software (TOMS) 18, 3 (1992), 360--373. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Tamer Elsayed, Jimmy Lin, and Donald Metzler. 2011. When close enough is good enough: Approximate positional indexes for efficient ranked retrieval. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, Glasgow, Scotland, UK, 1993--1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on. IEEE, Washington, DC, USA, 178--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Aristides Gionis, Piotr Indyk, Rajeev Motwani, and others. 1999. Similarity search in high dimensions via hashing. In VLDB, Vol. 99. Edinburgh, Scotland, UK, 518--529. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-view stereo for community photo collections. In IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, Rio de Janeiro, Brazil, 1--8.Google ScholarGoogle Scholar
  17. Junfeng He, Wei Liu, and Shih-Fu Chang. 2010. Scalable similarity search with optimized kernel hashing. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Washington DC, USA, 1129--1138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Pro-ceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval. ACM, Seattle, Washington, USA, 284--291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, Dallas, TX, USA, 604--613. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Prateek Jain, Brian Kulis, and Kristen Grauman. 2008. Fast image search for learned metrics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008). IEEE, Anchorage, Alaska, USA, 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  21. Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In European Conference on Computer Vision (LNCS), Andrew Zisserman David Forsyth, Philip Torr (Ed.), Vol. I. Springer, Marseille, France, 304--317. http://lear.inrialpes.fr/pubs/2008/JDS08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Alexis Joly and Olivier Buisson. 2011. Random maximum margin hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, Colorado Springs, USA, 873--880. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2012. On Bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics. La Palma, Canary Islands, 592--600.Google ScholarGoogle Scholar
  24. Brian Kulis and Kristen Grauman. 2012. Kernelized locality-sensitive hashing. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 6 (2012), 1092--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 591--600. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR 5 (2004), 361--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ping Li and Christian König. 2010. b-Bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 671--680. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 (May 2007), 1019--1031. Issue 7. DOI:http://dx.doi.org/10.1002/asi.v58:7 Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB. Vienna, Austria, 950--961. http://dl.acm.org/citation.cfm?id=1325851.1325958. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. ACM, Banff, Alberta, Canada, 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. ACM, San Diego, CA, USA, 29--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, and Cyrus Chothia. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 4 (1995), 536--540.Google ScholarGoogle ScholarCross RefCross Ref
  33. Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: Using local sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 622--629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. John A. Rice. 2007. Mathematical statistics and data analysis. Cengage Learning, Boston, MA, USA.Google ScholarGoogle Scholar
  35. Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics 20, 11 (2004), 1682--1689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Venu Satuluri and Srinivasan Parthasarathy. 2011. Symmetrizations for clustering directed graphs. In Proceedings of the 14th International Conference on Extending Database Technology. ACM, Uppsala, Sweden, 343--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment 5, 5 (2012), 430--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Venu Satuluri, Srinivasan Parthasarathy, and Yiye Ruan. 2011. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 721--732. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 5 (1998), 1299--1319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2012. Descriptor learning using convex optimisation. In Computer Vision--ECCV 2012. Springer, Florence, Italy, 243--256. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2008. Modeling the world from internet photo collections. International Journal of Computer Vision 80, 2 (2008), 189--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, Providence, Rhode Island, USA, 563--576. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jean-Philippe Vert, Hiroto Saigo, and Tatsuya Akutsu. 2004. Local alignment kernels for biological sequences. Kernel methods in computational biology (2004), 131--154.Google ScholarGoogle Scholar
  44. Simon Winder, Gang Hua, and Matthew Brown. 2009. Picking the best daisy. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Miami, FL, USA, 178--185.Google ScholarGoogle ScholarCross RefCross Ref
  45. Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jiaqi Zhai, Yin Lou, and Johannes Gehrke. 2011. ATLAS: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 997--1008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. 2006. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, Vol. 2. IEEE, New York, NY, USA, 2126--2136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. X. Zhu and A. B. Goldberg. 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3, 1 (2009), 1--130.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 10, Issue 2
      October 2015
      291 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/2835206
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 12 October 2015
      • Accepted: 1 May 2015
      • Revised: 1 April 2015
      • Received: 1 December 2014
      Published in tkdd Volume 10, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader