research-article

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

Authors:
Aniket Chakrabarti

The Ohio State University

The Ohio State University
View Profile

,
Venu Satuluri

Twitter Inc.

Twitter Inc.
View Profile

,
Atreya Srivathsan

Amazon.com Inc.

Amazon.com Inc.
View Profile

,
Srinivasan Parthasarathy

The Ohio State University

The Ohio State University
View Profile

ACM Transactions on Knowledge Discovery from Data Volume 10 Issue 2Article No.: 19pp 1–32https://doi.org/10.1145/2778990

Published:12 October 2015Publication History

ACM Transactions on Knowledge Discovery from Data

Abstract

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates to search, locality-sensitive hashing (LSH) based indexing methods are very effective. However, most such methods only use LSH for the first phase of similarity search—that is, efficient indexing for candidate generation. In this article, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search—performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2 × --20 × for a wide variety of datasets.

We also extend the BayesLSH algorithm for kernel methods—in which the similarity between two data objects is defined by a kernel function. Since the embedding of data points in the transformed kernel space is unknown, algorithms such as AllPairs which rely on building inverted index structure for fast similarity search do not work with kernel functions. Exhaustive search across all possible pairs is also not an option since the dataset can be huge and computing the kernel values for each pair can be prohibitive. We propose K-BayesLSH an all-pairs similarity search problem for kernel functions. K-BayesLSH leverages a recently proposed idea—kernelized locality sensitive hashing (KLSH)—for hash bit computation and candidate generation, and uses the aforementioned BayesLSH idea for candidate pruning and similarity estimation. We ran a broad spectrum of experiments on a variety of datasets drawn from different domains and with distinct kernels and find a speedup of 2 × --7 × over vanilla KLSH.

References

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. 2011. Building rome in a day. Commun. ACM 54, 10 (Oct. 2011), 105--112. DOI:http://dx.doi.org/10.1145/2001269.2001293 Google ScholarDigital Library
Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117--122. Google ScholarDigital Library
Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM, Banff, Alberta, Canada, 131--140. DOI:http://dx.doi.org/10.1145/1242572.1242591 Google ScholarDigital Library
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations (extended abstract). In STOC'98. ACM, Dallas, Texas, United States, 327--336. DOI:http://dx.doi.org/10.1145/276698.276781 Google ScholarDigital Library
Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157--1166. Google ScholarDigital Library
Aniket Chakrabarti and Srinivasan Parthasarathy. 2015. Sequential hypothesis tests for adaptive locality sensitive hashing. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Florence, Italy, 162--172. Google ScholarDigital Library
John-Marc Chandonia, Gary Hon, Nigel S. Walker, Loredana Lo Conte, Patrice Koehl, Michael Levitt, and Steven E. Brenner. 2004. The ASTRAL compendium in 2004. Nucleic acids research 32, suppl 1 (2004), D189--D192.Google Scholar
Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing. ACM, Montral, Qubec, Canada, 380--388. Google ScholarDigital Library
Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proc. BMVC. Scotland, UK, 76.1--76.12.Google Scholar
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG. ACM, Brooklyn, NY, USA, 253--262. Google ScholarDigital Library
Armido R. Didonato and Alfred H. Morris, Jr. 1992. Algorithm 708: Significant digit computation of the incomplete beta function ratios. ACM Transactions on Mathematical Software (TOMS) 18, 3 (1992), 360--373. Google ScholarDigital Library
Tamer Elsayed, Jimmy Lin, and Donald Metzler. 2011. When close enough is good enough: Approximate positional indexes for efficient ranked retrieval. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, Glasgow, Scotland, UK, 1993--1996. Google ScholarDigital Library
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303. Google ScholarDigital Library
Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on. IEEE, Washington, DC, USA, 178--178. Google ScholarDigital Library
Aristides Gionis, Piotr Indyk, Rajeev Motwani, and others. 1999. Similarity search in high dimensions via hashing. In VLDB, Vol. 99. Edinburgh, Scotland, UK, 518--529. Google ScholarDigital Library
Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-view stereo for community photo collections. In IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, Rio de Janeiro, Brazil, 1--8.Google Scholar
Junfeng He, Wei Liu, and Shih-Fu Chang. 2010. Scalable similarity search with optimized kernel hashing. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Washington DC, USA, 1129--1138. Google ScholarDigital Library
Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Pro-ceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval. ACM, Seattle, Washington, USA, 284--291. Google ScholarDigital Library
Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, Dallas, TX, USA, 604--613. Google ScholarDigital Library
Prateek Jain, Brian Kulis, and Kristen Grauman. 2008. Fast image search for learned metrics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008). IEEE, Anchorage, Alaska, USA, 1--8.Google ScholarCross Ref
Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In European Conference on Computer Vision (LNCS), Andrew Zisserman David Forsyth, Philip Torr (Ed.), Vol. I. Springer, Marseille, France, 304--317. http://lear.inrialpes.fr/pubs/2008/JDS08. Google ScholarDigital Library
Alexis Joly and Olivier Buisson. 2011. Random maximum margin hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, Colorado Springs, USA, 873--880. Google ScholarDigital Library
Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2012. On Bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics. La Palma, Canary Islands, 592--600.Google Scholar
Brian Kulis and Kristen Grauman. 2012. Kernelized locality-sensitive hashing. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 6 (2012), 1092--1104. Google ScholarDigital Library
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 591--600. Google ScholarDigital Library
D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR 5 (2004), 361--397. Google ScholarDigital Library
Ping Li and Christian König. 2010. b-Bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 671--680. Google ScholarDigital Library
David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 (May 2007), 1019--1031. Issue 7. DOI:http://dx.doi.org/10.1002/asi.v58:7 Google ScholarDigital Library
Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB. Vienna, Austria, 950--961. http://dl.acm.org/citation.cfm?id=1325851.1325958. Google ScholarDigital Library
Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. ACM, Banff, Alberta, Canada, 141--150. Google ScholarDigital Library
Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. ACM, San Diego, CA, USA, 29--42. Google ScholarDigital Library
Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, and Cyrus Chothia. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 4 (1995), 536--540.Google ScholarCross Ref
Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: Using local sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 622--629. Google ScholarDigital Library
John A. Rice. 2007. Mathematical statistics and data analysis. Cengage Learning, Boston, MA, USA.Google Scholar
Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics 20, 11 (2004), 1682--1689. Google ScholarDigital Library
Venu Satuluri and Srinivasan Parthasarathy. 2011. Symmetrizations for clustering directed graphs. In Proceedings of the 14th International Conference on Extending Database Technology. ACM, Uppsala, Sweden, 343--354. Google ScholarDigital Library
Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment 5, 5 (2012), 430--441. Google ScholarDigital Library
Venu Satuluri, Srinivasan Parthasarathy, and Yiye Ruan. 2011. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 721--732. Google ScholarDigital Library
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 5 (1998), 1299--1319. Google ScholarDigital Library
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2012. Descriptor learning using convex optimisation. In Computer Vision--ECCV 2012. Springer, Florence, Italy, 243--256. Google ScholarDigital Library
Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2008. Modeling the world from internet photo collections. International Journal of Computer Vision 80, 2 (2008), 189--210. Google ScholarDigital Library
Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, Providence, Rhode Island, USA, 563--576. Google ScholarDigital Library
Jean-Philippe Vert, Hiroto Saigo, and Tatsuya Akutsu. 2004. Local alignment kernels for biological sequences. Kernel methods in computational biology (2004), 131--154.Google Scholar
Simon Winder, Gang Hua, and Matthew Brown. 2009. Picking the best daisy. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Miami, FL, USA, 178--185.Google ScholarCross Ref
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 15. Google ScholarDigital Library
Jiaqi Zhai, Yin Lou, and Johannes Gehrke. 2011. ATLAS: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 997--1008. Google ScholarDigital Library
Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. 2006. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, Vol. 2. IEEE, New York, NY, USA, 2126--2136. Google ScholarDigital Library
X. Zhu and A. B. Goldberg. 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3, 1 (2009), 1--130.Google ScholarCross Ref

Index Terms

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Probabilistic retrieval models

Recommendations

Boosting multi-kernel locality-sensitive hashing for scalable image retrieval
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Similarity search is a key challenge for multimedia retrieval applications where data are usually represented in high-dimensional space. Among various algorithms proposed for similarity search in high-dimensional space, Locality-Sensitive Hashing (LSH) ...
Read More
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
Read More
Bayesian locality sensitive hashing for fast similarity search

Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Knowledge Discovery from Data Volume 10, Issue 2
October 2015
291 pages
ISSN:1556-4681
EISSN:1556-472X
DOI:10.1145/2835206
Editor:
Philip S. Yu
University of Illinois at Chicago, USA
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2015
- Accepted: 1 May 2015
- Revised: 1 April 2015
- Received: 1 December 2014
Published in tkdd Volume 10, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Locality-sensitive hashing
all-pairs similarity search
bayesian inference
kernel similarity measure
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 388
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

Bayesian locality sensitive hashing for fast similarity search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods

ACM Transactions on Knowledge Discovery from Data

Abstract

References

Cited By

Index Terms

Recommendations

Boosting multi-kernel locality-sensitive hashing for scalable image retrieval

Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm

Bayesian locality sensitive hashing for fast similarity search

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media