Abstract
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. In order to reduce the number of candidates to search, locality-sensitive hashing (LSH) based indexing methods are very effective. However, most such methods only use LSH for the first phase of similarity search—that is, efficient indexing for candidate generation. In this article, we present BayesLSH, a principled Bayesian algorithm for the subsequent phase of similarity search—performing candidate pruning and similarity estimation using LSH. A simpler variant, BayesLSH-Lite, which calculates similarities exactly, is also presented. Our algorithms are able to quickly prune away a large majority of the false positive candidate pairs, leading to significant speedups over baseline approaches. For BayesLSH, we also provide probabilistic guarantees on the quality of the output, both in terms of accuracy and recall. Finally, the quality of BayesLSH’s output can be easily tuned and does not require any manual setting of the number of hashes to use for similarity estimation, unlike standard approaches. For two state-of-the-art candidate generation algorithms, AllPairs and LSH, BayesLSH enables significant speedups, typically in the range 2 × --20 × for a wide variety of datasets.
We also extend the BayesLSH algorithm for kernel methods—in which the similarity between two data objects is defined by a kernel function. Since the embedding of data points in the transformed kernel space is unknown, algorithms such as AllPairs which rely on building inverted index structure for fast similarity search do not work with kernel functions. Exhaustive search across all possible pairs is also not an option since the dataset can be huge and computing the kernel values for each pair can be prohibitive. We propose K-BayesLSH an all-pairs similarity search problem for kernel functions. K-BayesLSH leverages a recently proposed idea—kernelized locality sensitive hashing (KLSH)—for hash bit computation and candidate generation, and uses the aforementioned BayesLSH idea for candidate pruning and similarity estimation. We ran a broad spectrum of experiments on a variety of datasets drawn from different domains and with distinct kernels and find a speedup of 2 × --7 × over vanilla KLSH.
- Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M. Seitz, and Richard Szeliski. 2011. Building rome in a day. Commun. ACM 54, 10 (Oct. 2011), 105--112. DOI:http://dx.doi.org/10.1145/2001269.2001293 Google ScholarDigital Library
- Alexandr Andoni and Piotr Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117--122. Google ScholarDigital Library
- Roberto J. Bayardo, Yiming Ma, and Ramakrishnan Srikant. 2007. Scaling up all pairs similarity search. In Proceedings of the 16th International Conference on World Wide Web (WWW'07). ACM, Banff, Alberta, Canada, 131--140. DOI:http://dx.doi.org/10.1145/1242572.1242591 Google ScholarDigital Library
- Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. 1998. Min-wise independent permutations (extended abstract). In STOC'98. ACM, Dallas, Texas, United States, 327--336. DOI:http://dx.doi.org/10.1145/276698.276781 Google ScholarDigital Library
- Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web. Computer Networks and ISDN Systems 29, 8 (1997), 1157--1166. Google ScholarDigital Library
- Aniket Chakrabarti and Srinivasan Parthasarathy. 2015. Sequential hypothesis tests for adaptive locality sensitive hashing. In Proceedings of the 24th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, Florence, Italy, 162--172. Google ScholarDigital Library
- John-Marc Chandonia, Gary Hon, Nigel S. Walker, Loredana Lo Conte, Patrice Koehl, Michael Levitt, and Steven E. Brenner. 2004. The ASTRAL compendium in 2004. Nucleic acids research 32, suppl 1 (2004), D189--D192.Google Scholar
- Moses S. Charikar. 2002. Similarity estimation techniques from rounding algorithms. In Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing. ACM, Montral, Qubec, Canada, 380--388. Google ScholarDigital Library
- Ken Chatfield, Victor Lempitsky, Andrea Vedaldi, and Andrew Zisserman. 2011. The devil is in the details: An evaluation of recent feature encoding methods. In Proc. BMVC. Scotland, UK, 76.1--76.12.Google Scholar
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. 2004. Locality-sensitive hashing scheme based on p-stable distributions. In SOCG. ACM, Brooklyn, NY, USA, 253--262. Google ScholarDigital Library
- Armido R. Didonato and Alfred H. Morris, Jr. 1992. Algorithm 708: Significant digit computation of the incomplete beta function ratios. ACM Transactions on Mathematical Software (TOMS) 18, 3 (1992), 360--373. Google ScholarDigital Library
- Tamer Elsayed, Jimmy Lin, and Donald Metzler. 2011. When close enough is good enough: Approximate positional indexes for efficient ranked retrieval. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM, Glasgow, Scotland, UK, 1993--1996. Google ScholarDigital Library
- Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The PASCAL visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303. Google ScholarDigital Library
- Li Fei-Fei, Rob Fergus, and Pietro Perona. 2004. Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Computer Vision and Pattern Recognition Workshop, 2004. CVPRW’04. Conference on. IEEE, Washington, DC, USA, 178--178. Google ScholarDigital Library
- Aristides Gionis, Piotr Indyk, Rajeev Motwani, and others. 1999. Similarity search in high dimensions via hashing. In VLDB, Vol. 99. Edinburgh, Scotland, UK, 518--529. Google ScholarDigital Library
- Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, and Steven M. Seitz. 2007. Multi-view stereo for community photo collections. In IEEE 11th International Conference on Computer Vision (ICCV’07). IEEE, Rio de Janeiro, Brazil, 1--8.Google Scholar
- Junfeng He, Wei Liu, and Shih-Fu Chang. 2010. Scalable similarity search with optimized kernel hashing. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, Washington DC, USA, 1129--1138. Google ScholarDigital Library
- Monika Henzinger. 2006. Finding near-duplicate web pages: A large-scale evaluation of algorithms. In Pro-ceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Infor-mation Retrieval. ACM, Seattle, Washington, USA, 284--291. Google ScholarDigital Library
- Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing. ACM, Dallas, TX, USA, 604--613. Google ScholarDigital Library
- Prateek Jain, Brian Kulis, and Kristen Grauman. 2008. Fast image search for learned metrics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2008). IEEE, Anchorage, Alaska, USA, 1--8.Google ScholarCross Ref
- Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2008. Hamming embedding and weak geometric consistency for large scale image search. In European Conference on Computer Vision (LNCS), Andrew Zisserman David Forsyth, Philip Torr (Ed.), Vol. I. Springer, Marseille, France, 304--317. http://lear.inrialpes.fr/pubs/2008/JDS08. Google ScholarDigital Library
- Alexis Joly and Olivier Buisson. 2011. Random maximum margin hashing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). IEEE, Colorado Springs, USA, 873--880. Google ScholarDigital Library
- Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. 2012. On Bayesian upper confidence bounds for bandit problems. In International Conference on Artificial Intelligence and Statistics. La Palma, Canary Islands, 592--600.Google Scholar
- Brian Kulis and Kristen Grauman. 2012. Kernelized locality-sensitive hashing. Pattern Analysis and Machine Intelligence, IEEE Transactions on 34, 6 (2012), 1092--1104. Google ScholarDigital Library
- Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media? In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 591--600. Google ScholarDigital Library
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. 2004. Rcv1: A new benchmark collection for text categorization research. JMLR 5 (2004), 361--397. Google ScholarDigital Library
- Ping Li and Christian König. 2010. b-Bit minwise hashing. In Proceedings of the 19th International Conference on World Wide Web. ACM, Raleigh, North Carolina, USA, 671--680. Google ScholarDigital Library
- David Liben-Nowell and Jon Kleinberg. 2007. The link-prediction problem for social networks. J. Am. Soc. Inf. Sci. Technol. 58 (May 2007), 1019--1031. Issue 7. DOI:http://dx.doi.org/10.1002/asi.v58:7 Google ScholarDigital Library
- Qin Lv, William Josephson, Zhe Wang, Moses Charikar, and Kai Li. 2007. Multi-probe LSH: Efficient indexing for high-dimensional similarity search. In VLDB. Vienna, Austria, 950--961. http://dl.acm.org/citation.cfm?id=1325851.1325958. Google ScholarDigital Library
- Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th International Conference on World Wide Web. ACM, Banff, Alberta, Canada, 141--150. Google ScholarDigital Library
- Alan Mislove, Massimiliano Marcon, Krishna P. Gummadi, Peter Druschel, and Bobby Bhattacharjee. 2007. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement. ACM, San Diego, CA, USA, 29--42. Google ScholarDigital Library
- Alexey G. Murzin, Steven E. Brenner, Tim Hubbard, and Cyrus Chothia. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247, 4 (1995), 536--540.Google ScholarCross Ref
- Deepak Ravichandran, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: Using local sensitive hash function for high speed noun clustering. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, Ann Arbor, Michigan, USA, 622--629. Google ScholarDigital Library
- John A. Rice. 2007. Mathematical statistics and data analysis. Cengage Learning, Boston, MA, USA.Google Scholar
- Hiroto Saigo, Jean-Philippe Vert, Nobuhisa Ueda, and Tatsuya Akutsu. 2004. Protein homology detection using string alignment kernels. Bioinformatics 20, 11 (2004), 1682--1689. Google ScholarDigital Library
- Venu Satuluri and Srinivasan Parthasarathy. 2011. Symmetrizations for clustering directed graphs. In Proceedings of the 14th International Conference on Extending Database Technology. ACM, Uppsala, Sweden, 343--354. Google ScholarDigital Library
- Venu Satuluri and Srinivasan Parthasarathy. 2012. Bayesian locality sensitive hashing for fast similarity search. Proceedings of the VLDB Endowment 5, 5 (2012), 430--441. Google ScholarDigital Library
- Venu Satuluri, Srinivasan Parthasarathy, and Yiye Ruan. 2011. Local graph sparsification for scalable clustering. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 721--732. Google ScholarDigital Library
- Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller. 1998. Nonlinear component analysis as a kernel eigenvalue problem. Neural Computation 10, 5 (1998), 1299--1319. Google ScholarDigital Library
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2012. Descriptor learning using convex optimisation. In Computer Vision--ECCV 2012. Springer, Florence, Italy, 243--256. Google ScholarDigital Library
- Noah Snavely, Steven M. Seitz, and Richard Szeliski. 2008. Modeling the world from internet photo collections. International Journal of Computer Vision 80, 2 (2008), 189--210. Google ScholarDigital Library
- Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2009. Quality and efficiency in high dimensional nearest neighbor search. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data. ACM, Providence, Rhode Island, USA, 563--576. Google ScholarDigital Library
- Jean-Philippe Vert, Hiroto Saigo, and Tatsuya Akutsu. 2004. Local alignment kernels for biological sequences. Kernel methods in computational biology (2004), 131--154.Google Scholar
- Simon Winder, Gang Hua, and Matthew Brown. 2009. Picking the best daisy. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR’09). IEEE, Miami, FL, USA, 178--185.Google ScholarCross Ref
- Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, and Guoren Wang. 2011. Efficient similarity joins for near-duplicate detection. ACM Transactions on Database Systems (TODS) 36, 3 (2011), 15. Google ScholarDigital Library
- Jiaqi Zhai, Yin Lou, and Johannes Gehrke. 2011. ATLAS: A probabilistic algorithm for high dimensional similarity search. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data. ACM, Athens, Greece, 997--1008. Google ScholarDigital Library
- Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik. 2006. SVM-KNN: Discriminative nearest neighbor classification for visual category recognition. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006, Vol. 2. IEEE, New York, NY, USA, 2126--2136. Google ScholarDigital Library
- X. Zhu and A. B. Goldberg. 2009. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 3, 1 (2009), 1--130.Google ScholarCross Ref
Index Terms
- A Bayesian Perspective on Locality Sensitive Hashing with Extensions for Kernel Methods
Recommendations
Boosting multi-kernel locality-sensitive hashing for scalable image retrieval
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalSimilarity search is a key challenge for multimedia retrieval applications where data are usually represented in high-dimensional space. Among various algorithms proposed for similarity search in high-dimensional space, Locality-Sensitive Hashing (LSH) ...
Query-aware locality-sensitive hashing scheme for $$l_p$$lp norm
The problem of c-Approximate Nearest Neighbor (c-ANN) search in high-dimensional space is fundamentally important in many applications, such as image database and data mining. Locality-Sensitive Hashing (LSH) and its variants are the well-known indexing ...
Bayesian locality sensitive hashing for fast similarity search
Given a collection of objects and an associated similarity measure, the all-pairs similarity search problem asks us to find all pairs of objects with similarity greater than a certain user-specified threshold. Locality-sensitive hashing (LSH) based ...
Comments