research-article

LoSHa: A General Framework for Scalable Locality Sensitive Hashing

Authors:
Jinfeng Li

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
James Cheng

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Fan Yang

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Yuzhen Huang

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Yunjian Zhao

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Xiao Yan

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

,
Ruihao Zhao

Chinese University of Hong Kong, Hong Kong, Hong Kong

Chinese University of Hong Kong, Hong Kong, Hong Kong
View Profile

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2017Pages 635–644https://doi.org/10.1145/3077136.3080800

Published:07 August 2017Publication History

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 635–644

ABSTRACT

Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH solutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that reduces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa's API. We evaluate LoSHa and also compare with general-purpose platforms on the same LSH algorithms. Our results show that LoSHa's performance can be an order of magnitude faster, while the implementations on LoSHa are even more intuitive and require few lines of code.

References

B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174--2178, 2012. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327--336, 1998.Google ScholarDigital Library
M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002. Google ScholarDigital Library
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262, 2004. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.Google ScholarDigital Library
J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541--552, 2012. Google ScholarDigital Library
J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. DSH: data sensitive hashing for high-dimensional k-nnsearch. In SIGMOD, pages 1127--1138, 2014. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.Google ScholarDigital Library
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.Google ScholarDigital Library
P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, pages 744--755, 2009. Google ScholarDigital Library
Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. In PVLDB, volume 9, pages 1--12, 2015. Google ScholarDigital Library
P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, 1998.Google ScholarDigital Library
Learning to Hash. http://cs.nju.edu.cn/lwj/l2h.html. 2017.Google Scholar
S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In NIPS, pages 2834--2842, 2014.Google ScholarDigital Library
J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparison of general-purpose distributed systems for data processing. In IEEE BigData, pages 378--383, 2016. Google ScholarCross Ref
M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su. Scaling distributed machine learning with the parameter server. In OSDI, pages 583--598, 2014. Google ScholarDigital Library
LikeLike. https://github.com/takahi-i/likelike. 2017.Google Scholar
W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.Google ScholarDigital Library
Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. In PVLDB, volume 7, pages 745--756, 2014. Google ScholarDigital Library
LSH-Hadoop. https://github.com/lancenorskog/lsh-hadoop. 2017.Google Scholar
LSH-Spark. https://github.com/marufaytekin/lsh-spark. 2017.Google Scholar
Y. Lu, J. Cheng, D. Yan, and H. Wu. Large-scale distributed graph computing systems: An experimental evaluation. In PVLDB, volume 8, pages 281--292, 2014.Google ScholarDigital Library
Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. In Pattern Recognition Letters, volume 31, pages 1348--1358, 2010. Google ScholarDigital Library
A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 1. 2012.Google ScholarDigital Library
SoundCloud-LSH. https://github.com/soundcloud/cosine-lsh-join-spark. 2017.Google Scholar
Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. In PVLDB, volume 8, pages 1--12, 2014. Google ScholarDigital Library
N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In PVLDB, volume 6, pages 1930--1941, 2013. Google ScholarDigital Library
Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009. Google ScholarDigital Library
J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. In CoRR, volume abs/1408.2927, 2014.Google Scholar
F. Yang, Y. Huang, Y. Zhao, J. Li, G. Jiang, and J. Cheng. The best of both worlds: Big data programming with both productivity and performance. In SIGMOD, pages 1619--1622, 2017.Google ScholarDigital Library
F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. In PVLDB, volume 9, pages 420--431, 2016.Google ScholarDigital Library
F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. LFTF: A framework for efficient tensor analytics at scale. In PVLDB, volume 10, pages 745--756, 2017. Google ScholarDigital Library
Y. Zheng, Q. Guo, A. K. Tung, and S. Wu. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In SIGMOD, 2016.Google ScholarDigital Library

Index Terms

LoSHa: A General Framework for Scalable Locality Sensitive Hashing
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
2. Theory of computation
  1. Design and analysis of algorithms
    1. Distributed algorithms
      1. Self-organization
  2. Models of computation
    1. Concurrency
      1. Distributed computing models

Recommendations

Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search
DEXA 2013: Proceedings of the 24th International Conference on Database and Expert Systems Applications - Volume 8055

Locality-Sensitive Hashing LSH is widely used to solve approximate nearest neighbor search problems in high-dimensional spaces. The basic idea is to map the "nearby" objects into a same hash bucket with high probability. A significant drawback is that ...
Read More
A posteriori multi-probe locality sensitive hashing
MM '08: Proceedings of the 16th ACM international conference on Multimedia

Efficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for ...
Read More
Data Independent Method of Constructing Distributed LSH for Large-Scale Dynamic High-Dimensional Indexing
HPCC '12: Proceedings of the 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems

Constructing effective and efficient indexes for explosive growing multimedia data is a very challenging problem. To solve the problem, Haghani et al. provide a distributed similarity search method in high dimensions using Locality Sensitive Hashing. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2017
1476 pages
ISBN:9781450350228
DOI:10.1145/3077136
General Chairs:
Noriko Kando
National Institute of Informatics
,
Tetsuya Sakai
Waseda University
,
Hideo Joho
University of Tsukuba
,
Program Chairs:
Hang Li
Huawei Noah's Ark Lab
,
Arjen P. de Vries
Radboud University
,
Ryen W. White
Microsoft Cortana
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 August 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed similarity search
locality sensitive hashing
retrieval of high-dimensional data
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR '17 Paper Acceptance Rate78of362submissions,22%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 445
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

LoSHa: A General Framework for Scalable Locality Sensitive Hashing

SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Dynamic Multi-probe LSH: An I/O Efficient Index Structure for Approximate Nearest Neighbor Search

A posteriori multi-probe locality sensitive hashing

Data Independent Method of Constructing Distributed LSH for Large-Scale Dynamic High-Dimensional Indexing