skip to main content
10.1145/3077136.3080800acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

LoSHa: A General Framework for Scalable Locality Sensitive Hashing

Authors Info & Claims
Published:07 August 2017Publication History

ABSTRACT

Locality Sensitive Hashing (LSH) algorithms are widely adopted to index similar items in high dimensional space for approximate nearest neighbor search. As the volume of real-world datasets keeps growing, it has become necessary to develop distributed LSH solutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus most existing solutions are developed on general-purpose platforms such as Hadoop and Spark. However, we argue that these platforms are both hard to use for programming LSH algorithms and inefficient for LSH computation. We propose LoSHa, a distributed computing framework that reduces the development cost by designing a tailor-made, general programming interface and achieves high efficiency by exploring LSH-specific system implementation and optimizations. We show that many LSH algorithms can be easily expressed in LoSHa's API. We evaluate LoSHa and also compare with general-purpose platforms on the same LSH algorithms. Our results show that LoSHa's performance can be an order of magnitude faster, while the implementations on LoSHa are even more intuitive and require few lines of code.

References

  1. B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174--2178, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations (extended abstract). In STOC, pages 327--336, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253--262, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Gan, J. Feng, Q. Fang, and W. Ng. Locality-sensitive hashing scheme based on dynamic collision counting. In SIGMOD, pages 541--552, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Gao, H. V. Jagadish, W. Lu, and B. C. Ooi. DSH: data sensitive hashing for high-dimensional k-nnsearch. In SIGMOD, pages 1127--1138, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518--529, 1999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Haghani, S. Michel, and K. Aberer. Distributed similarity search in high dimensions using locality sensitive hashing. In EDBT, pages 744--755, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. In PVLDB, volume 9, pages 1--12, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, 1998.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Learning to Hash. http://cs.nju.edu.cn/lwj/l2h.html. 2017.Google ScholarGoogle Scholar
  15. S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. A. Gibson, and E. P. Xing. On model parallelization and scheduling strategies for distributed machine learning. In NIPS, pages 2834--2842, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Li, J. Cheng, Y. Zhao, F. Yang, Y. Huang, H. Chen, and R. Zhao. A comparison of general-purpose distributed systems for data processing. In IEEE BigData, pages 378--383, 2016. Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su. Scaling distributed machine learning with the parameter server. In OSDI, pages 583--598, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. LikeLike. https://github.com/takahi-i/likelike. 2017.Google ScholarGoogle Scholar
  19. W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1--8, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Liu, J. Cui, Z. Huang, H. Li, and H. T. Shen. SK-LSH: an efficient index structure for approximate nearest neighbor search. In PVLDB, volume 7, pages 745--756, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. LSH-Hadoop. https://github.com/lancenorskog/lsh-hadoop. 2017.Google ScholarGoogle Scholar
  22. LSH-Spark. https://github.com/marufaytekin/lsh-spark. 2017.Google ScholarGoogle Scholar
  23. Y. Lu, J. Cheng, D. Yan, and H. Wu. Large-scale distributed graph computing systems: An experimental evaluation. In PVLDB, volume 8, pages 281--292, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe LSH: efficient indexing for high-dimensional similarity search. In VLDB, pages 950--961, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. In Pattern Recognition Letters, volume 31, pages 1348--1358, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Rajaraman, J. D. Ullman, J. D. Ullman, and J. D. Ullman. Mining of massive datasets, volume 1. 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. SoundCloud-LSH. https://github.com/soundcloud/cosine-lsh-join-spark. 2017.Google ScholarGoogle Scholar
  29. Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. In PVLDB, volume 8, pages 1--12, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Sundaram, A. Turmukhametova, N. Satish, T. Mostak, P. Indyk, S. Madden, and P. Dubey. Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In PVLDB, volume 6, pages 1930--1941, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest neighbor search. In SIGMOD, pages 563--576, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Wang, H. T. Shen, J. Song, and J. Ji. Hashing for similarity search: A survey. In CoRR, volume abs/1408.2927, 2014.Google ScholarGoogle Scholar
  33. F. Yang, Y. Huang, Y. Zhao, J. Li, G. Jiang, and J. Cheng. The best of both worlds: Big data programming with both productivity and performance. In SIGMOD, pages 1619--1622, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. Yang, J. Li, and J. Cheng. Husky: Towards a more efficient and expressive distributed computing framework. In PVLDB, volume 9, pages 420--431, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. F. Yang, F. Shang, Y. Huang, J. Cheng, J. Li, Y. Zhao, and R. Zhao. LFTF: A framework for efficient tensor analytics at scale. In PVLDB, volume 10, pages 745--756, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Zheng, Q. Guo, A. K. Tung, and S. Wu. Lazylsh: Approximate nearest neighbor search for multiple distance functions with a single index. In SIGMOD, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LoSHa: A General Framework for Scalable Locality Sensitive Hashing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval
          August 2017
          1476 pages
          ISBN:9781450350228
          DOI:10.1145/3077136

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 7 August 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGIR '17 Paper Acceptance Rate78of362submissions,22%Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader