Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Lee, Kyung Mi; Jeong, Yoon-Su; Lee, Sang Ho; Lee, Keon Myung

doi:10.1007/s10586-017-1013-2

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Published: 15 July 2017

Volume 22, pages 1959–1971, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Kyung Mi Lee¹,
Yoon-Su Jeong²,
Sang Ho Lee¹ &
…
Keon Myung Lee ORCID: orcid.org/0000-0003-0132-0260¹

399 Accesses
5 Citations
Explore all metrics

Abstract

Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Query-aware locality-sensitive hashing scheme for $$l_p$$ norm

Article 29 June 2017

Enhancing Approximate Nearest Neighbor Search: Binary-Indexed LSH-Tries, Trie Rebuilding, and Batch Extraction

References

Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015)
Article Google Scholar
Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: Proceedings of CVPR, pp. 1–8 (2008)
Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: Proceedings of SIGGRAPH (2007)
Broder, A.Z.: Identify and filtering near-duplicate documents. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 1–10 (2000)
Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In: Proceedings of the VLDB Endowment, Vol. 6, No. 14 (2013)
Korelin, V., Blekanov, I.: Hierarchical clustering of large text databases using locality-sensitive hashing. In: Proceedings of the International Conference on Applications in Information Technology, pp. 61–64 (2015)
Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs locality-sensitive hashing cross-lingual pairwise similarity. In: Proceedings of SIGIR2011, pp. 943–952 (2011)
Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)
Article Google Scholar
Lee, K.M., Lee, C.H., Lee, K.M.: Statistical cluster validity indexes to consider cohesion and separation. In: Proceedings of 2012 International Conference on Fuzzy Theory and Its Applications, iFUZZY 2012, pp. 228–232 (2012)
Caruana, G., Li, M., Qi, M.: A MapReduce based parallel SVM for large scale spam filtering. In: Proceedings of 8th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2659–2662 (2011)
Rasheed, Z.: A map-reduce framework for clustering metagenomes. In: Proceedings of IEEE 27th International Symposium On Parallel and Distributed Processing, pp. 549–557 (2013)
Sunarso, F., Venugopal, S., Lauro, F.: Scalable protein sequence similarity search using locality-sensitive hashing and MapReduce. Technical Report UNSW-CSE-TR-201325, The University of New South Wales (2013). arXiv:1310.0883v1
Omohundro, S.: Five balltree construction algorithms. Technical Report, ICSI (1989)
Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of CVPR, Vol. 5 (2006)
C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006
Murphy, K.P.: Machine learning: a probabilistic perspective (2012)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithm. MIT Press, Cambridge (2009)
MATH Google Scholar
Lee, K.M.: Locality-sensitive hashing techniques for nearest neighbor search. Int. J. Fuzzy Logic Intell. Syst. 12(4), 300–307 (2012)
Article Google Scholar
Lee, K.M.: Locality sensitive hashing with replicated coverage. Int. J. Appl. Eng. Res. 9(21), 8747–8756 (2014)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of STOC (1998)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of VLDB (1999)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Lee, K.M., Lee, K.M.: A locality sensitive hashing technique for categorical data. Appl. Mech. Mater. (2013)
Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations, pp. 327–336. ACM Symposium on Theory of Computing (1998)
Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distribution. In: Symposium on Computational Geometry, pp. 253–262 (2004)
Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, New York (2014)
Book Google Scholar
Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problem. Adv. Multimed. (2015)
Verma, A., Cho, B., Zea, N.: Breaking the MapReduce stage barrier. Clust. Comput. 16(1), 191–206 (2013)
Article Google Scholar
Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)
Bahmani, B., Goel, A., Shinde, R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM (2012)
Wang, J., Lin, C.: MapReduce based personalized locality sensitive hashing for similarity joins on large scale data. Computat. Intell. Neuraosci. 2015, 13 (2015)
Google Scholar
Roh, S.B., Jeong, J.W., Ahn, T.C.: Fuzzy learning vector quantization based on fuzzy k-nearest neighbor prototypes. Int. J. Fuzzy Logic Intell. Syst. 11(2), 84–88 (2011)
Article Google Scholar
Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)
Article Google Scholar
Baluja, S., Covell, M.: Learning forgiving hash functions: algorithms and large scale tests. In: Proceedings of 20th International Joint Conference on Artificial intelligence, pp. 2663– 2669 (2007)
Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2012)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of SIGMOD’84 (1984)
He, J., Liu, W., Chang, S.-F.: Scalable similarity search with optimized Kernel hashing. In: Proceedings of IEEE International Conference on Knowledge Discovery and Data Mining, pp. 1129–1138 (2010)
Jiang, Q., Sun, M.: Semi-supervised simhash for efficient document similarity search. In: Proceedings The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 93–101 (2011)
Kulis, B., Grauman, K.: Kernelized locality sensitive hashing. In: Proceedings of 12th International Conference on Computer Vision (2009)
Matsushita, Y., Wada, T.: Principal component hashing: an accelerated approximate nearest neighbor search. In: Proceedings of PSIVT (2009)
Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: Proceedings of NIPS (2009)
Wang, J., Kumar, S., Chang, S.-F.: SemiSupervised hashing for large scale search. IEEE PAMI, Vol. 34, No. 12 (2012)
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of Neural Information Processing Systems, pp. 1753–1760 (2008)
Xu, H., Wang, J., Li, Z., Zeng, G., Le, S., Yu, N.: Complementary hashing for approximate nearest neighbor search. In: Proceedings of IEEE International Conference on Computer Vision (2011)
Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: Proceedings SIGIR, pp. 18–25 (2010)
Kim, Y.J., Lee, K.M.: Big numeric data classification using grid-based Bayesian inference in the MapReduce framework. Int. J. Fuzzy Logic Intell. Syst. 14(4), 313–321 (2014)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) (Grant No. 2015R1D1A1A01061062).

Author information

Authors and Affiliations

Department of Computer Science, Chungbuk National University, Chungdae-ro 1, Cheongju, Chungbuk, 28644, Korea
Kyung Mi Lee, Sang Ho Lee & Keon Myung Lee
Division of Information and Communication Convergence Engineering, Mokwon University, Daejeon, Korea
Yoon-Su Jeong

Authors

Kyung Mi Lee
View author publications
You can also search for this author in PubMed Google Scholar
Yoon-Su Jeong
View author publications
You can also search for this author in PubMed Google Scholar
Sang Ho Lee
View author publications
You can also search for this author in PubMed Google Scholar
Keon Myung Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Keon Myung Lee.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, K.M., Jeong, YS., Lee, S.H. et al. Bucket-size balancing locality sensitive hashing using the map reduce paradigm. Cluster Comput 22 (Suppl 1), 1959–1971 (2019). https://doi.org/10.1007/s10586-017-1013-2

Download citation

Received: 05 October 2016
Revised: 29 May 2017
Accepted: 22 June 2017
Published: 15 July 2017
Issue Date: 16 January 2019
DOI: https://doi.org/10.1007/s10586-017-1013-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Abstract

Access this article

Similar content being viewed by others

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Query-aware locality-sensitive hashing scheme for $$l_p$$ norm

Enhancing Approximate Nearest Neighbor Search: Binary-Indexed LSH-Tries, Trie Rebuilding, and Batch Extraction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

Abstract

Access this article

Similar content being viewed by others

Experimental Analysis of Locality Sensitive Hashing Techniques for High-Dimensional Approximate Nearest Neighbor Searches

Query-aware locality-sensitive hashing scheme for $$l_p$$ norm

Enhancing Approximate Nearest Neighbor Search: Binary-Indexed LSH-Tries, Trie Rebuilding, and Batch Extraction

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation