Skip to main content
Log in

Bucket-size balancing locality sensitive hashing using the map reduce paradigm

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Similarity search is an essential operation in such domains as data mining and content-based information retrieval. This simple operation causes considerable burden when the number of data records grows large, especially in big data applications. At the sacrifice of accuracy, approximate methods for finding similar ones have been developed to deliver effective services in a reasonable amount of time. Locality sensitive hashing is a class of efficient approximate similarity search techniques. Various algorithms have been proposed for locality sensitive hashing, which basically try to narrow down the candidate data set to be examined. The candidate data set does not always contain all the similar data to query and thus the search results are approximate. The increase in the size of a candidate set improves the recall of similar ones, but it deteriorates the processing speed. This paper is concerned with a method to increase the recall rate while not entailing significant cost. The method basically uses a random hyperplane partitioning technique to create buckets to which data objects are distributed. The nearest neighbors located on the other side of such hyperplanes can be false negatives when only the bucket to which query belongs is examined for finding similar neighbors. The proposed method extends the hyperplanes to occupy their vicinity so that the data objects in the vicinity of a hyperplane are treated as belonging to both sides of the hyperplane simultaneously. The over-sized buckets are further split by adding additional hyperplanes to control the bucket sizes. To improve the processing speed, the algorithm is realized in MapReduce paradigm on a Hadoop cluster. Some experiment results are presented to show its applicability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Han, L., Ong, H.Y.: Parallel data intensive applications using MapReduce: a data mining case study in biomedical sciences. Clust. Comput. 18(1), 403–418 (2015)

    Article  Google Scholar 

  2. Torralba, A., Fergus, R., Weiss, Y.: Small codes and large image databases for recognition. In: Proceedings of CVPR, pp. 1–8 (2008)

  3. Hays, J., Efros, A.A.: Scene completion using millions of photographs. In: Proceedings of SIGGRAPH (2007)

  4. Broder, A.Z.: Identify and filtering near-duplicate documents. In: Proceedings of Symposium on Combinatorial Pattern Matching, pp. 1–10 (2000)

  5. Sundaram, N., Turmukhametova, A., Satish, N., Mostak, T., Indyk, P., Madden, S., Dubey, P.: Streaming similarity search over one billion tweets using parallel locality-sensitive hashing. In: Proceedings of the VLDB Endowment, Vol. 6, No. 14 (2013)

  6. Korelin, V., Blekanov, I.: Hierarchical clustering of large text databases using locality-sensitive hashing. In: Proceedings of the International Conference on Applications in Information Technology, pp. 61–64 (2015)

  7. Ture, F., Elsayed, T., Lin, J.: No free lunch: brute force vs locality-sensitive hashing cross-lingual pairwise similarity. In: Proceedings of SIGIR2011, pp. 943–952 (2011)

  8. Alewiwi, M., Orencik, C., Savas, E.: Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Clust. Comput. 19(1), 109–126 (2016)

    Article  Google Scholar 

  9. Lee, K.M., Lee, C.H., Lee, K.M.: Statistical cluster validity indexes to consider cohesion and separation. In: Proceedings of 2012 International Conference on Fuzzy Theory and Its Applications, iFUZZY 2012, pp. 228–232 (2012)

  10. Caruana, G., Li, M., Qi, M.: A MapReduce based parallel SVM for large scale spam filtering. In: Proceedings of 8th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 2659–2662 (2011)

  11. Rasheed, Z.: A map-reduce framework for clustering metagenomes. In: Proceedings of IEEE 27th International Symposium On Parallel and Distributed Processing, pp. 549–557 (2013)

  12. Sunarso, F., Venugopal, S., Lauro, F.: Scalable protein sequence similarity search using locality-sensitive hashing and MapReduce. Technical Report UNSW-CSE-TR-201325, The University of New South Wales (2013). arXiv:1310.0883v1

  13. Omohundro, S.: Five balltree construction algorithms. Technical Report, ICSI (1989)

  14. Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: Proceedings of CVPR, Vol. 5 (2006)

  15. C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

  16. Murphy, K.P.: Machine learning: a probabilistic perspective (2012)

  17. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithm. MIT Press, Cambridge (2009)

    MATH  Google Scholar 

  18. Lee, K.M.: Locality-sensitive hashing techniques for nearest neighbor search. Int. J. Fuzzy Logic Intell. Syst. 12(4), 300–307 (2012)

    Article  Google Scholar 

  19. Lee, K.M.: Locality sensitive hashing with replicated coverage. Int. J. Appl. Eng. Res. 9(21), 8747–8756 (2014)

    Google Scholar 

  20. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating Systems Design and Implementation (2004)

  21. Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of STOC (1998)

  22. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of VLDB (1999)

  23. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  24. Lee, K.M., Lee, K.M.: A locality sensitive hashing technique for categorical data. Appl. Mech. Mater. (2013)

  25. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M.: Min-wise independent permutations, pp. 327–336. ACM Symposium on Theory of Computing (1998)

  26. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S.: Locality-sensitive hashing scheme based on p-stable distribution. In: Symposium on Computational Geometry, pp. 253–262 (2004)

  27. Leskovec, J., Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, New York (2014)

    Book  Google Scholar 

  28. Kang, S.J., Lee, S.Y., Lee, K.M.: Performance comparison of OpenMP, MPI, and MapReduce in practical problem. Adv. Multimed. (2015)

  29. Verma, A., Cho, B., Zea, N.: Breaking the MapReduce stage barrier. Clust. Comput. 16(1), 191–206 (2013)

    Article  Google Scholar 

  30. Jiang, H., Chen, Y., Qiao, Z., Weng, T.H., Li, K.C.: Scaling up MapReduce-based big data processing on multi-GPU systems. Clust. Comput. 18(1), 369–383 (2015)

  31. Bahmani, B., Goel, A., Shinde, R.: Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 2174–2178. ACM (2012)

  32. Wang, J., Lin, C.: MapReduce based personalized locality sensitive hashing for similarity joins on large scale data. Computat. Intell. Neuraosci. 2015, 13 (2015)

    Google Scholar 

  33. Roh, S.B., Jeong, J.W., Ahn, T.C.: Fuzzy learning vector quantization based on fuzzy k-nearest neighbor prototypes. Int. J. Fuzzy Logic Intell. Syst. 11(2), 84–88 (2011)

    Article  Google Scholar 

  34. Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51(1), 117–122 (2008)

    Article  Google Scholar 

  35. Baluja, S., Covell, M.: Learning forgiving hash functions: algorithms and large scale tests. In: Proceedings of 20th International Joint Conference on Artificial intelligence, pp. 2663– 2669 (2007)

  36. Gong, Y., Lazebnik, S., Gordo, A., Perronnin, F.: Iterative quantization: a procrustean approach to learning binary codes for large-scale image retrieval. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (2012)

  37. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceedings of SIGMOD’84 (1984)

  38. He, J., Liu, W., Chang, S.-F.: Scalable similarity search with optimized Kernel hashing. In: Proceedings of IEEE International Conference on Knowledge Discovery and Data Mining, pp. 1129–1138 (2010)

  39. Jiang, Q., Sun, M.: Semi-supervised simhash for efficient document similarity search. In: Proceedings The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 93–101 (2011)

  40. Kulis, B., Grauman, K.: Kernelized locality sensitive hashing. In: Proceedings of 12th International Conference on Computer Vision (2009)

  41. Matsushita, Y., Wada, T.: Principal component hashing: an accelerated approximate nearest neighbor search. In: Proceedings of PSIVT (2009)

  42. Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: Proceedings of NIPS (2009)

  43. Wang, J., Kumar, S., Chang, S.-F.: SemiSupervised hashing for large scale search. IEEE PAMI, Vol. 34, No. 12 (2012)

  44. Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Proceedings of Neural Information Processing Systems, pp. 1753–1760 (2008)

  45. Xu, H., Wang, J., Li, Z., Zeng, G., Le, S., Yu, N.: Complementary hashing for approximate nearest neighbor search. In: Proceedings of IEEE International Conference on Computer Vision (2011)

  46. Zhang, D., Wang, J., Cai, D., Lu, J.: Self-taught hashing for fast similarity search. In: Proceedings SIGIR, pp. 18–25 (2010)

  47. Kim, Y.J., Lee, K.M.: Big numeric data classification using grid-based Bayesian inference in the MapReduce framework. Int. J. Fuzzy Logic Intell. Syst. 14(4), 313–321 (2014)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) (Grant No. 2015R1D1A1A01061062).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Keon Myung Lee.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, K.M., Jeong, YS., Lee, S.H. et al. Bucket-size balancing locality sensitive hashing using the map reduce paradigm. Cluster Comput 22 (Suppl 1), 1959–1971 (2019). https://doi.org/10.1007/s10586-017-1013-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-017-1013-2

Keywords

Navigation