Skip to main content
Log in

LSH-based distributed similarity indexing with load balancing in high-dimensional space

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Locality-sensitive hashing (LSH) and its variants are well-known indexing schemes for solving the similarity search problem in high-dimensional space. Traditionally, these indexing schemes are centrally managed and multiple hash tables are needed to guarantee the search quality. However, due to the limitation of storage space and processing capacity of the server, the centralized indexing schemes become impractical for massive data objects. Therefore, several distributed indexing schemes based on peer-to-peer (P2P) networks are proposed, whereas how to ensure load balancing is still one of the key issues. To solve the problem, in this paper, we propose two theoretical LSH-based data distribution models in P2P networks for datasets with homogeneous and heterogeneous \(l_2\) norms, respectively. Unlike earlier schemes, to our knowledge, we focus on load balancing for a single hash table rather than multiple tables, which has not been considered previously. Then, we propose a static distributed indexing scheme with a novel load balancing indexing mapping method based on the cumulative distribution function by our models. Furthermore, we propose a dynamic load rebalancing algorithm using virtual node method of P2P networks to make the static indexing scheme more practical and robust. The experiments based on synthetic and real datasets show that the proposed distributed similarity indexing schemes are effective and efficient for load balancing in similarity indexing of high-dimensional space.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://corpus-texmex.irisa.fr/.

References

  1. Tao Y, Yi K, Sheng C, Kalnis P (2010) Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans Database Syst 35(3):1–20

    Article  Google Scholar 

  2. Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. VLDB 99(6):518–529

    Google Scholar 

  3. Hu W, Fan Y, Xing J et al (2018) Deep constrained siamese hash coding network and load-balanced locality-sensitive hashing for near duplicate image detection. IEEE Trans Image Process 27:4452–4464

    Article  MathSciNet  Google Scholar 

  4. Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp 541–552

  5. Shen L, Wu J, Wang Y, Feng L (2018) Towards load balancing for LSH-based distributed similarity indexing in high-dimensional space. In; IEEE 20th International Conference on High Performance Computing and Communications and IEEE 16th International Conference on Smart City and IEEE 4th International Conference on Data Science and Systems, pp 384–391

  6. Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, Philadelphia, pp 1186–1195

  7. Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, Rio de Janeiro, pp 950–961

  8. Christiani T (2017) A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, New York, pp 31–46

  9. Kraus N, Carmel D, Keidar I, Orenbach M (2016) Nearbucket-LSH: efficient similarity search in P2P networks. In: International Conference on Similarity Search and Applications. Springer, Cham, pp 236–249

  10. Karapiperis D, Gkoulalas-Divanis A, Verykios V (2016) LSHDB: a parallel and distributed engine for record linkage and similarity search In: IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, New York, pp 1–4

  11. Qi L et al (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624

    Article  Google Scholar 

  12. Zhai D et al (2018) Supervised distributed hashing for large-scale multimedia retrieval. IEEE Trans Multimedia 20(3):675–686

    Article  Google Scholar 

  13. Liao J, Yang D, Li T, Qi Q, Wang J, Sun H (2016) Fusion feature for LSH-based image retrieval in a cloud datacenter. Multimedia Tools Appl 75:15405–15427

    Article  Google Scholar 

  14. Chuang Y-T, Yu C-Y, Wu Q-W (2018) DSLM: a decentralized search for large and mobile networks. J Supercomput 74:738–767

    Article  Google Scholar 

  15. Bahmani B, Goel A, Shinde R (2012) Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, New York, pp 2174–2178

  16. Wadhwa S, Gupta P (2010) Distributed locality sensitivity hashing. In: Consumer Communications and Networking Conference, pp 1–4

  17. Haghani P, Michel S, Aberer K (2009) Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, New York, pp 744–755

  18. Raghu G et al (2018) Memory-based load balancing algorithm in structured peer-to-peer system. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Springer, Berlin, pp 431–439

  19. Lee KM, Jeong Y-S, Lee SH, Lee KM (2019) Bucket-size balancing locality sensitive hashing using the map reduce paradigm. Clust Comput J Netw Softw Tools Appl 22:S1959–S1971

    Google Scholar 

  20. Wu X, Zeng Y, Lin G (2017) An energy efficient VM migration algorithm in data centers. In: Proceedings of IEEE 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, pp 27–30

  21. Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in uncertain cloud environment. J Supercomput 73:4906–4922

    Article  Google Scholar 

  22. Qureshi B (2019) Profile-based power-aware workflow scheduling framework for energy-efficient data centers. Future Gener Comput Syst 94:453–467

    Article  Google Scholar 

  23. Haghani P, Michel S, CudreMauroux P, Aberer K (2008) LSH at large-distributed KNN search in high dimensions. In: WebDB

  24. Datar M, Immorlica N, Indyk P, Mirrokni V (2004) Locality sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry. ACM, New York, pp 253–262

  25. Balakrishnan H, Kaashoek F, Karger D, Morris R, Stoica I (2003) Looking up data in P2P systems. Commun ACM 46(2):43–48

    Article  Google Scholar 

  26. Godfrey B, Lakshminarayanan K, Surana S, Karp R, Stoica I (2004) Load balancing in dynamic structured P2P systems. In: Twenty-Third AnnualJoint Conference of the IEEE Computer and Communications Societies, INFOCOM. pp 2253–2262

  27. Pitoura T, Triantafillou P (2007) Load distribution fairness in P2P data management systems. In: IEEE 23rd International Conference on Data Engineering, ICDE. IEEE, New York, pp 396–405

Download references

Acknowledgements

This research is supported by National Natural Science Foundation of China under Grants Nos. 41571389 and 61373139, Open Research Fund from Key Laboratory of Computer Network and Information Integration (SEU), Ministry of Education, China, under Grant No. K93-9-2014-05B and Scientific Research Foundation of NUPT under Grant No. NY214063.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiagao Wu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, J., Shen, L. & Liu, L. LSH-based distributed similarity indexing with load balancing in high-dimensional space. J Supercomput 76, 636–665 (2020). https://doi.org/10.1007/s11227-019-03047-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-03047-6

Keywords

Navigation