Abstract
Locality-sensitive hashing (LSH) and its variants are well-known indexing schemes for solving the similarity search problem in high-dimensional space. Traditionally, these indexing schemes are centrally managed and multiple hash tables are needed to guarantee the search quality. However, due to the limitation of storage space and processing capacity of the server, the centralized indexing schemes become impractical for massive data objects. Therefore, several distributed indexing schemes based on peer-to-peer (P2P) networks are proposed, whereas how to ensure load balancing is still one of the key issues. To solve the problem, in this paper, we propose two theoretical LSH-based data distribution models in P2P networks for datasets with homogeneous and heterogeneous \(l_2\) norms, respectively. Unlike earlier schemes, to our knowledge, we focus on load balancing for a single hash table rather than multiple tables, which has not been considered previously. Then, we propose a static distributed indexing scheme with a novel load balancing indexing mapping method based on the cumulative distribution function by our models. Furthermore, we propose a dynamic load rebalancing algorithm using virtual node method of P2P networks to make the static indexing scheme more practical and robust. The experiments based on synthetic and real datasets show that the proposed distributed similarity indexing schemes are effective and efficient for load balancing in similarity indexing of high-dimensional space.
Similar content being viewed by others
References
Tao Y, Yi K, Sheng C, Kalnis P (2010) Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans Database Syst 35(3):1–20
Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. VLDB 99(6):518–529
Hu W, Fan Y, Xing J et al (2018) Deep constrained siamese hash coding network and load-balanced locality-sensitive hashing for near duplicate image detection. IEEE Trans Image Process 27:4452–4464
Gan J, Feng J, Fang Q, Ng W (2012) Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data. ACM, New York, pp 541–552
Shen L, Wu J, Wang Y, Feng L (2018) Towards load balancing for LSH-based distributed similarity indexing in high-dimensional space. In; IEEE 20th International Conference on High Performance Computing and Communications and IEEE 16th International Conference on Smart City and IEEE 4th International Conference on Data Science and Systems, pp 384–391
Panigrahy R (2006) Entropy based nearest neighbor search in high dimensions. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm. Society for Industrial and Applied Mathematics, Philadelphia, pp 1186–1195
Lv Q, Josephson W, Wang Z, Charikar M, Li K (2007) Multi-probe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of the 33rd International Conference on Very Large Data Bases. VLDB Endowment, Rio de Janeiro, pp 950–961
Christiani T (2017) A framework for similarity search with space-time tradeoffs using locality-sensitive filtering. In: Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, New York, pp 31–46
Kraus N, Carmel D, Keidar I, Orenbach M (2016) Nearbucket-LSH: efficient similarity search in P2P networks. In: International Conference on Similarity Search and Applications. Springer, Cham, pp 236–249
Karapiperis D, Gkoulalas-Divanis A, Verykios V (2016) LSHDB: a parallel and distributed engine for record linkage and similarity search In: IEEE 16th International Conference on Data Mining Workshops (ICDMW). IEEE, New York, pp 1–4
Qi L et al (2017) A distributed locality-sensitive hashing-based approach for cloud service recommendation from multi-source data. IEEE J Sel Areas Commun 35(11):2616–2624
Zhai D et al (2018) Supervised distributed hashing for large-scale multimedia retrieval. IEEE Trans Multimedia 20(3):675–686
Liao J, Yang D, Li T, Qi Q, Wang J, Sun H (2016) Fusion feature for LSH-based image retrieval in a cloud datacenter. Multimedia Tools Appl 75:15405–15427
Chuang Y-T, Yu C-Y, Wu Q-W (2018) DSLM: a decentralized search for large and mobile networks. J Supercomput 74:738–767
Bahmani B, Goel A, Shinde R (2012) Efficient distributed locality sensitive hashing. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management. ACM, New York, pp 2174–2178
Wadhwa S, Gupta P (2010) Distributed locality sensitivity hashing. In: Consumer Communications and Networking Conference, pp 1–4
Haghani P, Michel S, Aberer K (2009) Distributed similarity search in high dimensions using locality sensitive hashing. In: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, New York, pp 744–755
Raghu G et al (2018) Memory-based load balancing algorithm in structured peer-to-peer system. In: Progress in Intelligent Computing Techniques: Theory, Practice, and Applications. Springer, Berlin, pp 431–439
Lee KM, Jeong Y-S, Lee SH, Lee KM (2019) Bucket-size balancing locality sensitive hashing using the map reduce paradigm. Clust Comput J Netw Softw Tools Appl 22:S1959–S1971
Wu X, Zeng Y, Lin G (2017) An energy efficient VM migration algorithm in data centers. In: Proceedings of IEEE 16th International Symposium on Distributed Computing and Applications to Business, Engineering and Science, pp 27–30
Chen H, Zhu J, Zhang Z, Ma M, Shen X (2017) Real-time workflows oriented online scheduling in uncertain cloud environment. J Supercomput 73:4906–4922
Qureshi B (2019) Profile-based power-aware workflow scheduling framework for energy-efficient data centers. Future Gener Comput Syst 94:453–467
Haghani P, Michel S, CudreMauroux P, Aberer K (2008) LSH at large-distributed KNN search in high dimensions. In: WebDB
Datar M, Immorlica N, Indyk P, Mirrokni V (2004) Locality sensitive hashing scheme based on p-stable distributions. In: Proceedings of the Twentieth Annual Symposium on Computational Geometry. ACM, New York, pp 253–262
Balakrishnan H, Kaashoek F, Karger D, Morris R, Stoica I (2003) Looking up data in P2P systems. Commun ACM 46(2):43–48
Godfrey B, Lakshminarayanan K, Surana S, Karp R, Stoica I (2004) Load balancing in dynamic structured P2P systems. In: Twenty-Third AnnualJoint Conference of the IEEE Computer and Communications Societies, INFOCOM. pp 2253–2262
Pitoura T, Triantafillou P (2007) Load distribution fairness in P2P data management systems. In: IEEE 23rd International Conference on Data Engineering, ICDE. IEEE, New York, pp 396–405
Acknowledgements
This research is supported by National Natural Science Foundation of China under Grants Nos. 41571389 and 61373139, Open Research Fund from Key Laboratory of Computer Network and Information Integration (SEU), Ministry of Education, China, under Grant No. K93-9-2014-05B and Scientific Research Foundation of NUPT under Grant No. NY214063.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Wu, J., Shen, L. & Liu, L. LSH-based distributed similarity indexing with load balancing in high-dimensional space. J Supercomput 76, 636–665 (2020). https://doi.org/10.1007/s11227-019-03047-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-03047-6