Abstract
Top-k similarity join on high-dimensional data plays an important role in many applications. The traditional tree-like index based approaches can’t deal with large scale high-dimensional data efficiently because of “curse of dimensionality”. So in this paper, we firstly propose an approach to construct the similarity distribution histogram using stratified sampling method, then to estimate the similarity threshold according to the number of the required returned results, finally we propose a novel Top-k similarity join algorithm based on similarity distribution histogram. We conduct comprehensive experiments and the experimental results show that our proposed approaches has good efficiency and scalability.
This research was partially supported by the grants from the National Natural Science Foundation of China (No. 61602231); Training plan for young backbone teachers of Colleges and universities in Henan (No. 2017GGJS134); Key Scientific Research Project of Higher Education of Henan Province (No. 16A520022); Outstanding talents of scientific and technological innovation in Henan (No. 184200510011); National key research and development program (No. 2016YFE0104600); Scientific and Technological Project of Henan Province (No. 192102210122).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
References
Pang, J., Gu, Y., Xu, J., Yu, G.: Research advance on similarity join queries. J. Front. Comput. Technol. 7(1), 1–13 (2013)
Pang, J., Yu, G., Xu, J., Gu, Y.: Similarity joins on massive data based on mapreduce. Framework 42(1), 1–5 (2015)
Yu, M., Li, G., Deng, D., Feng, J.: String similarity search and join: a survey. Front. Comput. Sci. 10(3), 399–417 (2015)
Xu, W., Xu, Z., Ye, L.: Computing user similarity by combining item ratings and background knowledge from linked open data. In: Meng, X., Li, R., Wang, K., Niu, B., Wang, X., Zhao, G. (eds.) WISA 2018. LNCS, vol. 11242, pp. 467–478. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02934-0_43
Shim, K., Srikant, R., Agrawal, R.: High-dimensional similarity joins. In: Proceedings of ICDE, pp. 301–311 (1997)
Zhu, M., Papadias, D., Zhang, J., Lee, D.: Top-k spatial joins. IEEE Trans. Knowl. Data Eng. 17(4), 567–579 (2005)
Yu, C., Cui, B., Wang, S., Su, J.: Efficient index-based KNN join processing for high-dimensional data. Inf. Software Technol. 49(4), 32–344 (2007)
Sakurai, Y., Yoshikawa, M., Uemura, S., Kojima, H.: The A-tree: an index structure for high-dimensional spaces using relative approximation. In: Proceedings of VLDB, pp. 516–526 (2000)
Yu, X., Dong, J.: Indexing high-dimensional data for main-memory similarity search. Inf. Syst. 35(7), 825–843 (2010)
Böhm, C., Braunmüller, B., Krebs, F., Kriegel, H.: Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data. In: Proceedings of SIGMOD, pp. 379–388 (2001)
Dmitri, V.: Kalashnikov, Super-EGO: fast multi-dimensional similarity join. VLDB J. 22(4), 56–85 (2013)
Lopez, M., Liao, S.: Finding k-closest-pairs efficiently for high dimensional data. In: Proceedings of CCCG, pp. 197–204 (2000)
Seidl, T., Fries, S., Boden, B.: MR-DSJ: distance-based self-join for large-scale vector data analysis with mapreduce. In: Proceedings of BTW, pp. 37–56 (2013)
Fries, S., Boden, B., Stepien, G., Seidl, T.: PHiDJ: parallel similarity self-join for high-dimensional vector data with mapreduce. In: Proceedings of ICDE, pp. 796–807 (2014)
Wang, J., Shen, H., Song, J., Ji, J.: Hashing for similarity search: a survey, pp. 1–29. arXiv:1408.2927 (2014)
Stupar, A., Michel, S., Schenkel, R.: Rankreduce-processing K-nearest neighbor queries on top of mapreduce. In: Proceedings of LSDS-IR, pp. 13–18 (2010)
Lv, Q., Josephson, W., Wang, Z., Charikar, M., Li, K.: MultiProbe LSH: efficient indexing for high-dimensional similarity search. In: Proceedings of VLDB, pp. 950–961 (2007)
Gao, J., Jagadish, H., Lu, W., Ooi, B.: DSH: data sensitive hashing for high-dimensional k-NN search. In: Proceedings of SIGMOD, pp. 1127–1138 (2015)
Pham, N., Pagh, R.: Scalability and Total Recall with Fast CoveringLSH, pp. 1–13. arXiv:1602.02620v1 (2016)
Haghani, P., Michel, S., CudreMauroux, P., Aberer, K.: LSH at large - distributed KNN search in high dimensions. In: Proceedings of WebDB, pp. 1–6 (2008)
Wang, J., Lin, C.: Mapreduce based personalized locality sensitive hashing for similarity joins on large scale data. Comput. Intell. Neurosci. 2015, 1–13 (2015). Article ID 217216
Luo, W., Tan, H., Mao, H., Ni, L.: Efficient similarity joins on massive high-dimensional datasets using mapreduce. In: Proceedings of MDM, pp. 1–10 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Ma, Y., Zhang, R., Zhang, Y. (2019). Similarity Histogram Estimation Based Top-k Similarity Join Algorithm on High-Dimensional Data. In: Ni, W., Wang, X., Song, W., Li, Y. (eds) Web Information Systems and Applications. WISA 2019. Lecture Notes in Computer Science(), vol 11817. Springer, Cham. https://doi.org/10.1007/978-3-030-30952-7_60
Download citation
DOI: https://doi.org/10.1007/978-3-030-30952-7_60
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30951-0
Online ISBN: 978-3-030-30952-7
eBook Packages: Computer ScienceComputer Science (R0)