Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Alewiwi, Mahmoud; Orencik, Cengiz; Savaş, Erkay

doi:10.1007/s10586-015-0506-0

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Published: 09 November 2015

Volume 19, pages 109–126, (2016)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Mahmoud Alewiwi¹,
Cengiz Orencik¹ &
Erkay Savaş¹

802 Accesses
16 Citations
Explore all metrics

Abstract

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

An Efficient Document Indexing-Based Similarity Search in Large Datasets

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

References

Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)
Article MATH Google Scholar
Apache Hadoop. http://hadoop.apache.org
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 918–929. VLDB Endowment (2006)
Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 731–736 (2010). doi:10.1109/ICDM.2010.70
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 131–140. ACM, New York (2007). doi:10.1145/1242572.1242591
Brown, R.A.: Hadoop at home: large-scale computing at a small college. SIGCSE Bull. 41(1), 106–110 (2009). doi:10.1145/1539024.1508904
Article Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, p. 5. IEEE Computer Society, Washington, DC (2006). doi:10.1109/ICDE.2006.9
Connor, M., Kumar, P.: Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans. Vis. Comput. Graph. 16(4), 599–608 (2010). doi:10.1109/TVCG.2010.9
Article Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492
Article Google Scholar
Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short ’08, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)
Enron Dataset. http://www.cs.cmu.edu/~./enron/
Falchi, F., Perego, R., Lucchese, C., Rabitti, F., Orlando, S.: A metric cache for similarity search. In: LSDS-IR (2008)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345
Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)
Google Scholar
Lucene. http://lucene.apache.org/
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)
Book MATH Google Scholar
Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Cloud-I’13, p. 2 (2013)
Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)
Google Scholar
Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754. ACM, New York (2004). doi:10.1145/1007568.1007652
Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010). doi:10.1145/1806907.1806912
Article Google Scholar
Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 495–506. ACM, New York (2010). doi:10.1145/1807167.1807222
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 131–140. ACM, New York (2008). doi:10.1145/1367497.1367516
Yang, B., Myung, J., Lee, S.G., Lee, D.: A mapreduce-based filtering algorithm for vector similarity join. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. ACM, New York (2013). doi:10.1145/2448556.2448627
Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp. 38–49. ACM, New York (2012). doi:10.1145/2247596.2247602
Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70(1), 60–83 (2011)
Article Google Scholar

Download references

Acknowledgments

This project is supported by TUBITAK under Grant Number 113E537.

Author information

Authors and Affiliations

Faculty of Science and Engineering, Sabanci University, Istanbul, Turkey
Mahmoud Alewiwi, Cengiz Orencik & Erkay Savaş

Authors

Mahmoud Alewiwi
View author publications
You can also search for this author in PubMed Google Scholar
Cengiz Orencik
View author publications
You can also search for this author in PubMed Google Scholar
Erkay Savaş
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cengiz Orencik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alewiwi, M., Orencik, C. & Savaş, E. Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Comput 19, 109–126 (2016). https://doi.org/10.1007/s10586-015-0506-0

Download citation

Received: 19 February 2015
Revised: 29 September 2015
Accepted: 29 October 2015
Published: 09 November 2015
Issue Date: March 2016
DOI: https://doi.org/10.1007/s10586-015-0506-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Abstract

Access this article

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

Abstract

Access this article

Similar content being viewed by others

An Efficient Document Indexing-Based Similarity Search in Large Datasets

CuAPSS: A Hybrid CUDA Solution for AllPairs Similarity Search

TLCSim: A Large-Scale Two-Level Clustering Similarity Search with MapReduce

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation