Skip to main content
Log in

Efficient top-k similarity document search utilizing distributed file systems and cosine similarity

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Document similarity has important real life applications such as finding duplicate web sites and identifying plagiarism. While the basic techniques such as k-similarity algorithms have been long known, overwhelming amount of data, being collected such as in big data setting, calls for novel algorithms to find highly similar documents in reasonably short amount of time. In particular, pairwise comparison of documents’ features, a key operation in calculating document similarity, necessitates prohibitively high storage and computation power. In this paper, we propose a new filtering technique that decreases the number of comparisons between the query set and the search set to find highly similar documents. The proposed filtering technique utilizes Z-order prefix, based on the cosine similarity measure, in which only the most important features are used first to find highly similar documents. We propose a three-phase approach, where the phases are near duplicate detection, common important terms and join phase. We utilize the Hadoop distributed file system and the MapReduce parallel programming model to scale our techniques to big data setting. Our experimental results on real data show that the proposed method performs better than the previous work in the literature in terms of the number of joins, and therefore, speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Angiulli, F., Pizzuti, C.: An approximate algorithm for top-k closest pairs join query in large high dimensional data. Data Knowl. Eng. 53(3), 263–281 (2005)

    Article  MATH  Google Scholar 

  2. Apache Hadoop. http://hadoop.apache.org

  3. Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: Proceedings of the 32nd International Conference on Very Large Data Bases, VLDB ’06, pp. 918–929. VLDB Endowment (2006)

  4. Baraglia, R., De Francisci Morales, G., Lucchese, C.: Document similarity self-join with MapReduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), pp. 731–736 (2010). doi:10.1109/ICDM.2010.70

  5. Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: Proceedings of the 16th International Conference on World Wide Web, WWW ’07, pp. 131–140. ACM, New York (2007). doi:10.1145/1242572.1242591

  6. Brown, R.A.: Hadoop at home: large-scale computing at a small college. SIGCSE Bull. 41(1), 106–110 (2009). doi:10.1145/1539024.1508904

    Article  Google Scholar 

  7. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE ’06, p. 5. IEEE Computer Society, Washington, DC (2006). doi:10.1109/ICDE.2006.9

  8. Connor, M., Kumar, P.: Fast construction of k-nearest neighbor graphs for point clouds. IEEE Trans. Vis. Comput. Graph. 16(4), 599–608 (2010). doi:10.1109/TVCG.2010.9

    Article  Google Scholar 

  9. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008). doi:10.1145/1327452.1327492

    Article  Google Scholar 

  10. Elsayed, T., Lin, J., Oard, D.W.: Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers. HLT-Short ’08, pp. 265–268. Association for Computational Linguistics, Stroudsburg (2008)

  11. Enron Dataset. http://www.cs.cmu.edu/~./enron/

  12. Falchi, F., Perego, R., Lucchese, C., Rabitti, F., Orlando, S.: A metric cache for similarity search. In: LSDS-IR (2008)

  13. Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004). http://dl.acm.org/citation.cfm?id=1005332.1005345

  14. Li, R., Ju, L., Peng, Z., Yu, Z., Wang, C.: Batch text similarity search with mapreduce. In: Du, X., Fan, W., Peng, Z., Sharaf, M.A. (eds.) APWeb. Lecture Notes in Computer Science, vol. 6612, pp. 412–423. Springer, Heidelberg (2011)

    Google Scholar 

  15. Lucene. http://lucene.apache.org/

  16. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, New York (2008)

    Book  MATH  Google Scholar 

  17. Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Cloud-I’13, p. 2 (2013)

  18. Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)

    Google Scholar 

  19. Sarawagi, S., Kirpal, A.: Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, SIGMOD ’04, pp. 743–754. ACM, New York (2004). doi:10.1145/1007568.1007652

  20. Tao, Y., Yi, K., Sheng, C., Kalnis, P.: Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst. 35(3), 20:1–20:46 (2010). doi:10.1145/1806907.1806912

    Article  Google Scholar 

  21. Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD ’10, pp. 495–506. ACM, New York (2010). doi:10.1145/1807167.1807222

  22. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW ’08, pp. 131–140. ACM, New York (2008). doi:10.1145/1367497.1367516

  23. Yang, B., Myung, J., Lee, S.G., Lee, D.: A mapreduce-based filtering algorithm for vector similarity join. In: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication, ICUIMC ’13, pp. 71:1–71:5. ACM, New York (2013). doi:10.1145/2448556.2448627

  24. Zhang, C., Li, F., Jestes, J.: Efficient parallel knn joins for large data in mapreduce. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT ’12, pp. 38–49. ACM, New York (2012). doi:10.1145/2247596.2247602

  25. Zhu, S., Wu, J., Xiong, H., Xia, G.: Scaling up top-k cosine similarity search. Data Knowl. Eng. 70(1), 60–83 (2011)

    Article  Google Scholar 

Download references

Acknowledgments

This project is supported by TUBITAK under Grant Number 113E537.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cengiz Orencik.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Alewiwi, M., Orencik, C. & Savaş, E. Efficient top-k similarity document search utilizing distributed file systems and cosine similarity. Cluster Comput 19, 109–126 (2016). https://doi.org/10.1007/s10586-015-0506-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0506-0

Keywords

Navigation