Skip to main content
Log in

Strategic and suave processing for performing similarity joins using MapReduce

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

An efficient MapReduce Algorithm for performing Similarity Joins between multisets is proposed. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence improve the efficiency of the algorithm. Multisets represent real-world data better by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently and unscalably incorporate prefix filtering. This work extends the filtering techniques, namely the prefix, size and positional to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model, thereby minimizing the pairs generated and joined, resulting in I/O, network and computational efficiency. A technique to enhance the scalability of the algorithm is also presented as a contingency need. Algorithms are developed using Hadoop and tested using real-world Twitter data. Experimental results demonstrate unprecedented performance gain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

References

  1. Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp 918–929

  2. Baraglia R, De Francisci Morales G, Lucchese C (2010) Document similarity self-join with mapreduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE, pp 731–736

  3. Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 131–140

  4. Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8):1157–1166

    Article  Google Scholar 

  5. Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, 2006. ICDE’06. IEEE, New York, pp 5–5

  6. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  7. Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers. association for, computational linguistics, pp 265–268

  8. Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. J Web Eng 2(4):228–246

    Google Scholar 

  9. Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D (2008) Fast indexes and algorithms for set similarity selection queries. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008. IEEE, New York pp 267–276

  10. Hadjieleftheriou M, Koudas N, Srivastava D (2009) Incremental maintenance of length normalized indexes for approximate string matching. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, New York, pp 429–440

  11. Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291

  12. Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54(3):203–215

    Article  Google Scholar 

  13. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, New York, pp 604–613

  14. Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715

    Article  Google Scholar 

  15. Metwally A, Agrawal D, El Abbadi A (2007) Detectives: detecting coalition hit inflation attacks in advertising networks streams. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 241–250

  16. Ricardo BY et al (1999) Modern information retrieval. Pearson Education India, Delhi

  17. Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 269–278

  18. Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754

  19. Singh D, Ibrahim A, Yohanna T, Singh J (2007) An overview of the applications of multisets. Novi Sad J Math 37(3):73–92

    MATH  MathSciNet  Google Scholar 

  20. Spertus E, Sahami M, Buyukkokten O (2005) Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, pp 678–684

  21. The Apache Software Foundation (2014) Hadoop. URL: http://hadoop.apache.org

  22. Vernica R, Adviser-Carey MJ (2011) Efficient processing of set-similarity joins on large clusters. California State University at Long Beach

  23. Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, New York, pp 495–506

  24. White T (2009) Hadoop: the definitive guide: the definitive guide. O’Reilly Media

  25. Winkler WE (1999) The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer

  26. Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, New York, pp 916–927

  27. Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert C. Green II.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lakshminarayanan, M., Acosta, W.F., Green, R.C. et al. Strategic and suave processing for performing similarity joins using MapReduce. J Supercomput 69, 930–954 (2014). https://doi.org/10.1007/s11227-014-1197-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1197-7

Keywords

Navigation