Strategic and suave processing for performing similarity joins using MapReduce

Lakshminarayanan, Mahalakshmi; Acosta, William F.; Green, Robert C.; Devabhaktuni, Vijay

doi:10.1007/s11227-014-1197-7

Strategic and suave processing for performing similarity joins using MapReduce

Published: 05 May 2014

Volume 69, pages 930–954, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Mahalakshmi Lakshminarayanan¹,
William F. Acosta²,
Robert C. Green II³ &
…
Vijay Devabhaktuni¹

185 Accesses
1 Citation
Explore all metrics

Abstract

An efficient MapReduce Algorithm for performing Similarity Joins between multisets is proposed. Filtering techniques for similarity joins minimize the number of pairs of entities joined and hence improve the efficiency of the algorithm. Multisets represent real-world data better by considering the frequency of its elements. Prior serial algorithms incorporate filtering techniques only for sets, but not multisets, while prior MapReduce algorithms do not incorporate any filtering technique or inefficiently and unscalably incorporate prefix filtering. This work extends the filtering techniques, namely the prefix, size and positional to multisets, and also achieves the challenging task of efficiently incorporating them in the shared-nothing MapReduce model, thereby minimizing the pairs generated and joined, resulting in I/O, network and computational efficiency. A technique to enhance the scalability of the algorithm is also presented as a contingency need. Algorithms are developed using Hadoop and tested using real-world Twitter data. Experimental results demonstrate unprecedented performance gain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Towards a Scalable Set Similarity Join Using MapReduce and LSH

Efficient Graph Similarity Join with Scalable Prefix-Filtering Using MapReduce

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

Article 23 May 2022

References

Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the 32nd international conference on Very large data bases, VLDB Endowment, pp 918–929
Baraglia R, De Francisci Morales G, Lucchese C (2010) Document similarity self-join with mapreduce. In: 2010 IEEE 10th International Conference on Data Mining (ICDM), IEEE, pp 731–736
Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 131–140
Broder AZ, Glassman SC, Manasse MS, Zweig G (1997) Syntactic clustering of the web. Comput Netw ISDN Syst 29(8):1157–1166
Article Google Scholar
Chaudhuri S, Ganti V, Kaushik R (2006) A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd international conference on data engineering, 2006. ICDE’06. IEEE, New York, pp 5–5
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Elsayed T, Lin J, Oard DW (2008) Pairwise document similarity in large collections with mapreduce. In: Proceedings of the 46th annual meeting of the association for computational linguistics on human language technologies: short papers. association for, computational linguistics, pp 265–268
Fetterly D, Manasse M, Najork M (2003) On the evolution of clusters of near-duplicate web pages. J Web Eng 2(4):228–246
Google Scholar
Hadjieleftheriou M, Chandel A, Koudas N, Srivastava D (2008) Fast indexes and algorithms for set similarity selection queries. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008. IEEE, New York pp 267–276
Hadjieleftheriou M, Koudas N, Srivastava D (2009) Incremental maintenance of length normalized indexes for approximate string matching. In: Proceedings of the 2009 ACM SIGMOD international conference on management of data. ACM, New York, pp 429–440
Henzinger M (2006) Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, New York, pp 284–291
Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarized documents. J Am Soc Inf Sci Technol 54(3):203–215
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the thirtieth annual ACM symposium on theory of computing. ACM, New York, pp 604–613
Metwally A, Faloutsos C (2012) V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc VLDB Endow 5(8):704–715
Article Google Scholar
Metwally A, Agrawal D, El Abbadi A (2007) Detectives: detecting coalition hit inflation attacks in advertising networks streams. In: Proceedings of the 16th international conference on World Wide Web. ACM, New York, pp 241–250
Ricardo BY et al (1999) Modern information retrieval. Pearson Education India, Delhi
Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 269–278
Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the 2004 ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754
Singh D, Ibrahim A, Yohanna T, Singh J (2007) An overview of the applications of multisets. Novi Sad J Math 37(3):73–92
MATH MathSciNet Google Scholar
Spertus E, Sahami M, Buyukkokten O (2005) Evaluating similarity measures: a large-scale study in the orkut social network. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, New York, pp 678–684
The Apache Software Foundation (2014) Hadoop. URL: http://hadoop.apache.org
Vernica R, Adviser-Carey MJ (2011) Efficient processing of set-similarity joins on large clusters. California State University at Long Beach
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, New York, pp 495–506
White T (2009) Hadoop: the definitive guide: the definitive guide. O’Reilly Media
Winkler WE (1999) The state of record linkage and current research problems. In: Statistical Research Division, US Census Bureau, Citeseer
Xiao C, Wang W, Lin X, Shang H (2009) Top-k set similarity joins. In: IEEE 25th international conference on data engineering, 2009. ICDE’09. IEEE, New York, pp 916–927
Xiao C, Wang W, Lin X, Yu JX, Wang G (2011) Efficient similarity joins for near-duplicate detection. ACM Trans Database Syst (TODS) 36(3):15
Article Google Scholar

Download references

Author information

Authors and Affiliations

Electrical Engineering and Computer Science, University of Toledo, 2801 W Bancroft Street, Toledo, OH , 43606, USA
Mahalakshmi Lakshminarayanan & Vijay Devabhaktuni
Harman International, 702 Deerpath Dr, Vernon Hills, IL , 60061, USA
William F. Acosta
Department of Computer Science, Bowling Green State University, Bowling Green, OH , 43403, USA
Robert C. Green II

Authors

Mahalakshmi Lakshminarayanan
View author publications
You can also search for this author in PubMed Google Scholar
William F. Acosta
View author publications
You can also search for this author in PubMed Google Scholar
Robert C. Green II
View author publications
You can also search for this author in PubMed Google Scholar
Vijay Devabhaktuni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert C. Green II.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lakshminarayanan, M., Acosta, W.F., Green, R.C. et al. Strategic and suave processing for performing similarity joins using MapReduce. J Supercomput 69, 930–954 (2014). https://doi.org/10.1007/s11227-014-1197-7

Download citation

Published: 05 May 2014
Issue Date: August 2014
DOI: https://doi.org/10.1007/s11227-014-1197-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Strategic and suave processing for performing similarity joins using MapReduce

Abstract

Access this article

Similar content being viewed by others

Towards a Scalable Set Similarity Join Using MapReduce and LSH

Efficient Graph Similarity Join with Scalable Prefix-Filtering Using MapReduce

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Strategic and suave processing for performing similarity joins using MapReduce

Abstract

Access this article

Similar content being viewed by others

Towards a Scalable Set Similarity Join Using MapReduce and LSH

Efficient Graph Similarity Join with Scalable Prefix-Filtering Using MapReduce

A Scalable Similarity Join Algorithm Based on MapReduce and LSH

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation