A Spark-based Apriori algorithm with reduced shuffle overhead

Raj, Shashi; Ramesh, Dharavath; Sethi, Krishan Kumar

doi:10.1007/s11227-020-03253-7

A Spark-based Apriori algorithm with reduced shuffle overhead

Published: 27 March 2020

Volume 77, pages 133–151, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

896 Accesses
32 Citations
Explore all metrics

Abstract

Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

Article 07 April 2020

Shashi Raj, Dharavath Ramesh, … Krishan Kumar Sethi

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

Article 30 January 2017

Krishan Kumar Sethi & Dharavath Ramesh

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

Article 03 August 2022

Shashi Raj & Dharavath Ramesh

References

Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
MATH Google Scholar
Aggarwal CC, Bhuiyan MA, Al Hasan M (2014) Frequent pattern mining algorithms: a survey. In: Frequent pattern mining. Springer, Cham, pp. 19–64
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
MATH Google Scholar
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Google Scholar
Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM sIGKDD Explor Newsl 14(2):1–5
Article Google Scholar
Che D, Safran M, Peng Z (2013, April) From big data to big data mining: challenges, issues, and opportunities. In: International Conference on Database Systems for Advanced Applications. Springer, Berlin, pp 1–15
Sagiroglu S, Sinanc D (2013, May) Big data: a review. In: 2013 International Conference on Collaboration Technologies and Systems (CTS). IEEE, pp 42–47
Agrawal R, Srikant R (1994, September) Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference Very Large Data Bases, VLDB, Vol 1215. pp 487–499
Goswami DN, Anshu C, Raghuvanshi CS (2010) An algorithm for frequent pattern mining based on apriori. Int J Comput Sci Eng 2(04):942–947
Google Scholar
Borgelt C (2003, November) Efficient implementations of apriori and eclat. In: FIMI’03: proceedings of the IEEE ICDM workshop on frequent itemset mining implementations
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. ACM sigmod record 29(2):1–12
Article Google Scholar
Savasere A, Omiecinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. Georgia Institute of Technology, Atlanta
Google Scholar
Lin MY, Lee PY, Hsueh SC (2012, February) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication. pp 1–8
Li N, Zeng L, He Q, Shi Z (2012, August) Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, pp 236–241
Yang XY, Liu Z, Fu Y (2010, June) MapReduce as a programming model for association rules algorithm on Hadoop. In: The 3rd International Conference on Information Sciences and Interaction Sciences. IEEE, pp 99–102
Lin X (2014, June) Mr-apriori: Association rules algorithm based on mapreduce. In: 2014 IEEE 5th international conference on software engineering and service science. IEEE, pp 141–144
Yahya O, Hegazy O, Ezat E (2012) An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model. Int J Rev Comput 12
Apache hadoop (2013). https://hadoop.apache.org/. Accessed Mar 2019
Apache Spark: Lightning-fast cluster computing. (2016) The Apache Software Foundation. Spark1.6.0. https://spark.apache.org/. Accessed Mar 2019
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O'Reilly Media Inc., Champaign
Google Scholar
Lin MY, Lee PY, Hsueh SC (2012, February) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication. pp 1–8
Moens S, Aksehirli E, Goethals B (2013, October) Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data. IEEE, pp 111–118
Hammoud S (2011) MapReduce network enabled algorithms for classification based on association rules (Doctoral dissertation, Brunel University School of Engineering and Design PhD Theses)
Thabtah F, Hammoud S (2013) Mr-arm: a map-reduce association rule mining framework. Parallel process lett 23(03):1350012
Article MathSciNet Google Scholar
Yu KM, Zhou J, Hong TP, Zhou JL (2010) A load-balanced distributed parallel mining algorithm. Expert Syst Appl 37(3):2459–2464
Article MathSciNet Google Scholar
Aouad LM, Le-Khac NA, Kechadi TM (2010) Performance study of distributed apriori-like frequent itemsets mining. Knowl Inf Syst 23(1):55–72
Article Google Scholar
Chen Z, Cai S, Song Q, Zhu C (2011, August) An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC). IEEE, pp 1908–1911
Zhang F, Liu M, Gui F, Shen W, Shami A, Ma Y (2015) A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust Comput 18(4):1493–1501
Article Google Scholar
Qiu H, Gu R, Yuan C, Huang Y (2014, May) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE international parallel & distributed processing symposium workshops. IEEE, pp 1664–1671
Rathee S, Kaul M, Kashyap A (2015, October). R-Apriori: an efficient apriori based algorithm on spark. In: Proceedings of the 8th workshop on Ph. D. Workshop in information and knowledge management. pp 27–34
Sethi KK, Ramesh D (2017) HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73(8):3652–3668
Article Google Scholar
RDD Programming Guide (2019, January). https://spark.apache.org/docs/latest/rdd-programming-guide.html
IBM’s synthetic datasets generated by IBM’s Quest dataset generator (2019, January). https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
Datasets for chess, mushroom and retail (2019, January). https://fimi.ua.ac.be/data/

Download references

Acknowledgments

This work is partially funded by IIT(ISM), Govt. of India, Dhanbad. The authors would like to express their gratitude and heartiest thanks to the Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Bakhtiyarpur College of Engineering, Patliputra, Patna, Bihar, 800013, India
Shashi Raj
Department of Computer Science and Engineering, Indian Institute of Technology (ISM), Dhanbad, Jharkhand, 826004, India
Dharavath Ramesh & Krishan Kumar Sethi

Authors

Shashi Raj
View author publications
You can also search for this author in PubMed Google Scholar
Dharavath Ramesh
View author publications
You can also search for this author in PubMed Google Scholar
Krishan Kumar Sethi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dharavath Ramesh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Raj, S., Ramesh, D. & Sethi, K.K. A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 77, 133–151 (2021). https://doi.org/10.1007/s11227-020-03253-7

Download citation

Published: 27 March 2020
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11227-020-03253-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Spark-based Apriori algorithm with reduced shuffle overhead

Abstract

Access this article

Similar content being viewed by others

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Spark-based Apriori algorithm with reduced shuffle overhead

Abstract

Access this article

Similar content being viewed by others

EAFIM: efficient apriori-based frequent itemset mining algorithm on Spark for big transactional data

HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing

PartEclat: an improved Eclat-based frequent itemset mining algorithm on spark clusters using partition technique

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation