Abstract
Mining frequent itemset is considered as a core activity to find association rules from transactional datasets. Among the different well-known approaches to find frequent itemsets, the Apriori algorithm is the earliest proposed. Many attempts have been made to adopt the Apriori algorithm for large-scale datasets. But the bottlenecks associated with Apriori like/such as repeated scans of the input dataset, generation of all the candidate itemsets prior to counting their support value, etc., reduce the effectiveness of Apriori for large-size datasets. When the data size is large, even distributed and parallel implementations of Apriori using the MapReduce framework does not perform well. This is due to the iterative nature of the algorithm that incurs high disk overhead. In each iteration, the input dataset is scanned that resides on disk, causing the high disk I/O. Apache Spark implementations of Apriori show better performance due to in-memory processing capabilities. It makes iterative scanning of datasets faster by keeping it in a memory abstraction called resilient distributed dataset (RDD). An RDD keeps datasets in the form of key-value pairs spread across the cluster nodes. RDD operations require these key-value pairs to be redistributed among cluster nodes in the course of processing. This redistribution or shuffle operation incurs communication and synchronization overhead. In this manuscript, we propose a novel approach, namely the Spark-based Apriori algorithm with reduced shuffle overhead (SARSO). It utilizes the benefits of Spark’s parallel and distributed computing environment, and it is in-memory processing capabilities. It improves the efficiency further by reducing the shuffle overhead caused by RDD operations at each iteration. In other words, it restricts the movement of key-value pairs across the cluster nodes by using a partitioning method and hence reduces the necessary communication and synchronization overhead incurred by the Spark shuffle operation. Extensive experiments have been conducted to measure the performance of the SARSO on benchmark datasets and compared with an existing algorithm. Experimental results show that the SARSO has better performance in terms of running time and scalability.
Similar content being viewed by others
References
Aggarwal CC (2015) Data mining: the textbook. Springer, Berlin
Aggarwal CC, Bhuiyan MA, Al Hasan M (2014) Frequent pattern mining algorithms: a survey. In: Frequent pattern mining. Springer, Cham, pp. 19–64
Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam
Wu X, Zhu X, Wu GQ, Ding W (2013) Data mining with big data. IEEE Trans Knowl Data Eng 26(1):97–107
Fan W, Bifet A (2013) Mining big data: current status, and forecast to the future. ACM sIGKDD Explor Newsl 14(2):1–5
Che D, Safran M, Peng Z (2013, April) From big data to big data mining: challenges, issues, and opportunities. In: International Conference on Database Systems for Advanced Applications. Springer, Berlin, pp 1–15
Sagiroglu S, Sinanc D (2013, May) Big data: a review. In: 2013 International Conference on Collaboration Technologies and Systems (CTS). IEEE, pp 42–47
Agrawal R, Srikant R (1994, September) Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference Very Large Data Bases, VLDB, Vol 1215. pp 487–499
Goswami DN, Anshu C, Raghuvanshi CS (2010) An algorithm for frequent pattern mining based on apriori. Int J Comput Sci Eng 2(04):942–947
Borgelt C (2003, November) Efficient implementations of apriori and eclat. In: FIMI’03: proceedings of the IEEE ICDM workshop on frequent itemset mining implementations
Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. ACM sigmod record 29(2):1–12
Savasere A, Omiecinski ER, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. Georgia Institute of Technology, Atlanta
Lin MY, Lee PY, Hsueh SC (2012, February) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication. pp 1–8
Li N, Zeng L, He Q, Shi Z (2012, August) Parallel implementation of apriori algorithm based on mapreduce. In: 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. IEEE, pp 236–241
Yang XY, Liu Z, Fu Y (2010, June) MapReduce as a programming model for association rules algorithm on Hadoop. In: The 3rd International Conference on Information Sciences and Interaction Sciences. IEEE, pp 99–102
Lin X (2014, June) Mr-apriori: Association rules algorithm based on mapreduce. In: 2014 IEEE 5th international conference on software engineering and service science. IEEE, pp 141–144
Yahya O, Hegazy O, Ezat E (2012) An efficient implementation of Apriori algorithm based on Hadoop-Mapreduce model. Int J Rev Comput 12
Apache hadoop (2013). https://hadoop.apache.org/. Accessed Mar 2019
Apache Spark: Lightning-fast cluster computing. (2016) The Apache Software Foundation. Spark1.6.0. https://spark.apache.org/. Accessed Mar 2019
Karau H, Konwinski A, Wendell P, Zaharia M (2015) Learning spark: lightning-fast big data analysis. O'Reilly Media Inc., Champaign
Lin MY, Lee PY, Hsueh SC (2012, February) Apriori-based frequent itemset mining algorithms on MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication. pp 1–8
Moens S, Aksehirli E, Goethals B (2013, October) Frequent itemset mining for big data. In: 2013 IEEE International Conference on Big Data. IEEE, pp 111–118
Hammoud S (2011) MapReduce network enabled algorithms for classification based on association rules (Doctoral dissertation, Brunel University School of Engineering and Design PhD Theses)
Thabtah F, Hammoud S (2013) Mr-arm: a map-reduce association rule mining framework. Parallel process lett 23(03):1350012
Yu KM, Zhou J, Hong TP, Zhou JL (2010) A load-balanced distributed parallel mining algorithm. Expert Syst Appl 37(3):2459–2464
Aouad LM, Le-Khac NA, Kechadi TM (2010) Performance study of distributed apriori-like frequent itemsets mining. Knowl Inf Syst 23(1):55–72
Chen Z, Cai S, Song Q, Zhu C (2011, August) An improved Apriori algorithm based on pruning optimization and transaction reduction. In: 2011 2nd International Conference on Artificial Intelligence, Management Science and Electronic Commerce (AIMSEC). IEEE, pp 1908–1911
Zhang F, Liu M, Gui F, Shen W, Shami A, Ma Y (2015) A distributed frequent itemset mining algorithm using Spark for Big Data analytics. Clust Comput 18(4):1493–1501
Qiu H, Gu R, Yuan C, Huang Y (2014, May) Yafim: a parallel frequent itemset mining algorithm with spark. In: 2014 IEEE international parallel & distributed processing symposium workshops. IEEE, pp 1664–1671
Rathee S, Kaul M, Kashyap A (2015, October). R-Apriori: an efficient apriori based algorithm on spark. In: Proceedings of the 8th workshop on Ph. D. Workshop in information and knowledge management. pp 27–34
Sethi KK, Ramesh D (2017) HFIM: a Spark-based hybrid frequent itemset mining algorithm for big data processing. J Supercomput 73(8):3652–3668
RDD Programming Guide (2019, January). https://spark.apache.org/docs/latest/rdd-programming-guide.html
IBM’s synthetic datasets generated by IBM’s Quest dataset generator (2019, January). https://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
Datasets for chess, mushroom and retail (2019, January). https://fimi.ua.ac.be/data/
Acknowledgments
This work is partially funded by IIT(ISM), Govt. of India, Dhanbad. The authors would like to express their gratitude and heartiest thanks to the Department of Computer Science & Engineering, Indian Institute of Technology (ISM), Dhanbad, India, for providing their research support.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Raj, S., Ramesh, D. & Sethi, K.K. A Spark-based Apriori algorithm with reduced shuffle overhead. J Supercomput 77, 133–151 (2021). https://doi.org/10.1007/s11227-020-03253-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03253-7