Abstract
Frequent itemset mining serves as the main method of association rule mining. With the limitations in computing space and performance, the association of frequent items in large data mining requires both extensive time and effort, particularly when the datasets become increasingly larger. In the process of associated data mining in a big data environment, the MapReduce programming model is typically used to perform task partitioning and parallel processing, which could improve the execution efficiency of the algorithm. However, to ensure that the associated rule is not destroyed during task partitioning and parallel processing, the inner-relationship data must be stored in the computer space. Because inner-relationship data are redundant, storage of these data will significantly increase the space usage in comparison with the original dataset. In this study, we find that the formation of the frequent pattern (FP) mining algorithm depends mainly on the conditional pattern bases. Based on the parallel frequent pattern (PFP) algorithm theory, the grouping model divides frequent items into several groups according to their frequencies. We propose a non-group PFP (NG-PFP) mining algorithm that cancels the grouping model and reduces the data redundancy between sub-tasks. Moreover, we present the NG-PFP algorithm for task partition and parallel processing, and its performance in the Hadoop cluster environment is analyzed and discussed. Experimental results indicate that the non-group model shows obvious improvement in terms of computational efficiency and the space utilization rate.
Similar content being viewed by others
References
Agrawal R, Srikant R, 1994. Fast algorithms for mining association rules in large databases. Proc Int Conf on Very Large Data Bases, p.487–499.
Agarwal RC, Aggarwal CC, Prasad VVV, 2002. A tree projection algorithm for generation of frequent item sets. J Parall Distrib Comput, 61(3):350–371. https://doi.org/10.1006/jpdc.2000.1693
Bauer M, Bruveris M, Charon N, et al., 2018. A relaxed approach for curve matching with elastic metrics. https://arxiv.gg363.site/abs/1803.10893
Berti-Équille L, Harmouch H, Naumann F, et al., 2018. Discovery of genuine functional dependencies from relational data with missing values. Proc VLDB Endowm, 11(8):880–892. https://doi.org/10.14778/3204028.3204032
Caruccio L, Deufemia V, Polese G, 2016. On the discovery of relaxed functional dependencies. Proc 20th Int Database Engineering & Applications Symp, p.53–61. https://doi.org/10.1145/2938503.2938519
Caruccio L, Deufemia V, Polese G, 2017. Evolutionary mining of relaxed dependencies from big data collections. Proc 7th Int Conf on Web Intelligence, Mining and Semantics, Article 5. https://doi.org/10.1145/3102254.3102259
Caruccio L, Polese G, Tortora G, 2018. Dependency-based query/view synchronization upon schema evolutions. Int Conf on Conceptual Modeling, p.91–105. https://doi.org/10.1007/978-3-030-01391-2_17
Chen JC, Chen YG, Du XY, et al., 2013. Big data challenge: a data management perspective. Front Comput Sci, 7(2):157–164. https://doi.org/10.1007/s11704-013-3903-7
Cong S, Han J, Padua D, 2005. Parallel mining of closed sequential patterns. Proc 11th ACM SIGKDD Int Conf on Knowledge Discovery in Data Mining, p.562–567.
Deng LL, Lou YS, 2015. Improvement and research of FP-growth algorithm based on distributed spark. Proc Int Conf on Cloud Computing and Big Data, p.105–108. https://doi.org/10.1109/CCBD.2015.15
di-Jorio L, Laurent A, Teisseire M, 2009. Mining frequent gradual itemsets from large databases. Int Symp on Intelligent Data Analysis, p.297–308. https://doi.org/10.1007/978-3-642-03915-7_26
El-Hajj M, Zaïane OR, 2006. Parallel bifold: large-scale parallel pattern mining with constraints. Distrib Parall Datab, 20(3):225–243. https://doi.org/10.1007/s10619-006-0445-0
Ge KS, Su HY, Li DS, et al., 2017. Efficient parallel implementation of a density peaks clustering algorithm on graphics processing unit. Front Inform Technol Electron Eng, 18(7):915–927. https://doi.org/10.1631/FITEE.1601786
Han JW, Pei J, Yin YW, 2000. Mining frequent patterns without candidate generation. ACM SIGMOD Rec, 29(2):1–12. https://doi.org/10.1145/335191.335372
Huhtala Y, Kärkkäinen J, Porkka P, et al., 1999. Tane: an efficient algorithm for discovering functional and approximate dependencies. Comput J, 42(2):100–111. https://doi.org/10.1093/comjnl/42.2.100
Kruse S, Naumann F, 2018. Efficient discovery of approximate dependencies. Proc VLDB Endowm, 11(7):759–772. https://doi.org/10.14778/3192965.3192968
Li HY, Wang Y, Zhang D, et al., 2008. PFP: parallel FP-growth for query recommendation. Proc ACM Conf on Recommender Systems, p.107–114. https://doi.org/10.1145/1454008.1454027
Li N, Zeng L, He Q, et al., 2012. Parallel implementation of Apriori algorithm based on MapReduce. Proc. 13th ACIS Int Conf on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing, p.236–241. https://doi.org/10.1109/SNPD.2012.31
Lin KW, Chung SH, 2015. A fast and resource efficient mining algorithm for discovering frequent patterns in distributed computing environments. Fut Gener Comput Syst, 52:49–58. https://doi.org/10.1016/j.future.2015.05.009
Lin MY, Lee PY, Hsueh SC, 2012. Apriori-based frequent itemset mining algorithms on MapReduce. Proc 6th Int Conf on Ubiquitous Information Management and Communication, Article 26. https://doi.org/10.1145/2184751.2184842
Liu JQ, Wu YS, Zhou QF, et al., 2015. Parallel Eclat for opportunistic mining of frequent itemsets. Int Conf on Database and Expert Systems Applications, p.401–415. https://doi.org/10.1007/978-3-319-22849-5_27
Lucchese C, Orlando S, Perego R, et al., 2004. WebDocs: a real-life huge transactional dataset. Proc IEEE ICDM Workshop on Frequent Itemset Mining Implementations.
Mandros P, Boley M, Vreeken J, 2017. Discovering reliable approximate functional dependencies. Proc 23rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining, p.355–363. https://doi.org/10.1145/3097983.3098062
Riondato M, DeBrabant JA, Fonseca R, et al., 2012. PARMA: a parallel randomized algorithm for approximate association rules mining in MapReduce. Proc 21st ACM Int Conf on Information and Knowledge Management, p.85–94. https://doi.org/10.1145/2396761.2396776
Siddiqa A, Karim A, Gani A, 2017. Big data storage technologies: a survey. Front Inform Technol Electron Eng, 18(8):1040–1070. https://doi.org/10.1631/FITEE.1500441
Srikant R, Agrawal R, 1996. Mining sequential patterns: generalizations and performance improvements. Int Conf on Extending Database Technology, p.1–17. https://doi.org/10.1007/BFb0014140
Wang F, Hu L, Zhou J, et al., 2015. A survey from the perspective of evolutionary process in the Internet of Things. Int J Distrib Sen Networks, 11(3):462752. https://doi.org/10.1155/2015/462752
Wang J, Han J, 2004. BIDE: efficient mining of frequent closed sequences. Proc 20th Int Conf on Data Engineering, p.79–90. https://doi.org/10.1109/ICDE.2004.1319986
Xia D, Zhou Y, Rong Z, et al., 2013. IPFP: an improved parallel FP-growth algorithm for frequent itemsets mining. Proc 59th ISI World Statistics Congress, p.4034–4039.
Xia D, Rong Z, Zhou Y, 2014. A novel parallel algorithm for frequent itemsets mining in massive small files datasets. ICIC Expr Lett Part B, 5(2):459–466.
Yang Q, Du FY, Zhu X, et al., 2016. Improved balanced parallel FP-growth with MapReduce. Joint Int Conf on Artificial Intelligence and Computer Engineering and Int Conf on Network and Communication Security, p.1–5. https://doi.org/10.12783/dtcse/aice-ncs2016/5681
Yang XY, Liu Z, Fu Y, 2010. MapReduce as a programming model for association rules algorithm on Hadoop. Proc 3rd Int Conf on Information Sciences and Interaction Sciences, p.99–102. https://doi.org/10.1109/ICICIS.2010.5534718
Yu KM, Zhou JY, Hsiao WC, 2007. Load balancing approach parallel algorithm for frequent pattern mining. Proc Int Conf on Parallel Computing Technologies, p.623–631. https://doi.org/10.1007/978-3-540-73940-1_63
Zaki MJ, 2000. Scalable algorithms for association mining. IEEE Trans Know Data Eng, 12(3):372–390. https://doi.org/10.1109/69.846291
Zaki MJ, 2001a. Parallel sequence mining on shared-memory machines. J Parall Distrib Comput, 61(3):401–426.
Zaki MJ, 2001b. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn, 42(1–2):31–60. https://doi.org/10.1023/A:1007652502315
Zhang XL, Breitinger F, Baggili I, 2016. Rapid Android parser for investigating DEX files (RAPID). Dig Investig, 17:28–39. https://doi.org/10.1016/j.diin.2016.03.002
Zhang XL, Baggili I, Breitinger F, 2017. Breaking into the vault: privacy, security and forensic analysis of Android vault applications. Comput Secur, 70:516–531. https://doi.org/10.1016/j.cose.2017.07.011
Zhang ZG, Ji GL, Tang MM, 2013. MREclat: an algorithm for parallel mining frequent itemsets. Proc Int Conf on Advanced Cloud and Big Data, p.177–180. https://doi.org/10.1109/CBD.2013.22
Zhao YX, Zhang WX, Li DS, et al., 2016. Pegasus: a distributed and load-balancing fingerprint identification system. Front Inform Technol Electron Eng, 17(8):766–780. https://doi.org/10.1631/FITEE.1500487
Zheng XF, Wang S, 2014. Study on the method of road transport management information data mining based on pruning Eclat algorithm and MapReduce. Proc Soc Behav Sci, 138:757–766. https://doi.org/10.1016/j.sbspro.2014.07.254
Zhou L, Zhong ZY, Chang J, et al., 2010. Balanced parallel FP-growth with MapReduce. Proc. IEEE Youth Conf on Information, Computing and Telecommunications, p.243–246. https://doi.org/10.1109/YCICT.2010.5713090
Zhuang YT, Wu F, Chen C, et al., 2017. Challenges and opportunities: from big data to knowledge in AI 2.0. Front Inform Technol Electron Eng, 18(1):3–14. https://doi.org/10.1631/FITEE.1601883
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Zhe-jun KUANG, Hang ZHOU, Dong-dai ZHOU, Jin-peng ZHOU, and Kun YANG declare that they have no conflict of interest.
Additional information
Project supported by the Fundamental Research Funds for the Central Universities, China (No. 2412015KJ005), the Twelfth Five-Year Plan Project of the Education Department of Jilin Province, China (No. 557), and the Thirteenth Five-Year Plan for Scientific Research of the Education Department of Jilin Province, China (No. JJKH20191197KJ)
Rights and permissions
About this article
Cite this article
Kuang, Zj., Zhou, H., Zhou, Dd. et al. A non-group parallel frequent pattern mining algorithm based on conditional patterns. Frontiers Inf Technol Electronic Eng 20, 1234–1245 (2019). https://doi.org/10.1631/FITEE.1800467
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1631/FITEE.1800467