Abstract
Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when (i) the amount of data tends to be very large and/or (ii) the minimum support (MinSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently, relying on an absolute minimum support (AMinSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.
S. Slah—This work has been partially supported by the Inria Project Lab Hemera.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Labrinidis, A., Jagadish, H.V.: Challenges and opportunities with big data. Proc. VLDB Endow. 5(12), 2032–2033 (2012)
Berry, M.: Survey of Text Mining Clustering, Classification, and Retrieval. Springer, New York (2004)
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14(2), 1–5 (2013)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, CA, USA, p. 10. Berkeley (2010)
Hadoop
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Santiago de Chile, Chile (1994)
Savasere, A., Omiecinski, E., Navathe, S.B. An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)
Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci. Inf. Comput. Sci. 160(1–4), 161–171 (2004)
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y. Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys), Lausanne, Switzerland, pp. 107–114. ACM (2008)
Owen, S.: Mahout in Action. Manning Publications Co., Shelter Island (2012)
Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
English wikipedia articles (2014). http://dumps.wikimedia.org/enwiki/latest
The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl. Based Syst. 21(6), 507–513 (2008)
Han, J., Pei, J., Yin, J.: Mining frequent patterns without candidate generation. SIGMODREC ACM SIGMOD Rec. 29, 1–12 (2000)
Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)
Anand, R.: Mining of Massive Datasets. Cambridge University Press, New York (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Salah, S., Akbarinia, R., Masseglia, F. (2015). Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-319-22849-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)