Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments

Salah, Saber; Akbarinia, Reza; Masseglia, Florent

doi:10.1007/978-3-319-22849-5_21

Saber Salah¹⁸,
Reza Akbarinia¹⁸ &
Florent Masseglia¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9261))

Included in the following conference series:

1305 Accesses
3 Citations

Abstract

Frequent itemset mining (FIM) is one of the fundamental cornerstones in data mining. While, the problem of FIM has been thoroughly studied, few of both standard and improved solutions scale. This is mainly the case when (i) the amount of data tends to be very large and/or (ii) the minimum support (MinSup) threshold is very low. In this paper, we propose a highly scalable, parallel frequent itemset mining (PFIM) algorithm, namely Parallel Absolute Top Down (PATD). PATD algorithm renders the mining process of very large databases (up to Terabytes of data) simple and compact. Its mining process is made up of only one parallel job, which dramatically reduces the mining runtime, the communication cost and the energy power consumption overhead, in a distributed computational platform. Based on a clever and efficient data partitioning strategy, namely Item Based Data Partitioning (IBDP), PATD algorithm mines each data partition independently, relying on an absolute minimum support (AMinSup) instead of a relative one. PATD has been extensively evaluated using real-world data sets. Our experimental results suggest that PATD algorithm is significantly more efficient and scalable than alternative approaches.

S. Slah—This work has been partially supported by the Inria Project Lab Hemera.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Data placement in massively distributed environments for fast parallel mining of frequent itemsets

Article 24 March 2017

Multi-level dataset decomposition for parallel frequent itemset mining on a cluster of personal computers

Article 03 January 2018

Massively Distributed Environments and Closed Itemset Mining: The DCIM Approach

References

Labrinidis, A., Jagadish, H.V.: Challenges and opportunities with big data. Proc. VLDB Endow. 5(12), 2032–2033 (2012)
Article Google Scholar
Berry, M.: Survey of Text Mining Clustering, Classification, and Retrieval. Springer, New York (2004)
Google Scholar
Fan, W., Bifet, A.: Mining big data: current status, and forecast to the future. SIGKDD Explor. Newsl. 14(2), 1–5 (2013)
Article MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, CA, USA, p. 10. Berkeley (2010)
Google Scholar
Hadoop
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 487–499. Santiago de Chile, Chile (1994)
Google Scholar
Savasere, A., Omiecinski, E., Navathe, S.B. An efficient algorithm for mining association rules in large databases. In: Proceedings of International Conference on Very Large Data Bases (VLDB), pp. 432–444 (1995)
Google Scholar
Tsay, Y.-J., Chang-Chien, Y.-W.: An efficient cluster and decomposition algorithm for mining association rules. Inf. Sci. Inf. Comput. Sci. 160(1–4), 161–171 (2004)
MATH Google Scholar
Li, H., Wang, Y., Zhang, D., Zhang, M., Chang, E.Y. Pfp: parallel fp-growth for query recommendation. In: Pu, P., Bridge, D.G., Mobasher, B., Ricci, F. (eds.) Proceedings of the ACM Conference on Recommender Systems (RecSys), Lausanne, Switzerland, pp. 107–114. ACM (2008)
Google Scholar
Owen, S.: Mahout in Action. Manning Publications Co., Shelter Island (2012)
Google Scholar
Grid5000. https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home
English wikipedia articles (2014). http://dumps.wikimedia.org/enwiki/latest
The clueweb09 dataset (2009). http://www.lemurproject.org/clueweb09.php/
Song, W., Yang, B., Zhangyan, X.: Index-bittablefi: an improved algorithm for mining frequent itemsets. Knowl. Based Syst. 21(6), 507–513 (2008)
Article Google Scholar
Han, J., Pei, J., Yin, J.: Mining frequent patterns without candidate generation. SIGMODREC ACM SIGMOD Rec. 29, 1–12 (2000)
Article Google Scholar
Riondato, M., DeBrabant, J.A., Fonseca, R., Upfal, E.: Parma: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: 21st ACM International Conference on Information and Knowledge Management (CIKM), Maui, HI, USA, pp. 85–94. ACM (2012)
Google Scholar
Anand, R.: Mining of Massive Datasets. Cambridge University Press, New York (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Zenith Team, INRIA and LIRMM, University of Montpellier, Montpellier, France
Saber Salah, Reza Akbarinia & Florent Masseglia

Authors

Saber Salah
View author publications
You can also search for this author in PubMed Google Scholar
Reza Akbarinia
View author publications
You can also search for this author in PubMed Google Scholar
Florent Masseglia
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Florent Masseglia .

Editor information

Editors and Affiliations

Hewlett-Packard Enterprise, Sunnyvale, California, USA
Qiming Chen
Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
Blaise Pascal University, Aubiere, France
Farouk Toumani
University of Linz, Linz, Austria
Roland Wagner
Universidad Politécnica de Valencia, Valencia, Spain
Hendrik Decker

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Salah, S., Akbarinia, R., Masseglia, F. (2015). Data Partitioning for Fast Mining of Frequent Itemsets in Massively Distributed Environments. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-22849-5_21
Published: 11 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22848-8
Online ISBN: 978-3-319-22849-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics