Article

Toward terabyte pattern mining: an architecture-conscious solution

Authors:

Gregory Buehrer,

Srinivasan Parthasarathy,

Shirish Tatikonda,

Joel SaltzAuthors Info & Claims

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 2 - 12

https://doi.org/10.1145/1229428.1229432

Published: 14 March 2007 Publication History

Abstract

We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.

References

[1]

R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.

Digital Library

[2]

R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 1996.

Digital Library

[3]

R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large all Data Bases (VLDB), 1994.

Digital Library

[4]

R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the International Conference on Data all Engineering (ICDE), 1995.

Digital Library

[5]

S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proceedings of the International Conference on Management of Data (SIGMOD), 1997.

Digital Library

[6]

G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity pc. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 86--95, New York, NY, USA, 2006. ACM Press.

Digital Library

[7]

D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS), pages 31--42, 1996.

Digital Library

[8]

D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-memory multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 279--288, 1998.

Digital Library

[9]

S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 255--265, New York, NY, USA, 2005. ACM Press.

Digital Library

[10]

G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1999.

Digital Library

[11]

A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 577--588, 2005.

Digital Library

[12]

B. Goethals and M. Zaki. Advances in frequent itemset mining implementations. In Proceedings of the ICDM workshop on frequent itemset mining implementations, 2003.

[13]

E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 277--288, New York, NY, USA, 1997. ACM Press.

Digital Library

[14]

J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.

Digital Library

[15]

J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.

Digital Library

[16]

A. Javed and A. Khokhar. Frequent pattern mining on message passing multiprocessor systems. Distributed and Parallel Databases, 16:1--14, 2004.

Digital Library

[17]

H. Mannila, H. Toivonen, and A. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997.

Digital Library

[18]

M. Otey, C. Wang, S. Parthasarathy, A. Veloso, and W. Meira. Mining frequent itemsets in distributed and dynamic databases. In Proceedings of the International Conference on Data Mining all(ICDM), 2003.

Digital Library

[19]

S. Parthasarathy and S. Dwarkadas. Shared state for distributed interactive data mining applications. Journal of Parallel and Distributed Databases, 2002.

Digital Library

[20]

S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems Journal, 2001.

Digital Library

[21]

A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1995.

Digital Library

[22]

A. Schuster and R. Wolff. Communication-efficient distributed mining of association rules. In SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 473--484, New York, NY, USA, 2001. ACM Press.

Digital Library

[23]

A. Schuster, R. Wolff, and D. Trock. A high-performance distributed algorithm for mining association rules. In Proceedings of the International Conference on Data Mining (ICDM), 2003.

Digital Library

[24]

C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1998.

Digital Library

[25]

H. Toivonen. Sampling large databases for association rules. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1996.

Digital Library

[26]

M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery discovery of association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1995.

[27]

M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, 1997.

Digital Library

[28]

M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, 1999.

Digital Library

[29]

O.R. Zaïane, MEl-Hajj, and PLu. Fast parallel association rule mining without candidacy generation.

Cited By

Essam AAbdel-Fattah MAbdelhamid L(2022)Towards Enhancing the Performance of Parallel FP-Growth on SparkIEEE Access10.1109/ACCESS.2021.313778910(286-296)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2021.3137789
Renz-Wieland ABertsch MGemulla R(2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00134
Lin CChung SChen JYu YLin K(2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10619-018-7221-9
Show More Cited By

Index Terms

Toward terabyte pattern mining: an architecture-conscious solution
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Parallel frequent itemset mining using systolic arrays

Since extraction of frequent itemsets from a transaction database is crucial to several data mining tasks such as association rule generation, so frequent itemset mining is one of the most important concepts in data mining. One of the major problems in ...
Identification of adverse disease agents and risk analysis using frequent pattern mining
Highlights
- An improved algorithm is proposed to construct FP-tree from transactional datasets.
Abstract
Life-threatening illnesses such as cancer, cirrhosis of the liver, and hepatitis have become crucial problems for humanity. The risk of mortality can be deflated by early detection of symptoms and providing the best possible diagnosis. ...
Mining Top-k Regular High-Utility Itemsets in Transactional Databases

Mining high-utility itemsets is an important task in the area of data mining. It involves exponential mining space and returns a very large number of high-utility itemsets. In a real-time scenario, it is often sufficient to mine a small number of high-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2007

284 pages

ISBN:9781595936028

DOI:10.1145/1229428

General Chair:
Katherine Yelick
UC Berkeley and Lawrence Berkeley National Lab., USA
,
Program Chair:
John Mellor-Crummey
Rice University, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP07

Sponsor:

PPoPP07: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

March 14 - 17, 2007

California, San Jose, USA

Acceptance Rates

PPoPP '07 Paper Acceptance Rate 22 of 65 submissions, 34%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
793
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Essam AAbdel-Fattah MAbdelhamid L(2022)Towards Enhancing the Performance of Parallel FP-Growth on SparkIEEE Access10.1109/ACCESS.2021.313778910(286-296)Online publication date: 2022
https://doi.org/10.1109/ACCESS.2021.3137789
Renz-Wieland ABertsch MGemulla R(2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00134
Lin CChung SChen JYu YLin K(2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
https://dl.acm.org/doi/10.1007/s10619-018-7221-9
Shahbazi NSoltani RGryz J(2018)Memory Efficient Frequent Itemset MiningMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96133-0_2(16-27)Online publication date: 8-Jul-2018
https://doi.org/10.1007/978-3-319-96133-0_2
Saeed ARauf AKhusro SMahfooz S(2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
https://dl.acm.org/doi/10.1145/2908446.2908457
Shohdy SVishnu AAgrawal G(2016)Fault Tolerant Frequent Pattern Mining2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.012(12-21)Online publication date: Dec-2016
https://doi.org/10.1109/HiPC.2016.012
Beedkar KBerberich KGemulla RMiliaraki I(2015)Closing the GapACM Transactions on Database Systems10.1145/275721740:2(1-44)Online publication date: 30-Jun-2015
https://dl.acm.org/doi/10.1145/2757217
Vishnu AAgarwal K(2015)Large Scale Frequent Pattern Mining Using MPI One-Sided ModelProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.30(138-147)Online publication date: 8-Sep-2015
https://dl.acm.org/doi/10.1109/CLUSTER.2015.30
Wang JZeng Y(2014)The Optimization of Parallel Frequent Pattern Growth Algorithm Based on Mahout in Cloud Manufacturing EnvironmentProceedings of the 2014 Seventh International Symposium on Computational Intelligence and Design - Volume 0210.1109/ISCID.2014.258(420-423)Online publication date: 13-Dec-2014
https://dl.acm.org/doi/10.1109/ISCID.2014.258
Rathi SDhote C(2014)Using parallel approach in pre-processing to improve frequent pattern growth algorithm2014 International Conference on Information Systems and Computer Networks (ISCON)10.1109/ICISCON.2014.6965221(72-76)Online publication date: Mar-2014
https://doi.org/10.1109/ICISCON.2014.6965221
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten