skip to main content
10.1145/1229428.1229432acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
Article

Toward terabyte pattern mining: an architecture-conscious solution

Published: 14 March 2007 Publication History

Abstract

We present a strategy for mining frequent item sets from terabyte-scale data sets on cluster systems. The algorithm embraces the holistic notion of architecture-conscious datamining, taking into account the capabilities of the processor, the memory hierarchy and the available network interconnects. Optimizations have been designed for lowering communication costs using compressed data structures and a succinct encoding. Optimizations for improving cache, memory and I/O utilization using pruningand tiling techniques, and smart data placement strategies are also employed. We leverage the extended memory spaceand computational resources of a distributed message-passing clusterto design a scalable solution, where each node can extend its metastructures beyond main memory by leveraging 64-bit architecture support. Our solution strategy is presented in the context of FPGrowth, a well-studied and rather efficient frequent pattern mining algorithm. Results demonstrate that the proposed strategy result in near-linearscaleup on up to 48 nodes.

References

[1]
R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the International Conference on Management of Data (SIGMOD), 1993.
[2]
R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Transactions on Knowledge and Data Engineering, 1996.
[3]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the International Conference on Very Large all Data Bases (VLDB), 1994.
[4]
R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the International Conference on Data all Engineering (ICDE), 1995.
[5]
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to correlations. In Proceedings of the International Conference on Management of Data (SIGMOD), 1997.
[6]
G. Buehrer, S. Parthasarathy, and A. Ghoting. Out-of-core frequent pattern mining on a commodity pc. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 86--95, New York, NY, USA, 2006. ACM Press.
[7]
D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In Proceedings of the International Conference on Parallel and Distributed Information Systems (PDIS), pages 31--42, 1996.
[8]
D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-memory multiprocessors. In Proceedings of the ACM Symposium on Parallel Algorithms and Architectures, pages 279--288, 1998.
[9]
S. Cong, J. Han, J. Hoeflinger, and D. Padua. A sampling-based framework for parallel data mining. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 255--265, New York, NY, USA, 2005. ACM Press.
[10]
G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1999.
[11]
A. Ghoting, G. Buehrer, S. Parthasarathy, D. Kim, A. Nguyen, Y. Chen, and P. Dubey. Cache-conscious frequent pattern mining on a modern processor. In Proceedings of the International Conference on Very Large Data Bases (VLDB), pages 577--588, 2005.
[12]
B. Goethals and M. Zaki. Advances in frequent itemset mining implementations. In Proceedings of the ICDM workshop on frequent itemset mining implementations, 2003.
[13]
E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In SIGMOD '97: Proceedings of the 1997 ACM SIGMOD international conference on Management of data, pages 277--288, New York, NY, USA, 1997. ACM Press.
[14]
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2000.
[15]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proceedings of the International Conference on Management of Data (SIGMOD), 2000.
[16]
A. Javed and A. Khokhar. Frequent pattern mining on message passing multiprocessor systems. Distributed and Parallel Databases, 16:1--14, 2004.
[17]
H. Mannila, H. Toivonen, and A. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1997.
[18]
M. Otey, C. Wang, S. Parthasarathy, A. Veloso, and W. Meira. Mining frequent itemsets in distributed and dynamic databases. In Proceedings of the International Conference on Data Mining all(ICDM), 2003.
[19]
S. Parthasarathy and S. Dwarkadas. Shared state for distributed interactive data mining applications. Journal of Parallel and Distributed Databases, 2002.
[20]
S. Parthasarathy, M. Zaki, M. Ogihara, and W. Li. Parallel data mining for association rules on shared-memory systems. Knowledge and Information Systems Journal, 2001.
[21]
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1995.
[22]
A. Schuster and R. Wolff. Communication-efficient distributed mining of association rules. In SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data, pages 473--484, New York, NY, USA, 2001. ACM Press.
[23]
A. Schuster, R. Wolff, and D. Trock. A high-performance distributed algorithm for mining association rules. In Proceedings of the International Conference on Data Mining (ICDM), 2003.
[24]
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1998.
[25]
H. Toivonen. Sampling large databases for association rules. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 1996.
[26]
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery discovery of association rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD), 1995.
[27]
M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, 1997.
[28]
M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, 1999.
[29]
O.R. Zaïane, MEl-Hajj, and PLu. Fast parallel association rule mining without candidacy generation.

Cited By

View all
  • (2022)Towards Enhancing the Performance of Parallel FP-Growth on SparkIEEE Access10.1109/ACCESS.2021.313778910(286-296)Online publication date: 2022
  • (2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
  • (2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
  • Show More Cited By

Index Terms

  1. Toward terabyte pattern mining: an architecture-conscious solution

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming
    March 2007
    284 pages
    ISBN:9781595936028
    DOI:10.1145/1229428
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 March 2007

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. itemset mining
    2. out of core
    3. parallel

    Qualifiers

    • Article

    Conference

    PPoPP07
    Sponsor:

    Acceptance Rates

    PPoPP '07 Paper Acceptance Rate 22 of 65 submissions, 34%;
    Overall Acceptance Rate 230 of 1,014 submissions, 23%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Towards Enhancing the Performance of Parallel FP-Growth on SparkIEEE Access10.1109/ACCESS.2021.313778910(286-296)Online publication date: 2022
    • (2019)Scalable Frequent Sequence Mining with Flexible Subsequence Constraints2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00134(1490-1501)Online publication date: Apr-2019
    • (2018)A fast and low idle time method for mining frequent patterns in distributed and many-task computing environmentsDistributed and Parallel Databases10.1007/s10619-018-7221-936:4(613-641)Online publication date: 1-Dec-2018
    • (2018)Memory Efficient Frequent Itemset MiningMachine Learning and Data Mining in Pattern Recognition10.1007/978-3-319-96133-0_2(16-27)Online publication date: 8-Jul-2018
    • (2016)Compressed Bitmaps Based Frequent Itemsets Mining on HadoopProceedings of the 10th International Conference on Informatics and Systems10.1145/2908446.2908457(159-165)Online publication date: 9-May-2016
    • (2016)Fault Tolerant Frequent Pattern Mining2016 IEEE 23rd International Conference on High Performance Computing (HiPC)10.1109/HiPC.2016.012(12-21)Online publication date: Dec-2016
    • (2015)Closing the GapACM Transactions on Database Systems10.1145/275721740:2(1-44)Online publication date: 30-Jun-2015
    • (2015)Large Scale Frequent Pattern Mining Using MPI One-Sided ModelProceedings of the 2015 IEEE International Conference on Cluster Computing10.1109/CLUSTER.2015.30(138-147)Online publication date: 8-Sep-2015
    • (2014)The Optimization of Parallel Frequent Pattern Growth Algorithm Based on Mahout in Cloud Manufacturing EnvironmentProceedings of the 2014 Seventh International Symposium on Computational Intelligence and Design - Volume 0210.1109/ISCID.2014.258(420-423)Online publication date: 13-Dec-2014
    • (2014)Using parallel approach in pre-processing to improve frequent pattern growth algorithm2014 International Conference on Information Systems and Computer Networks (ISCON)10.1109/ICISCON.2014.6965221(72-76)Online publication date: Mar-2014
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media