Skip to main content
Log in

An integrated, generic approach to pattern mining: data mining template library

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Frequent pattern mining (FPM) is an important data mining paradigm to extract informative patterns like itemsets, sequences, trees, and graphs. However, no practical framework for integrating the FPM tasks has been attempted. In this paper, we describe the design and implementation of the Data Mining Template Library (DMTL) for FPM. DMTL utilizes a generic data mining approach, where all aspects of mining are controlled via a set of properties. It uses a novel pattern property hierarchy to define and mine different pattern types. This property hierarchy can be thought of as a systematic characterization of the pattern space, i.e., a meta-pattern specification that allows the analyst to specify new pattern types, by extending this hierarchy. Furthermore, in DMTL all aspects of mining are controlled by a set of different mining properties. For example, the kind of mining approach to use, the kind of data types and formats to mine over, the kind of back-end storage manager to use, are all specified as a list of properties. This provides tremendous flexibility to customize the toolkit for various applications. Flexibility of the toolkit is exemplified by the ease with which support for a new pattern can be added. Experiments on synthetic and public dataset are conducted to demonstrate the scalability provided by the persistent back-end in the library. DMTL been publicly released as open-source software (http://dmtl.sourceforge.net/), and has been downloaded by numerous researchers from all over the world.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD conference on management of data

  • Agrawal R, Mannila H, Srikant R, Toivonen H, Verkamo AI (1996) Fast discovery of association rules. In: Advances in knowledge discovery and data mining. AAAI Press, Menlo Park, CA, pp 307–328

  • Agrawal R, Srikant R (1995) Mining sequential patterns. In: 11th International conference on data engineering

  • Antunes C, Oliveira AL (2004) Sequential pattern mining algorithms: Trade-offs between speed and memory. In: 2nd International workshop on mining graphs, trees and sequences with ECML/PKDD

  • Asai T, Abe K, Kawasoe S, Arimura H, Satamoto H, Arikawa S (2002) Efficient substructure discovery from large semi-structured data. In: 2nd SIAM international conference on data mining

  • Asai T, Arimura H, Uno T, Nakano S (2003) Discovering frequent substructures in large unordered trees. In: 6th International conference on discovery science

  • Ayres J, Flannick J, Gehrke JE, Yiu T (2002) Sequential pattern mining using a bitmap representation. In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Balcazar JL, Casas-Garriga G (2005) On horn axiomatizations for sequential data. In: 10th International confererence on database theory

  • Bayardo RJ Jr (1998) Efficiently mining long patterns from databases. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM, New York, USA, pp 85–93

  • Brin S, Motwani R, Ullman J, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: ACM SIGMOD conference on management of data

  • Buehrer G, Parthasarathy S, Ghoting A (2006). Out-of-core frequent pattern mining on a commodity PC. In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Burdick D, Calimlim M, Gehrke J (2001a) MAFIA: a maximal frequent itemset algorithm for transactional databases. In: IEEE international conference on data engineering

  • Burdick D, Calimlim M, Gehrke J (2001b) MAFIA: a maximal frequent itemset algorithm for transactional databases. In: 17th International conference on data engineering

  • Chi Y, Yang Y, Muntz RR (2003) Indexing and mining free trees. In: 3rd IEEE international conference on data mining

  • Chi Y, Yang Y, Muntz RR (2004a) HybridTreeMiner: an efficient algorihtm for mining frequent rooted trees and free trees using canonical forms. In: 16th International conference on scientific and statistical database management

  • Chi Y, Yang Y, Xia Y, Muntz RR (2004b) CMTreeMiner: mining both closed and maximal frequent subtrees. In: 8th Pacific-Asia conferernce on knowledge discovery and data mining

  • Cook D, Holder L (1994) Substructure discovery using minimal description length and background knowledge. J Arti Intell Res 1: 231–255

    Google Scholar 

  • Dehaspe L, Toivonen H, King R (1998) Finding frequent substructures in chemical compounds. In: 4th ACM SIGKDD international conference knowledge discovery and data mining

  • Ganter B, Wille R (1999) Formal concept analysis: mathematical foundations. Springer-Verlag

  • Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential pattern mining with regular expression constraints. In: 25th International conference on very large data bases

  • Ghoting A, Buehrer G, Parthasarathy S, Kim D, Nguyen A, Chen Y-K, Dubey P (2005) Cache-conscious frequent pattern mining on a modern processor. In: 31st International conference on very large data bases

  • Goethals B, Zaki MJ (2003) Advances in frequent itemset mining implementations: report on FIMI’03. SIGKDD Explor 6: 109–117

    Article  Google Scholar 

  • Gschwind T (2001) PSTL—A C++ Persistent Standard Template Library. In: 6th USENIX conference on object-oriented technologies and systems

  • Han J, Pei J, Yin Y (2000a) Mining frequent patterns without candidate generation. In: ACM SIGMOD conference on management of data

  • Han J, Pei J, Yin Y (2000b). Mining frequent patterns without candidate generation. In: ACM SIGMOD conference on management of data

  • Hasan MA, Chaoji V, Salem S, Zaki M (2005) DMTL: A generic Data Mining Template Library. In: 1st Workshop on library-centric software design (with OOPSLA)

  • Horváth T, Ramon J, Wrobel S (2006) Frequent subgraph mining in outerplanar graphs. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining

  • Huan J, Wang W, Prins J (2003a) Efficient mining of frequent subgraphs in the presence of isomorphism. In: IEEE international conference on data mining

  • Huan J, Wang W, Prins J (2003b) Efficient mining of frequent subgraphs in the presence of isomorphism (Technical report TR03-021). University of North Carolina

  • Inokuchi A, Washio T, Motoda H (2000) An apriori-based algorithm for mining frequent substructures from graph data. In: 4th European conference on principles of knowledge discovery and data mining

  • Inokuchi A, Washio T, Motoda H (2003) Complete mining of frequent patterns from graphs: Mining graph data. Machine Learn 50: 321–354

    Article  MATH  Google Scholar 

  • Kramer S, Raedt LD, Helma C (2001) Molecular feature mining in HIV data. In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Kuramochi M, Karypis G (2001). Frequent subgraph discovery. In: 1st IEEE international conference on data mining

  • Kuramochi M, Karypis G (2004) An efficient algorithm for discovering frequent subgraphs. IEEE Transactions on Knowledge and Data Engineering 16: 1038–1051

    Article  Google Scholar 

  • Mannila H, Toivonen H. (1996) Discovering generalized episodes using minimal occurences. In: 2nd International conference knowledge discovery and data mining

  • Mannila H, Toivonen H. (1997) Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery 1: 241–258

    Article  Google Scholar 

  • Mannila H, Toivonen H, Verkamo I (1995) Discovering frequent episodes in sequences. In: 1st International conference knowledge discovery and data mining

  • Musser D, Derge G, Saini A. (2001) STL tutorial and reference guide, 2nd edition. Addison-Wesley

  • Nijssen S, Kok J (2003) Efficient discovery of frequent unordered trees. In: 1st Internationall workshop on mining graphs, trees and sequences

  • Nijssen S, Kok J (2004) A quickstart in frequent structure mining can make a difference. In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Oates T, Schmill MD, Jensen D, Cohen PR (1997) A family of algorithms for finding temporal structure in data. In: 6th International workshop on AI and statistics

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: 7th International conference on database theory

  • Pei J, Han J, Mortazavi-Asl B, Pinto H, Chen Q, Dayal U, Hsu M-C (2001). PrefixSpan: mining sequential patterns efficiently by prefixprojected pattern growth. In: IEEE international conference on data engineering

  • Savasere A, Omiecinski E, Navathe S (1995) An efficient algorithm for mining association rules in large databases. In: 21st International conference on very large data bases

  • Shasha D, Wang J, Zhang S (2004) Unordered tree mining with applications to phylogeny. In: IEEE international conference on data engineering

  • Siek J, Lee L, Lumsdaine A (2002). The boost graph library. Addison-Wesley

  • Srikant R, Agrawal R (1996) Mining sequential patterns: Generalizations and performance improvements. In: 5th International conference extending database technology

  • Termier A, Rousset M-C, Sebag M (2002) TreeFinder: a first step towards xml data mining. In: IEEE international conference on data mining

  • Termier A, Rousset M-C, Sebag M (2004) Dryade: a new approach for discovering closed frequent trees in heterogeneous tree databases. In: IEEE international conference on data mining

  • Ullmann JR (1976) An algorithm for subgraph isomorphism. J ACM 23: 31–

    Article  MathSciNet  Google Scholar 

  • Wang C, Hong M, Pei J, Zhou H, Wang W, Shi B (2004) Efficient pattern-growth methods for frequent tree pattern mining. In: Pacific-Asia conference on knowledge discovery and data mining

  • Wang J, Han J (2004) BIDE: efficient mining of frequent closed sequences. In: IEEE international conference on data engineering

  • Wang J, Han J, Pei J. (2003). CLOSET+: searching for the best strategies for mining frequent closed itemsets. In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Wang K, Liu H (1998) Discovering typical structures of documents: a road map approach. In: ACM SIGIR international conference on information retrieval

  • Witten I, Frank E (1999) Data mining: practical machine learning tools and techniques with java implementations. Morgan Kauffman

  • Xiao Y, Yao J-F, Li Z, Dunham MH (2003) Efficient data mining for maximal frequent subtrees. In: IEEE international conference on data mining

  • Yan X, Han J (2002a) gSpan: graph-based substructure pattern mining. In: IEEE international conference on data mining

  • Yan X, Han J (2002b) gSpan: graph-based substructure pattern mining (Technical report UIUCDCS-R-2002-2296). University of Illinois at Urbana-Champaign

  • Yan X, Han J (2003) CloseGraph: mining closed frequent graph patterns In: ACM SIGKDD international conference on knowledge discovery and data mining

  • Yoshida K, Motoda H (1995) CLIP: concept learning from inference patterns. Artif Intel 75: 63–92

    Article  Google Scholar 

  • Zaki MJ (2000a) Scalable algorithms for association mining. IEEE Trans Knowl Data Eng 12: 372–390

    Article  Google Scholar 

  • Zaki MJ (2000b) Sequences mining in categorical domains: Incorporating constraints. In: 9th International conference on information and knowledge management

  • Zaki MJ (2001) SPADE: An efficient algorithm for mining frequent sequences. Machine Learn J 42: 31–60

    Article  MATH  Google Scholar 

  • Zaki MJ (2002) Efficiently mining frequent trees in a forest. In: 8th ACM SIGKDD international conference on knowledge discovery and data mining

  • Zaki MJ (2005a) Efficiently mining frequent embedded unordered trees. Fundamenta Informaticae 66: 33–52

    MATH  MathSciNet  Google Scholar 

  • Zaki MJ (2005b) Efficiently mining frequent trees in a forest: algorithms and applications. IEEE Trans Knowledge Data Eng 17: 1021–1035

    Article  Google Scholar 

  • Zaki MJ, Gouda K (2003) Fast vertical mining using Diffsets. In: 9th ACM SIGKDD international conference on knowledge discovery and data mining, pp 326–335

  • Zaki MJ, Hsiao C-J (2002) ChARM: an efficient algorithm for closed itemset mining. In: 2nd SIAM international conference on data mining

  • Zaki MJ, Hsiao C-J (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17: 462–478

    Article  Google Scholar 

  • Zaki MJ, Parimi N, De N, Gao F, Phoophakdee B, Urban J, Chaoji V, Hasan M, Salem S (2004) Towards generic pattern mining. In: International conference on formal concept analysis (Invited paper)

  • Zaki MJ, Parthasarathy S, Ogihara M, Li W (1997) New algorithms for fast discovery of association rules. In: 3rd International conference on knowledge discovery and data mining

  • Zou B, Ma X, Kemme B, Newton G, Precu D (2006) Data mining using relational database management systems. In: Pacific-asia conference on knowledge discovery and data mining

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohammed J. Zaki.

Additional information

Responsible editor: Hannu Toivonen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chaoji, V., Al Hasan, M., Salem, S. et al. An integrated, generic approach to pattern mining: data mining template library. Data Min Knowl Disc 17, 457–495 (2008). https://doi.org/10.1007/s10618-008-0098-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0098-x

Keywords

Navigation