A new classification of datasets for frequent itemsets

Flouvat, Frédéric; De Marchi, Fabien; Petit, Jean-Marc

doi:10.1007/s10844-008-0077-0

A new classification of datasets for frequent itemsets

Published: 21 January 2009

Volume 34, pages 1–19, (2010)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Frédéric Flouvat¹,
Fabien De Marchi² &
Jean-Marc Petit³

190 Accesses
9 Citations
Explore all metrics

Abstract

The discovery of frequent patterns is a famous problem in data mining. While plenty of algorithms have been proposed during the last decade, only a few contributions have tried to understand the influence of datasets on the algorithms behavior. Being able to explain why certain algorithms are likely to perform very well or very poorly on some datasets is still an open question. In this setting, we describe a thorough experimental study of datasets with respect to frequent itemsets. We study the distribution of frequent itemsets with respect to itemsets size together with the distribution of three concise representations: frequent closed, frequent free and frequent essential itemsets. For each of them, we also study the distribution of their positive and negative borders whenever possible. The main outcome of these experiments is a new classification of datasets invariant w.r.t. minsup variations and robust to explain efficiency of several implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

The set of closed itemsets is not closed downwards, and thus the notion of borders does not apply.

References

Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between sets of items in large databases. In P. Buneman & S. Jajodia (Ed.), SIGMOD Conference (pp. 207–216). New York: ACM.
Google Scholar
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In J. B. Bocca, M. Jarke, & C. Zaniolo (Ed.), VLDB (pp. 487–499). San Francisco: Morgan Kaufmann.
Google Scholar
Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2), 66–75.
Article Google Scholar
Bayardo Jr., R. J., Goethals, B., & Zaki, M. J. (Eds.) (2004). FIMI ’04, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, November 1, 2004, CEUR workshop proceedings (Vol. 126). Brighton, UK: CEUR-WS.org.
Google Scholar
Bayardo Jr., R. J., & Zaki, M. J. (Eds.) (2003). FIMI ’03, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, November 19, 2003, CEUR workshop proceedings (Vol. 90). Melbourne, Florida, USA: CEUR-WS.org.
Google Scholar
Beeri, C., & Vardi, M. Y. (1984). A proof procedure for data dependencies. Journal of the Association for Computing Machinery, 31(4), 718–741.
MATH MathSciNet Google Scholar
Borgelt, C. (2003). Efficient implementations of Apriori and Eclat. In R. J. Bayardo Jr., & M. J. Zaki (Eds.), 1st workshop of frequent item set mining implementations. Melbourne, FL, USA: FIMI 2003.
Google Scholar
Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2003). Free-sets: A condensed representation of boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery, 7(1), 5–22.
Article MathSciNet Google Scholar
Burdick, D., Calimlim, M., Flannick, J., Gehrke, J., & Yiu, T. (2003). Mafia: A performance study of mining maximal frequent itemsets. In R. J. Bayardo Jr. & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.
Burdick, D., Calimlim, M., & Gehrke, J. (2001). Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE (pp. 443–452). Los Alamitos: IEEE Computer Society.
Google Scholar
Bykowski, A., & Rigottim, C. (2001). A condensed representation to find frequent patterns. In PODS. New York: ACM.
Google Scholar
Calders, T., & Goethals, B. (2003). Minimal k-free representations of frequent sets. In N. Lavrac, D. Gamberger, H. Blockeel, & L. Todorovski (Eds.), PKDD. Lecture Notes in Computer Science (Vol. 2838, pp. 71–82). New York: Springer.
Google Scholar
Casali, A., Cicchetti, R., & Lakhal, L. (2005). Essential patterns: A perfect cover of frequent patterns. In A. Min Tjoa & J. Trujillo (Eds.), DaWaK. Lecture notes in computer science (Vol. 3589, pp. 428–437). New York: Springer.
Google Scholar
Casanova, M. A., Fagin, R., & Papadimitriou, C. H. (1984). Inclusion dependencies and their interaction with functional dependencies. Journal of Computer and System Sciences, 28(1), 29–59.
Article MATH MathSciNet Google Scholar
De Marchi, F., & Petit, J.-M. (2003). Zigzag: a new algorithm for mining large inclusion dependencies in database. In ICDM (pp. 27–34). Los Alamitos, IEEE Computer Society.
Google Scholar
Flouvat, F. (2008). Study of frequent itemsets datasets. http://pages.univ-nc.nc/~flouvat/.
Flouvat, F., De Marchi, F., & Petit, J.-M. (2004). ABS: Adaptive Borders Search of frequent itemsets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.
Goethals, B. (2003). Frequent itemset mining implementations repository. http://fimi.cs.helsinki.fi/.
Gouda, K., & Zaki, M. J. (2001). Efficiently mining maximal frequent itemsets. In N. Cercone, T. Y. Lin, & X. Wu (Eds.), ICDM (pp. 163–170). Los Alamitos: IEEE Computer Society.
Google Scholar
Grahne, G., & Zhu, J. (2003). Efficiently using prefix-trees in mining frequent itemsets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.
Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, & P. A. Bernstein (Eds.), SIGMOD conference (pp. 1–12). New York: ACM.
Chapter Google Scholar
Kantola, M., Mannila, H., Räihä, K.-J., & Siirtola, H. (1992). Discovering functional and inclusion dependencies in relational databases. International Journal of Intelligent Systems, 7, 591–607.
Article MATH Google Scholar
Koeller, A., & Rundensteiner, E. A. (2003). Discovery of high-dimensional. In U. Dayal, K. Ramamritham, & T. M. Vijayaraman (Eds.), ICDE (pp. 683–685). Los Alamitos: IEEE Computer Society.
Google Scholar
Kryszkiewicz, M., & Gajek, M. (2002). Concise representation of frequent patterns based on generalized disjunction-free generators. In M.-S. Cheng, P. S. Yu, & B. Liu (Eds.), PAKDD. Lecture notes in computer science (Vol. 2336, pp. 159–171). New York: Springer.
Google Scholar
Liu, G., Lu, H., Yu, J. X., Wei, W., & Xiao, X. (2003). Afopt: An efficient implementation of pattern growth approach. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.
Mannila, H., & Räihä, K-J. (1994). The design of relational databases (2nd ed.). Reading: Addison-Wesley.
Google Scholar
Mannila, H., & Toivonen, H. (1996). Multiple uses of frequent sets and condensed representations (extended abstract). In KDD (pp. 189–194).
Mannila, H., & Toivonen, H. (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3), 241–258.
Article Google Scholar
Orlando, S., Lucchese, C., Palmerini, P., Perego, R., & Silvestri, F. (2003). kdci: a multi-strategy algorithm for mining frequent sets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.
Palmerini, P., Orlando, S., & Perego, R. (2004). Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, & L. M. Liebrock (Eds.), SAC (pp. 515–519). New York: ACM.
Chapter Google Scholar
Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Discovering frequent closed itemsets for association rules. In C. Beeri & P. Buneman (Eds.), ICDT. Lecture notes in computer science (Vol. 1540, pp. 398–416). New York: Springer.
Chapter Google Scholar
Ramesh, G., Maniatty, W., & Zaki, M.-J. (2003). Feasible itemset distributions in data mining: theory and application. In PODS (pp. 284–295). New York: ACM.
Google Scholar
Ramesh, G., Zaki, M.-J., & Maniatty, W. (2005). Distribution-based synthetic database generation techniques for itemset mining. In IDEAS (pp. 307–316). Los Alamitos: IEEE Computer Society.
Google Scholar
Uno, T., Asai, T., Uchida, Y., & Arimura, H. (2004). An efficient algorithm for enumerating closed patterns in transaction databases. In E. Suzuki, S. Arikawa (Eds.), Discovery Science. Lecture notes in computer science (Vol. 3245, pp. 16–31). New York: Springer.
Google Scholar
Zaki, M.-J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast discovery of association rules. In KDD (pp. 283–286), Journal of Intelligent Information Systems.

Download references

Author information

Authors and Affiliations

University of New Caledonia, PPME, BP R4, 98851, Nouméa, New Caledonia
Frédéric Flouvat
Université de Lyon, Université Lyon 1, LIRIS, UMR5205 CNRS, 69621, Lyon, France
Fabien De Marchi
Université de Lyon, INSA-Lyon, LIRIS, UMR5205 CNRS, 69621, Lyon, France
Jean-Marc Petit

Authors

Frédéric Flouvat
View author publications
You can also search for this author in PubMed Google Scholar
Fabien De Marchi
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Marc Petit
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Frédéric Flouvat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Flouvat, F., De Marchi, F. & Petit, JM. A new classification of datasets for frequent itemsets. J Intell Inf Syst 34, 1–19 (2010). https://doi.org/10.1007/s10844-008-0077-0

Download citation

Received: 30 August 2007
Revised: 22 December 2008
Accepted: 22 December 2008
Published: 21 January 2009
Issue Date: February 2010
DOI: https://doi.org/10.1007/s10844-008-0077-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new classification of datasets for frequent itemsets

Abstract

Access this article

Similar content being viewed by others

Structure of frequent itemsets with extended double constraints

Mining frequent itemsets using the N-list and subsume concepts

A Comparative Analysis of Algorithms for Mining Frequent Itemsets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A new classification of datasets for frequent itemsets

Abstract

Access this article

Similar content being viewed by others

Structure of frequent itemsets with extended double constraints

Mining frequent itemsets using the N-list and subsume concepts

A Comparative Analysis of Algorithms for Mining Frequent Itemsets

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation