Skip to main content
Log in

A new classification of datasets for frequent itemsets

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The discovery of frequent patterns is a famous problem in data mining. While plenty of algorithms have been proposed during the last decade, only a few contributions have tried to understand the influence of datasets on the algorithms behavior. Being able to explain why certain algorithms are likely to perform very well or very poorly on some datasets is still an open question. In this setting, we describe a thorough experimental study of datasets with respect to frequent itemsets. We study the distribution of frequent itemsets with respect to itemsets size together with the distribution of three concise representations: frequent closed, frequent free and frequent essential itemsets. For each of them, we also study the distribution of their positive and negative borders whenever possible. The main outcome of these experiments is a new classification of datasets invariant w.r.t. minsup variations and robust to explain efficiency of several implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The set of closed itemsets is not closed downwards, and thus the notion of borders does not apply.

References

  • Agrawal, R., Imielinski, T., & Swami, A. N. (1993). Mining association rules between sets of items in large databases. In P. Buneman & S. Jajodia (Ed.), SIGMOD Conference (pp. 207–216). New York: ACM.

    Google Scholar 

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules in large databases. In J. B. Bocca, M. Jarke, & C. Zaniolo (Ed.), VLDB (pp. 487–499). San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Bastide, Y., Taouil, R., Pasquier, N., Stumme, G., & Lakhal, L. (2000). Mining frequent patterns with counting inference. SIGKDD Explorations, 2(2), 66–75.

    Article  Google Scholar 

  • Bayardo Jr., R. J., Goethals, B., & Zaki, M. J. (Eds.) (2004). FIMI ’04, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, November 1, 2004, CEUR workshop proceedings (Vol. 126). Brighton, UK: CEUR-WS.org.

    Google Scholar 

  • Bayardo Jr., R. J., & Zaki, M. J. (Eds.) (2003). FIMI ’03, proceedings of the IEEE ICDM workshop on frequent itemset mining implementations, November 19, 2003, CEUR workshop proceedings (Vol. 90). Melbourne, Florida, USA: CEUR-WS.org.

    Google Scholar 

  • Beeri, C., & Vardi, M. Y. (1984). A proof procedure for data dependencies. Journal of the Association for Computing Machinery, 31(4), 718–741.

    MATH  MathSciNet  Google Scholar 

  • Borgelt, C. (2003). Efficient implementations of Apriori and Eclat. In R. J. Bayardo Jr., & M. J. Zaki (Eds.), 1st workshop of frequent item set mining implementations. Melbourne, FL, USA: FIMI 2003.

    Google Scholar 

  • Boulicaut, J.-F., Bykowski, A., & Rigotti, C. (2003). Free-sets: A condensed representation of boolean data for the approximation of frequency queries. Data Mining and Knowledge Discovery, 7(1), 5–22.

    Article  MathSciNet  Google Scholar 

  • Burdick, D., Calimlim, M., Flannick, J., Gehrke, J., & Yiu, T. (2003). Mafia: A performance study of mining maximal frequent itemsets. In R. J. Bayardo Jr. & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.

  • Burdick, D., Calimlim, M., & Gehrke, J. (2001). Mafia: A maximal frequent itemset algorithm for transactional databases. In ICDE (pp. 443–452). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Bykowski, A., & Rigottim, C. (2001). A condensed representation to find frequent patterns. In PODS. New York: ACM.

    Google Scholar 

  • Calders, T., & Goethals, B. (2003). Minimal k-free representations of frequent sets. In N. Lavrac, D. Gamberger, H. Blockeel, & L. Todorovski (Eds.), PKDD. Lecture Notes in Computer Science (Vol. 2838, pp. 71–82). New York: Springer.

    Google Scholar 

  • Casali, A., Cicchetti, R., & Lakhal, L. (2005). Essential patterns: A perfect cover of frequent patterns. In A. Min Tjoa & J. Trujillo (Eds.), DaWaK. Lecture notes in computer science (Vol. 3589, pp. 428–437). New York: Springer.

    Google Scholar 

  • Casanova, M. A., Fagin, R., & Papadimitriou, C. H. (1984). Inclusion dependencies and their interaction with functional dependencies. Journal of Computer and System Sciences, 28(1), 29–59.

    Article  MATH  MathSciNet  Google Scholar 

  • De Marchi, F., & Petit, J.-M. (2003). Zigzag: a new algorithm for mining large inclusion dependencies in database. In ICDM (pp. 27–34). Los Alamitos, IEEE Computer Society.

    Google Scholar 

  • Flouvat, F. (2008). Study of frequent itemsets datasets. http://pages.univ-nc.nc/~flouvat/.

  • Flouvat, F., De Marchi, F., & Petit, J.-M. (2004). ABS: Adaptive Borders Search of frequent itemsets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.

  • Goethals, B. (2003). Frequent itemset mining implementations repository. http://fimi.cs.helsinki.fi/.

  • Gouda, K., & Zaki, M. J. (2001). Efficiently mining maximal frequent itemsets. In N. Cercone, T. Y. Lin, & X. Wu (Eds.), ICDM (pp. 163–170). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Grahne, G., & Zhu, J. (2003). Efficiently using prefix-trees in mining frequent itemsets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.

  • Han, J., Pei, J., & Yin, Y. (2000). Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, & P. A. Bernstein (Eds.), SIGMOD conference (pp. 1–12). New York: ACM.

    Chapter  Google Scholar 

  • Kantola, M., Mannila, H., Räihä, K.-J., & Siirtola, H. (1992). Discovering functional and inclusion dependencies in relational databases. International Journal of Intelligent Systems, 7, 591–607.

    Article  MATH  Google Scholar 

  • Koeller, A., & Rundensteiner, E. A. (2003). Discovery of high-dimensional. In U. Dayal, K. Ramamritham, & T. M. Vijayaraman (Eds.), ICDE (pp. 683–685). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Kryszkiewicz, M., & Gajek, M. (2002). Concise representation of frequent patterns based on generalized disjunction-free generators. In M.-S. Cheng, P. S. Yu, & B. Liu (Eds.), PAKDD. Lecture notes in computer science (Vol. 2336, pp. 159–171). New York: Springer.

    Google Scholar 

  • Liu, G., Lu, H., Yu, J. X., Wei, W., & Xiao, X. (2003). Afopt: An efficient implementation of pattern growth approach. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.

  • Mannila, H., & Räihä, K-J. (1994). The design of relational databases (2nd ed.). Reading: Addison-Wesley.

    Google Scholar 

  • Mannila, H., & Toivonen, H. (1996). Multiple uses of frequent sets and condensed representations (extended abstract). In KDD (pp. 189–194).

  • Mannila, H., & Toivonen, H. (1997). Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3), 241–258.

    Article  Google Scholar 

  • Orlando, S., Lucchese, C., Palmerini, P., Perego, R., & Silvestri, F. (2003). kdci: a multi-strategy algorithm for mining frequent sets. In R. J. Bayardo Jr., B. Goethals, & M. J. Zaki (Eds.), Journal of Intelligent Information Systems.

  • Palmerini, P., Orlando, S., & Perego, R. (2004). Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, & L. M. Liebrock (Eds.), SAC (pp. 515–519). New York: ACM.

    Chapter  Google Scholar 

  • Pasquier, N., Bastide, Y., Taouil, R., & Lakhal, L. (1999). Discovering frequent closed itemsets for association rules. In C. Beeri & P. Buneman (Eds.), ICDT. Lecture notes in computer science (Vol. 1540, pp. 398–416). New York: Springer.

    Chapter  Google Scholar 

  • Ramesh, G., Maniatty, W., & Zaki, M.-J. (2003). Feasible itemset distributions in data mining: theory and application. In PODS (pp. 284–295). New York: ACM.

    Google Scholar 

  • Ramesh, G., Zaki, M.-J., & Maniatty, W. (2005). Distribution-based synthetic database generation techniques for itemset mining. In IDEAS (pp. 307–316). Los Alamitos: IEEE Computer Society.

    Google Scholar 

  • Uno, T., Asai, T., Uchida, Y., & Arimura, H. (2004). An efficient algorithm for enumerating closed patterns in transaction databases. In E. Suzuki, S. Arikawa (Eds.), Discovery Science. Lecture notes in computer science (Vol. 3245, pp. 16–31). New York: Springer.

    Google Scholar 

  • Zaki, M.-J., Parthasarathy, S., Ogihara, M., & Li, W. (1997). New algorithms for fast discovery of association rules. In KDD (pp. 283–286), Journal of Intelligent Information Systems.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Frédéric Flouvat.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Flouvat, F., De Marchi, F. & Petit, JM. A new classification of datasets for frequent itemsets. J Intell Inf Syst 34, 1–19 (2010). https://doi.org/10.1007/s10844-008-0077-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-008-0077-0

Keywords

Navigation