Skip to main content
Log in

On efficiently summarizing categorical databases

  • Research Article
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Frequent itemset mining was initially proposed and has been studied extensively in the context of association rule mining. In recent years, several studies have also extended its application to transaction or document clustering. However, most of the frequent itemset based clustering algorithms need to first mine a large intermediate set of frequent itemsets in order to identify a subset of the most promising ones that can be used for clustering. In this paper, we study how to directly find a subset of high quality frequent itemsets that can be used as a concise summary of the transaction database and to cluster the categorical data. By exploring key properties of the subset of itemsets that we are interested in, we proposed several search space pruning methods and designed an efficient algorithm called SUMMARY. Our empirical results show that SUMMARY runs very fast even when the minimum support is extremely low and scales very well with respect to the database size, and surprisingly, as a pure frequent itemset mining algorithm it is very effective in clustering the categorical data and summarizing the dense transaction databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal, R., Aggarwal, C., Prasad, V.: A tree projection algorithm for generation of frequent item sets. J Parallel Distrib Comput 61(3), 350–371 (2001)

    Article  Google Scholar 

  2. Agrawal, R., Imielinski, T., Swami, A.: Mining associations between sets of items in massive databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 207–216. Washington DC (1993)

  3. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Bocca, J.B., Jarke, M., Zaniolo, C. (eds.) Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499. Santiago de Chile, Chile (1994)

  4. Antonie, M., Zaiane, O.: Text document categorization by term association. In: Proceedings of the 2002 IEEE International Conference on Data Mining, pp. 19–26. Maebashi City, Japan (2002)

  5. Bayardo, R.J.: Brute-force mining of high-confidence classification rules. In: Heckerman, D., Mannila, H., Pregibon, D. (eds.) Proceedings of the 3rd International Conference on Knowledge Discovery and Data mining), pp. 123–126. Newport Beach, California, USA (1997)

  6. Bayardo, R.J.: Efficiently mining long patterns from databases. In: Haas, L.M., Tiwary, A. (eds.) Proceedings ACM SIGMOD International Conference on Management of Data, pp. 85–93. Seattle, Washington (1998)

  7. Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: KDD'02 Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 436–442. Edmonton, Alberta, Canada (2002)

  8. Boulicaut, J., Bykowski, A., Rigotti, C.: Free-sets: a condensed representation of Boolean data for the approximation of frequency queries. J Data Mining Knowl Discovery 7(1), 5–22 (2003)

    MathSciNet  Google Scholar 

  9. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: Peckham, J. (ed.) Proceedings ACM SIGMOD International Conference on Management of Data, pp. 255–264. Tucson, Arizona, USA (1997)

  10. Burdick, D., Calimlim, M., Gehrke, J.: MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the 17th International Conference on Data Engineering, pp. 443–452. Heidelberg, Germany (2001)

  11. Fung, B., Wang, K., Ester, M.: Hierachical document clustering using frequent itemsets. In: Barbara, D., Kamath, C. (eds.) Proceedings of the 3rd SIAM International Conference on Data Mining. USA, San Francisco, CA (2003)

  12. Gade, K., Wang, J., Karypis, G.: Efficient closed pattern mining in the presence of tough block constraints. In: Kim, W., Kohavi, R., Gehrke, J., DuMouchel, W. (eds.) Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, pp. 138–147. Washington, USA (2004)

  13. Ganti, V., Gehrke, J., Ramakrishnan, R.: CACTUS: clustering categorical data using Summaries. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 73–83. San Diego, CA, USA (1999)

  14. Guha, S., Rastogi, R., Shim, K.: ROCK: a robut clustering algorithm for categorical attributes. In: Proceedings of the 15th International Conference on Data Engineering, pp. 512–521. Sydney, Austrialia (1999)

  15. Gunopulos, D., Mannila, H., Saluja, S.: Discovering all most specific sentences by randomized algorithms. In: Afrati, F.N., Kolaitis, P.G. (eds.) Proceedings of the 6th International Conference on Database Theory, pp. 215–229. Delphi, Greece (1997)

  16. Goethals, B., Zaki, M.: Advances in frequent itemset mining implementations: Introduction to FIMI03. In: Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations. Melbourne, Florida, USA (2003)

  17. Grahne, G., Zhu, J.: Efficiently using prefix-trees in mining frequent itemsets. In: Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations. Melbourne, Florida, USA (2003)

  18. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12. Dallas, Texas, USA (2000)

  19. Li, W., Han, J., Pei, J.: CMAR: accurate and efficient classification based on multiple class-association rules. In: Cercone, N., Lin, T.Y., Wu, X. (eds.) Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 369–376. San Jose, California, USA (2001)

  20. Liu, B., Hsu, W., Ma, Y.: Integrating classification and association rule mining. In: Agrawal, R., Stolorz, P.E., Piatetsky-Shapiro, G. (eds.) Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80–86. New York City, New York, USA (1998)

  21. Liu, G., Lu, H., Lou, W., Yu, J.X.: On computing, storing and querying frequent patterns. In: Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (eds.) Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 607–612. Washington, District of Columbia, USA (2003)

  22. Mannila, H., Toivonen, H.: Multiple uses of frequent sets and condensed representations. In: Simoudis, E., Han, J., Fayyad, U.M. (eds.) Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 189–194. Portland, Oregon, USA (1996)

  23. Pan, F., Cong, G., Tung, A.K.H., Yang, J., Zaki, M.: CARPENTER: finding closed patterns in long biological datasets. In: Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (eds.) Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 637–642. Washington, District of Columbia, USA (2003)

  24. Park, J., Chen, M., Yu, P.S.: An effective hash based algorithm for mining association rules. In: Carey, M.J., Schneider, D.A. (eds.) Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pp. 175–186. San Jose, California (1995)

  25. Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. In: Beeri, C., Buneman, P. (eds.) Proceedings of the 6th International Conference on Database Theory, pp. 398–416. Jerusalem, Israel (1999)

  26. Pei, J., Han, J., Lu, H., Nishio, S., Tang, S., Yang, D.: H-mine: hyper-structure mining of frequent patterns in large databases. In: Cercone, N., Lin, T.Y., Wu, X. (eds.) Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 441–448. San Jose, California, USA (2001)

  27. Pei, J., Han, J., Mao, R.: CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Gunopulos, D., Rastogi, R. (eds.) Proceedings of 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pp. 21–30. Dallas, Texas, USA (2000)

  28. Seno, M., Karypis, G.: LPMiner: an algorithm for finding frequent itemsets using length-decreasing support constraint. In: Cercone, N., Lin, T.Y., Wu, X. (eds.) Proceedings of the 2001 IEEE International Conference on Data Mining, pp 505–512. San Jose, California, USA (2001)

  29. Toivonen, H.: Sampling large databases for association rules. In: Vijayaraman, T.M., Buchmann, A.P., Mohan, C., Sarda, N.L. (eds.) Proceedings of 22th International Conference on Very Large Data Bases, pp 134–145. Mumbai, India (1996)

  30. Wang, J., Han, J., Pei, J.: CLOSET+: Searching for the best strategies for mining frequent closed itemsets. In: Getoor, L., Senator, T.E., Domingos, P., Faloutsos, C. (eds.) Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 236–245. Washington, District of Columbia, USA (2003)

  31. Wang, J., Karypis, G.: BAMBOO: Accelerating closed itemset mining by deeply pushing the length-decreasing support constraint. In: Proceedings of the 4th SIAM International Conference on Data Mining. Lake Buena Vista, Florida, USA (2004)

  32. Wang, K., Xu, C., Liu, B.: Clustering transactions using large items. In: Proceedings of the 1999 ACM CIKM International Conference on Information and Knowledge Management, pp 483–490. Kansas City, Missouri, USA (1999)

  33. Xiong, H., Steinbach, M., Tan, P., Kumar, V.: HICAP:Hierarchial clustering with pattern preservation. In: Proceedings of the 4th SIAM International Conference on Data Mining. Lake Buena Vista, Florida, USA (2004)

  34. Yiu, M., Mamoulis, N.: Frequent pattern based iterative projected clustering. In: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 689–692. Melbourne, Florida, USA (2003)

  35. Zaki, M., Hsiao, C.: CHARM: an Efficient algorithm for closed itemset mining. In: Grossman, R.L., Han, J., Kumar, V., Mannila, H., Motwani, R. (eds.) Proceedings of the 4th SIAM International Conference on Data Mining. Arlington, VA, USA (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Karypis.

Additional information

Jianyong Wang received the Ph.D. degree in computer science in 1999 from the Institute of Computing Technology, the Chinese Academy of Sciences. Since then, he ever worked as an assistant professor in the Department of Computer Science and Technology, Peking (Beijing) University in the areas of distributed systems and Web search engines, and visited the School of Computing Science at Simon Fraser University, the Department of Computer Science at the University of Illinois at Urbana-Champaign, and the Digital Technology Center and the Department of Computer Science at the University of Minnesota, mainly working in the area of data mining. He is currently an associate professor of the Department of Computer Science and Technology at Tsinghua University, P.R. China.

George Karypis received his Ph.D. degree in computer science at the University of Minnesota and he is currently an associate professor at the Department of Computer Science and Engineering at the University of Minnesota. His research interests spans the areas of parallel algorithm design, data mining, bioinformatics, information retrieval, applications of parallel processing in scientific computing and optimization, sparse matrix computations, parallel preconditioners, and parallel programming languages and libraries. His research has resulted in the development of software libraries for serial and parallel graph partitioning (METIS and ParMETIS), hypergraph partitioning (hMETIS), for parallel Cholesky factorization (PSPASES), for collaborative filtering-based recommendation algorithms (SUGGEST), clustering high dimensional datasets (CLUTO), and finding frequent patterns in diverse datasets (PAFI). He has coauthored over ninety journal and conference papers on these topics and a book title “Introduction to Parallel Computing” (Publ. Addison Wesley, 2003, 2nd edition). In addition, he is serving on the program committees of many conferences and workshops on these topics and is an associate editor of the IEEE Transactions on Parallel and Distributed Systems.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Karypis, G. On efficiently summarizing categorical databases. Knowl Inf Syst 9, 19–37 (2006). https://doi.org/10.1007/s10115-005-0216-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0216-7

Keywords

Navigation