Skip to main content
Log in

Summarizing categorical data by clustering attributes

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data rapidly provides too much detail for such a quick glance. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering—without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried. Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models can also be employed as surrogates for the data; as an example of this we show that we can quickly and accurately query the estimated supports of frequent generalized itemsets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Au W, Chan K, Wong A, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinform 2(2): 83–101

    Article  Google Scholar 

  • Baumgartner C, Böhm C, Baumgartner D (2005) Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed Inform 38(2): 89–98

    Article  Google Scholar 

  • Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 63–72

  • Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206

    Article  MathSciNet  Google Scholar 

  • Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 79–88

  • Chandola V, Kumar V (2005) Summarization—compressing data into an informative representation. In: Proceedings of the IEEE international conference on data mining (ICDM’05), IEEE, pp 98–105

  • Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed October 2010

  • Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, New York

    MATH  Google Scholar 

  • Das G, Mannila H, Ronkainen P (1997) Similarity of attributes by external probes. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’97), pp 23–29

  • De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3): 407–446

    Article  MathSciNet  MATH  Google Scholar 

  • Dhillon I, Mallela S, Kumar R (2003) A divisive information theoretic feature clustering algorithm for text classification. J Mach Learn Res 3: 1265–1287

    MATH  Google Scholar 

  • Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed March 2011

  • Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inf Syst (KAIS) 28(1): 197–226

    Article  Google Scholar 

  • Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. Trans Knowl Discov Data 1(3): 1556–4681

    Google Scholar 

  • Goethals B, Zaki MJ (2003) Frequent itemset mining implementations repository (FIMI). http://fimi.ua.ac.be. Accessed October 2010

  • Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  • Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86

    Article  MathSciNet  Google Scholar 

  • Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’09). ACM, New York, pp 379–388

  • Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07). ACM, New York, pp 350–359

  • Heikinheimo H, Vreeken J, Siebes A, Mannila H (2009) Low-entropy set selection. In: Proceedings of the SIAM international conference on data mining (SDM’09). SIAM, New York, pp 569–579

  • Kirkpatrick S (1984) Optimization by simulated annealing: quantitative studies. Stat Phys 34(5): 975–986

    Article  MathSciNet  Google Scholar 

  • Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 237–244

  • Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM’10). SIAM, New York, pp 153–164

  • Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, New York

    MATH  Google Scholar 

  • Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’10). Springer, New York, pp 321–336

  • Mampaey M, Tatti N, Vreeken J (2011) Tell me what I need to know: succinctly summarizing data with itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11). ACM, New York, pp 573–581

  • Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London

    Google Scholar 

  • Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332

    Article  Google Scholar 

  • Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT international conference on database theory, pp 398–416

  • Pensa R, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD’05). Springer, New York, pp 643–650

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471

    Article  MATH  Google Scholar 

  • Rissanen J (2007) Information and complexity in statistical modeling. Springer, New York

    MATH  Google Scholar 

  • Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423

    MathSciNet  MATH  Google Scholar 

  • Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM’06). SIAM, New York, pp 393–404

  • Vanden Bulcke T, Vanden Broucke P, Van Hoof V, Wouters K, Vanden Broucke S, Smits G, Smits E, Proesmans S, Van Genechten T, Eyskens F (2011) Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J Biomed Inform 44(2): 319–325

    Article  Google Scholar 

  • Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12): 3265–3290

    Article  MathSciNet  Google Scholar 

  • Vreeken J, van Leeuwen M, Siebes A (2007) Preserving privacy through data generation. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 685–690

  • Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214

    Article  MathSciNet  MATH  Google Scholar 

  • Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, New York

    MATH  Google Scholar 

  • Wang J, Karypis G (2004) SUMMARY: efficiently summarizing transactions for clustering. In: Proceedings of the IEEE international conference on data mining (ICDM’04), IEEE, pp 241–248

  • Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 730–735

  • Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05). ACM, New York, pp 314–323

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Mampaey.

Additional information

Responsible editor: M.J. Zaki.

The research described in this paper builds upon and extends the work appearing in ECML PKDD’10 as Mampaey and Vreeken (2010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mampaey, M., Vreeken, J. Summarizing categorical data by clustering attributes. Data Min Knowl Disc 26, 130–173 (2013). https://doi.org/10.1007/s10618-011-0246-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-011-0246-6

Keywords

Navigation