Summarizing categorical data by clustering attributes

Mampaey, Michael; Vreeken, Jilles

doi:10.1007/s10618-011-0246-6

Summarizing categorical data by clustering attributes

Published: 26 November 2011

Volume 26, pages 130–173, (2013)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Michael Mampaey¹ &
Jilles Vreeken¹

962 Accesses
Explore all metrics

Abstract

For a book, its title and abstract provide a good first impression of what to expect from it. For a database, obtaining a good first impression is typically not so straightforward. While low-order statistics only provide very limited insight, downright mining the data rapidly provides too much detail for such a quick glance. In this paper we propose a middle ground, and introduce a parameter-free method for constructing high-quality descriptive summaries of binary and categorical data. Our approach builds a summary by clustering attributes that strongly correlate, and uses the Minimum Description Length principle to identify the best clustering—without requiring a distance measure between attributes. Besides providing a practical overview of which attributes interact most strongly, these summaries can also be used as surrogates for the data, and can easily be queried. Extensive experimentation shows that our method discovers high-quality results: correlated attributes are correctly grouped, which is verified both objectively and subjectively. Our models can also be employed as surrogates for the data; as an example of this we show that we can quickly and accurately query the estimated supports of frequent generalized itemsets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sequentially Grouping Items into Clusters of Unspecified Number

The Difference and the Norm — Characterising Similarities and Differences Between Databases

What to Read Next? Challenges and Preliminary Results in Selecting Representative Documents

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Au W, Chan K, Wong A, Wang Y (2005) Attribute clustering for grouping, selection, and classification of gene expression data. IEEE/ACM Trans Comput Biol Bioinform 2(2): 83–101
Article Google Scholar
Baumgartner C, Böhm C, Baumgartner D (2005) Modelling of classification rules on metabolic patterns including machine learning and expert knowledge. Biomed Inform 38(2): 89–98
Article Google Scholar
Bringmann B, Zimmermann A (2007) The chosen few: on identifying valuable patterns. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 63–72
Calders T, Goethals B (2007) Non-derivable itemset mining. Data Min Knowl Discov 14(1): 171–206
Article MathSciNet Google Scholar
Chakrabarti D, Papadimitriou S, Modha DS, Faloutsos C (2004) Fully automatic cross-associations. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 79–88
Chandola V, Kumar V (2005) Summarization—compressing data into an informative representation. In: Proceedings of the IEEE international conference on data mining (ICDM’05), IEEE, pp 98–105
Coenen F (2003) The LUCS-KDD discretised/normalised ARM and CARM data library. http://www.csc.liv.ac.uk/~frans/KDD/Software/LUCS-KDD-DN/DataSets/dataSets.html. Accessed October 2010
Cover TM, Thomas JA (2006) Elements of information theory, 2nd edn. Wiley, New York
MATH Google Scholar
Das G, Mannila H, Ronkainen P (1997) Similarity of attributes by external probes. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’97), pp 23–29
De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3): 407–446
Article MathSciNet MATH Google Scholar
Dhillon I, Mallela S, Kumar R (2003) A divisive information theoretic feature clustering algorithm for text classification. J Mach Learn Res 3: 1265–1287
MATH Google Scholar
Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/ml. Accessed March 2011
Garriga GC, Junttila E, Mannila H (2011) Banded structure in binary matrices. Knowl Inf Syst (KAIS) 28(1): 197–226
Article Google Scholar
Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. Trans Knowl Discov Data 1(3): 1556–4681
Google Scholar
Goethals B, Zaki MJ (2003) Frequent itemset mining implementations repository (FIMI). http://fimi.ua.ac.be. Accessed October 2010
Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1): 55–86
Article MathSciNet Google Scholar
Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’09). ACM, New York, pp 379–388
Heikinheimo H, Hinkkanen E, Mannila H, Mielikäinen T, Seppänen JK (2007) Finding low-entropy sets and trees from binary data. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’07). ACM, New York, pp 350–359
Heikinheimo H, Vreeken J, Siebes A, Mannila H (2009) Low-entropy set selection. In: Proceedings of the SIAM international conference on data mining (SDM’09). SIAM, New York, pp 569–579
Kirkpatrick S (1984) Optimization by simulated annealing: quantitative studies. Stat Phys 34(5): 975–986
Article MathSciNet Google Scholar
Knobbe AJ, Ho EKY (2006) Maximally informative k-itemsets and their efficient discovery. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 237–244
Kontonasios KN, De Bie T (2010) An information-theoretic approach to finding noisy tiles in binary databases. In: Proceedings of the SIAM international conference on data mining (SDM’10). SIAM, New York, pp 153–164
Li M, Vitányi P (1993) An introduction to Kolmogorov complexity and its applications. Springer, New York
MATH Google Scholar
Mampaey M, Vreeken J (2010) Summarising data by clustering items. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD’10). Springer, New York, pp 321–336
Mampaey M, Tatti N, Vreeken J (2011) Tell me what I need to know: succinctly summarizing data with itemsets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11). ACM, New York, pp 573–581
Mitchell-Jones A, Amori G, Bogdanowicz W, Krystufek B, Reijnders PH, Spitzenberger F, Stubbe M, Thissen J, Vohralik V, Zima J (1999) The atlas of European mammals. Academic Press, London
Google Scholar
Myllykangas S, Himberg J, Böhling T, Nagy B, Hollmén J, Knuutila S (2006) DNA copy number amplification profiling of human neoplasms. Oncogene 25(55): 7324–7332
Article Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the ICDT international conference on database theory, pp 398–416
Pensa R, Robardet C, Boulicaut JF (2005) A bi-clustering framework for categorical data. In: Proceedings of the European conference on principles and practice of knowledge discovery in databases (PKDD’05). Springer, New York, pp 643–650
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1): 465–471
Article MATH Google Scholar
Rissanen J (2007) Information and complexity in statistical modeling. Springer, New York
MATH Google Scholar
Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27: 379–423
MathSciNet MATH Google Scholar
Siebes A, Vreeken J, van Leeuwen M (2006) Item sets that compress. In: Proceedings of the SIAM international conference on data mining (SDM’06). SIAM, New York, pp 393–404
Vanden Bulcke T, Vanden Broucke P, Van Hoof V, Wouters K, Vanden Broucke S, Smits G, Smits E, Proesmans S, Van Genechten T, Eyskens F (2011) Data mining methods for classification of Medium-Chain Acyl-CoA dehydrogenase deficiency (MCADD) using non-derivatized tandem MS neonatal screening data. J Biomed Inform 44(2): 319–325
Article Google Scholar
Vereshchagin N, Vitanyi P (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12): 3265–3290
Article MathSciNet Google Scholar
Vreeken J, van Leeuwen M, Siebes A (2007) Preserving privacy through data generation. In: Proceedings of the IEEE international conference on data mining (ICDM’07), IEEE, pp 685–690
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Discov 23(1): 169–214
Article MathSciNet MATH Google Scholar
Wallace C (2005) Statistical and inductive inference by minimum message length. Springer, New York
MATH Google Scholar
Wang J, Karypis G (2004) SUMMARY: efficiently summarizing transactions for clustering. In: Proceedings of the IEEE international conference on data mining (ICDM’04), IEEE, pp 241–248
Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06). ACM, New York, pp 730–735
Yan X, Cheng H, Han J, Xin D (2005) Summarizing itemset patterns: a profile-based approach. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05). ACM, New York, pp 314–323

Download references

Author information

Authors and Affiliations

Advanced Database Research and Modelling, Department of Mathematics and Computer Science, University of Antwerp, Antwerp, Belgium
Michael Mampaey & Jilles Vreeken

Authors

Michael Mampaey
View author publications
You can also search for this author inPubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Michael Mampaey.

Additional information

Responsible editor: M.J. Zaki.

The research described in this paper builds upon and extends the work appearing in ECML PKDD’10 as Mampaey and Vreeken (2010).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mampaey, M., Vreeken, J. Summarizing categorical data by clustering attributes. Data Min Knowl Disc 26, 130–173 (2013). https://doi.org/10.1007/s10618-011-0246-6

Download citation

Received: 27 April 2011
Accepted: 16 November 2011
Published: 26 November 2011
Issue Date: January 2013
DOI: https://doi.org/10.1007/s10618-011-0246-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Summarizing categorical data by clustering attributes

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Sequentially Grouping Items into Clusters of Unspecified Number

The Difference and the Norm — Characterising Similarities and Differences Between Databases

What to Read Next? Challenges and Preliminary Results in Selecting Representative Documents

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now