Cluster It! Semiautomatic Splitting and Naming of Classification Concepts

Stork, Dominik; Eckert, Kai; Stuckenschmidt, Heiner

doi:10.1007/978-3-319-00035-0_37

Dominik Stork²¹,
Kai Eckert²¹ &
Heiner Stuckenschmidt²¹

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2800 Accesses

Abstract

In this paper, we present a semiautomatic approach to split overpopulated classification concepts (i.e. classes) into subconcepts and propose suitable names for the new concepts. Our approach consists of three steps: In a first step, meaningful term clusters are created and presented to the user for further curation and selection of possible new subconcepts. A graph representation and simple tf-idf weighting is used to create the cluster suggestions. The term clusters are used as seeds for the subsequent content-based clustering of the documents using k-Means. At last, the resulting clusters are evaluated based on their correlation with the preselected term clusters and proper terms for the naming of the clusters are proposed. We show that this approach efficiently supports the maintainer while avoiding the usual quality problems of fully automatic clustering approaches, especially with respect to the handling of outliers and determination of the number of target clusters. The documents of the parent concept are directly assigned to the new subconcepts favoring high precision.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://people.csail.mit.edu/jrennie/20Newsgroups/

References

Brank, J., Grobelnik, M., & Mladenic, D. (2008). Predicting category additions in a topic hierarchy. In J. Domingue & C. Anutariya (Eds.), ASWC, Bangkok, Thailand (Lecture notes in computer science, Vol. 5367, pp. 315–329). Berlin, Germany: Springer.
Google Scholar
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., & Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition (pp. 359–368). Berlin Heidelberg, Germany: Springer.
Google Scholar
Stefanowski, J., & Weiss, D. (2007) Comprehensible and accurate cluster labels in text clustering. In Large scale semantic access to content (text, image, video, and sound) (pp. 198–209). RIAO ’07, Le centre de hautes etudes internationales d’informatique documentaire, Paris, France.
Google Scholar
Stork, D. (2010). Automatic concept splitting and naming for thesaurus maintenance. Master’s thesis, University of Mannheim.
Google Scholar
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In SIGIR, Melbourne, Australia (pp. 46–54). New York: ACM.
Google Scholar
Zhang, D., & Dong, Y. (2004). Semantic, hierarchical, online clustering of web search results. In J.X. Yu, X. Lin, H. Lu, & Y. Zhang (Eds.), APWeb, Hangzhou, China (Lecture notes in computer science, Vol. 3007, pp. 69–78) New York/Berlin, Germany: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

KR & KM Research Group, University of Mannheim, Mannheim, Germany
Dominik Stork, Kai Eckert & Heiner Stuckenschmidt

Authors

Dominik Stork
View author publications
You can also search for this author in PubMed Google Scholar
Kai Eckert
View author publications
You can also search for this author in PubMed Google Scholar
Heiner Stuckenschmidt
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Stork .

Editor information

Editors and Affiliations

University of Essex Department of Mathematical Sciences, Colchester, United Kingdom
Berthold Lausen
Ghent University Department of Marketing, Ghent, Belgium
Dirk Van den Poel
University of Marburg Databionics, FB 12, Marburg, Germany
Alfred Ultsch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stork, D., Eckert, K., Stuckenschmidt, H. (2013). Cluster It! Semiautomatic Splitting and Naming of Classification Concepts. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_37

Download citation

DOI: https://doi.org/10.1007/978-3-319-00035-0_37
Published: 16 July 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00034-3
Online ISBN: 978-3-319-00035-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics