Abstract
In this paper, we present a semiautomatic approach to split overpopulated classification concepts (i.e. classes) into subconcepts and propose suitable names for the new concepts. Our approach consists of three steps: In a first step, meaningful term clusters are created and presented to the user for further curation and selection of possible new subconcepts. A graph representation and simple tf-idf weighting is used to create the cluster suggestions. The term clusters are used as seeds for the subsequent content-based clustering of the documents using k-Means. At last, the resulting clusters are evaluated based on their correlation with the preselected term clusters and proper terms for the naming of the clusters are proposed. We show that this approach efficiently supports the maintainer while avoiding the usual quality problems of fully automatic clustering approaches, especially with respect to the handling of outliers and determination of the number of target clusters. The documents of the parent concept are directly assigned to the new subconcepts favoring high precision.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Brank, J., Grobelnik, M., & Mladenic, D. (2008). Predicting category additions in a topic hierarchy. In J. Domingue & C. Anutariya (Eds.), ASWC, Bangkok, Thailand (Lecture notes in computer science, Vol. 5367, pp. 315–329). Berlin, Germany: Springer.
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., & Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition (pp. 359–368). Berlin Heidelberg, Germany: Springer.
Stefanowski, J., & Weiss, D. (2007) Comprehensible and accurate cluster labels in text clustering. In Large scale semantic access to content (text, image, video, and sound) (pp. 198–209). RIAO ’07, Le centre de hautes etudes internationales d’informatique documentaire, Paris, France.
Stork, D. (2010). Automatic concept splitting and naming for thesaurus maintenance. Master’s thesis, University of Mannheim.
Zamir, O., & Etzioni, O. (1998). Web document clustering: A feasibility demonstration. In SIGIR, Melbourne, Australia (pp. 46–54). New York: ACM.
Zhang, D., & Dong, Y. (2004). Semantic, hierarchical, online clustering of web search results. In J.X. Yu, X. Lin, H. Lu, & Y. Zhang (Eds.), APWeb, Hangzhou, China (Lecture notes in computer science, Vol. 3007, pp. 69–78) New York/Berlin, Germany: Springer.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer International Publishing Switzerland
About this paper
Cite this paper
Stork, D., Eckert, K., Stuckenschmidt, H. (2013). Cluster It! Semiautomatic Splitting and Naming of Classification Concepts. In: Lausen, B., Van den Poel, D., Ultsch, A. (eds) Algorithms from and for Nature and Life. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Cham. https://doi.org/10.1007/978-3-319-00035-0_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-00035-0_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-00034-3
Online ISBN: 978-3-319-00035-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)