Abstract
In this paper we investigate the possibility of an automatic construction of conceptual taxonomies and evaluate the achievable results. The hierarchy is performed by Ward algorithm, guided by Goodman-Kruskal τ as proximity measure. Then, we provide a concise description of each cluster by a keyword representative selected by PageRank.
The obtained hierarchy has the same advantages - both descriptive and operative - of indices on keywords which partition a set of documents with respect to their content.
We performed experiments in a real case - the abstracts of the papers published in ACM TODS in which the papers have been manually classified into the ACM Computing Taxonomy (CT). We evaluated objectively the generated hierarchy by two methods: Jaccard measure and entropy. We obtained good results by both the methods. Finally we evaluated the capability to classify in the categories of the two taxonomies showing that KH provides a greater facility than CT.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Gates, S.C., Yu, P.S.: On the merits of building categorization systems by supervised clustering. In: Proc. of 5th ACM Int. Conf. on Knowledge Discovery and Data Mining, San Diego, US, pp. 352–356 (1999)
Anderberg, M.R.: Cluster analysis for applications, 2nd edn. Academic (1973)
Chakrabarti, S., Dom, B., Agrawal, R., Raghavan, P.: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal 7(3), 163–178 (1998)
Clifton, C., Cooley, R., Rennie, J.: Topcat: Data mining for topic identification in a text corpus. IEEE Trans. Knowledge and Data Engineering 16(8), 949–964 (2004)
Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1/2), 143–175 (2001)
Gates, S.C., Teiken, W., Cheng, K.-S.F.: Taxonomies by the numbers: building high-performance taxonomies. In: ACM CIKM 2005: Proc. of the 14th ACM international conference on Information and knowledge management, pp. 568–577 (2005)
Goodman, L.A., Kruskal, W.H.: Measures of association for cross classifications. Journal American Statistical Association 49(268), 732–764 (1954)
Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: ACM SIGIR 2000, pp. 224–231 (2000)
Hofmann, T.: The cluster-abstraction model: Unsupervised learning of topic hierarchies from text data. In: IJCAI, pp. 682–687 (1999)
Ienco, D., Meo, R.: Exploration and reduction of the feature space by hierarchical clustering. In: SDM 2008 (2008)
Lewis, D.D.: Evaluating text categorization. In: Proc. Speech and Natural Language Workshop, HLT (1991)
Mihalcea, R.: Unsupervised large-vocabulary word sense disambiguation with graph-based algorithms for sequence data labeling. In: HLT/EMNLP 2005 (2005)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30 (1998)
Michalski, R.S., Stepp, R.E.: Learning from observation: Conceptual clustering. Machine Learning: An Artificial Intelligence Approach, 331–363 (1983)
Sanderson, M., Croft, W.B.: Deriving concept hierarchies from text. In: Research and Development in Information Retrieval, pp. 206–213 (1999)
Segal, E., Koller, D., Ormoneit, D.: Probabilistic abstraction hierarchies. In: Proc. NIPS 2001 (2001)
Zamir, O., Etzioni, O., Madani, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: SIGACM KDD Conference, pp. 287–290 (1997)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ienco, D., Meo, R. (2008). Towards the Automatic Construction of Conceptual Taxonomies. In: Song, IY., Eder, J., Nguyen, T.M. (eds) Data Warehousing and Knowledge Discovery. DaWaK 2008. Lecture Notes in Computer Science, vol 5182. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85836-2_31
Download citation
DOI: https://doi.org/10.1007/978-3-540-85836-2_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85835-5
Online ISBN: 978-3-540-85836-2
eBook Packages: Computer ScienceComputer Science (R0)