Abstract
We describe a semantic clustering method designed to address shortcomings in the common bag-of-words document representation for functional semantic classification tasks. The method uses WordNet-based distance metrics to construct a similarity matrix, and expectation maximization to find and represent clusters of semantically-related terms. Using these clusters as features for machine learning helps maintain performance across distinct, domain-specific vocabularies while reducing the size of the document representation. We present promising results along these lines, and evaluate several algorithms and parameters that influence machine learning performance. We discuss limitations of the study and future work for optimizing and evaluating the method.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Passonneau, R., Yano, T., Lippincott, T., Klavans, J.: Functional Semantic Categories for Art History Text: Human Labeling and Preliminary Machine Learning. In: International Conference on Computer Vision Theory and Applications, Workshop 3: Metadata Mining for Image Understanding (2008)
Klavans, J., Sheffield, C., Abels, E., Bedouin, J., Jenemann, L., Lippincott, T., Lin, J., Passonneau, R., Sidhu, T., Soergel, D., Yano, T.: Computational Linguistics for Metadata Building: Aggregating Text Processing Technologies for Enhanced Image Access. In: OntoImage 2008: 2nd International Language Resources for Content-Based Image Retrieval Workshop (2008)
Yano, T.: Experiments on Non-Topical Paragraph Classification of the Art History Textbook (unpublished) (2007)
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377 (1999)
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 146–153. ACM, New York (2001)
Slonim, N., Tishby, N., Winter, Y.: Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR 2000, pp. 208–215. ACM Press, New York (2000)
Termier, R., Rousset, M.-c., Sebag, M.: Combining statistics and semantics for word and document clustering. In: Ontology Learning Workshop, IJCAI 2001, pp. 49–54 (2001)
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453 (1995)
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: The 10th International Conference on Research in Computational Linguistics (1997)
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)
Budanitsky, A., Hirst, G.: Evaluating WordNet-Based measures of semantic distance. Computational Linguistics 32(1), 13–47 (2006)
Artstein, R., Poesio, M.: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 555–596 (2008)
Passonneau, R.J.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy (May 2006)
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
Scott, S., Matwin, S.: Text Classification using WordNet Hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pages 38–44, pp. 45–52. Association for Computational Linguistics (1998)
Leacock, Chodorow: Filling in a sparse training space for word sense identification (1994)
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics (1994)
Dempster, A.P., Laird, N.M., Diftler, M., Lovchik, C., Magruder, D., Rehnmark, F.: Maximum likelihood from incomplete data via the EM algorithm (1977)
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
Platt, J.C.: Sequential minimal optimization: A fast algorithm for training support vector machines. Technical report, Advances in Kernel Methods - Support Vector Learning (1998)
Fawcett, T.: ROC graphs: Notes and practical considerations for data mining researchers. Technical Report Tech report HPL-2003-4, HP Laboratories, Palo Alto, CA, USA (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lippincott, T., Passonneau, R. (2009). Semantic Clustering for a Functional Text Classification Task. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-00382-0_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)