Semantic Clustering for a Functional Text Classification Task

Lippincott, Thomas; Passonneau, Rebecca

doi:10.1007/978-3-642-00382-0_41

Thomas Lippincott¹⁷ &
Rebecca Passonneau¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

1773 Accesses
4 Citations

Abstract

We describe a semantic clustering method designed to address shortcomings in the common bag-of-words document representation for functional semantic classification tasks. The method uses WordNet-based distance metrics to construct a similarity matrix, and expectation maximization to find and represent clusters of semantically-related terms. Using these clusters as features for machine learning helps maintain performance across distinct, domain-specific vocabularies while reducing the size of the document representation. We present promising results along these lines, and evaluate several algorithms and parameters that influence machine learning performance. We discuss limitations of the study and future work for optimizing and evaluating the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Passonneau, R., Yano, T., Lippincott, T., Klavans, J.: Functional Semantic Categories for Art History Text: Human Labeling and Preliminary Machine Learning. In: International Conference on Computer Vision Theory and Applications, Workshop 3: Metadata Mining for Image Understanding (2008)
Google Scholar
Klavans, J., Sheffield, C., Abels, E., Bedouin, J., Jenemann, L., Lippincott, T., Lin, J., Passonneau, R., Sidhu, T., Soergel, D., Yano, T.: Computational Linguistics for Metadata Building: Aggregating Text Processing Technologies for Enhanced Image Access. In: OntoImage 2008: 2nd International Language Resources for Content-Based Image Retrieval Workshop (2008)
Google Scholar
Yano, T.: Experiments on Non-Topical Paragraph Classification of the Art History Textbook (unpublished) (2007)
Google Scholar
Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)
Google Scholar
Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377 (1999)
Google Scholar
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 146–153. ACM, New York (2001)
Google Scholar
Slonim, N., Tishby, N., Winter, Y.: Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR 2000, pp. 208–215. ACM Press, New York (2000)
Google Scholar
Termier, R., Rousset, M.-c., Sebag, M.: Combining statistics and semantics for word and document clustering. In: Ontology Learning Workshop, IJCAI 2001, pp. 49–54 (2001)
Google Scholar
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453 (1995)
Google Scholar
Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: The 10th International Conference on Research in Computational Linguistics (1997)
Google Scholar
Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)
Google Scholar
Budanitsky, A., Hirst, G.: Evaluating WordNet-Based measures of semantic distance. Computational Linguistics 32(1), 13–47 (2006)
Article MATH Google Scholar
Artstein, R., Poesio, M.: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 555–596 (2008)
Article Google Scholar
Passonneau, R.J.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy (May 2006)
Google Scholar
Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)
Google Scholar
Scott, S., Matwin, S.: Text Classification using WordNet Hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pages 38–44, pp. 45–52. Association for Computational Linguistics (1998)
Google Scholar
Leacock, Chodorow: Filling in a sparse training space for word sense identification (1994)
Google Scholar
Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics (1994)
Google Scholar
Dempster, A.P., Laird, N.M., Diftler, M., Lovchik, C., Magruder, D., Rehnmark, F.: Maximum likelihood from incomplete data via the EM algorithm (1977)
Google Scholar
Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)
MATH Google Scholar
Platt, J.C.: Sequential minimal optimization: A fast algorithm for training support vector machines. Technical report, Advances in Kernel Methods - Support Vector Learning (1998)
Google Scholar
Fawcett, T.: ROC graphs: Notes and practical considerations for data mining researchers. Technical Report Tech report HPL-2003-4, HP Laboratories, Palo Alto, CA, USA (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science Center for Computational Learning Systems, Columbia University, New York, NY, USA
Thomas Lippincott & Rebecca Passonneau

Authors

Thomas Lippincott
View author publications
You can also search for this author in PubMed Google Scholar
Rebecca Passonneau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

National Polytechnic Institute, Center for Computing Research, 07738, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lippincott, T., Passonneau, R. (2009). Semantic Clustering for a Functional Text Classification Task. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_41

Download citation

DOI: https://doi.org/10.1007/978-3-642-00382-0_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00381-3
Online ISBN: 978-3-642-00382-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics