Skip to main content

Semantic Clustering for a Functional Text Classification Task

  • Conference paper
Computational Linguistics and Intelligent Text Processing (CICLing 2009)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5449))

Abstract

We describe a semantic clustering method designed to address shortcomings in the common bag-of-words document representation for functional semantic classification tasks. The method uses WordNet-based distance metrics to construct a similarity matrix, and expectation maximization to find and represent clusters of semantically-related terms. Using these clusters as features for machine learning helps maintain performance across distinct, domain-specific vocabularies while reducing the size of the document representation. We present promising results along these lines, and evaluate several algorithms and parameters that influence machine learning performance. We discuss limitations of the study and future work for optimizing and evaluating the method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Passonneau, R., Yano, T., Lippincott, T., Klavans, J.: Functional Semantic Categories for Art History Text: Human Labeling and Preliminary Machine Learning. In: International Conference on Computer Vision Theory and Applications, Workshop 3: Metadata Mining for Image Understanding (2008)

    Google Scholar 

  2. Klavans, J., Sheffield, C., Abels, E., Bedouin, J., Jenemann, L., Lippincott, T., Lin, J., Passonneau, R., Sidhu, T., Soergel, D., Yano, T.: Computational Linguistics for Metadata Building: Aggregating Text Processing Technologies for Enhanced Image Access. In: OntoImage 2008: 2nd International Language Resources for Content-Based Image Retrieval Workshop (2008)

    Google Scholar 

  3. Yano, T.: Experiments on Non-Topical Paragraph Classification of the Art History Textbook (unpublished) (2007)

    Google Scholar 

  4. Pereira, F., Tishby, N., Lee, L.: Distributional clustering of English words. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, pp. 183–190 (1993)

    Google Scholar 

  5. Tishby, N., Pereira, F.C., Bialek, W.: The information bottleneck method. In: Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377 (1999)

    Google Scholar 

  6. Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 146–153. ACM, New York (2001)

    Google Scholar 

  7. Slonim, N., Tishby, N., Winter, Y.: Document clustering using word clusters via the information bottleneck method. In: ACM SIGIR 2000, pp. 208–215. ACM Press, New York (2000)

    Google Scholar 

  8. Termier, R., Rousset, M.-c., Sebag, M.: Combining statistics and semantics for word and document clustering. In: Ontology Learning Workshop, IJCAI 2001, pp. 49–54 (2001)

    Google Scholar 

  9. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453 (1995)

    Google Scholar 

  10. Jiang, J., Conrath, D.: Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. In: The 10th International Conference on Research in Computational Linguistics (1997)

    Google Scholar 

  11. Lin, D.: An Information-Theoretic Definition of Similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann, San Francisco (1998)

    Google Scholar 

  12. Budanitsky, A., Hirst, G.: Evaluating WordNet-Based measures of semantic distance. Computational Linguistics 32(1), 13–47 (2006)

    Article  MATH  Google Scholar 

  13. Artstein, R., Poesio, M.: Inter-Coder Agreement for Computational Linguistics. Computational Linguistics 34(4), 555–596 (2008)

    Article  Google Scholar 

  14. Passonneau, R.J.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), Genoa, Italy (May 2006)

    Google Scholar 

  15. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D., Miller, K.: Introduction to WordNet: An On-line Lexical Database (1993)

    Google Scholar 

  16. Scott, S., Matwin, S.: Text Classification using WordNet Hypernyms. In: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, pages 38–44, pp. 45–52. Association for Computational Linguistics (1998)

    Google Scholar 

  17. Leacock, Chodorow: Filling in a sparse training space for word sense identification (1994)

    Google Scholar 

  18. Wu, Z., Palmer, M.: Verb semantics and lexical selection. In: 32nd Annual Meeting of the Association for Computational Linguistics (1994)

    Google Scholar 

  19. Dempster, A.P., Laird, N.M., Diftler, M., Lovchik, C., Magruder, D., Rehnmark, F.: Maximum likelihood from incomplete data via the EM algorithm (1977)

    Google Scholar 

  20. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005)

    MATH  Google Scholar 

  21. Platt, J.C.: Sequential minimal optimization: A fast algorithm for training support vector machines. Technical report, Advances in Kernel Methods - Support Vector Learning (1998)

    Google Scholar 

  22. Fawcett, T.: ROC graphs: Notes and practical considerations for data mining researchers. Technical Report Tech report HPL-2003-4, HP Laboratories, Palo Alto, CA, USA (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lippincott, T., Passonneau, R. (2009). Semantic Clustering for a Functional Text Classification Task. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2009. Lecture Notes in Computer Science, vol 5449. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00382-0_41

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-00382-0_41

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-00381-3

  • Online ISBN: 978-3-642-00382-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics