Skip to main content

Finding the Optimal Number of Clusters for Word Sense Disambiguation

  • Conference paper
Text, Speech and Dialogue (TSD 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6836))

Included in the following conference series:

  • 910 Accesses

Abstract

Ambiguity is an inherent problem for many tasks in Natural Language Processing. Unsupervised and semi-supervised approaches to ambiguity resolution are appealing as they lower the cost of manual labour. Typically, those methods struggle with estimation of number of senses without supervision. This paper shows research on using stopping functions applied to clustering algorithms for estimation of number of senses. The experiments were performed for Polish and English. We found that estimation based on PK2 stopping functions is encouraging, but only when using coarse-grained distinctions between senses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agirre, E., Edmonds, P. (eds.): Word Sense Disambiguation: Algorithms and Applications. Springer, Heidelberg (2006)

    Google Scholar 

  2. Pedersen, T., Kulkarni, A.: Selecting the right number of senses based on clustering criterion functions (2006)

    Google Scholar 

  3. Pawlowski, A.: Metody kwantytatywne w sekwencyjnej analizie danych. English title: Quantitative methods in sequential data analysis. Katedra Lingwistyki Formalnej Uniwersytetu Warszawskiego (2006)

    Google Scholar 

  4. Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315(5814) (2007)

    Google Scholar 

  5. Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)

    Article  Google Scholar 

  6. Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics Simulation and Computation 3(1), 1–27 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  7. Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal Of The Royal Statistical Society Series B 63(2), 411–423 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  8. Mojena, R.: Hierarchical grouping methods and stopping rules: an evaluation. Computer Journal 20(4) (1977)

    Google Scholar 

  9. Broda, B., Piasecki, M., Maziarz, M.: Evaluating LexCSD — a weakly-supervised method on improved semantically annotated corpus in a large scale experiment. In: Intelligent Information Systems (2010)

    Google Scholar 

  10. Fellbaum, C., et al.: WordNet: An electronic lexical database. MIT press, Cambridge (1998)

    MATH  Google Scholar 

  11. Piasecki, M., Szpakowicz, S., Broda, B.: A wordnet from the ground up. Oficyna wydawnicza Politechniki Wroclawskiej (2009)

    Google Scholar 

  12. Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science PAS (2004)

    Google Scholar 

  13. Weiss, D.: Korpus Rzeczpospolitej (2008), http://www.cs.put.poznan.pl/dweiss/rzeczpospolita

  14. Kilgarriff, A., Rosenzweig, J.: Framework and results for English SENSEVAL. Computers and the Humanities 34(1), 15–48 (2000)

    Article  Google Scholar 

  15. Edmonds, P.: SENSEVAL: The evaluation of word sense disambiguation systems. ELRA Newsletter 7(3), 5–14 (2002)

    Google Scholar 

  16. Mihalcea, R., Chklovski, T., Kilgarriff, A.: The Senseval-3 English lexical sample task. In: 3rd Int. Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 25–28 (2004)

    Google Scholar 

  17. Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: SemEval-2007 task 17: English lexical sample, SRL and all words. In: Proc. of the 4th International Workshop on Semantic Evaluations, pp. 87–92. ACL (2007)

    Google Scholar 

  18. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  19. Pedersen, T., Kulkarni, A.: Automatic cluster stopping with criterion functions and the Gap Statistic. In: Proceedings of the Demo Session of NAACL (2006)

    Google Scholar 

  20. Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: the 90% solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX, pp. 57–60. Association for Computational Linguistics (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Broda, B., Kędzia, P. (2011). Finding the Optimal Number of Clusters for Word Sense Disambiguation. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_49

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23538-2_49

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23537-5

  • Online ISBN: 978-3-642-23538-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics