Finding the Optimal Number of Clusters for Word Sense Disambiguation

Broda, Bartosz; Kędzia, Paweł

doi:10.1007/978-3-642-23538-2_49

Bartosz Broda²¹ &
Paweł Kędzia²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6836))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

910 Accesses

Abstract

Ambiguity is an inherent problem for many tasks in Natural Language Processing. Unsupervised and semi-supervised approaches to ambiguity resolution are appealing as they lower the cost of manual labour. Typically, those methods struggle with estimation of number of senses without supervision. This paper shows research on using stopping functions applied to clustering algorithms for estimation of number of senses. The experiments were performed for Polish and English. We found that estimation based on PK2 stopping functions is encouraging, but only when using coarse-grained distinctions between senses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agirre, E., Edmonds, P. (eds.): Word Sense Disambiguation: Algorithms and Applications. Springer, Heidelberg (2006)
Google Scholar
Pedersen, T., Kulkarni, A.: Selecting the right number of senses based on clustering criterion functions (2006)
Google Scholar
Pawlowski, A.: Metody kwantytatywne w sekwencyjnej analizie danych. English title: Quantitative methods in sequential data analysis. Katedra Lingwistyki Formalnej Uniwersytetu Warszawskiego (2006)
Google Scholar
Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315(5814) (2007)
Google Scholar
Milligan, G., Cooper, M.: An examination of procedures for determining the number of clusters in a data set. Psychometrika 50(2), 159–179 (1985)
Article Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Communications in Statistics Simulation and Computation 3(1), 1–27 (1974)
Article MathSciNet MATH Google Scholar
Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. Journal Of The Royal Statistical Society Series B 63(2), 411–423 (2001)
Article MathSciNet MATH Google Scholar
Mojena, R.: Hierarchical grouping methods and stopping rules: an evaluation. Computer Journal 20(4) (1977)
Google Scholar
Broda, B., Piasecki, M., Maziarz, M.: Evaluating LexCSD — a weakly-supervised method on improved semantically annotated corpus in a large scale experiment. In: Intelligent Information Systems (2010)
Google Scholar
Fellbaum, C., et al.: WordNet: An electronic lexical database. MIT press, Cambridge (1998)
MATH Google Scholar
Piasecki, M., Szpakowicz, S., Broda, B.: A wordnet from the ground up. Oficyna wydawnicza Politechniki Wroclawskiej (2009)
Google Scholar
Przepiórkowski, A.: The IPI PAN Corpus: Preliminary version. Institute of Computer Science PAS (2004)
Google Scholar
Weiss, D.: Korpus Rzeczpospolitej (2008), http://www.cs.put.poznan.pl/dweiss/rzeczpospolita
Kilgarriff, A., Rosenzweig, J.: Framework and results for English SENSEVAL. Computers and the Humanities 34(1), 15–48 (2000)
Article Google Scholar
Edmonds, P.: SENSEVAL: The evaluation of word sense disambiguation systems. ELRA Newsletter 7(3), 5–14 (2002)
Google Scholar
Mihalcea, R., Chklovski, T., Kilgarriff, A.: The Senseval-3 English lexical sample task. In: 3rd Int. Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 25–28 (2004)
Google Scholar
Pradhan, S.S., Loper, E., Dligach, D., Palmer, M.: SemEval-2007 task 17: English lexical sample, SRL and all words. In: Proc. of the 4th International Workshop on Semantic Evaluations, pp. 87–92. ACL (2007)
Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Pedersen, T., Kulkarni, A.: Automatic cluster stopping with criterion functions and the Gap Statistic. In: Proceedings of the Demo Session of NAACL (2006)
Google Scholar
Hovy, E., Marcus, M., Palmer, M., Ramshaw, L., Weischedel, R.: Ontonotes: the 90% solution. In: Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX, pp. 57–60. Association for Computational Linguistics (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, Wrocław University of Technology, Poland
Bartosz Broda & Paweł Kędzia

Authors

Bartosz Broda
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Kędzia
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Sciences, University of West Bohemia, Univerzitní 22, 306 14, Pilsen, Czech Republic
Ivan Habernal
Faculty of Applied Sciences, Dept. of Computer Science and Engineering, University of West Bohemia, Univerzitni 8, 306 14, Pilsen, Czech Republic
Václav Matoušek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Broda, B., Kędzia, P. (2011). Finding the Optimal Number of Clusters for Word Sense Disambiguation. In: Habernal, I., Matoušek, V. (eds) Text, Speech and Dialogue. TSD 2011. Lecture Notes in Computer Science(), vol 6836. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23538-2_49

Download citation

DOI: https://doi.org/10.1007/978-3-642-23538-2_49
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23537-5
Online ISBN: 978-3-642-23538-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics