Skip to main content
Log in

Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Supervised machine learning methods to model word sense often rely on human labelers to provide a single, ground truth label for each word in its context. We examine issues in establishing ground truth word sense labels using a fine-grained sense inventory from WordNet. Our data consist of a sentence corpus of 1,000 sentences: 100 for each of ten moderately polysemous words. Each word was given multiple sense labels—or a multilabel—from trained and untrained annotators. The multilabels give a nuanced representation of the degree of agreement on instances. A suite of assessment metrics is used to analyze the sets of multilabels, such as comparisons of sense distributions across annotators. Our assessment indicates that the general annotation procedure is reliable, but that words differ regarding how reliably annotators can assign WordNet sense labels, independent of the number of senses. We also investigate the performance of an unsupervised machine learning method to infer ground truth labels from various combinations of labels from the trained and untrained annotators. We find tentative support for the hypothesis that performance depends on the quality of the set of multilabels, independent of the number of labelers or their training.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://www.anc.org.

  2. See downloads link at http://www.anc.org/MASC/Home.html.

  3. One annotator dropped out during the round.

  4. The remaining 50 were those used in round 2.1, and are not discussed further here.

  5. Anveshan is available at http://vikas-bhardwaj.com/tools/Anveshan.zip.

  6. To compute α, we use Ron Artstein’s perl script, available as http://ron.artstein.org/resources/calculate-alpha.perl.

  7. Perfect disagreement can arise for two annotators on binary labels: the two annotators always select the pair of values that disagree. Square brackets represent an interval that includes the endpoints; a parenthesis indicates the endpoint is not included in the interval.

  8. Note that the annotation tool allowed annotators to expand the context before and after a sentence, to determine whether a larger context clarified which sense to choose. Also note that we preserve the annotator ids that appear in the data releases.

  9. Due to ties in the data, the p value computation is not exact.

  10. In the interest of space, we presented full Leverage, \(\overline{JSD}\) and KLD′ across trained annotators for only two of the eight words (Tables 2a–3a).

  11. Due to lack of resources and time, we could not do all round 2.2. words.

  12. GLAD is available from http://mplab.ucsd.edu/~jake.

  13. Note that the learning performance results for AMT subsets avg are averages over fifty iterations.

  14. This table reports averages for Leverage, JSD and KLD′. Note that the assessment results for AMT subsets avg are averages of averages over fifty iterations.

  15. Relatively new venues include the Linguistic Annotation Workshops (LAW), and the inclusion of a Resources/Evaluation track for recent annual meetings of the Association for Computational Linguistics.

References

  • Agirre, E., de Lacalle, O. L., Fellbaum, C., Hsieh, S. K., Tesconi, M., Monachini, M., Vossen, P., & Segers, R. (2010). SemEval-2010 Task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th international workshop on semantic evaluation (pp. 75–80).

  • Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon Mechanical Turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, Association for Computational Linguistics, Los Angeles (pp. 195–203).

  • Bhardwaj, V., Passonneau, R. J., Salleb-Aouissi, A., & Ide, N. (2010). Anveshan: A framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the fourth linguistic annotation workshop (LAW IV).

  • Bruce, R. F., & Wiebe, J. M. (1999). Decomposable modeling in natural language processing. Computational Linguistics, 25(2), 195-208.

    Google Scholar 

  • Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, Association for Computational Linguistics, Morristown, NJ (pp. 286–295).

  • Callison-Burch, C., & Dredze, M. (2010). Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 1–12).

  • Chugur, I., Gonzalo, J., & Verdejo, F. (2002). Polysemy and sense proximity in the SENSEVAL-2 test suite. In Proceedings of the SIGLEX/SENSEVAL workshop on word sense disambiguation: Recent successes and future directions, Philadelphia (pp. 32–39).

  • Cohen, J. (1960). A coeffiecient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Diab, M. (2004). Relieving the data acquisition bottleneck in word sense disambiguation. In Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 303–311).

  • Dowty, D. (1979). Word meaning and montague grammar. Dordrecht: D. Reidel.

    Book  Google Scholar 

  • Erk, K. (2009). Representing words as regions in vector space. In CoNLL ’09: Proceedings of the 13th conference on computational natural language learning (pp. 57–65).

  • Erk, K., & Mccarthy, D. (2009). Graded word sense assignment. In Proceedings of empirical methods in natural language processing (EMNLP 09) (pp. 440–449).

  • Erk, K., McCarthy, D., & Gaylord, N. (2009). Investigations on word senses and word usages. In Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing (pp. 10–18).

  • Fillmore, C. J., Johnson, C. R., & Petruck, M. R. L. (2003). Background to framenet. International Journal of Lexicography, 16(3), 235–250.

    Article  Google Scholar 

  • Hovy, E., Marcus, M., Palmer, M., Ramsha, L., & Weischedel, R. (2006). Ontonotes: The 90% solution. In Proceedings of HLT-NAACL 2006 (pp. 57–60).

  • Ide, N. (2000). Cross-lingual sense determination: Can it work? Computers and the Humanities. Special Issue on the proceedings of the SIGLEX/SENSEVAL Workshop, 34(1–2), 223–234.

  • Ide, N., & Wilks, Y. (2006). Making sense about sense. In E. Agirre & P. Edmonds (Eds.), Word sense disambiguation: Algorithms and applications (pp. 47–74). Dordrecht: Springer.

    Chapter  Google Scholar 

  • Ide, N., Erjavec, T., & Tufis, D. (2002). Sense discrimination with parallel corpora. In Proceedings of ACL’02 workshop on word sense disambiguation: Recent successes and future directions (pp. 54–60).

  • Ide, N., Baker, C., Fellbaum, C., & Passonneau, R. J. (2010). The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the association for computational linguistics (pp. 68–73).

  • Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31, 91–113.

    Article  Google Scholar 

  • Kilgarriff, A. (1998). SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of the 1st international conference on language resources and evaluation (LREC), Granada (pp. 581–588).

  • Klein, D., & Murphy, G. (2002). Paper has been my ruin: Conceptual relations of polysemous words. Journal of Memory and Language, 47, 548.

    Article  Google Scholar 

  • Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage Publications.

    Google Scholar 

  • Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86.

    Article  Google Scholar 

  • Landauer, T., & Dumais, S. (1977). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240.

    Article  Google Scholar 

  • Lavrac, N., Flach, P. A., & Zupan, B. (1999). Rule evaluation measures: a unifying view. In Proceedings of the 9th international workshop on inductive logic programming (ILP-99) (pp. 174–185).

  • Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151.

    Article  Google Scholar 

  • Manandhar, S., Klapaftis, I., Dligach, D., & Pradhan, S. (2010). SemEval-2010 task 14: Word sense induction & disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (SemEval), Association for Computational Linguistics, Uppsala, Sweden (pp. 63–68).

  • Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993). Introduction to WordNet: An on-line lexical database (revised). Tech. Rep. Cognitive Science Laboratory (CSL) Report 43, Princeton University, Princeton. Revised March 1993.

  • Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 10:1–10:69.

    Article  Google Scholar 

  • Ng, H. T., Lim, C. Y., & Foo, S. K. (1999). A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX Workshop On Standardizing Lexical Resources.

  • Palmer, M., Dang, H. T., & Fellbaum, C. (2007). Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(2), 137–163.

    Google Scholar 

  • Passonneau, R. J. (1997). Applying reliability metrics to co-reference annotation. Technical Report, Department of Computer Science, CUCS-017-97, Columbia University.

  • Passonneau, R. J. (2006). Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Fifth international conference on language resources and evaluation (LREC).

  • Passonneau, R. J., Habash, N., & Rambow, O. (2006). Inter-annotator agreement on a multilingual semantic annotation task. In Proceedings of the international conference on language resources and evaluation (LREC), Genoa, Italy (pp. 1951–1956).

  • Passonneau, R. J., Salleb-Aouissi, A., & Ide, N. (2009). Making sense of word sense variation. In Proceedings of the NAACL-HLT 2009 workshop on semantic evaluations.

  • Passonneau, R. J., Salleb-Aouissi, A., Bhardwaj, V., & Ide, N. (2010). Word sense annotation of polysemous words by multiple annotators. In Seventh international conference on language resources and evaluation (LREC).

  • Passonneau, R. J., Baker, C., Fellbaum, C., & Ide, N. (2012). The MASC word sense sentence corpus. In Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, Turkey, May 23–25.

  • Pedersen, T. (2002a). Assessing system agreement and instance difficulty in the lexical sample tasks of SENSEVAL-2. In Proceedings of the ACL-02 workshop on word sense disambiguation: Recent successes and future directions (pp. 40–46).

  • Pedersen, T. (2002b). Evaluating the effectiveness of ensembles of decision trees in disambiguating SENSEVAL lexical samples. In Proceedings of the ACL-02 workshop on word sense disambiguation: Recent successes and future directions (pp. 81–87).

  • Piatetsky-Shapiro, G. (1999). Discovery, analysis and presentation of strong rules. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the workshop on frontiers in corpus annotation II: Pie in the sky (pp. 76–83).

  • Pradhan, S., Loper, E., Dligach, D., & Palmer, M. (2007). SemEval-2007 Task-17: English lexical sample, SRL and all words. In Proceedings of 4th international workshop on semantic evaluations (SemEval-2007), Prague, Czech Republic (pp. 87–92).

  • Raykar, V. C., Yu, S., Zhao, L. H., Jerebko, A., Florin, C., Valadez, G. H., Bogoni, L., & Moy, L. (2009). Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th annual international conference on machine learning (ICML 09), New York, NY (pp. 889–896).

  • Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322.

    Google Scholar 

  • Resnik, P., & Yarowsky, D. (1999). Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2), 113–133.

    Article  Google Scholar 

  • Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. R., & Scheffczyk, J. (2006). Framenet II: Extended theory and practice. Available from http://framenet.icsi.berkeley.edu/index.ph.

  • Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 17, 321–325.

    Article  Google Scholar 

  • Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple noisy labelers. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’08 (pp. 614–622).

  • Snow, R., Jurafsky, D., & Ng, A. Y. (2007). Learning to merge word senses. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1005–1014).

  • Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Honolulu (pp. 254–263).

  • Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In Computer ision and Pattern Recognition Workshops (CVPRW 08), First IEEE workshop on internet vision, pp. 1–8.

  • Véronis, J. (1998). A study of polysemy judgements and inter-annotator agreement. In SENSEVAL Workshop, Sussex.

  • Whitehill, J., Ruvolo, P., Wu, T. fan, Bergsma, J., & Movellan, J. (2000). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 22 (pp. 2035–2043). Cambridge: MIT Press.

  • Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo, G., Bogoni, L., Moy, L. G., & Dy, J. (2010). Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS) (pp. 932–939).

Download references

Acknowledgments

This work was supported by NSF award CRI-0708952, including a supplement to fund co-author Vikas Bhardwaj as a Graduate Research Assistant for one semester. The authors thank the annotators for their excellent work and thoughtful comments on sense inventories. We thank Bob Carpenter for discussions about data from multiple annotators, and for his generous and insightful comments on drafts of the paper. Finally, we thank the anonymous reviewers who provided deep and thoughtful critiques, as well as very careful proofreading.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rebecca J. Passonneau.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Passonneau, R.J., Bhardwaj, V., Salleb-Aouissi, A. et al. Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations. Lang Resources & Evaluation 46, 219–252 (2012). https://doi.org/10.1007/s10579-012-9188-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9188-x

Keywords

Navigation