Abstract
We present methods for evaluating human and automatictaggers that extend current practice in three ways. First, we show howto evaluate taggers that assign multiple tags to each test instance,even if they do not assign probabilities. Second, we show how toaccommodate a common property of manually constructed ``gold standards''that are typically used for objective evaluation, namely that there isoften more than one correct answer. Third, we show how to measureperformance when the set of possible tags is tree-structured in an IS-Ahierarchy. To illustrate how our methods can be used to measureinter-annotator agreement, we show how to compute the kappa coefficientover hierarchical tag sets.
Similar content being viewed by others
References
Atkins, S. “Tools for computer-aided lexicography: the Hector project”. In Papers in Computational Lexicography: COMPLEX '93. Budapest, 1993.
Carletta, J. “Assessing agreement on classification tasks: the Kappa statistic”. Computational Linguistics 22(2), 249–254, 1996.
Chinchor, N. (ed.) “Proceedings of the 7th Message Understanding Conference”. Columbia,MD: Science Applications International Corporation (SAIC), 1998. Online publication athttp://www.muc.saic.com/proceedings/muc_7_toc.html.
Fellbaum, C. (ed.) WordNet: An Electronic Lexical Database; Cambridge, MA: MIT Press, 1998.
Krishnamurthy, R. and D. Nicholls. “Peeling an onion: the lexicographer's experience of manual sense-tagging”. In SENSEVAL Workshop. Sussex, England, 1998.
Resnik, P. and D. Yarowsky. “A perspective on word sense disambiguation methods and their evaluation”. In M. Light (ed.): ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, and How? Washington, D.C., 1997.
Resnik, P. and D. Yarowsky. “Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation”. Natural Language Engineering, 5(2), 1999.
Siegel, S. and N.J. Castellan, Jr. Nonparametric Statistics for the Behavioral Sciences. Second edition. McGraw-Hill, 1988.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Melamed, I.D., Resnik, P. Tagger Evaluation Given Hierarchical Tag Sets. Computers and the Humanities 34, 79–84 (2000). https://doi.org/10.1023/A:1002402902356
Issue Date:
DOI: https://doi.org/10.1023/A:1002402902356