Skip to main content

Evaluation Metrics

  • Chapter
  • First Online:
Anaphora Resolution

Abstract

This chapter discusses how to evaluate anaphora or coreference resolution systems. The problem is non-trivial in that it needs to deal with a multitude of sub-problems, such as: (1) What is the evaluation unit (entities or links); if entities, is entity-alignment needed? if links, how to handle single-mention entities? (2) How to deal with the fact that the response mention set may differ from that of the key mention set? We will review the prevailing metrics proposed in the last two decades, including MUC, B-cubed, CEAF and BLANC. We will give illustrative examples to show how they are computed, and the scenarios under which they are intended to be used. We will present their strengths and weaknesses, and clarify some misunderstandings of the metrics found in the recent literature.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://conll.github.io/reference-coreference-scorers

  2. 2.

    Links for computing MUC-F are the minimum set of links needed to connect mentions in entities. Therefore, if an entity has n mentions, the number of links is n − 1. This contrasts with how links are counted in BLANC, where all pairs of mentions within an entity are counted.

  3. 3.

    We use the same symbols ϕ 3(⋅ ) and ϕ 4(⋅ ) as in [8].

References

  1. Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. J. (2008). http://link.springer.com/journal/10791

  2. Bagga, A., Baldwin, B.: Algorithms for scoring coreference chains. In: Proceedings of the Linguistic Coreference Workshop at The First International Conference on Language Resources and Evaluation (LREC’98), Granada, pp. 563–566 (1998)

    Google Scholar 

  3. Balas, E., Miller, D., Pekny, J., Toth, P.: A parallel shortest augmenting path algorithm for the assignment problem. J. ACM (JACM) 38 (4), 985–1007 (1991)

    Google Scholar 

  4. Bourgeois, F., Lassalle, J.C.: An extension of the Munkres algorithm for the assignment problem to rectangular matrices. Commun. ACM 14 (12), 802–804 (1971)

    Article  MathSciNet  MATH  Google Scholar 

  5. Cai, J., Strube, M.: Evaluation metrics for end-to-end coreference resolution systems. In: Proceedings of SIGDIAL, Tokyo, pp. 28–36 (2010)

    Google Scholar 

  6. Gupta, A., Ying, L.: Algorithms for finding maximum matchings in bipartite graphs. Technical report, RC 21576 (97320), IBM T.J. Watson Research Center (1999)

    Google Scholar 

  7. Kuhn, H.: The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 2 (83), 83–97 (1955)

    Article  MathSciNet  MATH  Google Scholar 

  8. Luo, X.: On coreference resolution performance metrics. In: Proceedings of Human Language Technology (HLT)/Empirical Methods in Natural Language Processing (EMNLP), Vancouver (2005)

    Google Scholar 

  9. Luo, X., Pradhan, S., Recasens, M., Hovy, E.: An extension of BLANC to system mentions. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 24–29. Association for Computational Linguistics, Baltimore (2014). http://www.aclweb.org/anthology/P14-2005

  10. MUC-6: Proceedings of the Sixth Message Understanding Conference(MUC-6). Morgan Kaufmann, San Francisco (1995)

    Google Scholar 

  11. MUC-7: Proceedings of the Seventh Message Understanding Conference(MUC-7), Fairfax (1998)

    Google Scholar 

  12. Munkres, J.: Algorithms for the assignment and transportation problems. J. SIAM 5, 32–38 (1957)

    MathSciNet  MATH  Google Scholar 

  13. NIST: The ACE evaluation plan. www.nist.gov/speech/tests/ace/index.htm (2003)

  14. NIST: ACE 2005 evaluation. www.nist.gov/speech/tests/ace/ace05/index.htm (2005)

  15. NIST: ACE 2008 evaluation. http://www.itl.nist.gov/iad/mig//tests/ace/2008 (2008)

  16. Pradhan, S., Luo, X., Recasens, M., Hovy, E., Ng, V., Strube, M.: Scoring coreference partitions of predicted mentions: a reference implementation. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 30–35. Association for Computational Linguistics, Baltimore (2014). http://www.aclweb.org/anthology/P14-2006

  17. Rahman, A., Ng, V.: Supervised models for coreference resolution. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 968–977. Association for Computational Linguistics, Singapore (2009). http://www.aclweb.org/anthology/D/D09/D09-1101

  18. Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66 (336), 846–850 (1971)

    Article  Google Scholar 

  19. Recasens, M., Hovy, E.: BLANC: implementing the Rand index for coreference evaluation. Nat. Lang. Eng. 17, 485–510 (2011). doi:10.1017/S135132491000029X. http://journals.cambridge.org/article_S135132491000029X

    Article  Google Scholar 

  20. Stoyanov, V., Gilbert, N., Cardie, C., Riloff, E.: Conundrums in noun phrase coreference resolution: making sense of the state-of-the-art. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL’09, vol. 2, pp. 656–664. Association for Computational Linguistics, Stroudsburg (2009). http://dl.acm.org/citation.cfm?id=1690219.1690238

  21. Vilain, M., Burger, J., Aberdeen, J., Connolly, D., Hirschman, L.: A model-theoretic coreference scoring scheme. In: Proceedings of MUC6, Columbia, pp. 45–52 (1995)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoqiang Luo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Luo, X., Pradhan, S. (2016). Evaluation Metrics. In: Poesio, M., Stuckardt, R., Versley, Y. (eds) Anaphora Resolution. Theory and Applications of Natural Language Processing. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-47909-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-47909-4_5

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-47908-7

  • Online ISBN: 978-3-662-47909-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics