skip to main content
research-article

Comparison of Methods to Annotate Named Entity Corpora

Published:21 July 2018Publication History
Skip Abstract Section

Abstract

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, degree of agreement, and performance were evaluated based on the gold standard. Because there were two annotators for one text for each method, two performances were evaluated: the average performance of both annotators and the performance when at least one annotator is correct. The experiments reveal that semi-automatic annotation is faster, achieves better agreement, and performs better on average. However, they also indicate that sometimes, fully manual annotation should be used for some texts whose document types are substantially different from the training data document types. In addition, the machine learning experiments using semi-automatic and fully manually annotated corpora as training data indicate that the F-measures could be better for some texts when manual instead of semi-automatic annotation was used. Finally, experiments using the annotated corpora for training as additional corpora show that (i) the NE recognition performance does not always correspond to the performance of the NE tag annotation and (ii) the system trained with the manually annotated corpus outperforms the system trained with the semi-automatically annotated corpus with respect to newswires, even though the existing NE recognizer was mainly trained with newswires.

References

  1. Bea Alex, Claire Grover, Rongzhou Shen, and Mijail Kabadjov. 2010. Agile corpus annotation in practice: An overview of manual and automatic annotation of CVs. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 29--37. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. 2009. Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 Workshop on the People’s Web Meets NLP (ACL-IJCNLP’09). 57--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Lonneke Van der Plas, Tanja Samardzić, and Paola Merlo. 2010. Cross-lingual validity of propbank in the manual annotation of french. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 113--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bruno Guillaume, Karën Fort, and Nicolas Lefebvre. 2016. Crowdsourcing complex language resources: Playing to annotate dependency syntax. In Proceedings of COLING 2016. 3041--3052.Google ScholarGoogle Scholar
  5. Masatsugu Hangyo, Daisuke Kawahara, and Sadao Kurohashi. 2012. Building a diverse document leads corpus annotated with semantic relations. In Proceedings of PACLIC 2012. 535--544.Google ScholarGoogle Scholar
  6. Taiichi Hashimoto, Takashi Inui, and Koji Murakami. 2008. Constructing extended named entity annotated corpora (in Japanese). IPSJ SIG Technical Reports (NLP) 2008-NL-188 (2008), 113--120.Google ScholarGoogle Scholar
  7. Ai Hirata and Mamoru Komachi. 2015. Analysis of named entity recognition for texts of various genres (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid8equals;sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZGYxOTg3YWE1MDIzOTRi.Google ScholarGoogle Scholar
  8. Masaaki Ichihara, Kanako Komiya, Tomoya Iwakura, and Maiko Yamazaki. 2015. Error analysis of named entity recognition in BCCWJ. NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDoxZTY1MWY4YTBjNmNjNzIx.Google ScholarGoogle Scholar
  9. Tomoya Iwakura. 2015. Error analysis of named entity extraction (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZTg0ZmJmYmRjNThmN2I1.Google ScholarGoogle Scholar
  10. Tomoya Iwakura, Kanako Komiya, and Ryuichi Tachibana. 2016. Constructing a Japanese basic named entity corpus of various genres. In Proceedings of NEWS 2016, Workshop of ACL 2016. 41--46.Google ScholarGoogle ScholarCross RefCross Ref
  11. Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi, and Manabu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proceedings of COLING 2014. 269--278.Google ScholarGoogle Scholar
  12. Kanako Komiya, Masaya Suzuki, Tomoya Iwakura, Minoru Sasaki, and Hiroyuki Shinnou. 2016. Comparison of annotating methods for named entity corpora. In Proceedings of LAW-X 2016, Workshop of ACL 2016. 59--67.Google ScholarGoogle ScholarCross RefCross Ref
  13. Lou Burnard. 2010. Reference Guide for the British National Corpus (World Edition). Retrieved February 26, 2018 from http://www.natcorp.ox.ac.uk/archive/worldURG/urg.pdf.Google ScholarGoogle Scholar
  14. Yuichiro Machida, Daisuke Kawahara, Sadao Kurohashi, and Manabu Sassano. 2016. Design of word association games using dialog systems for acquisition of word association knowledge. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction (AKBC’16) 2016. 86--91.Google ScholarGoogle ScholarCross RefCross Ref
  15. Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48 (2014), 345--371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics—Special Issue on Using Large Corpora: II 19 (1993), 313--330. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara, and Yoko Yamakata. 2015. Named entity recognizer trainable from partially annotated data. In International Conference of the Pacific Association for Computational Linguistics. 10--17.Google ScholarGoogle Scholar
  18. Ryohei Sasano and Sadao Kurohashi. 2008. Japanese named entity recognition using structural natural language processing. In Proceedings of International Joint Conference on Natural Language Processing. 607--612.Google ScholarGoogle Scholar
  19. Satoshi Sekine and Hitoshi Isahara. 2000. IREX: IR and IE evaluation project in Japanese. In Proceedings of the Language Resources and Evaluation Conference, No. 1019.Google ScholarGoogle Scholar
  20. Rion Snow, Brendan O’Conner, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—But is it good? Evaluation non-expert annotation for natural lanugage tasks. In Proceedings of the 2008 Conference on Emprical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Takenobu Tokunaga, Jin Nishikara, Tomoya Iwakura, and Nobuhiro Yugami. 2015. Analysis of eye tracking data of annotators for named entity recognition task (in Japanese). IPSJ SIG Technical Reports NLP 2015-NL-223 (2015), 1--8.Google ScholarGoogle Scholar

Index Terms

  1. Comparison of Methods to Annotate Named Entity Corpora

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 17, Issue 4
        December 2018
        193 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3229525
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 July 2018
        • Accepted: 1 May 2018
        • Revised: 1 March 2018
        • Received: 1 October 2017
        Published in tallip Volume 17, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader