research-article

Comparison of Methods to Annotate Named Entity Corpora

Authors:
Kanako Komiya

Ibaraki University, Kanako, Japan

Ibaraki University, Kanako, Japan
View Profile

,
Masaya Suzuki

Ibaraki University, Ibaraki, Japan

Ibaraki University, Ibaraki, Japan
View Profile

,
Tomoya Iwakura

Fujitsu Laboratories, Kawasaki, Japan

Fujitsu Laboratories, Kawasaki, Japan
View Profile

,
Minoru Sasaki

Ibaraki University, Ibaraki, Japan

Ibaraki University, Ibaraki, Japan
View Profile

,
Hiroyuki Shinnou

Ibaraki University, Ibaraki, Japan

Ibaraki University, Ibaraki, Japan
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 17 Issue 4Article No.: 34pp 1–16https://doi.org/10.1145/3218820

Published:21 July 2018Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, degree of agreement, and performance were evaluated based on the gold standard. Because there were two annotators for one text for each method, two performances were evaluated: the average performance of both annotators and the performance when at least one annotator is correct. The experiments reveal that semi-automatic annotation is faster, achieves better agreement, and performs better on average. However, they also indicate that sometimes, fully manual annotation should be used for some texts whose document types are substantially different from the training data document types. In addition, the machine learning experiments using semi-automatic and fully manually annotated corpora as training data indicate that the F-measures could be better for some texts when manual instead of semi-automatic annotation was used. Finally, experiments using the annotated corpora for training as additional corpora show that (i) the NE recognition performance does not always correspond to the performance of the NE tag annotation and (ii) the system trained with the manually annotated corpus outperforms the system trained with the semi-automatically annotated corpus with respect to newswires, even though the existing NE recognizer was mainly trained with newswires.

References

Bea Alex, Claire Grover, Rongzhou Shen, and Mijail Kabadjov. 2010. Agile corpus annotation in practice: An overview of manual and automatic annotation of CVs. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 29--37. Google ScholarDigital Library
Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. 2009. Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 Workshop on the People’s Web Meets NLP (ACL-IJCNLP’09). 57--62. Google ScholarDigital Library
Lonneke Van der Plas, Tanja Samardzić, and Paola Merlo. 2010. Cross-lingual validity of propbank in the manual annotation of french. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 113--117. Google ScholarDigital Library
Bruno Guillaume, Karën Fort, and Nicolas Lefebvre. 2016. Crowdsourcing complex language resources: Playing to annotate dependency syntax. In Proceedings of COLING 2016. 3041--3052.Google Scholar
Masatsugu Hangyo, Daisuke Kawahara, and Sadao Kurohashi. 2012. Building a diverse document leads corpus annotated with semantic relations. In Proceedings of PACLIC 2012. 535--544.Google Scholar
Taiichi Hashimoto, Takashi Inui, and Koji Murakami. 2008. Constructing extended named entity annotated corpora (in Japanese). IPSJ SIG Technical Reports (NLP) 2008-NL-188 (2008), 113--120.Google Scholar
Ai Hirata and Mamoru Komachi. 2015. Analysis of named entity recognition for texts of various genres (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid8equals;sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZGYxOTg3YWE1MDIzOTRi.Google Scholar
Masaaki Ichihara, Kanako Komiya, Tomoya Iwakura, and Maiko Yamazaki. 2015. Error analysis of named entity recognition in BCCWJ. NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDoxZTY1MWY4YTBjNmNjNzIx.Google Scholar
Tomoya Iwakura. 2015. Error analysis of named entity extraction (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZTg0ZmJmYmRjNThmN2I1.Google Scholar
Tomoya Iwakura, Kanako Komiya, and Ryuichi Tachibana. 2016. Constructing a Japanese basic named entity corpus of various genres. In Proceedings of NEWS 2016, Workshop of ACL 2016. 41--46.Google ScholarCross Ref
Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi, and Manabu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proceedings of COLING 2014. 269--278.Google Scholar
Kanako Komiya, Masaya Suzuki, Tomoya Iwakura, Minoru Sasaki, and Hiroyuki Shinnou. 2016. Comparison of annotating methods for named entity corpora. In Proceedings of LAW-X 2016, Workshop of ACL 2016. 59--67.Google ScholarCross Ref
Lou Burnard. 2010. Reference Guide for the British National Corpus (World Edition). Retrieved February 26, 2018 from http://www.natcorp.ox.ac.uk/archive/worldURG/urg.pdf.Google Scholar
Yuichiro Machida, Daisuke Kawahara, Sadao Kurohashi, and Manabu Sassano. 2016. Design of word association games using dialog systems for acquisition of word association knowledge. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction (AKBC’16) 2016. 86--91.Google ScholarCross Ref
Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48 (2014), 345--371. Google ScholarDigital Library
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics—Special Issue on Using Large Corpora: II 19 (1993), 313--330. Google ScholarDigital Library
Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara, and Yoko Yamakata. 2015. Named entity recognizer trainable from partially annotated data. In International Conference of the Pacific Association for Computational Linguistics. 10--17.Google Scholar
Ryohei Sasano and Sadao Kurohashi. 2008. Japanese named entity recognition using structural natural language processing. In Proceedings of International Joint Conference on Natural Language Processing. 607--612.Google Scholar
Satoshi Sekine and Hitoshi Isahara. 2000. IREX: IR and IE evaluation project in Japanese. In Proceedings of the Language Resources and Evaluation Conference, No. 1019.Google Scholar
Rion Snow, Brendan O’Conner, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—But is it good? Evaluation non-expert annotation for natural lanugage tasks. In Proceedings of the 2008 Conference on Emprical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarDigital Library
Takenobu Tokunaga, Jin Nishikara, Tomoya Iwakura, and Nobuhiro Yugami. 2015. Analysis of eye tracking data of annotators for named entity recognition task (in Japanese). IPSJ SIG Technical Reports NLP 2015-NL-223 (2015), 1--8.Google Scholar

Index Terms

Comparison of Methods to Annotate Named Entity Corpora
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Language resources

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologies

Named entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Read More
Named entity extraction and disambiguation: the missing link
ESAIR '13: Proceedings of the sixth international workshop on Exploiting semantic annotations in information retrieval

Named entity extraction (NEE) and disambiguation (NED) are two areas of research that are well covered in literature. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics ...
Read More
Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text
Uncertainty Reasoning for the Semantic Web III
Abstract
Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 17, Issue 4
December 2018
193 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3229525
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 July 2018
- Accepted: 1 May 2018
- Revised: 1 March 2018
- Received: 1 October 2017
Published in tallip Volume 17, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Annotation
named entity extraction
non-expert annotator
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 192
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Comparison of Methods to Annotate Named Entity Corpora

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing

Named entity extraction and disambiguation: the missing link

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Comparison of Methods to Annotate Named Entity Corpora

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Automatic gazette creation for named entity recognition and application to resume processing

Named entity extraction and disambiguation: the missing link

Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media