Abstract
The authors compared two methods for annotating a corpus for the named entity (NE) recognition task using non-expert annotators: (i) revising the results of an existing NE recognizer and (ii) manually annotating the NEs completely. The annotation time, degree of agreement, and performance were evaluated based on the gold standard. Because there were two annotators for one text for each method, two performances were evaluated: the average performance of both annotators and the performance when at least one annotator is correct. The experiments reveal that semi-automatic annotation is faster, achieves better agreement, and performs better on average. However, they also indicate that sometimes, fully manual annotation should be used for some texts whose document types are substantially different from the training data document types. In addition, the machine learning experiments using semi-automatic and fully manually annotated corpora as training data indicate that the F-measures could be better for some texts when manual instead of semi-automatic annotation was used. Finally, experiments using the annotated corpora for training as additional corpora show that (i) the NE recognition performance does not always correspond to the performance of the NE tag annotation and (ii) the system trained with the manually annotated corpus outperforms the system trained with the semi-automatically annotated corpus with respect to newswires, even though the existing NE recognizer was mainly trained with newswires.
- Bea Alex, Claire Grover, Rongzhou Shen, and Mijail Kabadjov. 2010. Agile corpus annotation in practice: An overview of manual and automatic annotation of CVs. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 29--37. Google ScholarDigital Library
- Jon Chamberlain, Udo Kruschwitz, and Massimo Poesio. 2009. Constructing an anaphorically annotated corpus with non-experts: Assessing the quality of collaborative annotations. In Proceedings of the 2009 Workshop on the People’s Web Meets NLP (ACL-IJCNLP’09). 57--62. Google ScholarDigital Library
- Lonneke Van der Plas, Tanja Samardzić, and Paola Merlo. 2010. Cross-lingual validity of propbank in the manual annotation of french. In Proceedings of 4th Linguistic Annotation Workshop, ACL 2010. 113--117. Google ScholarDigital Library
- Bruno Guillaume, Karën Fort, and Nicolas Lefebvre. 2016. Crowdsourcing complex language resources: Playing to annotate dependency syntax. In Proceedings of COLING 2016. 3041--3052.Google Scholar
- Masatsugu Hangyo, Daisuke Kawahara, and Sadao Kurohashi. 2012. Building a diverse document leads corpus annotated with semantic relations. In Proceedings of PACLIC 2012. 535--544.Google Scholar
- Taiichi Hashimoto, Takashi Inui, and Koji Murakami. 2008. Constructing extended named entity annotated corpora (in Japanese). IPSJ SIG Technical Reports (NLP) 2008-NL-188 (2008), 113--120.Google Scholar
- Ai Hirata and Mamoru Komachi. 2015. Analysis of named entity recognition for texts of various genres (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid8equals;sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZGYxOTg3YWE1MDIzOTRi.Google Scholar
- Masaaki Ichihara, Kanako Komiya, Tomoya Iwakura, and Maiko Yamazaki. 2015. Error analysis of named entity recognition in BCCWJ. NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDoxZTY1MWY4YTBjNmNjNzIx.Google Scholar
- Tomoya Iwakura. 2015. Error analysis of named entity extraction (in Japanese). NLP2015 Error Analysis Workshop (2015). https://docs.google.com/viewer?a=v8pid=sites8srcid=ZGVmYXVsdGRvbWFpbnxwcm9qZWN0bmV4dG5scHxneDo1ZTg0ZmJmYmRjNThmN2I1.Google Scholar
- Tomoya Iwakura, Kanako Komiya, and Ryuichi Tachibana. 2016. Constructing a Japanese basic named entity corpus of various genres. In Proceedings of NEWS 2016, Workshop of ACL 2016. 41--46.Google ScholarCross Ref
- Daisuke Kawahara, Yuichiro Machida, Tomohide Shibata, Sadao Kurohashi, Hayato Kobayashi, and Manabu Sassano. 2014. Rapid development of a corpus with discourse annotations using two-stage crowdsourcing. In Proceedings of COLING 2014. 269--278.Google Scholar
- Kanako Komiya, Masaya Suzuki, Tomoya Iwakura, Minoru Sasaki, and Hiroyuki Shinnou. 2016. Comparison of annotating methods for named entity corpora. In Proceedings of LAW-X 2016, Workshop of ACL 2016. 59--67.Google ScholarCross Ref
- Lou Burnard. 2010. Reference Guide for the British National Corpus (World Edition). Retrieved February 26, 2018 from http://www.natcorp.ox.ac.uk/archive/worldURG/urg.pdf.Google Scholar
- Yuichiro Machida, Daisuke Kawahara, Sadao Kurohashi, and Manabu Sassano. 2016. Design of word association games using dialog systems for acquisition of word association knowledge. In Proceedings of the 5th Workshop on Automated Knowledge Base Construction (AKBC’16) 2016. 86--91.Google ScholarCross Ref
- Kikuo Maekawa, Makoto Yamazaki, Toshinobu Ogiso, Takehiko Maruyama, Hideki Ogura, Wakako Kashino, Hanae Koiso, Masaya Yamaguchi, Makiro Tanaka, and Yasuharu Den. 2014. Balanced corpus of contemporary written Japanese. Language Resources and Evaluation 48 (2014), 345--371. Google ScholarDigital Library
- Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics—Special Issue on Using Large Corpora: II 19 (1993), 313--330. Google ScholarDigital Library
- Tetsuro Sasada, Shinsuke Mori, Tatsuya Kawahara, and Yoko Yamakata. 2015. Named entity recognizer trainable from partially annotated data. In International Conference of the Pacific Association for Computational Linguistics. 10--17.Google Scholar
- Ryohei Sasano and Sadao Kurohashi. 2008. Japanese named entity recognition using structural natural language processing. In Proceedings of International Joint Conference on Natural Language Processing. 607--612.Google Scholar
- Satoshi Sekine and Hitoshi Isahara. 2000. IREX: IR and IE evaluation project in Japanese. In Proceedings of the Language Resources and Evaluation Conference, No. 1019.Google Scholar
- Rion Snow, Brendan O’Conner, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—But is it good? Evaluation non-expert annotation for natural lanugage tasks. In Proceedings of the 2008 Conference on Emprical Methods in Natural Language Processing (EMNLP’08). 254--263. Google ScholarDigital Library
- Takenobu Tokunaga, Jin Nishikara, Tomoya Iwakura, and Nobuhiro Yugami. 2015. Analysis of eye tracking data of annotators for named entity recognition task (in Japanese). IPSJ SIG Technical Reports NLP 2015-NL-223 (2015), 1--8.Google Scholar
Index Terms
- Comparison of Methods to Annotate Named Entity Corpora
Recommendations
Automatic gazette creation for named entity recognition and application to resume processing
COMPUTE '12: Proceedings of the 5th ACM COMPUTE Conference: Intelligent & scalable system technologiesNamed entities are important content-carrying units within documents. Consequently named entity recognition (NER) is an important part of information extraction. One fast and accurate approach to NER uses a list or gazette consisting of known instances. ...
Named entity extraction and disambiguation: the missing link
ESAIR '13: Proceedings of the sixth international workshop on Exploiting semantic annotations in information retrievalNamed entity extraction (NEE) and disambiguation (NED) are two areas of research that are well covered in literature. Typical fields addressing these topics are information retrieval, natural language processing, and semantic web. Although these topics ...
Uncertainty Handling in Named Entity Extraction and Disambiguation for Informal Text
Uncertainty Reasoning for the Semantic Web IIIAbstractSocial media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them ...
Comments