Abstract
We describe a multifaceted approach to named entity recognition that can be deployed with minimal data resources and a handful of hours of non-expert annotation. We describe how this approach was applied in the 2016 LoReHLT evaluation and demonstrate that both statistical and rule-based approaches contribute to our performance. We also demonstrate across many languages the value of selecting the sentences to be annotated when training on small amounts of data.


Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
IL3_dictionary.xml; LDC-provided.
xinjiang_places.pdf with link to Wikipedia; LDC-provided.
link in CategoryII_list.pdf; LDC-provided.
parallel_grammar.pdf; LDC-provided.
This is the numerical stability parameter typically used in AdaGrad implementations.
The word shape feature collapsed all consecutive letters in a name to a single letter to attempt to identify punctuation patterns. For example, the name Bob would have the shape a, while @Bob would have the shape @a.
Arabic and Mandarin were also provided but we exclude them from our experiments here due to data processing issues. Yoruba is excluded because it had too little data for meaningful experiments. Hausa was excluded because the data did not annotate the gpe type. LDC catalog numbers were 2014E115, 2015E70, and 2016E{29,87,91,93,95,97,99,103}.
References
Bonadiman D, Severyn A, Moschitti A (2015) Deep neural networks for named entity recognition in Italian. CLiC it 51–55
Collins M, Singer Y (1999) Unsupervised models for named entity classification. In: Proceedings of the joint SIGDAT conference on empirical methods in natural language processing and very large corpora, pp 100–110
Duchi JC, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, ICML ’01, pp 282–289
Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C (2016) Neural architectures for named entity recognition. CoRR abs/1603.01360, http://arxiv.org/abs/1603.01360
Li W, McCallum A (2003) Rapid development of hindi named entity recognition using conditional random fields and feature induction. In: ACM transactions on Asian language information processing, pp 290–294
Linguistic Data Consortium (2016) LORELEI IL3 incident language pack for year 1 Eval. LDC2016E57
Nadeau D, Turney PD, Matwin S (2006) Unsupervised named-entity recognition: generating gazetteers and resolving ambiguity. In: Proceedings of the 19th international conference on advances in artificial intelligence: Canadian Society for Computational Studies of Intelligence, Springer, Berlin, Heidelberg, AI’06, pp 266–277
Ramshaw LA, Marcus MP (1999) Text chunking using transformation-based learning. In: Armstrong S, Church K, Isabelle P, Manzi S, Tzoukermann E, Yarowsky D (eds) Natural language processing using very large corpora. Springer, The Netherlands, Dordrecht, pp 157–176
Riaz K (2010) Rule-based named entity recognition in urdu. In: Proceedings of the 2010 named entities workshop, Association for computational linguistics, Stroudsburg, PA, NEWS ’10, pp 126–135
Settles B (2010) Active learning literature survey. In: Computer sciences technical report, University of Wisconsin-Madison
Sun H, Grishman R, Wang Y (2016) Domain adaptation with active learning for named entity recognition. In: Sun X, Liu A, Chao HC, Bertino E (eds) Cloud computing and security: second international conference. Revised Selected Papers, Part II, Springer International Publishing, Cham, ICCCS 2016, Nanjing, China, 29–31 July 2016, pp 611–622
Sundheim BM (1995) Overview of results of the MUC-6 evaluation. In: Proceedings of the 6th conference on message understanding, Association for Computational Linguistics, Stroudsburg, PA, MUC-6 ’95, pp 13–31
Tjong Kim Sang EF, De Meulder F (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 - Vol4, Association for Computational Linguistics, Stroudsburg, PA, CoNLL ’03, pp 142–147
Wick M (2016) Geonames ontology. http://www.geonames.org/about.html
Xu H, Marcus M, Ungar L, Yang C (2017) Unsupervised morphology learning with statistical paradigms, unpublished manuscript
Zhang B, Pan X, Wang T, Vaswani A, Ji H, Knight K, Marcu D (2016) Name tagging for low-resource incident languages based on expectation-driven learning. In: Proceedings of ACL 2016
Acknowledgements
This material is based upon work supported by the the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0113. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. (Approved for Public Release by DARPA on Aug 29, 2017 (DISTAR Approval #28392) , Distribution Unlimited)
Author information
Authors and Affiliations
Corresponding author
Additional information
All work described in this article was performed at Raytheon BBN Technologies. Authors Freedman, Gabbard, Lignos, and Weischedel are currently affiliated with the University of Southern California Information Sciences Institute, 4676 Admiralty Way, Suite 1001, Marina del Rey, 90292, USA.
Rights and permissions
About this article
Cite this article
Gabbard, R., DeYoung, J., Lignos, C. et al. Combining rule-based and statistical mechanisms for low-resource named entity recognition. Machine Translation 32, 31–43 (2018). https://doi.org/10.1007/s10590-017-9208-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-017-9208-0