Abstract
Named entity recognition (NER) system is often realized using supervised methods such as CRF and LSTM-CRF. However, supervised methods often require large training data. In some low-resource languages, annotated data is often hard to obtain. Annotation projection method obtains annotated data from high-resource languages automatically. However, the data obtained automatically contains a lot of noise. In this paper, we propose a new data selection schema to select the high-quality sentences in annotated data. The data selection schema computes the sentence score considering the occurrence number of entity-tags and the minimum scores of entity-tags in sentences. The selected sentences can be used as an auxiliary annotated data in low resource languages. Experiments show that our data selection schema outperforms previous methods.
The work is supported by both National scientific and Technological Innovation Zero (No. 17-H863-01-ZT-005-005-01) and State’s Key Project of Research and Development Plan (No. 2016QY03D0505).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We consider English as high-resource language. Although Chinese is not truly low-resource language, we simulate the low-resource environment by limiting the size of training data which is similar to [14].
- 2.
- 3.
Before using the data from annotation projection in Chinese, we change the word tag schema to character tag schema. For example, the word ‘
References
Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (2005). https://aclweb.org/anthology/H05-1091
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC -2016. LNCS (LNAI), vol. 10102, pp. 239–250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4_20
Ehrmann, M., Turchi, M.: Building multilingual named entity annotated corpora exploiting parallel corpora. In: Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (2010)
Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection approach for relation detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 564–571. Coling 2010 Organizing Committee (2010). https://www.aclweb.org/anthology/C10-1064
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1030, https://www.aclweb.org/anthology/N16-1030
Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117. Association for Computational Linguistics (2006). https://www.aclweb.org/anthology/W06-0115
Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, https://www.aclweb.org/anthology/P17-1135
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1) (2003). https://www.aclweb.org/anthology/J03-1002
Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003). https://www.aclweb.org/anthology/W03-0419
Täckström, O., McDonald, R., Uszkoreit, J.: Cross-lingual word clusters for direct transfer of linguistic structure. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 477–487. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/N12-1052
Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55–66 (2014). https://www.aclweb.org/anthology/Q14-1005
Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks (2016)
Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 956–966. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1090,https://aclweb.org/anthology/P14-1090
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (2001). https://www.aclweb.org/anthology/H01-1035
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, Y., Liao, M., Lv, P., Zheng, C. (2023). An Empirical Data Selection Schema in Annotation Projection Approach. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-24340-0_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)