An Empirical Data Selection Schema in Annotation Projection Approach

Hu, Yun; Liao, Mingxue; Lv, Pin; Zheng, Changwen

doi:10.1007/978-3-031-24340-0_2

Yun Hu^8,9,
Mingxue Liao⁹,
Pin Lv⁹ &
…
Changwen Zheng⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

350 Accesses

Abstract

Named entity recognition (NER) system is often realized using supervised methods such as CRF and LSTM-CRF. However, supervised methods often require large training data. In some low-resource languages, annotated data is often hard to obtain. Annotation projection method obtains annotated data from high-resource languages automatically. However, the data obtained automatically contains a lot of noise. In this paper, we propose a new data selection schema to select the high-quality sentences in annotated data. The data selection schema computes the sentence score considering the occurrence number of entity-tags and the minimum scores of entity-tags in sentences. The selected sentences can be used as an auxiliary annotated data in low resource languages. Experiments show that our data selection schema outperforms previous methods.

The work is supported by both National scientific and Technological Innovation Zero (No. 17-H863-01-ZT-005-005-01) and State’s Key Project of Research and Development Plan (No. 2016QY03D0505).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We consider English as high-resource language. Although Chinese is not truly low-resource language, we simulate the low-resource environment by limiting the size of training data which is similar to [14].
2.
http://nlp.nju.edu.cn/cwmt-wmt/.
3.
Before using the data from annotation projection in Chinese, we change the word tag schema to character tag schema. For example, the word ‘

References

Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (2005). https://aclweb.org/anthology/H05-1091
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)
Google Scholar
Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC -2016. LNCS (LNAI), vol. 10102, pp. 239–250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4_20
Chapter Google Scholar
Ehrmann, M., Turchi, M.: Building multilingual named entity annotated corpora exploiting parallel corpora. In: Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (2010)
Google Scholar
Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection approach for relation detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 564–571. Coling 2010 Organizing Committee (2010). https://www.aclweb.org/anthology/C10-1064
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)
Google Scholar
Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)
Google Scholar
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1030, https://www.aclweb.org/anthology/N16-1030
Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117. Association for Computational Linguistics (2006). https://www.aclweb.org/anthology/W06-0115
Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, https://www.aclweb.org/anthology/P17-1135
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1) (2003). https://www.aclweb.org/anthology/J03-1002
Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003). https://www.aclweb.org/anthology/W03-0419
Täckström, O., McDonald, R., Uszkoreit, J.: Cross-lingual word clusters for direct transfer of linguistic structure. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 477–487. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/N12-1052
Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55–66 (2014). https://www.aclweb.org/anthology/Q14-1005
Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks (2016)
Google Scholar
Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 956–966. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1090,https://aclweb.org/anthology/P14-1090
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (2001). https://www.aclweb.org/anthology/H01-1035

Download references

Author information

Authors and Affiliations

University of Chinese Academy of Sciences, Beijing, China
Yun Hu
Institute of Software, Chinese Academy of Sciences, Beijing, China
Yun Hu, Mingxue Liao, Pin Lv & Changwen Zheng

Authors

Yun Hu
View author publications
You can also search for this author in PubMed Google Scholar
Mingxue Liao
View author publications
You can also search for this author in PubMed Google Scholar
Pin Lv
View author publications
You can also search for this author in PubMed Google Scholar
Changwen Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yun Hu .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, Y., Liao, M., Lv, P., Zheng, C. (2023). An Empirical Data Selection Schema in Annotation Projection Approach. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-031-24340-0_2
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Empirical Data Selection Schema in Annotation Projection Approach