Skip to main content

An Empirical Data Selection Schema in Annotation Projection Approach

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

  • 350 Accesses

Abstract

Named entity recognition (NER) system is often realized using supervised methods such as CRF and LSTM-CRF. However, supervised methods often require large training data. In some low-resource languages, annotated data is often hard to obtain. Annotation projection method obtains annotated data from high-resource languages automatically. However, the data obtained automatically contains a lot of noise. In this paper, we propose a new data selection schema to select the high-quality sentences in annotated data. The data selection schema computes the sentence score considering the occurrence number of entity-tags and the minimum scores of entity-tags in sentences. The selected sentences can be used as an auxiliary annotated data in low resource languages. Experiments show that our data selection schema outperforms previous methods.

The work is supported by both National scientific and Technological Innovation Zero (No. 17-H863-01-ZT-005-005-01) and State’s Key Project of Research and Development Plan (No. 2016QY03D0505).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We consider English as high-resource language. Although Chinese is not truly low-resource language, we simulate the low-resource environment by limiting the size of training data which is similar to [14].

  2. 2.

    http://nlp.nju.edu.cn/cwmt-wmt/.

  3. 3.

    Before using the data from annotation projection in Chinese, we change the word tag schema to character tag schema. For example, the word ‘

References

  1. Bunescu, R., Mooney, R.: A shortest path dependency kernel for relation extraction. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (2005). https://aclweb.org/anthology/H05-1091

  2. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)

    Google Scholar 

  3. Dong, C., Zhang, J., Zong, C., Hattori, M., Di, H.: Character-based LSTM-CRF with radical-level features for Chinese named entity recognition. In: Lin, C.-Y., Xue, N., Zhao, D., Huang, X., Feng, Y. (eds.) ICCPOL/NLPCC -2016. LNCS (LNAI), vol. 10102, pp. 239–250. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-50496-4_20

    Chapter  Google Scholar 

  4. Ehrmann, M., Turchi, M.: Building multilingual named entity annotated corpora exploiting parallel corpora. In: Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora (AEPC) (2010)

    Google Scholar 

  5. Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection approach for relation detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), pp. 564–571. Coling 2010 Organizing Committee (2010). https://www.aclweb.org/anthology/C10-1064

  6. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. Computer Science (2014)

    Google Scholar 

  7. Lafferty, J.D., Mccallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Eighteenth International Conference on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  8. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics (2016). https://doi.org/10.18653/v1/N16-1030, https://www.aclweb.org/anthology/N16-1030

  9. Levow, G.A.: The third international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp. 108–117. Association for Computational Linguistics (2006). https://www.aclweb.org/anthology/W06-0115

  10. Ni, J., Dinu, G., Florian, R.: Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 1470–1480. Association for Computational Linguistics (2017). https://doi.org/10.18653/v1/P17-1135, https://www.aclweb.org/anthology/P17-1135

  11. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1) (2003). https://www.aclweb.org/anthology/J03-1002

  12. Sang, E.F.T.K., Meulder, F.D.: Introduction to the conll-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003 (2003). https://www.aclweb.org/anthology/W03-0419

  13. Täckström, O., McDonald, R., Uszkoreit, J.: Cross-lingual word clusters for direct transfer of linguistic structure. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 477–487. Association for Computational Linguistics (2012). https://www.aclweb.org/anthology/N12-1052

  14. Wang, M., Manning, C.D.: Cross-lingual projected expectation regularization for weakly supervised learning. Trans. Assoc. Comput. Linguist. 2, 55–66 (2014). https://www.aclweb.org/anthology/Q14-1005

  15. Yang, Z., Salakhutdinov, R., Cohen, W.W.: Transfer learning for sequence tagging with hierarchical recurrent networks (2016)

    Google Scholar 

  16. Yao, X., Van Durme, B.: Information extraction over structured data: question answering with freebase. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, pp. 956–966. Association for Computational Linguistics (2014). https://doi.org/10.3115/v1/P14-1090,https://aclweb.org/anthology/P14-1090

  17. Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the First International Conference on Human Language Technology Research (2001). https://www.aclweb.org/anthology/H01-1035

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yun Hu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, Y., Liao, M., Lv, P., Zheng, C. (2023). An Empirical Data Selection Schema in Annotation Projection Approach. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24340-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24339-4

  • Online ISBN: 978-3-031-24340-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics