Skip to main content

Semi-supervised Learning for Mongolian Morphological Segmentation

  • Conference paper
  • First Online:
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (NLP-NABD 2016, CCL 2016)

Abstract

Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First, a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data. Then, a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data. Finally, we present some error correction models to revise segmentation results. Experimental results show that our method can improve the segmentation results compared with the pure supervised learning. Besides, we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ruokolainen, T., Kohonen, O., Sirts, K., Gr¨onroos, S.A., Kurimo, M., Virpioja, S.: A comparative study on minimally supervised morphological segmentation. Comput. Linguist. 42(1), 91–120 (2016)

    Article  MathSciNet  Google Scholar 

  2. Ahlberg, M., Forsberg, M., Hulden, M.: Semi-supervised learning of morphological paradigms and lexicons. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 569–578 (2014)

    Google Scholar 

  3. Yang, P., Zhang, J., Li, M., Wudabala, Xue, Y.: Morphology-processing in Chinese-Mongolian statistical machine translation. J. Chin. Inf. Process. 23(1), 50–57 (2009). (in Chinese)

    Google Scholar 

  4. Jiang, W., Wu, J., Chang, Q., Nasan-urtu, Liu, Q., Zhao, L.: Directed graph model for Mongolian lexical analysis. J. Chin. Inf. Process. 25(5), 94–100 (2011). (in Chinese)

    Google Scholar 

  5. Jiang, W., Wu, J., Wuriliga, Nasan-urtu, Liu, Q.: Discriminative stem-affix segmentation for directed-graph-based Mongolian lexical analyzer. J. Chin. Inf. Process. 25(4), 30–34 (2011)

    Google Scholar 

  6. Hou, H., Liu, Q., Nasanurtu, Murengaowa, Li, J.: Mongolian word segmentation based on statistical language model. Pattern Recognit. Artif. Intell. 22(1), 109–112 (2009). (in Chinese)

    Google Scholar 

  7. Zhao, W., Hou, H., Cong, W., Song, M.: Research on conditional random fields based Mongolian word segmentation. J. Chin. Inf. Process. 24(5), 31–35 (2010). (in Chinese)

    Google Scholar 

  8. He, M., Li, M., Chen, L.: Mongolian morphological segmentation with hidden Markov model. In: IALP, pp. 117–120 (2012)

    Google Scholar 

  9. Liu, H., Li, M., Zhang, J., Chen, L.: Morpheme segmentation using bilingual features. In: IALP, pp. 209–212 (2012)

    Google Scholar 

  10. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML (2001)

    Google Scholar 

  11. Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)

    Article  Google Scholar 

  12. Li, W., Chen, L., Wudabala, Li, M.: Chained machine translation using morphemes as pivot language. In: COLING 2010 Workshop, pp. 169–177. ALR (2010)

    Google Scholar 

  13. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)

    Google Scholar 

  14. Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics (1996)

    Google Scholar 

  15. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 901–904 (2002)

    Google Scholar 

  16. Levy, R., Manning, C.: Is it harder to parse Chinese, or the Chinese treebank? In: Proceedings of ACL, pp. 439–446 (2003)

    Google Scholar 

  17. Och, F.J.: Minimum error rate training in statistical machine translation. In: ACL, pp. 160–167 (2003)

    Google Scholar 

  18. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)

    Google Scholar 

  19. Nasanurtu: An automatic segmentation system for the root, stem, sufix of the Mongolian. J. Inner Mongolia Univ. 29(2), 53–57 (1997). (in Chinese)

    Google Scholar 

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China under No. 61572462, No. 61502445, the National Key Technology R&D Program under No. 2014BAD10B03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miao Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Yang, Z., Li, M., Chen, L., Zeng, W., Gao, Y., Fu, S. (2016). Semi-supervised Learning for Mongolian Morphological Segmentation. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2016 2016. Lecture Notes in Computer Science(), vol 10035. Springer, Cham. https://doi.org/10.1007/978-3-319-47674-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-47674-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-47673-5

  • Online ISBN: 978-3-319-47674-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics