Semi-supervised Learning for Mongolian Morphological Segmentation

Yang, Zhenxin; Li, Miao; Chen, Lei; Zeng, Weihui; Gao, Yi; Fu, Sha

doi:10.1007/978-3-319-47674-2_13

Zhenxin Yang^18,19,
Miao Li¹⁸,
Lei Chen¹⁸,
Weihui Zeng¹⁸,
Yi Gao²⁰ &
…
Sha Fu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10035))

Included in the following conference series:

1750 Accesses
1 Citations

Abstract

Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First, a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data. Then, a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data. Finally, we present some error correction models to revise segmentation results. Experimental results show that our method can improve the segmentation results compared with the pure supervised learning. Besides, we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ruokolainen, T., Kohonen, O., Sirts, K., Gr¨onroos, S.A., Kurimo, M., Virpioja, S.: A comparative study on minimally supervised morphological segmentation. Comput. Linguist. 42(1), 91–120 (2016)
Article MathSciNet Google Scholar
Ahlberg, M., Forsberg, M., Hulden, M.: Semi-supervised learning of morphological paradigms and lexicons. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 569–578 (2014)
Google Scholar
Yang, P., Zhang, J., Li, M., Wudabala, Xue, Y.: Morphology-processing in Chinese-Mongolian statistical machine translation. J. Chin. Inf. Process. 23(1), 50–57 (2009). (in Chinese)
Google Scholar
Jiang, W., Wu, J., Chang, Q., Nasan-urtu, Liu, Q., Zhao, L.: Directed graph model for Mongolian lexical analysis. J. Chin. Inf. Process. 25(5), 94–100 (2011). (in Chinese)
Google Scholar
Jiang, W., Wu, J., Wuriliga, Nasan-urtu, Liu, Q.: Discriminative stem-affix segmentation for directed-graph-based Mongolian lexical analyzer. J. Chin. Inf. Process. 25(4), 30–34 (2011)
Google Scholar
Hou, H., Liu, Q., Nasanurtu, Murengaowa, Li, J.: Mongolian word segmentation based on statistical language model. Pattern Recognit. Artif. Intell. 22(1), 109–112 (2009). (in Chinese)
Google Scholar
Zhao, W., Hou, H., Cong, W., Song, M.: Research on conditional random fields based Mongolian word segmentation. J. Chin. Inf. Process. 24(5), 31–35 (2010). (in Chinese)
Google Scholar
He, M., Li, M., Chen, L.: Mongolian morphological segmentation with hidden Markov model. In: IALP, pp. 117–120 (2012)
Google Scholar
Liu, H., Li, M., Zhang, J., Chen, L.: Morpheme segmentation using bilingual features. In: IALP, pp. 209–212 (2012)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML (2001)
Google Scholar
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)
Article Google Scholar
Li, W., Chen, L., Wudabala, Li, M.: Chained machine translation using morphemes as pivot language. In: COLING 2010 Workshop, pp. 169–177. ALR (2010)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)
Google Scholar
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics (1996)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 901–904 (2002)
Google Scholar
Levy, R., Manning, C.: Is it harder to parse Chinese, or the Chinese treebank? In: Proceedings of ACL, pp. 439–446 (2003)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: ACL, pp. 160–167 (2003)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Google Scholar
Nasanurtu: An automatic segmentation system for the root, stem, sufix of the Mongolian. J. Inner Mongolia Univ. 29(2), 53–57 (1997). (in Chinese)
Google Scholar

Download references

Acknowledgement

This work is supported by the National Natural Science Foundation of China under No. 61572462, No. 61502445, the National Key Technology R&D Program under No. 2014BAD10B03.

Author information

Authors and Affiliations

Institute of Intelligent Machines, Chinese Academy of Sciences, Hefei, 230031, China
Zhenxin Yang, Miao Li, Lei Chen & Weihui Zeng
University of Science and Technology of China, Hefei, 230026, China
Zhenxin Yang
Yunnan Agricultural Expert System Leading Group Office, Kunming, 650000, China
Yi Gao & Sha Fu

Authors

Zhenxin Yang
View author publications
You can also search for this author in PubMed Google Scholar
Miao Li
View author publications
You can also search for this author in PubMed Google Scholar
Lei Chen
View author publications
You can also search for this author in PubMed Google Scholar
Weihui Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Yi Gao
View author publications
You can also search for this author in PubMed Google Scholar
Sha Fu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miao Li .

Editor information

Editors and Affiliations

Tsinghua University , Beijing, China
Maosong Sun
Fudan University , Shanghai, China
Xuanjing Huang
Dalian University of Technology , Dalian, China
Hongfei Lin
Tsinghua University , Beijing, China
Zhiyuan Liu
Tsinghua University , Beijing, China
Yang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, Z., Li, M., Chen, L., Zeng, W., Gao, Y., Fu, S. (2016). Semi-supervised Learning for Mongolian Morphological Segmentation. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2016 2016. Lecture Notes in Computer Science(), vol 10035. Springer, Cham. https://doi.org/10.1007/978-3-319-47674-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-47674-2_13
Published: 10 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47673-5
Online ISBN: 978-3-319-47674-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics