Abstract
Unlike previous Mongolian morphological segmentation methods based on large labeled training data or complicated rules concluded by linguists, we explore a novel semi-supervised method for a practical application, i.e., statistical machine translation (SMT), based on a low-resource learning setting, in which a small amount of labeled data and large amount of unlabeled data are available. First, a CRF-based supervised learning is exploited to predict morpheme boundaries by using small labeled data. Then, a lexicon-based segmentation model with small labeled data as the heuristic information is used to compensate the weakness in the first step by the abundant unlabeled data. Finally, we present some error correction models to revise segmentation results. Experimental results show that our method can improve the segmentation results compared with the pure supervised learning. Besides, we integrate the morphological segmentation result into Chinese-Mongolian SMT and achieve the satisfactory performance compared with the baseline.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ruokolainen, T., Kohonen, O., Sirts, K., Gr¨onroos, S.A., Kurimo, M., Virpioja, S.: A comparative study on minimally supervised morphological segmentation. Comput. Linguist. 42(1), 91–120 (2016)
Ahlberg, M., Forsberg, M., Hulden, M.: Semi-supervised learning of morphological paradigms and lexicons. In: Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 569–578 (2014)
Yang, P., Zhang, J., Li, M., Wudabala, Xue, Y.: Morphology-processing in Chinese-Mongolian statistical machine translation. J. Chin. Inf. Process. 23(1), 50–57 (2009). (in Chinese)
Jiang, W., Wu, J., Chang, Q., Nasan-urtu, Liu, Q., Zhao, L.: Directed graph model for Mongolian lexical analysis. J. Chin. Inf. Process. 25(5), 94–100 (2011). (in Chinese)
Jiang, W., Wu, J., Wuriliga, Nasan-urtu, Liu, Q.: Discriminative stem-affix segmentation for directed-graph-based Mongolian lexical analyzer. J. Chin. Inf. Process. 25(4), 30–34 (2011)
Hou, H., Liu, Q., Nasanurtu, Murengaowa, Li, J.: Mongolian word segmentation based on statistical language model. Pattern Recognit. Artif. Intell. 22(1), 109–112 (2009). (in Chinese)
Zhao, W., Hou, H., Cong, W., Song, M.: Research on conditional random fields based Mongolian word segmentation. J. Chin. Inf. Process. 24(5), 31–35 (2010). (in Chinese)
He, M., Li, M., Chen, L.: Mongolian morphological segmentation with hidden Markov model. In: IALP, pp. 117–120 (2012)
Liu, H., Li, M., Zhang, J., Chen, L.: Morpheme segmentation using bilingual features. In: IALP, pp. 209–212 (2012)
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of ICML (2001)
Creutz, M., Lagus, K.: Unsupervised models for morpheme segmentation and morphology learning. ACM Trans. Speech Lang. Process. (TSLP) 4(1), 3 (2007)
Li, W., Chen, L., Wudabala, Li, M.: Chained machine translation using morphemes as pivot language. In: COLING 2010 Workshop, pp. 169–177. ALR (2010)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)
Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics, pp. 310–318. Association for Computational Linguistics (1996)
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 901–904 (2002)
Levy, R., Manning, C.: Is it harder to parse Chinese, or the Chinese treebank? In: Proceedings of ACL, pp. 439–446 (2003)
Och, F.J.: Minimum error rate training in statistical machine translation. In: ACL, pp. 160–167 (2003)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: ACL, pp. 311–318 (2002)
Nasanurtu: An automatic segmentation system for the root, stem, sufix of the Mongolian. J. Inner Mongolia Univ. 29(2), 53–57 (1997). (in Chinese)
Acknowledgement
This work is supported by the National Natural Science Foundation of China under No. 61572462, No. 61502445, the National Key Technology R&D Program under No. 2014BAD10B03.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Yang, Z., Li, M., Chen, L., Zeng, W., Gao, Y., Fu, S. (2016). Semi-supervised Learning for Mongolian Morphological Segmentation. In: Sun, M., Huang, X., Lin, H., Liu, Z., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. NLP-NABD CCL 2016 2016. Lecture Notes in Computer Science(), vol 10035. Springer, Cham. https://doi.org/10.1007/978-3-319-47674-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-47674-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47673-5
Online ISBN: 978-3-319-47674-2
eBook Packages: Computer ScienceComputer Science (R0)