Abstract
A lexical gap easily leads to word alignment errors, which impairs a translation quality. This paper proposes some simple ideas to resolve the difficulty of handling the lexical gap. In morphologically rich languages, a predicate has a complex structure consisting of many morphemes, so we mainly address the issue of how to group the component morphemes by employing morpho-syntactic filters and statistical information from the SMT phrase table. In addition, we abstract grouping results depending on a lexical choice of the target side to enhance translation probabilities. In the experiment, we not only investigate how each method has an effect on Korean-to-Vietnamese SMT, but also show a promising improvement of BLEU score.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Baldwin, T., Kim, S.N.: Multiword expressions. In: Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010)
Bentivogli, L., Pianta, E.: Looking for lexical gaps. In: Proceedings of the ninth EURALEX International Congress, pp. 8–12. Universität Stuttgart, Stuttgart (2000)
Bouamor, D., Semmar, N., Zweigenbaum, P.: A study in using English-Arabic multi-word expressions for statistical machine translation. In: 4th International Conference on Arabic Language Processing (2012)
Dien, D., Thuy, V.: A maximum entropy approach for vietnamese word segmentation. In: Proceedings of 4th IEEE International Conference on Computer Science-Research, Innovation and Vision of the Future 2006 (RIVFóÀ\(\tilde{\hat{\rm E}}\)06), pp. 12–16 (2006)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
El-Kahlout, I.D., Oflazer, K.: Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans. Audio Speech Lang. Process. 18(6), 1313–1322 (2010)
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)
Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP, pp. 388–395. Citeseer (2004)
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)
Lambert, P., Banchs, R.: Grouping multi-word expressions according to part-of-speech in statistical machine translation. Multi-word-expressions in a multilingual context, p. 9 (2006)
Lee, J., Lee, D., Lee, G.G.: Improving phrase-based Korean-English statistical machine translation. In: INTERSPEECH (2006)
Li, S., Wong, D.F., Chao, L.S.: Korean-Chinese statistical translation model. In: 2012 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 2, pp. 767–772. IEEE (2012)
Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese pos tagging by integrating a rich feature set and Support Vector Machines. In: IEEE International Conference on Research, Innovation and Vision for the Future, RIVF 2008, pp. 128–133. IEEE (2008)
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Oh-Woog, K., Yujin, C., Mi-Young, K., Dong-Won, R., Moon-Ki, L., Jong-Hyeok, L.: Korean morphological analyzer and part-of-speech tagger based on cyb algorithm using syllable information. In: Proceedings of the 11th Annual Conference on Human and Cognitive Language Technology, pp. 76–87 (1999)
Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics (2009)
Sakata, J., Tokuhisa, M., Murata, M., et al.: Machine translation method based on non-compositional semantics (word-level sentence-pattern-based MT). In: Hasida, K., Purwarianti, A. (eds.) International Conference of the Pacific Association for Computational Linguistics, pp. 225–237. Springer, Singapore (2015). https://doi.org/10.1007/978-981-10-0515-2_16
Shin, S.: Corpus-based study of word order variations in Korean. In: Proceedings of the Corpus Linguistics Conference (CL 2007), pp. 27–30. Citeseer (2007)
Skadina, I., Rozis, R.: Multi-word expressions in English-Latvian. In: Human Language Technologies-The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 289, p. 97. IOS Press (2016)
TodiraÅŸcu, A., Navlea, M.: Aligning verb+ noun collocations to improve a French-Romanian FSMT system. In: Multi-word units in Machine Translation and Translation Technologies, MUMTTT 2015, p. 37 (2015)
Tran, P., Dinh, D., Nguyen, L.H.: Word re-segmentation in chinese-vietnamese machine translation. ACM Trans. Asian Low-Res. Lang. Inf. Process. (TALLIP) 16(2), 12 (2016)
Acknowledgment
This work was partly supported by the ICT R&D program of MSIP/IITP [R7119-16-1001, Core technology development of the real-time simultaneous speech translation based on knowledge enhancement], the ICT Consilience Creative Program of MSIP/IITP [R0346-16-1007] and SYSTRAN.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Cho, S.W., Lee, EH., Lee, JH. (2018). Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_11
Download citation
DOI: https://doi.org/10.1007/978-981-10-8438-6_11
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8437-9
Online ISBN: 978-981-10-8438-6
eBook Packages: Computer ScienceComputer Science (R0)