Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT

Cho, Seung Woo; Lee, Eui-Hyeon; Lee, Jong-Hyeok

doi:10.1007/978-981-10-8438-6_11

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 781))

Included in the following conference series:

International Conference of the Pacific Association for Computational Linguistics

887 Accesses
2 Citations

Abstract

A lexical gap easily leads to word alignment errors, which impairs a translation quality. This paper proposes some simple ideas to resolve the difficulty of handling the lexical gap. In morphologically rich languages, a predicate has a complex structure consisting of many morphemes, so we mainly address the issue of how to group the component morphemes by employing morpho-syntactic filters and statistical information from the SMT phrase table. In addition, we abstract grouping results depending on a lexical choice of the target side to enhance translation probabilities. In the experiment, we not only investigate how each method has an effect on Korean-to-Vietnamese SMT, but also show a promising improvement of BLEU score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Baldwin, T., Kim, S.N.: Multiword expressions. In: Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010)
Google Scholar
Bentivogli, L., Pianta, E.: Looking for lexical gaps. In: Proceedings of the ninth EURALEX International Congress, pp. 8–12. Universität Stuttgart, Stuttgart (2000)
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: A study in using English-Arabic multi-word expressions for statistical machine translation. In: 4th International Conference on Arabic Language Processing (2012)
Google Scholar
Dien, D., Thuy, V.: A maximum entropy approach for vietnamese word segmentation. In: Proceedings of 4th IEEE International Conference on Computer Science-Research, Innovation and Vision of the Future 2006 (RIVFóÀ\(\tilde{\hat{\rm E}}\)06), pp. 12–16 (2006)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
El-Kahlout, I.D., Oflazer, K.: Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans. Audio Speech Lang. Process. 18(6), 1313–1322 (2010)
Article Google Scholar
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)
Google Scholar
Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP, pp. 388–395. Citeseer (2004)
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)
Google Scholar
Lambert, P., Banchs, R.: Grouping multi-word expressions according to part-of-speech in statistical machine translation. Multi-word-expressions in a multilingual context, p. 9 (2006)
Google Scholar
Lee, J., Lee, D., Lee, G.G.: Improving phrase-based Korean-English statistical machine translation. In: INTERSPEECH (2006)
Google Scholar
Li, S., Wong, D.F., Chao, L.S.: Korean-Chinese statistical translation model. In: 2012 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 2, pp. 767–772. IEEE (2012)
Google Scholar
Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese pos tagging by integrating a rich feature set and Support Vector Machines. In: IEEE International Conference on Research, Innovation and Vision for the Future, RIVF 2008, pp. 128–133. IEEE (2008)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Oh-Woog, K., Yujin, C., Mi-Young, K., Dong-Won, R., Moon-Ki, L., Jong-Hyeok, L.: Korean morphological analyzer and part-of-speech tagger based on cyb algorithm using syllable information. In: Proceedings of the 11th Annual Conference on Human and Cognitive Language Technology, pp. 76–87 (1999)
Google Scholar
Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics (2009)
Google Scholar
Sakata, J., Tokuhisa, M., Murata, M., et al.: Machine translation method based on non-compositional semantics (word-level sentence-pattern-based MT). In: Hasida, K., Purwarianti, A. (eds.) International Conference of the Pacific Association for Computational Linguistics, pp. 225–237. Springer, Singapore (2015). https://doi.org/10.1007/978-981-10-0515-2_16
Google Scholar
Shin, S.: Corpus-based study of word order variations in Korean. In: Proceedings of the Corpus Linguistics Conference (CL 2007), pp. 27–30. Citeseer (2007)
Google Scholar
Skadina, I., Rozis, R.: Multi-word expressions in English-Latvian. In: Human Language Technologies-The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 289, p. 97. IOS Press (2016)
Google Scholar
Todiraşcu, A., Navlea, M.: Aligning verb+ noun collocations to improve a French-Romanian FSMT system. In: Multi-word units in Machine Translation and Translation Technologies, MUMTTT 2015, p. 37 (2015)
Google Scholar
Tran, P., Dinh, D., Nguyen, L.H.: Word re-segmentation in chinese-vietnamese machine translation. ACM Trans. Asian Low-Res. Lang. Inf. Process. (TALLIP) 16(2), 12 (2016)
Google Scholar

Download references

Acknowledgment

This work was partly supported by the ICT R&D program of MSIP/IITP [R7119-16-1001, Core technology development of the real-time simultaneous speech translation based on knowledge enhancement], the ICT Consilience Creative Program of MSIP/IITP [R0346-16-1007] and SYSTRAN.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Pohang University of Science and Technology, Pohang, Republic of Korea
Seung Woo Cho, Eui-Hyeon Lee & Jong-Hyeok Lee

Authors

Seung Woo Cho
View author publications
You can also search for this author in PubMed Google Scholar
Eui-Hyeon Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jong-Hyeok Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seung Woo Cho .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Kôiti Hasida
Natural Language Processing Lab, University of Computer Studies, Yangon, Yangon, Myanmar
Win Pa Pa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cho, S.W., Lee, EH., Lee, JH. (2018). Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_11

Download citation

DOI: https://doi.org/10.1007/978-981-10-8438-6_11
Published: 04 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8437-9
Online ISBN: 978-981-10-8438-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics