Skip to main content

Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT

  • Conference paper
  • First Online:
Computational Linguistics (PACLING 2017)

Abstract

A lexical gap easily leads to word alignment errors, which impairs a translation quality. This paper proposes some simple ideas to resolve the difficulty of handling the lexical gap. In morphologically rich languages, a predicate has a complex structure consisting of many morphemes, so we mainly address the issue of how to group the component morphemes by employing morpho-syntactic filters and statistical information from the SMT phrase table. In addition, we abstract grouping results depending on a lexical choice of the target side to enhance translation probabilities. In the experiment, we not only investigate how each method has an effect on Korean-to-Vietnamese SMT, but also show a promising improvement of BLEU score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Baldwin, T., Kim, S.N.: Multiword expressions. In: Handbook of Natural Language Processing, 2nd edn., pp. 267–292. Chapman and Hall/CRC (2010)

    Google Scholar 

  2. Bentivogli, L., Pianta, E.: Looking for lexical gaps. In: Proceedings of the ninth EURALEX International Congress, pp. 8–12. Universität Stuttgart, Stuttgart (2000)

    Google Scholar 

  3. Bouamor, D., Semmar, N., Zweigenbaum, P.: A study in using English-Arabic multi-word expressions for statistical machine translation. In: 4th International Conference on Arabic Language Processing (2012)

    Google Scholar 

  4. Dien, D., Thuy, V.: A maximum entropy approach for vietnamese word segmentation. In: Proceedings of 4th IEEE International Conference on Computer Science-Research, Innovation and Vision of the Future 2006 (RIVFóÀ\(\tilde{\hat{\rm E}}\)06), pp. 12–16 (2006)

    Google Scholar 

  5. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  6. El-Kahlout, I.D., Oflazer, K.: Exploiting morphology and local word reordering in English-to-Turkish phrase-based statistical machine translation. IEEE Trans. Audio Speech Lang. Process. 18(6), 1313–1322 (2010)

    Article  Google Scholar 

  7. Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics (2011)

    Google Scholar 

  8. Koehn, P.: Statistical significance tests for machine translation evaluation. In: EMNLP, pp. 388–395. Citeseer (2004)

    Google Scholar 

  9. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C., Zens, R., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, pp. 177–180. Association for Computational Linguistics (2007)

    Google Scholar 

  10. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 48–54. Association for Computational Linguistics (2003)

    Google Scholar 

  11. Lambert, P., Banchs, R.: Grouping multi-word expressions according to part-of-speech in statistical machine translation. Multi-word-expressions in a multilingual context, p. 9 (2006)

    Google Scholar 

  12. Lee, J., Lee, D., Lee, G.G.: Improving phrase-based Korean-English statistical machine translation. In: INTERSPEECH (2006)

    Google Scholar 

  13. Li, S., Wong, D.F., Chao, L.S.: Korean-Chinese statistical translation model. In: 2012 International Conference on Machine Learning and Cybernetics (ICMLC), vol. 2, pp. 767–772. IEEE (2012)

    Google Scholar 

  14. Nghiem, M., Dinh, D., Nguyen, M.: Improving Vietnamese pos tagging by integrating a rich feature set and Support Vector Machines. In: IEEE International Conference on Research, Innovation and Vision for the Future, RIVF 2008, pp. 128–133. IEEE (2008)

    Google Scholar 

  15. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  16. Oh-Woog, K., Yujin, C., Mi-Young, K., Dong-Won, R., Moon-Ki, L., Jong-Hyeok, L.: Korean morphological analyzer and part-of-speech tagger based on cyb algorithm using syllable information. In: Proceedings of the 11th Annual Conference on Human and Cognitive Language Technology, pp. 76–87 (1999)

    Google Scholar 

  17. Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, pp. 47–54. Association for Computational Linguistics (2009)

    Google Scholar 

  18. Sakata, J., Tokuhisa, M., Murata, M., et al.: Machine translation method based on non-compositional semantics (word-level sentence-pattern-based MT). In: Hasida, K., Purwarianti, A. (eds.) International Conference of the Pacific Association for Computational Linguistics, pp. 225–237. Springer, Singapore (2015). https://doi.org/10.1007/978-981-10-0515-2_16

    Google Scholar 

  19. Shin, S.: Corpus-based study of word order variations in Korean. In: Proceedings of the Corpus Linguistics Conference (CL 2007), pp. 27–30. Citeseer (2007)

    Google Scholar 

  20. Skadina, I., Rozis, R.: Multi-word expressions in English-Latvian. In: Human Language Technologies-The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 289, p. 97. IOS Press (2016)

    Google Scholar 

  21. TodiraÅŸcu, A., Navlea, M.: Aligning verb+ noun collocations to improve a French-Romanian FSMT system. In: Multi-word units in Machine Translation and Translation Technologies, MUMTTT 2015, p. 37 (2015)

    Google Scholar 

  22. Tran, P., Dinh, D., Nguyen, L.H.: Word re-segmentation in chinese-vietnamese machine translation. ACM Trans. Asian Low-Res. Lang. Inf. Process. (TALLIP) 16(2), 12 (2016)

    Google Scholar 

Download references

Acknowledgment

This work was partly supported by the ICT R&D program of MSIP/IITP [R7119-16-1001, Core technology development of the real-time simultaneous speech translation based on knowledge enhancement], the ICT Consilience Creative Program of MSIP/IITP [R0346-16-1007] and SYSTRAN.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seung Woo Cho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cho, S.W., Lee, EH., Lee, JH. (2018). Phrase-Level Grouping for Lexical Gap Resolution in Korean-Vietnamese SMT. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_11

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-8438-6_11

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-8437-9

  • Online ISBN: 978-981-10-8438-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics