skip to main content
10.1145/3291280.3291794acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiaitConference Proceedingsconference-collections
research-article

An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback

Authors Info & Claims
Published:10 December 2018Publication History

ABSTRACT

One significant resource for language translation using Statistical Machine Translation (SMT) is parallel corpora. SMT model works well with timing aligned parallel corpora. However, imperfectly aligned sentences in the bilingual corpus typically leads to poorer translation in the final translation after training the SMT model. A major challenge in effectively applying nontiming aligned parallel corpora in the SMT model has not been thoroughly researched. The goal of this paper is to improve the accuracy of an English to Thai Statistical Machine Translation (SMT) model by improving the sentence alignment of parallel corpora. This work proposes an improved English-Thai translation framework for non-timing aligned Parallel corpora using an improved alignment algorithm: Bleualign with explicit user feedback. The generated model can then be applied to the Moses SMT training system to generate English-Thai translation. This experiment uses both English and Thai subtitles obtained from TED (www.ted.com) to build the parallel corpora. The TED corpora sentences are not timing aligned, and this research will try to generate an alignment model to be applied on the Moses SMT training system. The result shows that the model using our proposed algorithm outperforms two traditional alignment models: Gale-Church, Bleualign with the highest BLEU score of 0.36.

References

  1. Philipp Koehn. 2010. Statistical Machine Translation (1st. ed.). Cambridge University Press, Cambridge, UK. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Adam Lopez. 2008. Statistical Machine Translation. ACM Computing Surveys. 40, 3, Article 8 (August 2008), 49 pages.Google ScholarGoogle Scholar
  3. Sergei Nirenburg. 1989. Knowledge-Based Machine Translation. Machine Translation, 40, 1, (March 1989), 5--24.Google ScholarGoogle Scholar
  4. Arvi Hurskainen and Jörg Tiedemann. 2017. Rule-based Machine Translation from English to Finnish. In Proceedings of the Conference on Machine Translation (WMT), Volume 2. Association for Computational Linguistics. Copenhagen, Denmark, 323--329.Google ScholarGoogle ScholarCross RefCross Ref
  5. Sabine Hunsicker, Chen Yu and Christian Federmann. 2012. Machine Learning for Hybrid Machine Translation. In Proceedings of the 7th Workshop on Statistical Machine Translation. Association for Computational Linguistics (June 2012). Montreal, Canada, 312--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Declan Groves and Andy Way. 2005. Hybrid example-based SMT: the best of both worlds? In Proceedings of the ACL Workshop on Building and Using Parallel Texts (ParaText '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 183--190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Antonio Lagarda, Vicent Alabau, Francisco Casacuberta, Roberto Silva and Enrique Díaz-de-Liaño. 2009. Statistical post-editing of a rule-based machine translation system. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (NAACL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 217--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Michael Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus, Language Resources and Evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC' 16), Portorož, Slovenia.Google ScholarGoogle Scholar
  9. Mark Liberman and Christopher Cieri. 1998. The Creation, Distribution and Use of Linguistic Data. In 1st International Conference on Language Resources and Evaluation (LREC 1998). Granada, Spain.Google ScholarGoogle Scholar
  10. Chutchada Nusai, Yoshimi Suzuki and Haruaki Yamazaki. 2008. Estimating Word Translation Probabilities for Thai -- English Machine Translation using EM Algorithm, International Journal of Computer and Information Engineering, 2, 6, 2291--2296. Retrieved 9 September 2018 from https://waset.org/publications/10738/estimating-word-translation-probabilities-for-thai-english-machine-translation-using-em-algorithmGoogle ScholarGoogle Scholar
  11. Nawaphol Labutsri, Rapeeporn Chamchongm, Richard Booth and Annupan Rodtook. 2008. English Syntactic Reordering for English-Thai Phrase-Based Statistical Machine Translation. In Proceedings of the 6th International Joint Conference on Computer Science and Software Engineering (JCSSE 2009). Phuket, Thailand.Google ScholarGoogle Scholar
  12. Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010). Denver, Colorado, United States.Google ScholarGoogle Scholar
  13. Peter Brown, John Cocke, Stephen Della Pietra, Vincent Della Pietra, Fredrick Jelinek, John Lafferty, Robert Mercer, and Paul Roossin. 1990. A statistical approach to machine translation. Computational Linguistics. 16, 2 (June 1990), 79--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Preslav Nakov. 2008. Improved Statistical Machine Translation Using Monolingual Paraphrases. In Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence, Malik Ghallab, Constantine Spyropoulos, Nikos Fakotakis, and Nikos Avouris (Eds.). IOS Press, Amsterdam, The Netherlands, 338--342. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kaewchai Chancharoen, Nisanad Tannin, and Booncharoen Sirinaovakul. 1999. Pattern-based Machine Translation for English-Thai. In Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation. Taiwan, R. 0. C, 329--336.Google ScholarGoogle Scholar
  16. Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL '07). Association for Computational Linguistics, Stroudsburg, PA, USA, 177--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 187--197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 440--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Philipp Koehn. 2004. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the Tenth Machine Translation Summit. Asia-Pacific Association for Machine Translation, Phuket, Thailand, 79--86.Google ScholarGoogle Scholar
  21. Caroline Lavecchia, Kamel Smaïli, David Langlois. 2007. Building a bilingual dictionary from movie subtitles based on inter-lingual triggers. Translating and the Computer. (Nov 2007). London, United Kingdom.Google ScholarGoogle Scholar
  22. Einav Itamar and Alon Itai. 2008. Using Movie Subtitles for Creating a Large-scale Bilingual Corpora. In Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 269--272.Google ScholarGoogle Scholar
  23. William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics. 19, 1 (March 1993), 75--102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rico Sennrich and Martin Volk. 2011. Iterative, MT-based Sentence Alignment of Parallel Texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). Northern European Association for Language Technology (NEALT), Riga, Latvia, 175--182.Google ScholarGoogle Scholar
  25. Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL '03), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 160--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jussi Jousimo. 2017. Thai word segmentation with bi-directional RNN. (November 2017). Retrieved 11 September 2018 from https://sertiscorp.com/thai-word-segmentation-with-bi-directional_rnn.Google ScholarGoogle Scholar

Index Terms

  1. An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            IAIT '18: Proceedings of the 10th International Conference on Advances in Information Technology
            December 2018
            145 pages
            ISBN:9781450365680
            DOI:10.1145/3291280

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 10 December 2018

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            IAIT '18 Paper Acceptance Rate20of47submissions,43%Overall Acceptance Rate20of47submissions,43%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader