ABSTRACT
One significant resource for language translation using Statistical Machine Translation (SMT) is parallel corpora. SMT model works well with timing aligned parallel corpora. However, imperfectly aligned sentences in the bilingual corpus typically leads to poorer translation in the final translation after training the SMT model. A major challenge in effectively applying nontiming aligned parallel corpora in the SMT model has not been thoroughly researched. The goal of this paper is to improve the accuracy of an English to Thai Statistical Machine Translation (SMT) model by improving the sentence alignment of parallel corpora. This work proposes an improved English-Thai translation framework for non-timing aligned Parallel corpora using an improved alignment algorithm: Bleualign with explicit user feedback. The generated model can then be applied to the Moses SMT training system to generate English-Thai translation. This experiment uses both English and Thai subtitles obtained from TED (www.ted.com) to build the parallel corpora. The TED corpora sentences are not timing aligned, and this research will try to generate an alignment model to be applied on the Moses SMT training system. The result shows that the model using our proposed algorithm outperforms two traditional alignment models: Gale-Church, Bleualign with the highest BLEU score of 0.36.
- Philipp Koehn. 2010. Statistical Machine Translation (1st. ed.). Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
- Adam Lopez. 2008. Statistical Machine Translation. ACM Computing Surveys. 40, 3, Article 8 (August 2008), 49 pages.Google Scholar
- Sergei Nirenburg. 1989. Knowledge-Based Machine Translation. Machine Translation, 40, 1, (March 1989), 5--24.Google Scholar
- Arvi Hurskainen and Jörg Tiedemann. 2017. Rule-based Machine Translation from English to Finnish. In Proceedings of the Conference on Machine Translation (WMT), Volume 2. Association for Computational Linguistics. Copenhagen, Denmark, 323--329.Google ScholarCross Ref
- Sabine Hunsicker, Chen Yu and Christian Federmann. 2012. Machine Learning for Hybrid Machine Translation. In Proceedings of the 7th Workshop on Statistical Machine Translation. Association for Computational Linguistics (June 2012). Montreal, Canada, 312--316. Google ScholarDigital Library
- Declan Groves and Andy Way. 2005. Hybrid example-based SMT: the best of both worlds? In Proceedings of the ACL Workshop on Building and Using Parallel Texts (ParaText '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 183--190. Google ScholarDigital Library
- Antonio Lagarda, Vicent Alabau, Francisco Casacuberta, Roberto Silva and Enrique Díaz-de-Liaño. 2009. Statistical post-editing of a rule-based machine translation system. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (NAACL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 217--220. Google ScholarDigital Library
- Michael Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus, Language Resources and Evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC' 16), Portorož, Slovenia.Google Scholar
- Mark Liberman and Christopher Cieri. 1998. The Creation, Distribution and Use of Linguistic Data. In 1st International Conference on Language Resources and Evaluation (LREC 1998). Granada, Spain.Google Scholar
- Chutchada Nusai, Yoshimi Suzuki and Haruaki Yamazaki. 2008. Estimating Word Translation Probabilities for Thai -- English Machine Translation using EM Algorithm, International Journal of Computer and Information Engineering, 2, 6, 2291--2296. Retrieved 9 September 2018 from https://waset.org/publications/10738/estimating-word-translation-probabilities-for-thai-english-machine-translation-using-em-algorithmGoogle Scholar
- Nawaphol Labutsri, Rapeeporn Chamchongm, Richard Booth and Annupan Rodtook. 2008. English Syntactic Reordering for English-Thai Phrase-Based Statistical Machine Translation. In Proceedings of the 6th International Joint Conference on Computer Science and Software Engineering (JCSSE 2009). Phuket, Thailand.Google Scholar
- Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010). Denver, Colorado, United States.Google Scholar
- Peter Brown, John Cocke, Stephen Della Pietra, Vincent Della Pietra, Fredrick Jelinek, John Lafferty, Robert Mercer, and Paul Roossin. 1990. A statistical approach to machine translation. Computational Linguistics. 16, 2 (June 1990), 79--85. Google ScholarDigital Library
- Preslav Nakov. 2008. Improved Statistical Machine Translation Using Monolingual Paraphrases. In Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence, Malik Ghallab, Constantine Spyropoulos, Nikos Fakotakis, and Nikos Avouris (Eds.). IOS Press, Amsterdam, The Netherlands, 338--342. Google ScholarDigital Library
- Kaewchai Chancharoen, Nisanad Tannin, and Booncharoen Sirinaovakul. 1999. Pattern-based Machine Translation for English-Thai. In Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation. Taiwan, R. 0. C, 329--336.Google Scholar
- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL '07). Association for Computational Linguistics, Stroudsburg, PA, USA, 177--180. Google ScholarDigital Library
- Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 187--197. Google ScholarDigital Library
- Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 440--447. Google ScholarDigital Library
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318. Google ScholarDigital Library
- Philipp Koehn. 2004. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the Tenth Machine Translation Summit. Asia-Pacific Association for Machine Translation, Phuket, Thailand, 79--86.Google Scholar
- Caroline Lavecchia, Kamel Smaïli, David Langlois. 2007. Building a bilingual dictionary from movie subtitles based on inter-lingual triggers. Translating and the Computer. (Nov 2007). London, United Kingdom.Google Scholar
- Einav Itamar and Alon Itai. 2008. Using Movie Subtitles for Creating a Large-scale Bilingual Corpora. In Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 269--272.Google Scholar
- William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics. 19, 1 (March 1993), 75--102. Google ScholarDigital Library
- Rico Sennrich and Martin Volk. 2011. Iterative, MT-based Sentence Alignment of Parallel Texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). Northern European Association for Language Technology (NEALT), Riga, Latvia, 175--182.Google Scholar
- Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL '03), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 160--167. Google ScholarDigital Library
- Jussi Jousimo. 2017. Thai word segmentation with bi-directional RNN. (November 2017). Retrieved 11 September 2018 from https://sertiscorp.com/thai-word-segmentation-with-bi-directional_rnn.Google Scholar
Index Terms
- An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback
Recommendations
Automatically generated parallel treebanks and their exploitability in machine translation
Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a ...
Multi-Engine Machine Translation of Technical E-Contents from English to Hindi: Evaluated by Fluency & Adequacy
WCCCE '16: Proceedings of the 21st Western Canadian Conference on Computing EducationMachine translation engines are helpful to convert translation from one source language to other target language with ease of mind for the native user. The status of machine translation engines is good when it is used only for indicative reference on ...
Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation
In this paper, we present the results of our work on the development of a phrase-based statistical machine translation prototype from English to Turkish-an agglutinative language with very productive inflectional and derivational morphology. We ...
Comments