research-article

An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback

Authors:
Ryan Coughlin

Department of Computer Science, Assumption University, Bangkok, Thailand

Department of Computer Science, Assumption University, Bangkok, Thailand
View Profile

,
Rachsuda Setthawong

Department of Computer Science, Assumption University, Bangkok, Thailand

Department of Computer Science, Assumption University, Bangkok, Thailand
View Profile

,
Pisal Setthawong

Department of Management Information System, Assumption University, Bangkok, Thailand

Department of Management Information System, Assumption University, Bangkok, Thailand
View Profile

IAIT '18: Proceedings of the 10th International Conference on Advances in Information TechnologyDecember 2018Article No.: 14Pages 1–8https://doi.org/10.1145/3291280.3291794

Published:10 December 2018Publication History

IAIT '18: Proceedings of the 10th International Conference on Advances in Information Technology

Pages 1–8

ABSTRACT

One significant resource for language translation using Statistical Machine Translation (SMT) is parallel corpora. SMT model works well with timing aligned parallel corpora. However, imperfectly aligned sentences in the bilingual corpus typically leads to poorer translation in the final translation after training the SMT model. A major challenge in effectively applying nontiming aligned parallel corpora in the SMT model has not been thoroughly researched. The goal of this paper is to improve the accuracy of an English to Thai Statistical Machine Translation (SMT) model by improving the sentence alignment of parallel corpora. This work proposes an improved English-Thai translation framework for non-timing aligned Parallel corpora using an improved alignment algorithm: Bleualign with explicit user feedback. The generated model can then be applied to the Moses SMT training system to generate English-Thai translation. This experiment uses both English and Thai subtitles obtained from TED (www.ted.com) to build the parallel corpora. The TED corpora sentences are not timing aligned, and this research will try to generate an alignment model to be applied on the Moses SMT training system. The result shows that the model using our proposed algorithm outperforms two traditional alignment models: Gale-Church, Bleualign with the highest BLEU score of 0.36.

References

Philipp Koehn. 2010. Statistical Machine Translation (1st. ed.). Cambridge University Press, Cambridge, UK. Google ScholarDigital Library
Adam Lopez. 2008. Statistical Machine Translation. ACM Computing Surveys. 40, 3, Article 8 (August 2008), 49 pages.Google Scholar
Sergei Nirenburg. 1989. Knowledge-Based Machine Translation. Machine Translation, 40, 1, (March 1989), 5--24.Google Scholar
Arvi Hurskainen and Jörg Tiedemann. 2017. Rule-based Machine Translation from English to Finnish. In Proceedings of the Conference on Machine Translation (WMT), Volume 2. Association for Computational Linguistics. Copenhagen, Denmark, 323--329.Google ScholarCross Ref
Sabine Hunsicker, Chen Yu and Christian Federmann. 2012. Machine Learning for Hybrid Machine Translation. In Proceedings of the 7th Workshop on Statistical Machine Translation. Association for Computational Linguistics (June 2012). Montreal, Canada, 312--316. Google ScholarDigital Library
Declan Groves and Andy Way. 2005. Hybrid example-based SMT: the best of both worlds? In Proceedings of the ACL Workshop on Building and Using Parallel Texts (ParaText '05). Association for Computational Linguistics, Stroudsburg, PA, USA, 183--190. Google ScholarDigital Library
Antonio Lagarda, Vicent Alabau, Francisco Casacuberta, Roberto Silva and Enrique Díaz-de-Liaño. 2009. Statistical post-editing of a rule-based machine translation system. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers (NAACL '09). Association for Computational Linguistics, Stroudsburg, PA, USA, 217--220. Google ScholarDigital Library
Michael Ziemski, Marcin Junczys-Dowmunt, and Bruno Pouliquen. 2016. The United Nations Parallel Corpus, Language Resources and Evaluation. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC' 16), Portorož, Slovenia.Google Scholar
Mark Liberman and Christopher Cieri. 1998. The Creation, Distribution and Use of Linguistic Data. In 1st International Conference on Language Resources and Evaluation (LREC 1998). Granada, Spain.Google Scholar
Chutchada Nusai, Yoshimi Suzuki and Haruaki Yamazaki. 2008. Estimating Word Translation Probabilities for Thai -- English Machine Translation using EM Algorithm, International Journal of Computer and Information Engineering, 2, 6, 2291--2296. Retrieved 9 September 2018 from https://waset.org/publications/10738/estimating-word-translation-probabilities-for-thai-english-machine-translation-using-em-algorithmGoogle Scholar
Nawaphol Labutsri, Rapeeporn Chamchongm, Richard Booth and Annupan Rodtook. 2008. English Syntactic Reordering for English-Thai Phrase-Based Statistical Machine Translation. In Proceedings of the 6th International Joint Conference on Computer Science and Software Engineering (JCSSE 2009). Phuket, Thailand.Google Scholar
Rico Sennrich and Martin Volk. 2010. MT-based sentence alignment for OCR-generated parallel texts. In The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010). Denver, Colorado, United States.Google Scholar
Peter Brown, John Cocke, Stephen Della Pietra, Vincent Della Pietra, Fredrick Jelinek, John Lafferty, Robert Mercer, and Paul Roossin. 1990. A statistical approach to machine translation. Computational Linguistics. 16, 2 (June 1990), 79--85. Google ScholarDigital Library
Preslav Nakov. 2008. Improved Statistical Machine Translation Using Monolingual Paraphrases. In Proceedings of the 2008 conference on ECAI 2008: 18th European Conference on Artificial Intelligence, Malik Ghallab, Constantine Spyropoulos, Nikos Fakotakis, and Nikos Avouris (Eds.). IOS Press, Amsterdam, The Netherlands, 338--342. Google ScholarDigital Library
Kaewchai Chancharoen, Nisanad Tannin, and Booncharoen Sirinaovakul. 1999. Pattern-based Machine Translation for English-Thai. In Proceedings of the 13th Pacific Asia Conference on Language, Information and Computation. Taiwan, R. 0. C, 329--336.Google Scholar
Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondřej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions (ACL '07). Association for Computational Linguistics, Stroudsburg, PA, USA, 177--180. Google ScholarDigital Library
Kenneth Heafield. 2011. KenLM: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT '11). Association for Computational Linguistics, Stroudsburg, PA, USA, 187--197. Google ScholarDigital Library
Franz Josef Och and Hermann Ney. 2000. Improved statistical alignment models. In Proceedings of the 38th Annual Meeting on Association for Computational Linguistics (ACL '00). Association for Computational Linguistics, Stroudsburg, PA, USA, 440--447. Google ScholarDigital Library
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL '02). Association for Computational Linguistics, Stroudsburg, PA, USA, 311--318. Google ScholarDigital Library
Philipp Koehn. 2004. Europarl: A Parallel Corpus for Statistical Machine Translation. In Conference Proceedings: the Tenth Machine Translation Summit. Asia-Pacific Association for Machine Translation, Phuket, Thailand, 79--86.Google Scholar
Caroline Lavecchia, Kamel Smaïli, David Langlois. 2007. Building a bilingual dictionary from movie subtitles based on inter-lingual triggers. Translating and the Computer. (Nov 2007). London, United Kingdom.Google Scholar
Einav Itamar and Alon Itai. 2008. Using Movie Subtitles for Creating a Large-scale Bilingual Corpora. In Sixth International Conference on Language Resources and Evaluation (LREC'08). Marrakech, Morocco, 269--272.Google Scholar
William Gale and Kenneth Church. 1993. A program for aligning sentences in bilingual corpora. Computational Linguistics. 19, 1 (March 1993), 75--102. Google ScholarDigital Library
Rico Sennrich and Martin Volk. 2011. Iterative, MT-based Sentence Alignment of Parallel Texts. In Proceedings of the 18th Nordic Conference of Computational Linguistics (NODALIDA 2011). Northern European Association for Language Technology (NEALT), Riga, Latvia, 175--182.Google Scholar
Franz Josef Och. 2003. Minimum Error Rate Training in Statistical Machine Translation. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL '03), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 160--167. Google ScholarDigital Library
Jussi Jousimo. 2017. Thai word segmentation with bi-directional RNN. (November 2017). Retrieved 11 September 2018 from https://sertiscorp.com/thai-word-segmentation-with-bi-directional_rnn.Google Scholar

Index Terms

An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback
1. Computing methodologies

Index terms have been assigned to the content through auto-classification.

Recommendations

Automatically generated parallel treebanks and their exploitability in machine translation

Given much recent discussion and the shift in focus of the field, it is becoming apparent that the incorporation of syntax is the way forward for improvements to the current state-of-the-art in machine translation (MT). Parallel treebanks are a ...
Read More
Multi-Engine Machine Translation of Technical E-Contents from English to Hindi: Evaluated by Fluency & Adequacy
WCCCE '16: Proceedings of the 21st Western Canadian Conference on Computing Education

Machine translation engines are helpful to convert translation from one source language to other target language with ease of mind for the native user. The status of machine translation engines is good when it is used only for indicative reference on ...
Read More
Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation

In this paper, we present the results of our work on the development of a phrase-based statistical machine translation prototype from English to Turkish-an agglutinative language with very productive inflectional and derivational morphology. We ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IAIT '18: Proceedings of the 10th International Conference on Advances in Information Technology
December 2018
145 pages
ISBN:9781450365680
DOI:10.1145/3291280

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 December 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
English-Thai translation framework
nontiming parallel corpora
phrase alignment
statistical machine translation (SMT)
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
IAIT '18 Paper Acceptance Rate20of47submissions,43%Overall Acceptance Rate20of47submissions,43%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 67
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback

IAIT '18: Proceedings of the 10th International Conference on Advances in Information Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically generated parallel treebanks and their exploitability in machine translation

Multi-Engine Machine Translation of Technical E-Contents from English to Hindi: Evaluated by Fluency & Adequacy

Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An Improved English-Thai Translation Framework for Non-timing Aligned Parallel Corpora Using Bleualign with Explicit Feedback

IAIT '18: Proceedings of the 10th International Conference on Advances in Information Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Automatically generated parallel treebanks and their exploitability in machine translation

Multi-Engine Machine Translation of Technical E-Contents from English to Hindi: Evaluated by Fluency & Adequacy

Exploiting Morphology and Local Word Reordering in English-to-Turkish Phrase-Based Statistical Machine Translation

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media