Leveraging Arabic-English Bilingual Corpora with Crowd Sourcing-Based Annotation for Arabic-Hebrew SMT

Gaurav, Manish; Saikumar, Guruprasad; Srivastava, Amit; Natarajan, Premkumar; Ananthakrishnan, Shankar; Matsoukas, Spyros

doi:10.1007/978-3-642-37256-8_25

Manish Gaurav¹⁷,
Guruprasad Saikumar¹⁷,
Amit Srivastava¹⁷,
Premkumar Natarajan¹⁷,
Shankar Ananthakrishnan¹⁷ &
…
Spyros Matsoukas¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7817))

Included in the following conference series:

International Conference on Intelligent Text Processing and Computational Linguistics

2935 Accesses

Abstract

Recent studies in Statistical Machine Translation (SMT) paradigm have been focused on developing foreign language to English translation systems. However as SMT systems have matured, there is a lot of demand to translate from one foreign language to another language. Unfortunately, the availability of parallel training corpora for a pair of morphologically complex foreign languages like Arabic and Hebrew is very scarce. This paper uses active learning based data selection and crowd sourcing technique like Amazon Mechanical Turk to create Arabic-Hebrew parallel corpora. It then explores two different techniques to build Arabic-Hebrew SMT system. The first one involves the traditional cascading of two SMT systems using English as a pivot language. The second approach is training a direct Arabic-Hebrew SMT system using sentence pivoting. Finally, we use a phrase generalization approach to further improve our performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Corpus-Based Extraction and Translation of Arabic Multi-Words Expressions (MWEs)

Building and Exploiting Domain-Specific Comparable Corpora for Statistical Machine Translation

Survey of the Arabic Machine Translation Corpora

References

Callison-Burch, C., Koehn, P., Os-borne, M.: Improved statistical ma-chine translation using paraphrases. In: Proceedings NAACL 2006 (2006)
Google Scholar
Stefan, D., Munteanu, Marcu, D.: Improving machine translation perfor-mance by exploiting non-parallel corpora. Computational Linguistics 31(4), 477–504 (2005)
Article Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003)
Article MATH Google Scholar
Wu, H., Wang, H.: Revisiting Pivot Language Approach for Machine Translation. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 154–162. Association for Computational Linguistics (August 2009)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.-J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association of Computational Linguistics, ACL (2002)
Google Scholar
Utiyama, M., Isahara, H.: Reliable measures for aligning Japanese-English news articles and sentences. In: ACL, pp. 72–79 (2003)
Google Scholar
Utiyama, M., Isahara, H.: A Comparison of Pivot Methods for Phrase-Based Statistical Machine Translation. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, Rochester, New York, pp. 484–491. Association for Computational Linguistics (April 2007)
Google Scholar
Bertoldi, N., Barbaiani, M., Federi-co, M., Cattoni, R.: Phrase-Based Statistical Machine Translation with Pivot Languages. In: Proceedings of the International Workshop on Spoken Language Translation, Hawaii, USA, pp. 143–149 (2008)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based transla-tion. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, Morristown, NJ, USA, pp. 48–54. Association for Computational Linguistics (2003)
Google Scholar
Resnik, P., Smith, N.A.: The web as a parallel corpus. Computational Linguistics 29(3), 349–380 (2003)
Article Google Scholar
Zbib, R., Malchiodi, E., Devlin, J., Stallard, D., Matsoukas, S., Schwartz, R., Makhoul, J., Zaidany, O.F., Callison-Burch, C.: Machine Translation of Arabic Dialects. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2012)
Google Scholar
Ananthakrishnan, S., Vitaladevuni, S., Prasad, R., Natarajan, P.: Source Error-Projection for Sample Selection in Phrase-Based SMT for Resource-Poor Languages. In: Proceedings of the IJCNLP 2011 (2011)
Google Scholar
Cohn, T., Lapata, M.: Machine Translation by Triangulation: Making Ef-fective Use of Multi-Parallel Corpora. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics, Prague, Czech Republic, pp. 728–735 (2007)
Google Scholar
Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: NAACL Workshop on Creating Speech and Language Data With Amazon’s Mechanical Turk (2010)
Google Scholar
Ambati, V., Vogel, S., Carbonell, J.: Active learning and crowd-sourcing for machine translation. In: Proceedings of Language Resources and Evaluation, LREC (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Raytheon BBN Technologies, 10 Moulton Street, Cambridge, MA, 02138, USA
Manish Gaurav, Guruprasad Saikumar, Amit Srivastava, Premkumar Natarajan, Shankar Ananthakrishnan & Spyros Matsoukas

Authors

Manish Gaurav
View author publications
You can also search for this author in PubMed Google Scholar
Guruprasad Saikumar
View author publications
You can also search for this author in PubMed Google Scholar
Amit Srivastava
View author publications
You can also search for this author in PubMed Google Scholar
Premkumar Natarajan
View author publications
You can also search for this author in PubMed Google Scholar
Shankar Ananthakrishnan
View author publications
You can also search for this author in PubMed Google Scholar
Spyros Matsoukas
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Computing Research, National Polytechnic Institute, Mexico D.F., Mexico
Alexander Gelbukh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gaurav, M., Saikumar, G., Srivastava, A., Natarajan, P., Ananthakrishnan, S., Matsoukas, S. (2013). Leveraging Arabic-English Bilingual Corpora with Crowd Sourcing-Based Annotation for Arabic-Hebrew SMT. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2013. Lecture Notes in Computer Science, vol 7817. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-37256-8_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-37256-8_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-37255-1
Online ISBN: 978-3-642-37256-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics