Abstract
This chapter describes how semi-parallel and parallel data extracted from comparable corpora can be used in enhancing machine translation (MT) systems: what are the methods used for this task in statistical and rule-based machine translation systems; what kinds of showcases exist that illustrate the usage of such enhanced MT systems. The impact of data extracted from comparable corpora on MT quality is evaluated for 17 language pairs, and detailed studies involving human evaluation are carried out for 11 language pairs. At first, baseline statistical machine translation (SMT) systems were built using traditional SMT techniques. Then they were improved by the integration of additional data extracted from the comparable corpora. Comparative evaluation was performed to measure improvements. Comparable corpora were also used to enrich the linguistic knowledge of rule-based machine translation (RBMT) systems by applying terminology extraction technology. Finally, SMT systems were adjusted for a narrow domain and included domain-specific knowledge such as terminology, named entities (NEs), domain-specific language models (LMs), etc.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The News Commentary corpus is from the training data released for the shared tasks of the last few workshops for statistical machine translation (SMT).
- 3.
Apache OpenNLP (available at: http://opennlp.apache.org/).
- 4.
EuroTermBank (http://www.eurotermbank.com/).
- 5.
WordPress: http:www.wordpress.com
- 6.
Blogger: http://www.blogger.com/
- 7.
Twitter: http://twitter.com
- 8.
Tumblr: http://www.tumblr.com
- 9.
MediaWiki: http://www.mediawiki.org/
- 10.
https://www.tumblr.com/about, accessed in January, 2016.
- 11.
https://about.twitter.com/company, all numbers approximate as of September 30, 2015.
- 12.
http://stats.wikimedia.org/EN/ReportCardTopWikis.htm, accessed in January, 2016.
- 13.
Translatewiki project: http://translatewiki.net/wiki/
- 14.
Translate Toolkit & Pootle: http://translate.sourceforge.net/wiki/
- 15.
Yandex Translate: http://company.yandex.com/technologies/translation.xml
- 16.
References
Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. Proceedings of the 12thConference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.
Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Aleksić, V., & Thurmair, Gr. (2011). Personal Translator at WMT 2011. Proceedings of the WMT Edinburgh, UK.
Babych, B., & Hartley, A. (2008). Sensitivity of automated MT evaluation metrics on higher quality MT output: BLEU vs task-based evaluation methods. Proceedings of LREC, Marrakech.
Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43rd Annual Meeting of the Association of Computational Linguistics (ACL 2005), June 2005, Michigan.
Bertoldi, N., Haddow, B., & Fouet, J. B. (2009). Improved minimum error rate training in moses. The Prague Bulletin of Mathematical Linguistics, 91, 7–16.
Bojar, O., Federmann, C., Fishel, M., Graham, Y., Haddow, B., Koehn, P., & Monz, C. (2018). Findings of the 2018 Conference on Machine Translation (WMT18) (pp. 272–303). WMT (shared task) 2018.
Bontchev, B., & Vassileva, D. (2009). Courseware authoring for adaptive e-learning. Proceedings of the 2009 International Conference on Education Technology and Computer (ICETC ’09) (pp. 176–180). IEEE Computer Society, Washington, DC.
Bulterman, D. C. A., & Hardman, L. (2005). Structured multimedia authoring. ACM Transactions on Multimedia Computing, Communication and Applications, 1, 89–109.
Callison-Burch, Ch., Koehn, Ph., Monz, Ch., & Schroeder, J. (2009). Findings of the 2009 workshop on statistical machine translation. Proceedings of the 4th Workshop on SMT, Athens.
Capuano, N., Pierri, A., Colace, F., Gaeta, M., & Mangione, G. R. (2009). A mash-up authoring tool for e-learning based on pedagogical templates. Proceedings of the First ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’09) (pp. 87–94). ACM, New York, NY.
Carrera, J., Beregovaya, O., & Yanishevsky, A. (2009). Machine Translation for Cross-Language Social Media. Accessed April 23, 2013 from http://www.promt.com/company/technology/pdf/machine_translation_for_cross_language_social_media.pdf
Clark, E., & Araki, K. (2011). Text normalization in social media: Progress, problems and applications for a pre-processing system of casual English. Procedia – Social and Behavioral Sciences, 27, 2–11.
Deltour, R., & Roisin, C. (2006). The limsee3 multimedia authoring model. Proceedings of the 2006 ACM Symposium on Document Engineering (DocEng ‘06) (pp. 173–175). ACM, New York, NY.
Désilets, A., Gonzalez, L., Paquet, S., & Stojanovic, M. (2006). Translation the Wiki Way. The Conference Wiki of the 2006 International Symposium on Wikis. Odense, Denmark.
Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. Proceedings of the Second International Conference on Human Language Technology Research (HLT 2002) (pp. 138–145), San Diego.
Escudero, H., & Fuentes, R. (2010). Exchanging courses between different Intelligent Tutoring Systems: A generic course generation authoring tool. Knowledge-Based Systems, 23(8), 864–874.
Flournoy, R., & Duran, C. (2009). Machine translation and document localization at Adobe: From pilot to production. Proceedings of the Twelfth Machine Translation Summit, Ottawa, Canada.
Flournoy, R., & Rueppel, J. (2010). One technology: Many solutions. AMTA 2010: The Ninth Conference of the Association for Machine Translation in the Americas, Denver, CO, 6p.
Forcada, M. (2006). Open-source machine translation: An opportunity for minor languages. 5th SALTMIL Workshop on Minority Languages (pp. 1–7).
Garcia, I. (2009). Beyond translation memory: Computers and the professional translator. The Journal of Specialised Translation, 12, 199–214.
Hamon, O., Popescu-Belis, A., Choukri, K., Dabbadie, M., Hartley, A., Mustafa El Hadi, W., et al. (2006). CESTA: First conclusions of the technolangue mt evaluation campaign. Proceedings of the LREC, Genova, Italy.
Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).
Hovy, E., King, M., & Popescu-Belis, A. (2002). Principles of context-based machine translation evaluation. Machine Translation, 17(1), 43–75.
Hutchins, J. (2003). Machine translation and computer-based translation tools: What’s available and how it’s used. A New Spectrum of Translation Studies. University of Valladolid.
Intel Corporation. (2012). Enabling Multilingual Collaboration through Machine Translation (IT@Intel White Paper). Accessed March 30, 2013 from http://www.intel.com/content/www/us/en/it-management/intel-it-best-practices/enabling-multilingual-collaboration-through-machine-translation.html
Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262–270).
Jiang, J., Way, A., & Haque, R. (2012). Translating user-generated content in the social networking space. Proceedings of the Tenth Biennial Conference of the Association for Machine Translation in the Americas (AMTA-2012), San Diego, CA.
King, M., Popescu-Belis, A., & Hovy, E. (2003). FEMTI: Creating and using a framework for MT evaluation. Proceedings of MT Summit, New Orleans.
Koehn, P., & Schroeder, J. (2007). Experiments in domain adaptation for statistical machine translation. Proceedings of the Second Workshop on Statistical Machine Translation, Prague.
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., et al. (2007). Moses: Open source toolkit for statistical machine translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session.
Lewis, W., Wendt, C., & Bullock, D. (2010). Achieving domain specificity in SMT without overt siloing. Proceedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).
Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora (pp. 42–48), Valletta, Malta.
Mehm, F., Reuter, C., Göbel, S., & Steinmetz, R. (2012). Future trends in game authoring tools. Entertainment Computing-ICEC 2012 (Vol. 7522, pp. 536–541),Springer, Heidelberg.
Mitchell, L., & Roturier, J. (2012). Evaluation of machine-translated user generated content: A pilot study based on user ratings. Proceedings of the 16th EAMT Conference, 28–30 May 2012, Trento, Italy.
Mugwanya, R., & Marsden, G. (2010). Mobile learning content authoring tools (MLCATs): A systematic review. Proceedings E-Infrastructures and E-Services on Developing Countries – Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering (pp. 20–31).
Müller, W., Iurgel, I., Otero, N., & Massler, U. (2010). Teaching English as a second language utilizing authoring tools for interactive digital storytelling. ICIDS’10 Proceedings of the Third Joint Conference on Interactive Digital Storytelling (pp. 222–227).
Munteanu, D., & Marcu, D. (2006). Improving machine translation performance by ex-ploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Najeh, H., Kolovratnik, D., Vaeyrynen, J., Steinberger, R., & Varga,D. (2014). DCEP-digital corpus of the European parliament. Proceedings of LREC 2014 (Language Resources and Evaluation Conference) (pp. 3164–3171).
O’Brien, S. (2005). Methodologies for measuring the correlations between post-editing effort and machine translatability. Machine Translation, 19(1), 37–58.
Och, F. J. (2003) Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (Vol. 1, pp. 160–167).
Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of ACL-2002: 40th Annual meeting of the Association for Computational Linguistics (pp. 311–318).
Pecina, P., Toral, A., Papavassiliou, V., Prokopidis, P., & van Genabith, J. (2012). Domain adaptation of statistical machine translation using web-crawled resources: A case study. Proceedings of the EAMT 2012, Trento, Italy.
Pinnis, M. (2012). Latvian and lithuanian named entity recognition with TildeNER. Proceedings of LREC 2012, 21–27 May, 2012, Istanbul, Turkey.
Pinnis, M., & Skadiņš, R. (2012). MT Adaptation for Under-Resourced Domains –What Works and What Not. Baltic HLT2012.
Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A., et al. (2012a). Toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of ACL 2012, System Demonstrations Track, Jeju Island, Republic of Korea, 8–14 July 2012.
Pinnis, M., Ljubešić, N., Ştefănescu, D., Skadiņa, I., Tadić, M., & Gornostay, T. (2012b). Term extraction, tagging and mapping tools for under-resourced languages. Proceedings of the 10th Conference on Terminology and Knowledge Engineering, Madrid, Spain.
Pinnis, M., Skadiņa, I., & Vasiļjevs, A. (2013). Domain adaptation in statistical machine translation using comparable corpora: Case study for english latvian it localisation. Proceedings of the 14th International Conference on Intelligent Text Processing and Computational Linguistics CICLING 2013.
Plitt, M., & Masselot, F. (2010). A Productivity Test of Statistical Machine Translation Post-Editing in a Typical Localisation Context. The Prague Bulletin of Mathematical Lin-Guistics, 93, 7–16.
Popescu-Belis, A. (2008). Reference-based vs. task-based evaluation of human language technology. Proceedings of LREC.
Rirdance, S., & Vasiljevs, A. (Eds.). (2006). Towards consolidation of European terminology resources. Experience and recommendations from EuroTermBank project. Riga: EuroTermBank Consortium.
Roturier, J., & Bensadoun, A. (2011). Evaluation of MT systems to translate user generated content. Proceedings of Machine Translation Summit XIII (pp. 244–251), Xiamen, China.
Scherp, A., & Boll, S. (2005). Context-driven smart authoring of multimedia content with xSMART. Proceedings of the 13th Annual ACM International Conference on Multimedia (MULTIMEDIA ’05) (pp. 802–803). ACM, New York, NY.
Schmidtke, D. (2008). Microsoft office localization: Use of language and translation technology. Available at: http://www.tm-europe.org/files/resources/TM-Europe2008-Dag-Schmidtke-Microsoft.pdf
Schwenk, H., & Koehn, P. (2008). Large and diverse language models for statistical machine translation. IJCNLP2008.
Skadiņa, I., Aker, A., Giouli, V., Tufis, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications(Vol. 219, pp. 161–168), IOS Press.
Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiş, D., Verlič, M., et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of LREC’12 (pp. 438–445), Istanbul, Turkey, 21–27 May 2012.
Skadiņš, R., Goba, K., & Šics, V. (2010). Improving SMT for baltic languages with factored models. Proceedings of the Fourth International Conference Baltic HLT 2010 (pp. 125–132), October 7–8, 2010, Riga, Latvia.
Skadiņš, R., Puriņš, M., Skadiņa, I., & Vasiļjevs,A. (2011). Evaluation of SMT in localization to under-resourced inflected language. Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011 (pp. 35–40), May 30–31, 2011, Leuven, Belgium.
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. Proceedings of Association for Machine Translation in the Americas.
Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. Proceedings of WMT09.
Spärck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28, 11–21.
Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., TufisD., et al. (2006). The jrcacquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation.
Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), Istanbul, 21–27 May 2012.
Steinberger, R., Ebrahim, M., Poulis, A., Carrasco-Benitez, M., Schlüter, P., Przybyszewski, M., et al. (2014). An overview of the European Union’s highly multilingual parallel corpora. Language Resources and Evaluation Journal (LRE), 48(4), 679–707.
Su, F., & Babych, B. (2012). Measuring comparability of documents in non-parallel corpora for efficient extraction of (semi-) parallel translation equivalents. Proceedings of the EACL’12 Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRBMT) and Hybrid Approaches to Machine Translation (HyTra) (pp. 10–19), Avignon, France, 23–27 April 2012.
Thurmair, Gr., & Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. Proceedings of the EAMT 2012,Trento, Italy.
Tiedemann, J. (2009). News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent Advances in Natural Language Processing (Vol. V, pp. 237–248). Amsterdam/ Philadelphia: John Benjamins.
Tiedemann, J. (2012). Parallel data, tools and interfaces in OPUS. Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012).
Tyers, F., & Alperen, M. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.
Vasiļjevs, A., Skadiņš, R., & Tiedemann, J. (2012). LetsMT!: A cloud-based platform for do-it-yourself machine translation. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL2012) (pp. 43–48), Jeju, Republic of Korea, 10 July 2012, System Demonstrations.
Watson, C., Li, F. W. B., & Lau, R. W. H. (2010). A pedagogical interface for authoring adaptive e-learning courses. Proceedings of the Second ACM International Workshop on Multimedia Technologies for Distance Learning (MTDL ’10) (pp. 13–18). ACM, New York, NY.
White, J., O’Connell, T., & O’Mara, F. (1994). The ARPA MT evaluation methodologies: Evolution, lessons, and future approaches. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas (pp. 193–205). Columbia.
Xu, J., Zens, R., & Ney, H. (2006) Partitioning parallel documents using binary segmentation. Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL): Proceedings of the Workshop on Statistical Machine Translation (pp. 78–85), New York City, NY, June 2006.
Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007) Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Danmark, September 2007.
Zhang, X. (2011). Two-level parallel text extraction from comparable corpora. Diploma thesis of Univeristy of Saarland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Additional information
Chapter editor: Inguna Skadiņa
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Babych, B. et al. (2019). Training, Enhancing, Evaluating and Using MT Systems with Comparable Data. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-99004-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)