Integrating Specialized Bilingual Lexicons of Multiword Expressions for Domain Adaptation in Statistical Machine Translation

Semmar, Nasredine; Laib, Meriama

doi:10.1007/978-981-10-8438-6_9

Nasredine Semmar¹¹ &
Meriama Laib¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 781))

Included in the following conference series:

International Conference of the Pacific Association for Computational Linguistics

842 Accesses
1 Citations

Abstract

Domain adaptation consists in adapting Machine Translation (MT) systems designed for one domain to work in another. Multiword expressions generally characterize specific-domains vocabularies. Translating multiword expressions is a challenge for current Statistical Machine Translation (SMT) systems because corpus-based approaches are effective only when large amounts of parallel corpora are available. However, parallel corpora are only available for a limited number of language pairs and domains, and the process of building corpora for several language pairs and domains is time consuming and expensive. This paper describes an experimental evaluation of the impact of using a specialized bilingual lexicon of multiword expressions in order to obtain better domain adaptation for the state of the art statistical machine translation system Moses. Our study concerns the English-French language pair and two kinds of texts: in-domain texts from Europarl (European Parliament Proceedings) and out-of-domain texts from Emea (European Medicines Agency Documents). We introduce three methods to integrate extracted bilingual multiword expressions in Moses. We experimentally show that integrating specialized bilingual lexicons of multiword expressions improve translation quality of Moses for both in-domain and out-of-domain texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Sag, Ivan A., Baldwin, T., Bond, F., Copestake, A., Flickinger, D.: Multiword expressions: a pain in the neck for NLP. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 1–15. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45715-1_1
Chapter Google Scholar
Bungum, L., Gambäck, B.: A survey of domain adaptation in machine translation towards a refinement of domain space. In: Proceedings of the India-Norway Workshop on Web Concepts and Technologies (2011)
Google Scholar
Ceauşfu, A., Tinsley, J., Zhang, J., Way, A.: Experiments on domain adaptation for patent machine translation in the PLuTO project. In: Proceedings of EAMT (2011)
Google Scholar
Mathur, P., Federico, M., Köprü, S., Khadivi, S., Sawaf, H.: Topic adaptation for machine translation of E-commerce content. In: Proceedings of MT Summit XV (2015)
Google Scholar
Langlais, P.: Improving a general-purpose statistical translation engine by terminological lexicons. In: Proceedings of COLING: Second International Workshop on Computational Terminology (2002)
Google Scholar
Lewis, W.D., Wendt, C., Bullock, D.: Achieving domain specificity in SMT without overt siloing. In: Proceedings of LREC (2010)
Google Scholar
Hildebrand, A.S., Eck, M., Vogel, S., Alex, W.: Adaptation of the translation model for statistical machine translation based on information retrieval. In: Proceedings of the EAMT (2005)
Google Scholar
Civera, J., Juan, A.: Domain adaptation in statistical machine translation with mixture modelling. In: Proceedings of the Second Workshop on Statistical Machine Translation (2007)
Google Scholar
Bertoldi, N., Federico, M.: Domain adaptation for statistical machine translation with monolingual resources. In: Proceedings of the 4th Workshop on Statistical Machine Translation (2009)
Google Scholar
Banerjee, P., Du, J., Li, B., Naskar, S.K., Way, A., van Genabith, J.: Combining multi-domain statistical machine translation models using automatic classifiers. In: Proceedings of AMTA (2010)
Google Scholar
Daumé III, H., Jagarlamudi, J.: Domain adaptation for machine translation by mining unseen words. In: Proceedings of ACL (2011)
Google Scholar
Pecina, P., Toral, A., Way, A., Papa-vassiliou, V., Prokopidis, P., Giagkou, M.: Towards using web-crawled data for domain adaptation in statistical machine translation. In: Proceedings of EAMT (2011)
Google Scholar
Wang, W., Macherey, K., Macherey, W., Och, F., Xu, P.: Improved domain adaptation for statistical machine translation. In: Proceedings of AMTA (2012)
Google Scholar
Hasler, E., Haddow, B., Koehn, P.: Combining domain and topic adaptation for SMT. In: Proceedings of AMTA (2014)
Google Scholar
DeNero, J., Klein, D: The complexity of phrase alignment problems. In: Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (2008)
Google Scholar
Daille, B., Gaussier, E., Langé, J.M.: Towards automatic extraction of monolingual and bilingual terminology. In: Proceedings of the 15th Conference on Computational Linguistics ACL (1994)
Google Scholar
Blank, I.: Terminology extraction from parallel technical texts. In: Véronis, J. (ed.) Parallel Text Processing, vol. 13. Springer, Dordrecht (2000). https://doi.org/10.1007/978-94-017-2535-4_12
Chapter Google Scholar
Barbu, A.M: Simple linguistic methods for improving a word alignment algorithm. In: Proceedings of the 7th International Conference on the Statistical Analysis of Textual Data (2004)
Google Scholar
Semmar, N., Servan, C., De Chalendar, G., Le Ny, B., Bouzaglou, J.J.: A hybrid word alignment approach to improve translation lexicons with compound, words and idiomatic expressions. In: Proceedings of the 32nd Translating and the Computer Conference, ASLIB (2010)
Google Scholar
Mihalcea, R., Pedersen, T.: An evaluation exercise for word alignment. In: Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond (2003)
Google Scholar
Besançon, R., De Chalendar, G., Ferret, O., Gara, F., Laib, M., Mesnard, O., Semmar, N.: LIMA: a multilingual framework for linguistic analysis and linguistic resources development and evaluation. In: Proceedings of LREC (2010)
Google Scholar
Germann, U.: Yawat: yet another word alignment tool. In: Proceedings of ACL 2008
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: Identifying bilingual Multiword expressions for statistical machine translation. In: Proceedings of LREC (2012)
Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of LREC (2012)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL (2002)
Google Scholar
Semmar, N., Zennaki, O., Laib, M.: Improving the performance of an example-based machine translation system using a domain-specific bilingual lexicon. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation, PACLIC (2015)
Google Scholar
Semmar, N., Zennaki, O., Laib, M.: Evaluating the impact of using a domain-specific bilingual lexicon on the performance of a hybrid machine translation approach. In: Proceedings of Recent Advances in Natural Language Processing International Conference, RANLP (2015)
Google Scholar
Bouamor, D., Semmar, N., Zweigenbaum, P.: Automatic construction of a multiword expressions bilingual lexicon: a statistical machine translation evaluation perspective. In: Proceedings of the 3rd Workshop on Cognitive Aspects of the Lexicon, COLING (2012)
Google Scholar
Ren, Z., Lu, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions, ACL-IJCNLP (2009)
Google Scholar
Fraser, A., Marcu, D.: Measuring word alignment quality for statistical machine translation. Assoc. Comput. Linguist. 33(3), 293–303 (2007)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 700381.

Author information

Authors and Affiliations

CEA, LIST, Vision and Content Engineering Laboratory, 91191, Gif-sur-Yvette, France
Nasredine Semmar & Meriama Laib

Authors

Nasredine Semmar
View author publications
You can also search for this author in PubMed Google Scholar
Meriama Laib
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nasredine Semmar .

Editor information

Editors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, Tokyo, Japan
Kôiti Hasida
Natural Language Processing Lab, University of Computer Studies, Yangon, Yangon, Myanmar
Win Pa Pa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Semmar, N., Laib, M. (2018). Integrating Specialized Bilingual Lexicons of Multiword Expressions for Domain Adaptation in Statistical Machine Translation. In: Hasida, K., Pa, W. (eds) Computational Linguistics. PACLING 2017. Communications in Computer and Information Science, vol 781. Springer, Singapore. https://doi.org/10.1007/978-981-10-8438-6_9

Download citation

DOI: https://doi.org/10.1007/978-981-10-8438-6_9
Published: 04 March 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8437-9
Online ISBN: 978-981-10-8438-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics