Skip to main content

Introduction

  • Chapter
  • First Online:
  • 365 Accesses

Abstract

This book addresses the full set of questions that arise when attempting to exploit comparable corpora to overcome the bottleneck of insufficient parallel corpora that affects any data-driven machine translation approach, particularly in relation to under-resourced languages and narrow domains. It describes methods and tools for identifying and assessing comparability, for gathering comparable corpora from the Web, for extracting translation equivalents from within comparable texts and discusses the evaluation of this pipeline of methods and tools by incorporating their outputs into a machine translation system and assessing its performance in real application settings.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD   159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  • Abdul-Rauf, S., & Schwenk, H. (2009). On the use of comparable corpora to improve SMT performance. EACL 2009: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (pp. 16–23), Athens, Greece.

    Google Scholar 

  • Abdul-Rauf, S., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.

    Article  Google Scholar 

  • Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.

    Google Scholar 

  • Azpeitia, A., Etchegoyhen, T., & Martinez Garcia, E. (2018). Extracting parallel sentences from comparable corpora with STACC variants. Proceedings of the 11th Workshop on Building and Using Comparable Corpora (pp. 48–52).

    Google Scholar 

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0. article. Retrieved from http://arxiv.org/abs/1409.0473

  • Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huck, M., et al. (2016). Findings of the 2016 conference on machine translation. Proceedings of the First Conference on Machine Translation (WMT 2016), Vol. 2: Shared Task Papers (pp. 131–198).

    Google Scholar 

  • Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., et al. (2017). Findings of the 2017 conference on machine translation (WMT17). Proceedings of the Second Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 169–214). Association for Computational Linguistics, Copenhagen, Denmark. Retrieved from http://www.aclweb.org/anthology/W17-4717

  • Ceauşu, A., Ştefănescu, D., & Tufiş, D. (2006). Acquis communautaire sentence alignment using support vector machines. Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp. 2134–2137).

    Google Scholar 

  • Chiao, Y., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable. COLING '02 Proceedings of the 19th International Conference on Computational Linguistics (Vol. 2, pp. 1–5).

    Google Scholar 

  • Daille, B., & Morin, E. (2008). An effective compositional model for lexical alignment. Proceedings, 3rd International Joint Conference on Natural Language Processing (IJCLNP) (pp. 95–102).

    Google Scholar 

  • Devlin, J., Zbib, R., Huang, Z., Lamar, T., Schwartz, R. M., & Makhoul, J. (2014). Fast and robust neural network joint models for statistical machine translation. ACL (1) (pp. 1370–1380). In Proceedings.

    Google Scholar 

  • EAGLES. (1996). Preliminary recommendations on corpus typology. Electronic Resource: http://www.ilc.cnr.it/EAGLES96/corpustyp/corpustyp.html

  • Etchegoyhen, T., & Azpeitia, A. (2016). Set-theoretic alignment for comparable corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Vol.1: Long Papers, pp. 2009–2018).

    Google Scholar 

  • Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.

    Google Scholar 

  • Hewavitharana, S., & Vogel, S. (2008). Enhancing a statistical machine translation system by using an automatically extracted parallel corpus from comparable sources. Proceedings of the Workshop on Comparable Corpora, LREC’08 (pp. 7–10).

    Google Scholar 

  • Ion, R., & Tufiş, D. (2007). RACAI: Meaning affinity models. Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) (pp. 282–287), Association for Computational Linguistics, Prague, Czech Republic, June 2007.

    Google Scholar 

  • Irimia, E. (2009). Metode de traducere automată prin analogie. Aplicaţii pentru limbile română şi engleză. (Methods for Analogy-based Machine Translation. Applications for Romanian and English). PhD thesis, March 2009.

    Google Scholar 

  • Irvine, A., & Callison-Burch, Ch. (2013). Combining bilingual and comparable corpora for low resource machine translation. Proceedings of the Eighth Workshop on Statistical Machine Translation (pp. 262—270).

    Google Scholar 

  • Jean, S., Firat, O., Cho, K., Memisevic, R., & Bengio, Y. (2015). Montreal neural machine translation systems for WMT15. Proceedings of the Tenth Workshop on Statistical Machine Translation (pp. 134–140).

    Google Scholar 

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37.

    Article  Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. Proceedings of Machine Translation Summit X.

    Google Scholar 

  • Koehn, P. (2010). Statistical machine translation. Cambridge University Press.

    Google Scholar 

  • Koehn, P., & Knowles, R. (2017). Six challenges for neural machine translation. Proceedings of the First Workshop on Neural Machine Translation, NMT@ACL 2017 (pp. 28–39), Vancouver, Canada, August 4, 2017.

    Google Scholar 

  • Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.

    Google Scholar 

  • Lu, B., Jiang, T., Chow, K., & Tsou, B. K. (2010). Building a large English-Chinese parallel corpus from comparable patents and its experimental application to SMT. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora, Valletta, Malta (pp. 42–48).

    Google Scholar 

  • Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-based neural machine translation. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 1412–1421).

    Google Scholar 

  • McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.

    Chapter  Google Scholar 

  • Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – Using brain, not brawn comparable corpora. Proceedings, 45th Annual Meeting of the Association for Computational Linguistics (ACL) (pp. 664–671).

    Google Scholar 

  • Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Pinnis, M., Ion, R., Ştefănescu, D., Su, F., Skadiņa, I., Vasiļjevs, A. et al. (2012). ACCURAT toolkit for multi-level alignment and information extraction from comparable corpora. Proceedings of the ACL 2012 System Demonstrations (pp. 91–96). Association for Computational Linguistics, Jeju, South Korea.

    Google Scholar 

  • Rapp, R. (1995). Identifying word translations in non-parallel texts. Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (pp. 320–322).

    Google Scholar 

  • Rayson, P., & Garside, R. (2000) Comparing corpora using frequency profiling. Proceedings of the Comparing Corpora Workshop at ACL’00 (pp. 1–6).

    Google Scholar 

  • Rehm, G., & Uszkoreit, H. (Eds.). (2012). White paper series. Springer.

    Google Scholar 

  • Sennrich, R., Haddow, B., & Birch, A. (2016a). Edinburgh neural machine translation systems for WMT 16. Proceedings of the First Conference on Machine Translation, Vol. 2: Shared Task Papers (pp. 368–373), Berlin, Germany.

    Google Scholar 

  • Sennrich, R., Hadow, B., & Birch, A. (2016b). Improving neural machine translation models with monolingual data. Proceedings of Annual Meeting of ACL (pp. 86–96).

    Google Scholar 

  • Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium

    Google Scholar 

  • Skadiņa, I., Aker, A., Mastropavlos, N., Su, F., Tufiș, D., Verlic, M. et al. (2012). Collecting and using comparable corpora for statistical machine translation. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp. 438–445).

    Google Scholar 

  • Smith, J.R., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. NAACL-HLT 2010 (pp. 403–411).

    Google Scholar 

  • Ştefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. Proceedings of the 16th Annual Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), Trento, Italy.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D. et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of the 5th International Conference on Language Resources and Evaluation: LREC’06.

    Google Scholar 

  • Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely available translation memory in 22 languages. Proceedings of LREC’2012 (pp. 454–459), Istanbul, Turkey.

    Google Scholar 

  • Tiedemann, J. (2016). OPUS – Parallel corpora for everyone. Baltic Journal of Modern Computing (BJMC), 4(2). Special Issue: Proceedings of the 19th Annual Conference of the European Association of Machine Translation (EAMT), 2016.

    Google Scholar 

  • Tyers, F. M., & Alpren, M. S. (2010). South-East European Times: A parallel corpus of Balkan languages. Proceedings of Workshop “Exploitation of Multilingual Resources and Tools for Central and (South) Eastern European Languages”.

    Google Scholar 

  • Utiyama, M., & Isahara, H. (2003). Reliable measures for aligning Japanese-English news articles and sentences. Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (pp. 7–12).

    Google Scholar 

  • Xu, J., Kennington, C., Przywara, C., & Wanzare, L. (2012). Comparable corpora in Wikipedia text for machine translation. Proceedings of the 6th NIC Symposium 2012: 25 Years HLRZ/NIC (Book Section). ISBN: 9783893367580, Jülich, Germany, February 2012.

    Google Scholar 

  • Zhao, B., & Vogel, S. (2002). Adaptive parallel sentences mining from web bilingual news collection. Proceedings of the 2002 I.E. International Conference on Data Mining (ICDM’02) (p. 74).

    Google Scholar 

  • Zweigenbaum, P., Sharoff, S., & Rapp, R. (2018). Overview of the third BUCC shared task: Spotting parallel sentences in comparable corpora. Proceedings of 11th Workshop on Building and Using Comparable Corpora (pp. 39–42).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Inguna Skadiņa .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Skadiņa, I., Gaizauskas, R., Vasiļjevs, A., Paramita, M.L. (2019). Introduction. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics