Skip to main content

Cross-Language Comparability and Its Applications for MT

  • Chapter
  • First Online:
Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Abstract

The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Data downloaded March 2010: http://dumps.wikimedia.org/

  2. 2.

    Providing translation resources for under-resourced languages is the goal of the ACCURAT (http://www.accurat-project.eu/) project within which this study was carried out.

  3. 3.

    Bing Translate was used to translate all document pairs apart from HR–EN, which was translated using Google Translate.

  4. 4.

    Note: the text in bold that appears with a ‘|’ character separating terms represents the referred article title and the document text as it appears to the user.

  5. 5.

    When this was not possible (i.e. fewer than 10 document pairs were found in a bin), the maximum number of document pairs in that bin was chosen for the evaluation set and a higher number of documents were chosen from the lower bins to achieve the total number of 100 document pairs.

  6. 6.

    The questions were based on a prior pilot study in which 10 assessors assessed 5 document pairs and gave comments on the evaluation scheme and decisions regarding their assigned similarity score.

  7. 7.

    Data and judgements are available for download at the website: ir.shef.ac.uk/cloughie/resources/similarity_corpus.html

  8. 8.

    Agreement for the five similarity levels is calculated using a weighted version of Cohen’s Kappa, in which the order of classes is taken into account, e.g. similarity scores of 1 and 2 are in better agreement than scores 1 and 5.

  9. 9.

    In these experiments, we used the Weka Toolkit (version 3.4.13).

  10. 10.

    We used weka.attributeSelection.InfoGainAttributeEval for feature selection and weka.attributeSelection.Ranker to rank the features from the Weka Toolkit.

  11. 11.

    Available at http://www.lemurproject.org/

  12. 12.

    The JRC-Acquis covers 22 European languages and provides large-scale parallel corpora for all the 231 language pairs.

  13. 13.

    From manual inspection on the word alignment results, we find that if the alignment probability is higher than 0.3, it is more reliable.

  14. 14.

    Generally, in JRC-Acquis, the size of parallel corpora for most non-English language pairs is much smaller than that of language pairs that contain English. Therefore, the resulting bilingual dictionaries that contain English have better word coverage, as they have many more dictionary entries.

  15. 15.

    We use WordNet (Fellbaum 1998) for word lemmatization.

  16. 16.

    Available at http://code.google.com/p/microsoft-translator-java-api/

  17. 17.

    Available at http://nlp.stanford.edu/software/CRF-NER.shtml

  18. 18.

    For the correlation measure, we use numerical calibration to different comparability degrees: ‘Parallel’, ‘strongly-comparable’ and ‘weakly-comparable’ are converted to 3, 2 and 1, respectively. The correlation is then computed between the numerical comparability levels and the corresponding average comparability scores automatically derived from the metrics.

  19. 19.

    Remember that in our experiment, English is used as the pivot language for non-English language pairs.

  20. 20.

    A manual evaluation of a small set of extracted data shows that parallel phrases with parallelism score SC ≥ 0.4 are more reliable.

  21. 21.

    For the purpose of correlation measure, the three intervals are numerically calibrated as ‘1’, ‘2’ and ‘3’, respectively.

  22. 22.

    Alternatively, we can also train MT systems for text translation by using the available SMT toolkits (e.g. Moses) on large-scale parallel corpora.

References

  • Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.

    Google Scholar 

  • Babych, B., Hartley, A., Sharoff, S., & Mudraya, O. (2007). Assisting translators in indirect lexical transfer. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 136–143).

    Google Scholar 

  • Babych, B., Sharoff S., & Hartley, A. (2008). Generalising lexical translation strategies for MT using comparable corpora. Proceedings of LREC 2008, Marrakech, Morocco.

    Google Scholar 

  • Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12).

    Google Scholar 

  • Chiao, Y.-Ch., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of COLING 2002, Taipei, Taiwan.

    Google Scholar 

  • Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora. IJCNLP (pp. 707–718).

    Google Scholar 

  • Eisele, A., & Xu, J. (2010). improving machine translation performance using comparable corpora. Proceedings of the LREC Workshop on Building and Using Comparable Corpora, Malta, May 2010.

    Google Scholar 

  • Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., & Chen, Y. (2008). Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. Proceedings of the Third Workshop on Statistical Machine Translation (pp. 179–182).

    Google Scholar 

  • Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). Extraction of bilingual terminology from a multilingual web-based encyclopedia. Journal of Information Processing, 16, 67–79.

    Article  Google Scholar 

  • Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    Book  Google Scholar 

  • Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 '09).

    Google Scholar 

  • Finkel, J., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of ACL 2005, University of Michigan, Ann Arbor, MI.

    Google Scholar 

  • Frank, E., Paynter, G, & Witten, I. (1999). Domain-specific keyphrase extraction. Proceedings of IJCAI 1999, Stockholm, Sweden.

    Google Scholar 

  • Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA’98) (pp. 1–16). Springer.

    Google Scholar 

  • Fung, P., & Cheung, P. (2004a). Mining very non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of EMNLP 2004, Barcelona, Spain.

    Google Scholar 

  • Fung, P., & Cheung, P. (2004b). Multi-level bootstrapping for extracting parallel sentences from a quasicomparable corpus. Proceedings of COL- ING 2004, Geneva, Switzerland.

    Google Scholar 

  • Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. COLING ’98: Proceedings of the 17th International Conference on Computational Linguistics (pp. 414–420).

    Google Scholar 

  • Gamallo, P. O., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC (pp. 21–25). http://www.fb06.unimainz.de/lk/bucc2010/documents/Proceedings-BUCC-2010.pdf#page=29

  • Hatzivassiloglou, V., Klavans, J. L., & Eskin, E. (1999). Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 203–212).

    Google Scholar 

  • Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Proceedings of EMNLP 2003, Sapporo, Japan.

    Google Scholar 

  • Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, Istanbul, Turkey.

    Google Scholar 

  • Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45, 499–512.

    Article  Google Scholar 

  • Kessler, B., Numberg, G., & Schuetze, H. (1998). Automatic detection of text genre. ACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 32–38).

    Google Scholar 

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37 (Reprinted in Teubert & Krishnamurthy (Eds.), Corpus linguistics: Critical concepts in linguistics. Routledge. 2007.) Retrieved from http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf.

    Article  Google Scholar 

  • Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. Proceedings of EMNLP 1998, Granada, Spain.

    Google Scholar 

  • Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1254–1259).

    Google Scholar 

  • Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.

    Google Scholar 

  • Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.

    Article  Google Scholar 

  • Lin, W., Snover, M., & Ji, H. (2011). Unsupervised language-independent name translation mining from Wikipedia infoboxes. Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing (pp. 43–52). Edinburgh, Scotland (pp. 27–31).

    Google Scholar 

  • Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. Proceedings of NAACL 2009, Boulder, Colorado.

    Google Scholar 

  • Lu, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation performance by training data selection and optimization. Proceedings of the 2007 EMNLP-CoNLL (pp. 343–350).

    Google Scholar 

  • Maia, B. (2003). What are comparable corpora? Proceedings of the Corpus Linguistics Workshop on Multilingual Corpora: Linguistic Requirements and Technical Perspectives, Lancaster.

    Google Scholar 

  • McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.

    Google Scholar 

  • Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – using brain, not brawn comparable corpora. Proceedings of ACL 2007 (pp. 664–671), Prague, Czech Republic.

    Google Scholar 

  • Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88), Sydney, Australia.

    Google Scholar 

  • Munteanu, D. S., Fraser, A., Marcu, D. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL 2004: Main Proceedings (pp. 265–272).

    Google Scholar 

  • Och, F., & Ney, H. (2000). Improved statistical alignment models. Proceedings of ACL 2000, Hongkong, China.

    Google Scholar 

  • Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Otero, P. G., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the LREC Workshop on BUCC (pp. 30–37).

    Google Scholar 

  • Patry, A., & Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (pp. 87–95).

    Google Scholar 

  • Prochasson, E., & Fung, P. (2011). Rare word translation extraction from aligned comparable documents. Proceedings of ACL-HLT 2011, Portland, OR.

    Google Scholar 

  • Rapp, R. (1995). Identifying word translations in non-parallel texts. ACL ‘95: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322), Cambridge, MA.

    Google Scholar 

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. ACL ’99: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). College Park, MA.

    Google Scholar 

  • Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. WCC ‘00: Proceedings of the Workshop on Comparing Corpora (pp. 1–6).

    Google Scholar 

  • Saralegi, X., Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of the Workshop on Comparable Corpora, LREC 2008, Marrakech, Morocco.

    Google Scholar 

  • Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium.

    Google Scholar 

  • Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. COLING/ACL 2006 Main Conference Poster Sessions (pp. 739–746).

    Google Scholar 

  • Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., TufiÅŸ, D., & Gornostay, T. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (pp. 6–14), Valletta, Malta.

    Google Scholar 

  • Smith, J., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Proceedings of NAACL 2010, Los Angeles, CA.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., & TufiÅŸ, D. (2006). The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006, Genoa, Italy.

    Google Scholar 

  • Teubert, W. (1996). Comparable or parallel corpora? International Journal of Lexicography, 9, 238–264.

    Article  Google Scholar 

  • Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J. (2008). Mining Wikipedia as a parallel and comparable corpus. Language Forum, 1, 34.

    Google Scholar 

  • Vidulin, V., Lustrek, M., & Gams, M. (2007). Using genres to improve search engines. Proceedings of the International Workshop Towards Genre-Enable Search Engines: The Impact of Natural Language Processing (pp. 45–51).

    Google Scholar 

  • Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing IJCNLP 2005, 3651, 257–268.

    Article  Google Scholar 

  • Wu, Z., Markert, K., & Sharoff, S. (2010). Fine-grained genre classification using structural learning algorithms. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 749–759).

    Google Scholar 

  • Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007). Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark.

    Google Scholar 

  • Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of HLT-NAACL 2009, Boulder, CO.

    Google Scholar 

  • Zesch, T., Műller, C., & Gurevych, I. (2008). Extracting lexical semantic knowledge from Wikipedia and Wikictionary. Proceedings of the LREC 2008, Marrakech, Morocco.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bogdan Babych .

Editor information

Editors and Affiliations

Additional information

Chapter editors: Bogdan Babych and Robert Gaizauskas

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Babych, B. et al. (2019). Cross-Language Comparability and Its Applications for MT. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics