Cross-Language Comparability and Its Applications for MT

Babych, Bogdan; Su, Fangzhong; Hartley, Anthony; Aker, Ahmet; Paramita, Monica Lestari; Clough, Paul; Gaizauskas, Robert

doi:10.1007/978-3-319-99004-0_2

Bogdan Babych¹⁰,
Fangzhong Su¹⁰,
Anthony Hartley¹⁰,
Ahmet Aker¹¹,
Monica Lestari Paramita¹¹,
Paul Clough¹¹ &
…
Robert Gaizauskas¹¹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

395 Accesses
1 Citations

Abstract

The concept of comparability, or linguistic relatedness, or closeness between textual units or corpora has many possible applications in computational linguistics. Consequently, the task of measuring comparability has increasingly become a core technological challenge in the field, and needs to be developed and evaluated systematically. Many practical applications require corpora with controlled levels of comparability, which are established by comparability metrics. From this perspective, it is important to understand the linguistic and technological mechanisms and implications of comparability and develop a systematic methodology for developing, evaluating and using comparability metrics. This chapter presents our approach to developing and using such metrics for machine translation (MT), especially for under-resourced languages. We address three core areas: (1) systematic meta-evaluation (or calibration) of the metrics on the basis of parallel corpora; (2) the development of feature-selection techniques for the metrics on the basis of aligned comparable texts, such as Wikipedia articles and (3) applying the developed metrics for the tasks of MT for under-resourced languages and measuring their effectiveness for corpora with unknown degrees of comparability. This has led to redefining the vague linguistic concept of comparability in terms of task-specific performance of the tools, which extract phrase-level translation equivalents from comparable texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Data downloaded March 2010: http://dumps.wikimedia.org/
2.
Providing translation resources for under-resourced languages is the goal of the ACCURAT (http://www.accurat-project.eu/) project within which this study was carried out.
3.
Bing Translate was used to translate all document pairs apart from HR–EN, which was translated using Google Translate.
4.
Note: the text in bold that appears with a ‘|’ character separating terms represents the referred article title and the document text as it appears to the user.
5.
When this was not possible (i.e. fewer than 10 document pairs were found in a bin), the maximum number of document pairs in that bin was chosen for the evaluation set and a higher number of documents were chosen from the lower bins to achieve the total number of 100 document pairs.
6.
The questions were based on a prior pilot study in which 10 assessors assessed 5 document pairs and gave comments on the evaluation scheme and decisions regarding their assigned similarity score.
7.
Data and judgements are available for download at the website: ir.shef.ac.uk/cloughie/resources/similarity_corpus.html
8.
Agreement for the five similarity levels is calculated using a weighted version of Cohen’s Kappa, in which the order of classes is taken into account, e.g. similarity scores of 1 and 2 are in better agreement than scores 1 and 5.
9.
In these experiments, we used the Weka Toolkit (version 3.4.13).
10.
We used weka.attributeSelection.InfoGainAttributeEval for feature selection and weka.attributeSelection.Ranker to rank the features from the Weka Toolkit.
11.
Available at http://www.lemurproject.org/
12.
The JRC-Acquis covers 22 European languages and provides large-scale parallel corpora for all the 231 language pairs.
13.
From manual inspection on the word alignment results, we find that if the alignment probability is higher than 0.3, it is more reliable.
14.
Generally, in JRC-Acquis, the size of parallel corpora for most non-English language pairs is much smaller than that of language pairs that contain English. Therefore, the resulting bilingual dictionaries that contain English have better word coverage, as they have many more dictionary entries.
15.
We use WordNet (Fellbaum 1998) for word lemmatization.
16.
Available at http://code.google.com/p/microsoft-translator-java-api/
17.
Available at http://nlp.stanford.edu/software/CRF-NER.shtml
18.
For the correlation measure, we use numerical calibration to different comparability degrees: ‘Parallel’, ‘strongly-comparable’ and ‘weakly-comparable’ are converted to 3, 2 and 1, respectively. The correlation is then computed between the numerical comparability levels and the corresponding average comparability scores automatically derived from the metrics.
19.
Remember that in our experiment, English is used as the pivot language for non-English language pairs.
20.
A manual evaluation of a small set of extracted data shows that parallel phrases with parallelism score SC ≥ 0.4 are more reliable.
21.
For the purpose of correlation measure, the three intervals are numerically calibrated as ‘1’, ‘2’ and ‘3’, respectively.
22.
Alternatively, we can also train MT systems for text translation by using the available SMT toolkits (e.g. Moses) on large-scale parallel corpora.

References

Adafre, S. F., & de Rijke, M. (2006). Finding similar sentences across multiple languages in Wikipedia. Proceedings of the EACL Workshop on New Text, Trento, Italy.
Google Scholar
Babych, B., Hartley, A., Sharoff, S., & Mudraya, O. (2007). Assisting translators in indirect lexical transfer. Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (pp. 136–143).
Google Scholar
Babych, B., Sharoff S., & Hartley, A. (2008). Generalising lexical translation strategies for MT using comparable corpora. Proceedings of LREC 2008, Marrakech, Morocco.
Google Scholar
Bharadwaj, R. G., & Varma, V. (2011). Language independent identification of parallel sentences using Wikipedia. Proceedings of the 20th International Conference Companion on World Wide Web, WWW ’11 (pp. 11–12).
Google Scholar
Chiao, Y.-Ch., & Zweigenbaum, P. (2002). Looking for candidate translational equivalents in specialized, comparable corpora. Proceedings of COLING 2002, Taipei, Taiwan.
Google Scholar
Daille, B., & Morin, E. (2005). French-English terminology extraction from comparable corpora. IJCNLP (pp. 707–718).
Google Scholar
Eisele, A., & Xu, J. (2010). improving machine translation performance using comparable corpora. Proceedings of the LREC Workshop on Building and Using Comparable Corpora, Malta, May 2010.
Google Scholar
Eisele, A., Federmann, C., Saint-Amand, H., Jellinghaus, M., Herrmann, T., & Chen, Y. (2008). Using Moses to integrate multiple rule-based machine translation engines into a hybrid system. Proceedings of the Third Workshop on Statistical Machine Translation (pp. 179–182).
Google Scholar
Erdmann, M., Nakayama, K., Hara, T., & Nishio, S. (2008). Extraction of bilingual terminology from a multilingual web-based encyclopedia. Journal of Information Processing, 16, 67–79.
Article Google Scholar
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
Book Google Scholar
Filatova, E. (2009). Directions for exploiting asymmetries in multilingual Wikipedia. Proceedings of the Third International Workshop on Cross Lingual Information Access: Addressing the Information Need of Multilingual Societies (CLIAWS3 '09).
Google Scholar
Finkel, J., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by gibbs sampling. Proceedings of ACL 2005, University of Michigan, Ann Arbor, MI.
Google Scholar
Frank, E., Paynter, G, & Witten, I. (1999). Domain-specific keyphrase extraction. Proceedings of IJCAI 1999, Stockholm, Sweden.
Google Scholar
Fung, P. (1998). A statistical view on bilingual lexicon extraction: From parallel corpora to non-parallel corpora. Proceedings of the 3rd Conference of the Association for Machine Translation in the Americas (AMTA’98) (pp. 1–16). Springer.
Google Scholar
Fung, P., & Cheung, P. (2004a). Mining very non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. Proceedings of EMNLP 2004, Barcelona, Spain.
Google Scholar
Fung, P., & Cheung, P. (2004b). Multi-level bootstrapping for extracting parallel sentences from a quasicomparable corpus. Proceedings of COL- ING 2004, Geneva, Switzerland.
Google Scholar
Fung, P., & Yee, L. Y. (1998). An IR approach for translating new words from nonparallel, comparable texts. COLING ’98: Proceedings of the 17th International Conference on Computational Linguistics (pp. 414–420).
Google Scholar
Gamallo, P. O., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora, LREC (pp. 21–25). http://www.fb06.unimainz.de/lk/bucc2010/documents/Proceedings-BUCC-2010.pdf#page=29
Hatzivassiloglou, V., Klavans, J. L., & Eskin, E. (1999). Detecting text similarity over short passages: Exploring linguistic feature combinations via machine learning. Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 203–212).
Google Scholar
Hulth, A. (2003). Improved automatic keyword extraction given more linguistic knowledge. Proceedings of EMNLP 2003, Sapporo, Japan.
Google Scholar
Ion, R. (2012). PEXACC: A parallel data mining algorithm from comparable corpora. Proceedings of LREC 2012, Istanbul, Turkey.
Google Scholar
Kanaris, I., & Stamatatos, E. (2009). Learning to recognize webpage genres. Information Processing and Management, 45, 499–512.
Article Google Scholar
Kessler, B., Numberg, G., & Schuetze, H. (1998). Automatic detection of text genre. ACL '98: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (pp. 32–38).
Google Scholar
Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 1–37 (Reprinted in Teubert & Krishnamurthy (Eds.), Corpus linguistics: Critical concepts in linguistics. Routledge. 2007.) Retrieved from http://www.kilgarriff.co.uk/Publications/2001-K-CompCorpIJCL.pdf.
Article Google Scholar
Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. Proceedings of EMNLP 1998, Granada, Spain.
Google Scholar
Lee, M. D., Pincombe, B., & Welsh, M. (2005). An empirical evaluation of models of text document similarity. Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1254–1259).
Google Scholar
Li, B., & Gaussier, E. (2010). Improving corpus comparability for bilingual lexicon extraction from comparable corpora. Proceedings of COLING 2010, Beijing, China.
Google Scholar
Li, Y., McLean, D., Bandar, Z., O’Shea, J., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.
Article Google Scholar
Lin, W., Snover, M., & Ji, H. (2011). Unsupervised language-independent name translation mining from Wikipedia infoboxes. Proceedings of EMNLP 2011, Conference on Empirical Methods in Natural Language Processing (pp. 43–52). Edinburgh, Scotland (pp. 27–31).
Google Scholar
Liu, F., Pennell, D., Liu, F., & Liu, Y. (2009). Unsupervised approaches for automatic keyword extraction using meeting transcripts. Proceedings of NAACL 2009, Boulder, Colorado.
Google Scholar
Lu, Y., Huang, J., & Liu, Q. (2007). Improving statistical machine translation performance by training data selection and optimization. Proceedings of the 2007 EMNLP-CoNLL (pp. 343–350).
Google Scholar
Maia, B. (2003). What are comparable corpora? Proceedings of the Corpus Linguistics Workshop on Multilingual Corpora: Linguistic Requirements and Technical Perspectives, Lancaster.
Google Scholar
McEnery, A., & Xiao, Z. (2007). Parallel and comparable corpora? Incorporating Corpora: Translation and the Linguist. Translating Europe. Multilingual Matters, Clevedon.
Google Scholar
Morin, E., Daille, B., Takeuchi, K., & Kageura, K. (2007). Bilingual terminology mining – using brain, not brawn comparable corpora. Proceedings of ACL 2007 (pp. 664–671), Prague, Czech Republic.
Google Scholar
Munteanu, D., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Article Google Scholar
Munteanu, D. S., & Marcu, D. (2006). Extracting parallel sub-sentential fragments from non-parallel corpora. ACL-2006: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics (pp. 81–88), Sydney, Australia.
Google Scholar
Munteanu, D. S., Fraser, A., Marcu, D. (2004). Improved machine translation performance via parallel sentence extraction from comparable corpora. In: HLT-NAACL 2004: Main Proceedings (pp. 265–272).
Google Scholar
Och, F., & Ney, H. (2000). Improved statistical alignment models. Proceedings of ACL 2000, Hongkong, China.
Google Scholar
Och, F., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Article Google Scholar
Otero, P. G., & López, I. G. (2010). Wikipedia as multilingual source of comparable corpora. Proceedings of the LREC Workshop on BUCC (pp. 30–37).
Google Scholar
Patry, A., & Langlais, P. (2011). Identifying parallel documents from a large bilingual collection of texts: Application to parallel article extraction in Wikipedia. Proceedings of the 4th Workshop on Building and Using Comparable Corpora (pp. 87–95).
Google Scholar
Prochasson, E., & Fung, P. (2011). Rare word translation extraction from aligned comparable documents. Proceedings of ACL-HLT 2011, Portland, OR.
Google Scholar
Rapp, R. (1995). Identifying word translations in non-parallel texts. ACL ‘95: Proceedings of the 33rd Annual Meeting on Association for Computational Linguistics (pp. 320–322), Cambridge, MA.
Google Scholar
Rapp, R. (1999). Automatic identification of word translations from unrelated English and German corpora. ACL ’99: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 519–526). College Park, MA.
Google Scholar
Rayson, P., & Garside, R. (2000). Comparing corpora using frequency profiling. WCC ‘00: Proceedings of the Workshop on Comparing Corpora (pp. 1–6).
Google Scholar
Saralegi, X., Vicente, I., & Gurrutxaga, A. (2008). Automatic extraction of bilingual terms from comparable corpora in a popular science domain. Proceedings of the Workshop on Comparable Corpora, LREC 2008, Marrakech, Morocco.
Google Scholar
Sharoff, S. (2007). Classifying Web corpora into domain and genre using automatic feature identification. Proceedings of 3rd Web as Corpus Workshop, Louvain-la-Neuve, Belgium.
Google Scholar
Sharoff, S., Babych, B., & Hartley, A. (2006). Using comparable corpora to solve problems difficult for human translators. COLING/ACL 2006 Main Conference Poster Sessions (pp. 739–746).
Google Scholar
Skadiņa, I., Vasiļjevs, A., Skadiņš, R., Gaizauskas, R., Tufiş, D., & Gornostay, T. (2010). Analysis and evaluation of comparable corpora for under resourced areas of machine translation. Proceedings of the 3rd Workshop on Building and Using Comparable Corpora. Applications of Parallel and Comparable Corpora in Natural Language Engineering and the Humanities (pp. 6–14), Valletta, Malta.
Google Scholar
Smith, J., Quirk, C., & Toutanova, K. (2010). Extracting parallel sentences from comparable corpora using document level alignment. Proceedings of NAACL 2010, Los Angeles, CA.
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., & Tufiş, D. (2006). The JRC- Acquis: A multilingual aligned parallel corpus with 20+ languages. Proceedings of LREC 2006, Genoa, Italy.
Google Scholar
Teubert, W. (1996). Comparable or parallel corpora? International Journal of Lexicography, 9, 238–264.
Article Google Scholar
Tomás, J., Bataller, J., Casacuberta, F., & Lloret, J. (2008). Mining Wikipedia as a parallel and comparable corpus. Language Forum, 1, 34.
Google Scholar
Vidulin, V., Lustrek, M., & Gams, M. (2007). Using genres to improve search engines. Proceedings of the International Workshop Towards Genre-Enable Search Engines: The Impact of Natural Language Processing (pp. 45–51).
Google Scholar
Wu, D., & Fung, P. (2005). Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora. Natural Language Processing IJCNLP 2005, 3651, 257–268.
Article Google Scholar
Wu, Z., Markert, K., & Sharoff, S. (2010). Fine-grained genre classification using structural learning algorithms. Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (pp. 749–759).
Google Scholar
Xu, J., Deng, Y., Gao, Y., & Ney, H. (2007). Domain dependent machine translation. Proceedings of the Machine Translation Summit XI, Copenhagen, Denmark.
Google Scholar
Yu, K., & Tsujii, J. (2009). Extracting bilingual dictionary from comparable corpora with dependency heterogeneity. Proceedings of HLT-NAACL 2009, Boulder, CO.
Google Scholar
Zesch, T., Műller, C., & Gurevych, I. (2008). Extracting lexical semantic knowledge from Wikipedia and Wikictionary. Proceedings of the LREC 2008, Marrakech, Morocco.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Leeds, Leeds, UK
Bogdan Babych, Fangzhong Su & Anthony Hartley
University of Sheffield, Sheffield, UK
Ahmet Aker, Monica Lestari Paramita, Paul Clough & Robert Gaizauskas

Authors

Bogdan Babych
View author publications
You can also search for this author in PubMed Google Scholar
Fangzhong Su
View author publications
You can also search for this author in PubMed Google Scholar
Anthony Hartley
View author publications
You can also search for this author in PubMed Google Scholar
Ahmet Aker
View author publications
You can also search for this author in PubMed Google Scholar
Monica Lestari Paramita
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bogdan Babych .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Additional information

Chapter editors: Bogdan Babych and Robert Gaizauskas

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Babych, B. et al. (2019). Cross-Language Comparability and Its Applications for MT. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_2
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics