Mapping and Aligning Units from Comparable Corpora

Aker, Ahmet; Ceaușu, Alexandru; Feng, Yang; Gaizauskas, Robert; Hunsicker, Sabine; Ion, Radu; Irimia, Elena; Ștefănescu, Dan; Tufiș, Dan

doi:10.1007/978-3-319-99004-0_5

Ahmet Aker¹⁰,
Alexandru Ceaușu¹¹,
Yang Feng¹⁰,
Robert Gaizauskas¹⁰,
Sabine Hunsicker¹²,
Radu Ion¹¹,
Elena Irimia¹¹,
Dan Ștefănescu¹¹ &
…
Dan Tufiș¹¹

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

411 Accesses

Abstract

Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.

Most of the research in parallel data mining from comparable corpora focusses on parallel sentence mining, but parallel phrase mining (i.e. sub-sentential fragments) is of equal importance, because it can be more robust in the presence of weakly comparable corpora that usually do not contain whole translated sentences. We will present different approaches to parallel sentence and phrase mining from comparable corpora developed in the ACCURAT project, and we will evaluate them both in terms of absolute measures (e.g., P, R and F1) and with respect to their ability to generate significant improvements of the BLEU scores of a statistical translation system. Comprehensive testing of these algorithms in the context of statistical machine translation will be undertaken in Chap. 6.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Hardcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Unsupervised Construction of Quasi-comparable Corpora and Probing for Parallel Textual Data

Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources

Harvesting Comparable Corpora and Mining Them for Equivalent Bilingual Sentences Using Statistical Classification and Analogy-Based Heuristics

Notes

1.
With the possible exception of parallelising the computations.
2.
Or ‘alignments’ or ‘pairs.’ These terms will be used with the same meaning throughout this section.
3.
We did not attempt to find the mathematical maximum of the expression from Eq. (5.7), and we realise that the consequence of this choice and of the greedy search procedure is not finding the true optimum.
4.
http://www.accurat-project.eu/
5.
We keep functional words lists for all languages.
6.
http://incubator.apache.org/projects/lucene.net.html
7.
We experimented with different power values for the cohesion score. We had the best results with ½ (the square root).
8.
But we acknowledge the fact that the probability of a sentence pair being parallel as computed by the classifier of Munteanu and Marcu is a proper model of parallelism.
9.
To obtain the dictionaries mentioned throughout this subsection, we have applied GIZA++ on the JRC Acquis corpus (Steinberger et al. 2006).
10.
For two source and target words, if the pair is not in the dictionary, we use a 0 to 1 normalised version of the Levenshtein distance in order to assign a ‘translation probability’ based on string similarity alone. If the source and target words are similar above a certain threshold (experimentally set to 0.7), we consider them to be translations.
11.
Mostly from the News domain for all language pairs.
12.
When an example occurs multiple times with both labels, we retain all the occurrences of the example with the most frequent label and remove all the conflicting occurrences.
13.
http://www.accurat-project.eu/
14.
For each parallel sentence, 2 noise sentences were added.
15.
http://www.statmt.org/wmt11/translation-task.html
16.
http://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages
17.
http://en.wikipedia.org/wiki/List_of_Greek_place_names
18.
These phrases are extracted with the SVM margin that maximises the F-measure, see the ‘Classifier evaluation’ subsection for details.
19.
Koehn (2004) reports that an increase of 1% in BLEU score is a significant improvement.
20.
And, if it is a set, no source phrase is repeated.
21.
The probability threshold over which all generated parallel pairs is correct is dependent on the type of document pairs. For the English-Romanian pair of parallel documents on which we tested, at least 0.5 is guaranteed to indicate perfect parallelism (we have determined that by manually inspecting the output).

References

Aker, A., Kanoulas, E., & Gaizauskas, R. (2012a). A light way to collect comparable corpora from the Web. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 21–27), Istanbul, Turkey.
Google Scholar
Aker, A., Feng, Y., & Gaizauskas, R. (2012b). Automatic bilingual phrase extraction from comparable corpora. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, India.
Google Scholar
Aswani, N., & Gaizauskas, R. (2010). English-Hindi transliteration using multiple similarity metrics. In Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.
Google Scholar
Borman, S. (2009). The expectation maximization algorithm. A short tutorial. http://www.seanborman.com/publications/EM_algorithm.pdf
Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.
Google Scholar
Ceauşu, A. (2009). Statistical machine translation for Romanian. PhD Thesis, Romanian Academy (in Romanian).
Google Scholar
Chen, S. F.(1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (pp. 9–16), Columbus, OH.
Google Scholar
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 2005 (pp. 263–270), Ann Arbor, MI.
Google Scholar
Fellbaum, C. (Ed.) (1998) WordNet: An electronic lexical database. Cambridge, MA: MIT Press.
MATH Google Scholar
Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.
Google Scholar
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Google Scholar
Gao, Q., & Vogel, S. (2008). Parallel implementations of a word alignment tool. In Proceedings of ACL-08 HLT: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, June 20, 2008 (pp. 49–57), Ohio State University, Columbus, OH.
Google Scholar
Hewavitharana, S., & Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC 2011) (pp. 61–68), Portland, OR.
Google Scholar
Ion, R. (2012). PEXACC: A parallel sentence mining algorithm from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 2181–2188), May 21–27, 2012, Istanbul, Turkey.
Google Scholar
Ion, R., Ceauşu, A., & Irimia, E. (2011a). An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) (pp. 128–135), June 24th, 2011, Portland, OR.
Google Scholar
Ion, R., Zhang, X., Su, F., Paramita, M., & Ștefănescu, D. (2011b). Report on Multi-Level Alignment of Comparable Corpora. Technical report no. D2.2 of the ACCURAT Project (http://www.accurat-project.eu/).
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 388–395), Barcelona, Spain.
Google Scholar
Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, September 12–16, 2005 (pp. 79—86), Phuket, Thailand.
Google Scholar
Koehn, P., Och, F., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 48–54), May 27–June 1, 2003, Edmonton, Canada.
Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Cowan, B., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Prague, Czech Republic.
Google Scholar
Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.
Book Google Scholar
Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002) (pp. 289–295), July 6–7, 2002, University of Pennsylvania, Philadelphia, PA
Google Scholar
Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.
Article Google Scholar
Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 160–167), July 07–12, 2003, Sapporo, Japan.
Google Scholar
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Article Google Scholar
Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449.
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 7–12 2002 (pp. 311–318), University of Pennsylvania, Philadelphia, PA.
Google Scholar
Quirk, C., Udupa, R., & Menezes, A. (2007). Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI (pp. 321–327), September, 2007, Copenhagen, Demark.
Google Scholar
Rauf, S. A., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.
Article Google Scholar
Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. In Proceedings of the Fourth International Conference Baltic HLT 2010. Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168), IOS Press.
Google Scholar
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006): Visions for the Future of Machine Translation (pp. 223–231), Cambridge, MA.
Google Scholar
Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 259–268). Association for Computational Linguistics, Athens, Greece.
Google Scholar
Ștefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the16th Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), May 28–30, 2012, Trento, Italy.
Google Scholar
Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), May 24–26, 2006, Genoa, Italy.
Google Scholar
Steinberger, R., Eisele, A., Klocek, A., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), May 21–27, 2012, Istanbul, Turkey.
Google Scholar
Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference of Spoken Language Processing (ICSLP 2002) (pp. 901–904), September 2002, Denver, CO.
Google Scholar
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.
Article Google Scholar
Thi Ngoc Diep, D., Besacier, L., Castelli, E. (2010). A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT 2010), May 27–28, 2010, Saint-Raphaël, France.
Google Scholar
Tillmann, C. (2009). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Suntec, Singapore, August 4th, 2009.
Google Scholar
Tsvetkov, Y., & Wintner, S. (2010). Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10) (pp. 3389–3392), Valletta, Malta, May 2010.
Google Scholar
Tufiș, D., Ion, R., Ceaușu, A., & Ștefănescu, D. (2006). Improved lexical alignment by combining multiple reified alignments. In Proceedings of the11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006) (pp. 153–160), Trento, Italy, April 3–7 2006.
Google Scholar
Tufiș, D., Ion, R., Bozianu, L., Ceaușu, A., & Ștefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In A. Tanacs, D. Csendes, V. Vincze, C. Fellbaum, & P. Vossen (Eds.), Proceedings of 4th Global WordNet Conference, GWC-2008, January 2008 (pp. 441–452). Hungary: University of Szeged.
Google Scholar
Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. In Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Sheffield, Sheffield, UK
Ahmet Aker, Yang Feng & Robert Gaizauskas
Research Institute for Artificial Intelligence of the Romanian Academy (RACAI), Bucharest, Romania
Alexandru Ceaușu, Radu Ion, Elena Irimia, Dan Ștefănescu & Dan Tufiș
The German Research Center for Artificial Intelligence (DFKI), Kaiserslautern, Germany
Sabine Hunsicker

Authors

Ahmet Aker
View author publications
You can also search for this author in PubMed Google Scholar
Alexandru Ceaușu
View author publications
You can also search for this author in PubMed Google Scholar
Yang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Robert Gaizauskas
View author publications
You can also search for this author in PubMed Google Scholar
Sabine Hunsicker
View author publications
You can also search for this author in PubMed Google Scholar
Radu Ion
View author publications
You can also search for this author in PubMed Google Scholar
Elena Irimia
View author publications
You can also search for this author in PubMed Google Scholar
Dan Ștefănescu
View author publications
You can also search for this author in PubMed Google Scholar
Dan Tufiș
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Gaizauskas .

Editor information

Editors and Affiliations

Tilde, Riga, Latvia
Inguna Skadiņa
Department of Computer Science, University of Sheffield, Sheffield, UK
Robert Gaizauskas
School of Modern Languages & Cultures, University of Leeds, Leeds, UK
Bogdan Babych
Faculty of Humanities & Social Sciences, University of Zagreb, Zagreb, Croatia
Nikola Ljubešić
Institute for Artificial Intelligence, Romanian Academy, Bucharest, Romania
Dan Tufiş
Tilde , Riga, Latvia
Andrejs Vasiļjevs

Additional information

Chapter editors: Radu Ion and Dan Tufiș

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Aker, A. et al. (2019). Mapping and Aligning Units from Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-99004-0_5
Published: 07 February 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99003-3
Online ISBN: 978-3-319-99004-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics