Skip to main content

Mapping and Aligning Units from Comparable Corpora

  • Chapter
  • First Online:
Book cover Using Comparable Corpora for Under-Resourced Areas of Machine Translation

Abstract

Extracting parallel units (e.g. sentences or phrases) from comparable corpora in order to enrich existing statistical translation models is an avenue that has attracted a lot of research in recent years. There are experiments that convincingly show how parallel sentences extracted from comparable corpora are able to improve statistical machine translation (SMT). Yet, the existing body of research on the subject does not take into account the degree of comparability of the corpus being processed nor the computation time that it takes to extract translational similar pairs from a corpus of a given size. We will show that the performance of a parallel unit extractor crucially depends on the degree of comparability, such that it is more difficult to mine for parallel data in a weakly comparable corpus than a strongly comparable corpus.

Most of the research in parallel data mining from comparable corpora focusses on parallel sentence mining, but parallel phrase mining (i.e. sub-sentential fragments) is of equal importance, because it can be more robust in the presence of weakly comparable corpora that usually do not contain whole translated sentences. We will present different approaches to parallel sentence and phrase mining from comparable corpora developed in the ACCURAT project, and we will evaluate them both in terms of absolute measures (e.g., P, R and F1) and with respect to their ability to generate significant improvements of the BLEU scores of a statistical translation system. Comprehensive testing of these algorithms in the context of statistical machine translation will be undertaken in Chap. 6.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 159.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    With the possible exception of parallelising the computations.

  2. 2.

    Or ‘alignments’ or ‘pairs.’ These terms will be used with the same meaning throughout this section.

  3. 3.

    We did not attempt to find the mathematical maximum of the expression from Eq. (5.7), and we realise that the consequence of this choice and of the greedy search procedure is not finding the true optimum.

  4. 4.

    http://www.accurat-project.eu/

  5. 5.

    We keep functional words lists for all languages.

  6. 6.

    http://incubator.apache.org/projects/lucene.net.html

  7. 7.

    We experimented with different power values for the cohesion score. We had the best results with ½ (the square root).

  8. 8.

    But we acknowledge the fact that the probability of a sentence pair being parallel as computed by the classifier of Munteanu and Marcu is a proper model of parallelism.

  9. 9.

    To obtain the dictionaries mentioned throughout this subsection, we have applied GIZA++ on the JRC Acquis corpus (Steinberger et al. 2006).

  10. 10.

    For two source and target words, if the pair is not in the dictionary, we use a 0 to 1 normalised version of the Levenshtein distance in order to assign a ‘translation probability’ based on string similarity alone. If the source and target words are similar above a certain threshold (experimentally set to 0.7), we consider them to be translations.

  11. 11.

    Mostly from the News domain for all language pairs.

  12. 12.

    When an example occurs multiple times with both labels, we retain all the occurrences of the example with the most frequent label and remove all the conflicting occurrences.

  13. 13.

    http://www.accurat-project.eu/

  14. 14.

    For each parallel sentence, 2 noise sentences were added.

  15. 15.

    http://www.statmt.org/wmt11/translation-task.html

  16. 16.

    http://en.wikipedia.org/wiki/Names_of_European_cities_in_different_languages

  17. 17.

    http://en.wikipedia.org/wiki/List_of_Greek_place_names

  18. 18.

    These phrases are extracted with the SVM margin that maximises the F-measure, see the ‘Classifier evaluation’ subsection for details.

  19. 19.

    Koehn (2004) reports that an increase of 1% in BLEU score is a significant improvement.

  20. 20.

    And, if it is a set, no source phrase is repeated.

  21. 21.

    The probability threshold over which all generated parallel pairs is correct is dependent on the type of document pairs. For the English-Romanian pair of parallel documents on which we tested, at least 0.5 is guaranteed to indicate perfect parallelism (we have determined that by manually inspecting the output).

References

  • Aker, A., Kanoulas, E., & Gaizauskas, R. (2012a). A light way to collect comparable corpora from the Web. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 21–27), Istanbul, Turkey.

    Google Scholar 

  • Aker, A., Feng, Y., & Gaizauskas, R. (2012b). Automatic bilingual phrase extraction from comparable corpora. In Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012), IIT Bombay, Mumbai, India.

    Google Scholar 

  • Aswani, N., & Gaizauskas, R. (2010). English-Hindi transliteration using multiple similarity metrics. In Proceedings of the 7th Language Resources and Evaluation Conference (LREC 2010), Valletta, Malta.

    Google Scholar 

  • Borman, S. (2009). The expectation maximization algorithm. A short tutorial. http://www.seanborman.com/publications/EM_algorithm.pdf

  • Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., & Mercer, R. L. (1993). The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2), 263–311.

    Google Scholar 

  • Ceauşu, A. (2009). Statistical machine translation for Romanian. PhD Thesis, Romanian Academy (in Romanian).

    Google Scholar 

  • Chen, S. F.(1993). Aligning sentences in bilingual corpora using lexical information. In Proceedings of the 31st Annual Meeting on Association for Computational Linguistics (pp. 9–16), Columbus, OH.

    Google Scholar 

  • Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, June 2005 (pp. 263–270), Ann Arbor, MI.

    Google Scholar 

  • Fellbaum, C. (Ed.) (1998) WordNet: An electronic lexical database. Cambridge, MA: MIT Press.

    MATH  Google Scholar 

  • Fung, P., & Cheung, P. (2004). Mining very-non-parallel corpora: Parallel sentence and lexicon extraction via bootstrapping and EM. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 57–63), Barcelona, Spain.

    Google Scholar 

  • Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.

    Google Scholar 

  • Gao, Q., & Vogel, S. (2008). Parallel implementations of a word alignment tool. In Proceedings of ACL-08 HLT: Software Engineering, Testing, and Quality Assurance for Natural Language Processing, June 20, 2008 (pp. 49–57), Ohio State University, Columbus, OH.

    Google Scholar 

  • Hewavitharana, S., & Vogel, S. (2011). Extracting parallel phrases from comparable data. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web (BUCC 2011) (pp. 61–68), Portland, OR.

    Google Scholar 

  • Ion, R. (2012). PEXACC: A parallel sentence mining algorithm from comparable corpora. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012) (pp. 2181–2188), May 21–27, 2012, Istanbul, Turkey.

    Google Scholar 

  • Ion, R., Ceauşu, A., & Irimia, E. (2011a). An expectation maximization algorithm for textual unit alignment. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora (BUCC 2011) (pp. 128–135), June 24th, 2011, Portland, OR.

    Google Scholar 

  • Ion, R., Zhang, X., Su, F., Paramita, M., & Ștefănescu, D. (2011b). Report on Multi-Level Alignment of Comparable Corpora. Technical report no. D2.2 of the ACCURAT Project (http://www.accurat-project.eu/).

  • Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP-2004) (pp. 388–395), Barcelona, Spain.

    Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of the Tenth Machine Translation Summit, September 12–16, 2005 (pp. 79—86), Phuket, Thailand.

    Google Scholar 

  • Koehn, P., Och, F., & Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp. 48–54), May 27–June 1, 2003, Edmonton, Canada.

    Google Scholar 

  • Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Cowan, B., et al. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL Companion Volume Proceedings of the Demo and Poster Sessions (pp. 177–180), Prague, Czech Republic.

    Google Scholar 

  • Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval (Vol. 1). Cambridge: Cambridge University Press.

    Book  Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2002). Processing comparable corpora with bilingual suffix trees. In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002) (pp. 289–295), July 6–7, 2002, University of Pennsylvania, Philadelphia, PA

    Google Scholar 

  • Munteanu, D. S., & Marcu, D. (2005). Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4), 477–504.

    Article  Google Scholar 

  • Och, F. J. (2003). Minimum error rate training in statistical machine translation. Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (pp. 160–167), July 07–12, 2003, Sapporo, Japan.

    Google Scholar 

  • Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.

    Article  Google Scholar 

  • Och, F. J., & Ney, H. (2004). The alignment template approach to statistical machine translation. Computational Linguistics, 30(4), 417–449.

    Article  Google Scholar 

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, July 7–12 2002 (pp. 311–318), University of Pennsylvania, Philadelphia, PA.

    Google Scholar 

  • Quirk, C., Udupa, R., & Menezes, A. (2007). Generative models of noisy translations with applications to parallel fragment extraction. In Proceedings of the MT Summit XI (pp. 321–327), September, 2007, Copenhagen, Demark.

    Google Scholar 

  • Rauf, S. A., & Schwenk, H. (2011). Parallel sentence generation from comparable corpora for improved SMT. Machine Translation, 25(4), 341–375.

    Article  Google Scholar 

  • Skadiņa, I., Aker, A., Giouli, V., Tufiş, D., Gaizauskas, R., Mieriņa, M., et al. (2010). A collection of comparable corpora for under-resourced languages. In Proceedings of the Fourth International Conference Baltic HLT 2010. Frontiers in Artificial Intelligence and Applications (Vol. 219, pp. 161–168), IOS Press.

    Google Scholar 

  • Snover, M., Dorr, B., Schwartz, R., Micciulla, L., & Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA 2006): Visions for the Future of Machine Translation (pp. 223–231), Cambridge, MA.

    Google Scholar 

  • Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In Proceedings of the Fourth Workshop on Statistical Machine Translation (pp. 259–268). Association for Computational Linguistics, Athens, Greece.

    Google Scholar 

  • Ștefănescu, D., Ion, R., & Hunsicker, S. (2012). Hybrid parallel sentence mining from comparable corpora. In Proceedings of the16th Conference of the European Association for Machine Translation (EAMT 2012) (pp. 137–144), May 28–30, 2012, Trento, Italy.

    Google Scholar 

  • Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiș, D., et al. (2006). The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC’2006), May 24–26, 2006, Genoa, Italy.

    Google Scholar 

  • Steinberger, R., Eisele, A., Klocek, A., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012), May 21–27, 2012, Istanbul, Turkey.

    Google Scholar 

  • Stolcke, A. (2002). SRILM – An extensible language modeling toolkit. In Proceedings of the International Conference of Spoken Language Processing (ICSLP 2002) (pp. 901–904), September 2002, Denver, CO.

    Google Scholar 

  • Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., & Laurikkala, J. (2008). Focused web crawling in the acquisition of comparable corpora. Information Retrieval, 11(5), 427–445.

    Article  Google Scholar 

  • Thi Ngoc Diep, D., Besacier, L., Castelli, E. (2010). A fully unsupervised approach for mining parallel data from comparable corpora. In Proceedings of the 14th Annual Conference of the European Association for Machine Translation (EAMT 2010), May 27–28, 2010, Saint-Raphaël, France.

    Google Scholar 

  • Tillmann, C. (2009). A beam-search extraction algorithm for comparable data. In Proceedings of the ACL-IJCNLP 2009 Conference Short Papers (pp. 225–228), Suntec, Singapore, August 4th, 2009.

    Google Scholar 

  • Tsvetkov, Y., & Wintner, S. (2010). Automatic acquisition of parallel corpora from websites with dynamic content. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10) (pp. 3389–3392), Valletta, Malta, May 2010.

    Google Scholar 

  • Tufiș, D., Ion, R., Ceaușu, A., & Ștefănescu, D. (2006). Improved lexical alignment by combining multiple reified alignments. In Proceedings of the11th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2006) (pp. 153–160), Trento, Italy, April 3–7 2006.

    Google Scholar 

  • Tufiș, D., Ion, R., Bozianu, L., Ceaușu, A., & Ștefănescu, D. (2008). Romanian wordnet: Current state, new applications and prospects. In A. Tanacs, D. Csendes, V. Vincze, C. Fellbaum, & P. Vossen (Eds.), Proceedings of 4th Global WordNet Conference, GWC-2008, January 2008 (pp. 441–452). Hungary: University of Szeged.

    Google Scholar 

  • Zhang, Y., Wu, K., Gao, J., & Vines, P. (2006). Automatic acquisition of Chinese-English parallel corpus from the web. In Proceedings of 28th European Conference on Information Retrieval ECIR 2006, April 10–12, 2006, London.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Gaizauskas .

Editor information

Editors and Affiliations

Additional information

Chapter editors: Radu Ion and Dan Tufiș

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aker, A. et al. (2019). Mapping and Aligning Units from Comparable Corpora. In: Skadiņa, I., Gaizauskas, R., Babych, B., Ljubešić, N., Tufiş, D., Vasiļjevs, A. (eds) Using Comparable Corpora for Under-Resourced Areas of Machine Translation. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-99004-0_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99004-0_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99003-3

  • Online ISBN: 978-3-319-99004-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics