Skip to main content
Log in

Efficient document alignment across scenarios

  • Published:
Machine Translation

Abstract

We present and evaluate an approach to document alignment meant for efficiency and portability, as it relies on automatically extracted lexical translations and simple set-theoretic operations for the computation of document-level similarity. We compare our approach to the state of the art on a variety of alignment scenarios, showing that it outperforms alternative document-alignment methods in the vast majority of cases, on both parallel and comparable corpora. We also explore several forms of simple component optimisation to evaluate the potential for improvement of the core method, and describe several successful optimisation paths that lead to significant improvements over strong baselines. The proposed approach constitutes an effective and easy to deploy method to perform accurate document alignment across scenarios, with the potential to improve the creation of parallel corpora.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.accurat-project.eu/.

  2. http://www.statmt.org/wmt16/.

  3. In the reported experiments, we used giza++ (Och and Ney 2003) to extract lexical translation tables. Although lexical translation modelling is sometimes based only on ibm model 1 in related work on comparable corpora, it is standard practice in statistical MT to use more sophisticated ibm models, usually up to model 4. We followed the latter approach initially, as the same tables could then be used as components for both comparable corpora exploitation and smt system development. Alternatively, ibm model 2 can be employed as well, for instance with the fastalign toolkit (Dyer et al. 2013). This approach allows for a fast extraction of lexical translations from large datasets and can be favoured if translation tables are only meant for the document alignment process. We measured the impact of using one approach or the other on identical test sets in preliminary experiments and did not find any significant difference in terms of document alignment results.

  4. We used 5 as a default for all language pairs, as a compromise between larger sets with less reliable translation candidates and smaller sets which may miss translation alternatives in comparable corpora. Note that optimal values for k could be empirically determined on domain-specific development sets for each language pair; such document-level tuning sets are, however, not usually available.

  5. For all baseline results presented in Sect. 4, the texts are not truecased either, to maintain the number of operations and required models to a minimum. Truecasing would provide a better treatment of sentence-initial words but its impact on document-level sets needs to be measured; we performed such an evaluation and describe its results in Sect. 5.4.

  6. Checking for their presence in lexical translation tables allows one to distinguish between out-of-vocabulary tokens and entities with an existing translation, e.g. Germany translated into Spanish Alemania. It also prevents adding all capitalised tokens as named entities in languages such as German, where nouns are capitalised.

  7. Throughout the experiments we describe, n was set to 3, arbitrarily assuming this boundary for minimal stem length.

  8. To further improve the efficiency of the system, we use an implementation based on hash maps with minimal-length prefixes as keys and two sets as values for the original and translated tokens that have a given prefix in common. lcp is then computed on these reduced sets of elements.

  9. Experiments on internally available sets of parallel technical manuals showed improvements when including lcp over the base version of docal.

  10. https://lucene.apache.org.

  11. http://www.statmt.org/europarl/.

  12. Morin et al. (2015) refer to their similar removal of multiple source alignments as the pigeonhole method, following common terminology. To distinguish our version of the process from theirs, for presentation reasons we use the phrase best alignment optimisation (bao).

  13. We used the version available as of November 2015, in the opus repository: http://opus.lingfil.uu.se/JRC-Acquis.php.

  14. We used the versions available as of February 2016 at the address: http://www.statmt.org/europarl/. We refer to version 2 as eu2 and to version 5 as eu5.

  15. We refer to this variant as eu5.2 in the tables.

  16. For all tables in this paper, best results are indicated in bold.

  17. They also indicate that the first sentences of each document were removed, which would directly eliminate the previously mentioned one-liner documents that are part of the version 5 we used.

  18. https://comparable.limsi.fr/bucc2015/.

  19. Text Retrieval Conference, see http://trec.nist.gov/.

  20. Results from the lina system were the only ones available for this language pair.

  21. http://opus.lingfil.uu.se/MultiUN.php.

  22. http://nlp.stanford.edu/softhetware/segmenter.shtml.

  23. See Etchegoyhen et al. (2016) for a detailed description of this corpus, which can be found at the following address: http://metashare.elda.org/repository/search/?q=eitb+documents.

  24. http://www.statmt.org/wmt16/bilingual-task.html.

  25. The documents were processed on the previously mentioned single server with 64 gb of ram and 8 cores.

  26. Note also that the uedin1 system was ranked lower than docal in the soft scoring results, with a score of 89.1.

  27. For all the results in this section, we use as baseline the default and best-performing version of docal, which includes best alignment optimisation.

  28. The stacc system variants that include lexical weighting were the best-performing systems in all language pairs on the bucc 2018 comparable sentence alignment shared task (Azpeitia et al. 2018; Zweigenbaum et al. 2018).

  29. This is the default value used in all the experiments with weighting reported below.

  30. Since we use the version of docal that includes best alignment optimisation, which enforces 1–1 alignments, results on the other metrics (success@5 and mrr) are identical to those obtained on the success@1 metric.

  31. We used the kenlm toolkit (Heafield 2011) to train the language models.

  32. The upper bound was set to 500,000 after considering the relative sizes of the available corpora.

  33. Note that alternative methods may be employed to create balanced generic corpora, with different settings for sub-sampling, for instance. Our aim was to evaluate the impact of balanced generic corpora that provide significantly larger volumes of sentence pairs than the mono-domain jrc corpus, by allowing for some of the selected corpora to provide more sentence pairs than the smallest corpora, while also controlling the over-representation of the largest corpora via perplexity-based sub-sampling.

  34. The version of the systems that include the larger tables are denoted with .gen. In the results reported here, we used the version of docal augmented with weights, as it gave slightly better results on this task, as previously described.

  35. As previously described, soft scoring results were higher than the ones obtained on our corrected test set. However, only the latter was available to us, whence the results reported here come. Note that, since most of the soft scoring results may be accounted to the reported errors, the comparative results reported in this table are equally informative.

  36. We refer to this method as ssll.

  37. System variants that include truecasing are indicated with the .tc extension.

References

  • Azpeitia A, Etchegoyhen T (2016) DOCAL—Vicomtech’s participation in the WMT16 shared task on bilingual document alignment. In: Proceedings of the first conference on machine translation, vol 2: Shared Task Papers. Berlin, Germany, pp 666–671

  • Azpeitia A, Etchegoyhen T, Martínez Garcia E (2017) Weighted set-theoretic alignment of comparable sentences. In: Proceedings of the tenth workshop on building and using comparable corpora. Vancouver, Canada, pp 41–45

  • Azpeitia A, Etchegoyhen T, Martínez Garcia E (2018) Extracting parallel sentences from comparable corpora with STACC variants. In: Proceedings of the eleventh workshop on building and using comparable corpora. Miyazaki, Japan, pp 48–52

  • Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. CoRR arXiv:1409.0473, p 15

  • Brown PF, Cocke J, Della Pietra SA, Della Pietra VJ, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85

    Google Scholar 

  • Brown PF, Della Pietra VJ, Della Pietra SA, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  • Buck C, Koehn P (2016a) Findings of the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 554–563

  • Buck C, Koehn P (2016b) Quick and reliable document alignment via TF/IDF-weighted cosine distance. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 672–678

  • Chen J, Nie JY (2000) Parallel web text mining for cross-language IR. In: Content-based multimedia information access, vol 1. Centre des hautes études internationales d’informatique documentaire, Paris, France, pp 62–77

  • Chen J, Chau R, Yeh CH (2004) Discovering parallel text from the world wide web. In: Proceedings of the second workshop on australasian information security, data mining and web intelligence, and software internationalisation. Dunedin, New Zealand, pp 157–161

  • Dara AA, Lin YC (2016) YODA system for WMT16 shared task: bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 679–684

  • Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 10(3):297–302

    Article  Google Scholar 

  • Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM model 2. In: Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: human language technologies. Atlanta, USA, pp 644–648

  • Eisele A, Chen Y (2010) MultiUN: a multilingual corpus from United Nation documents. In: Proceedings of the seventh international conference on language resources and evaluation, European Language Resources Association (ELRA). Valletta, Malta, pp 2868–2872

  • Enright J, Kondrak G (2007) A fast method for parallel document identification. Human language technologies 2007: the conference of the north american chapter of the association for computational linguistics; Companion volume. Short papers, Rochester, New York, USA, pp 29–32

  • Esplà-Gomis M, Forcada ML (2009) Bitextor, a free/open-source software to harvest translation memories from multilingual websites. In: Proceedings of MT summit XII. Ottawa, Canada, pp 1–8

  • Esplà-Gomis M, Forcada ML, Ortiz-Rojas S, Ferràndez-Tordera J (2016) Bitextor’s participation in WMT’16: shared task on document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 685–691

  • Etchegoyhen T, Azpeitia A (2016a) A portable method for parallel and comparable document alignment. Baltic J Mod Comput 4(2):243–255

    Google Scholar 

  • Etchegoyhen T, Azpeitia A (2016b) Set-theoretic alignment for comparable corpora. In: Proceedings of the 54th annual meeting of the association for computational linguistics, vol 1: Long Papers. Berlin, Germany, pp 2009–2018

  • Etchegoyhen T, Azpeitia A, Pérez N (2016) Exploiting a Large Strongly Comparable Corpus. In: Proceedings of the tenth international conference on language resources and evaluation (LREC 2016). Portorož, Slovenia, pp 3523–3529

  • Fung P, Cheung P (2004) Mining very non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and E.M. In: Proceedings of empirical methods in natural language processing. Barcelona, Spain, pp 57–63

  • Gelbukh A, Sidorov G, Lavin-Villa E, Chanona-Hernandez L (2010) Automatic term extraction using log-likelihood based comparison with general reference corpus. In: Proceedings of the 15th international conference on application of natural language to information systems. Cardiff, Wales, pp 248–255

  • Germann U (2016) Bilingual document alignment with latent semantic indexing. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 692–696

  • Gomes L, Lopes GP (2016) First steps towards coverage-based document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 697–702

  • Heafield K (2011) KenLM: Faster and smaller language model queries. In: Proceedings of the sixth workshop on statistical machine translation. Edinburgh, Scotland, pp 187–197

  • Ion R, Ceauşu A, Irimia E (2011) An expectation maximization algorithm for textual unit alignment. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Portland, Oregon, pp 128–135

  • Ion R, Pinnis M, Ştefānescu D, Aker A, Paramita M, Su F, Irimia E, Zhang X, Ljubešić N (2012) ACCURAT D2.6: toolkit for multi-level alignment and information extraction from comparable corpora, version 3.0. Tech. rep., ACCURAT project. http://www.accurat-project.eu/

  • Jaccard P (1901) Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines. Bull de la Soc Vaudoise des Sci Nat 37:241–272

    Google Scholar 

  • Jakubina L, Langlais P (2016) BAD LUC\(@\)WMT 2016: a bilingual document alignment platform based on Lucene. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 703–709

  • Koehn P (2005) Europarl: a parallel corpus for statistical machine translation. In: Proceedings of the 10th machine translation summit. Phuket, Thailand, pp 79–86

  • Koehn P (2009) Statistical machine translation. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Le T, Vu HT, Oberländer J, Bojar O (2016) Using term position similarity and language modeling for bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 710–716

  • Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Soviet Phys Doklady 10(8):707–710

    MathSciNet  Google Scholar 

  • Li B, Gaussier E (2013) Exploiting comparable corpora for lexicon extraction: measuring and improving corpus quality. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, Germany, pp 131–149

    Chapter  Google Scholar 

  • Lohar P, Afli H, Liu CH, Way A (2016) The ADAPT bilingual document alignment system at WMT16. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 717–723

  • Ma X, Liberman M (1999) BITS: a method for bilingual text search over the web. In: Machine translation summit VII. Singapore, pp 538–542

  • Mahata S, Das D, Pal S (2016) WMT2016: a hybrid approach to bilingual document alignment. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 724–727

  • Manning CD, Schütze H (1999) Foundations of statistical natural language processing. MIT Press, Cambridge

    MATH  Google Scholar 

  • Medved M, Jakubícek M, Kovár V (2016) English-French document alignment based on keywords and statistical translation. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 728–732

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. Lake Tahoe, CA, USA, pp 3111–3119

  • Miller GA (1995) WordNet: a lexical database for english. Commun ACM 38(11):39–41

    Article  Google Scholar 

  • Morin E, Hazem A, Boudin F, Clouet EL (2015) LINA: identifying comparable documents from wikipedia. In: Proceedings of the eighth workshop on building and using comparable corpora. Beijing, China, pp 88–91

  • Munteanu DS, Marcu D (2005) Improving machine translation performance by exploiting non-parallel corpora. Comput Linguist 31(4):477–504

    Article  Google Scholar 

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  Google Scholar 

  • Papavassiliou V, Prokopidis P, Piperidis S (2016) The ILSP/ARC submission to the WMT 2016 bilingual document alignment shared task. In: Proceedings of the first conference on machine translation. Berlin, Germany, pp 733–739

  • Paramita ML, Guthrie D, Kanoulas E, Gaizauskas R, Clough P, Sanderson M (2013) Methods for collection and evaluation of comparable documents. In: Sharoff S, Rapp R, Zweigenbaum P, Fung P (eds) Building and using comparable corpora. Springer, Germany, pp 93–112

    Chapter  Google Scholar 

  • Patry A, Langlais P (2005) Automatic identification of parallel documents with light or without linguistic resources. In: Proceedings of the 18th Canadian society conference on advances in artificial intelligence. Victoria, Canada, pp 354–365

  • Patry A, Langlais P (2011) Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia. In: Proceedings of the 4th workshop on building and using comparable corpora: comparable corpora and the web. Portland, Oregon, pp 87–95

  • Prochasson E, Fung P (2011) Rare word translation extraction from aligned comparable documents. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Portland, Oregon, pp 1327–1335

  • Rapp R (1995) Identifying word translations in non-parallel texts. In: Proceedings of the 33rd annual meeting of the association for computational linguistics. Cambridge, MA, USA, pp 320–322

  • Resnik P, Smith NA (2003) The web as a parallel corpus. Comput Linguist 29(3):349–380

    Article  Google Scholar 

  • Sharoff S, Zweigenbaum P, Rapp R (2015) BUCC shared task: cross-language document similarity. In: Proceedings of the 8th workshop on building and using comparable corpora. Beijing, China, pp 74–78

  • Spärck Jones K, Walker S, Robertson SE (2000) A probabilistic model of information retrieval: development and comparative experiments. Part 2. Inf Proces Manag 36(6):809–840

    Article  Google Scholar 

  • Tiedemann J (2011) Bitext alignment. Synthesis Lectures on human language technologies. Morgan & Claypool Publishers, Williston, VT

  • Tiedemann J (2012) Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th language resources and evaluation conference. Istanbul, Turkey, pp 2214–2218

  • Tseng H, Chang P, Andrew G, Jurafsky D, Manning C (2005) A conditional random field word segmenter for sighan bakeoff 2005. In: Proceedings of the fourth SIGHAN workshop on chinese language processing. Jeju Island, Korea, pp 168–171

  • Uszkoreit J, Ponte JM, Popat AC, Dubiner M (2010) Large scale parallel document mining for machine translation. In: Proceedings of the 23rd international conference on computational linguistics. Beijing, China, pp 1101–1109

  • Zafarian A, Aghasadeghi A, Azadi F, Ghiasifard S, Alipanahloo Z, Bakhshaei S, Ziabary SMM (2015) AUT document alignment framework for BUCC workshop shared task. In: Proceedings of the 8th workshop on building and using comparable corpora. Beijing, China, pp 79–87

  • Zweigenbaum P, Sharoff S, Rapp R (2018) Overview of the third BUCC shared task: spotting parallel sentences in comparable corpora. In: Proceedings of the 11th workshop on building and using comparable corpora. Miyazaki, Japan, pp 39–42

Download references

Funding

This work was partially funded by the Spanish Ministry of Economy and competitiveness, via project AdapTA (RTC-2015-3627-7), and the Department of Economic Development and Competitiveness of the Basque Government, via project TRADIN (IG-2015/0000347).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thierry Etchegoyhen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Azpeitia, A., Etchegoyhen, T. Efficient document alignment across scenarios. Machine Translation 33, 205–237 (2019). https://doi.org/10.1007/s10590-019-09234-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-019-09234-9

Keywords

Navigation