Parallel fragments : Measuring their impact on translation performance

https://doi.org/10.1016/j.csl.2016.12.002Get rights and content

Highlights

  • Phrase fragments have proved to be a valuable resource for increasing translation and natural language generation performance.

  • A novel approach to find parallel fragments from comparable corpora is presented which is simple and efficient in processing.

  • Difference in translation improvement for fragments extracted from related versus non related corpus is presented.

  • Comparison of impact of parallel fragments vs. sentences is reported highlighting the significance of parallel segments.

  • Proposed approach is compared theoretically with an earlier approach on all phases of the fragment extraction pipeline.

Abstract

Lack of parallel corpora have diverted the direction of research towards exploring other arenas to fill in the dearth. Comparable corpora have proved to be a valuable resource in this regard. Interestingly other than the parallel sentences extracted from comparable corpora, parallel phrase fragments have also proved to be beneficial for statistical machine translation. We present a novel approach based on an efficient framework for parallel fragment extraction from comparable corpora. Using the fragments as additional corpus for translation, we are able to obtain an improvement of 0.88 and 0.89 BLEU points on test data for Arabic–English and French–English systems respectively. We have also conducted a detailed analysis of impact of fragments extracted from related vs non-related corpus. A comparison of impact of parallel fragments vs. parallel sentences is also presented highlighting the significance of parallel segments for statistical machine translation. The article concludes with a crude comparative analysis of our approach with an existing fragment extraction technique at various stages of the fragment extraction pipeline.

Introduction

In recent decades, construction and research on bilingual corpora has become an area of immense importance and interest. Due to its emerging importance, comparable corpora have become a significant object of study by researchers. These have proved to be beneficial in a variety of tasks such as improving SMT performance using extracted parallel sentences (Munteanu, Marcu, 2005, Abdul-Rauf, Schwenk, 2011), extracting phrasal alignments (Kumano et al., 2007), word sense disambiguation (Kaji, 2003), acquiring synonyms (Shimohata and Sumita, 2005), parallel fragment extraction (Munteanu, Marcu, 2006, Cettolo, Federico, Bertoldi., 2010), extracting lay paraphrases of specialized expressions (Deléger and Zweigenbaum, 2009) and language and translation model adaptation (Snover, Dorr, Schwartz, 2008, Abdul Rauf, Schwenk, Lambert, Nawaz, 2016) etc. They have specifically proved to be valuable for languages and domains which lack parallel corpora.

The world has become a global village and translations play a vital role in bridging communication gaps all over the world. It is not affordable for human beings to translate everything manually so the demand of machine translation (MT) is growing rapidly all over the world. In Statistical MT, translation information is automatically obtained from parallel corpora; parallel corpus is a corpus that contains sentence aligned source texts and their translations. Parallel corpora can be bilingual or multilingual, once a parallel corpus is available, then the rapid development of SMT systems for different language pairs is possible; by examining many samples of human-produced translation, SMT algorithms automatically learn how to translate (Brown et al., 1993).

Because of the high dependence on parallel corpora, the quality and quantity of parallel corpora are crucial for SMT, the insufficiency of parallel data is an issue of concern for SMT system development. Lack of parallel corpus and linguistic resources for many languages and domains is one of the major obstacles for diverse and good SMT systems. Reasons for the scarceness of these resources are the inherent richness of languages, domain diversity etc. Moreover, languages evolve over time, the SMT training corpora also needs to be updated accordingly. Again, this is difficult in the case of parallel corpora.

There are many language pairs which do not have enough parallel corpora. Building such corpora can take much time as corpus building is a slow process for less spoken languages. Germann (2001) report that it takes 140 translation hours to create a 1300 sentence (24,000 tokens) Tamil–English parallel corpus at an average translation rate of 170 words per hour. At this rate it would require 4–5 full time translators to translate 100,000 words in a month. This of course forces to explore other scenarios for parallel corpus creation. For many language pairs, most of the times, comparable corpora do exist. A comparable corpus is a collection of texts in two or more languages which has similar contents in each language but do not have the exact translation of each language pair. We can gather and compile comparable corpora from multilingual newspapers, Wikipedia and different websites which contain articles on same topic in different languages.

Extraction of parallel sentences and segments from comparable corpora is a challenging task. The usual sentence alignment techniques applicable for parallel corpora rely on equivalent sentences and paragraphs, which have same order in the two parts of the bitext. Due to this the search space in sentence alignment is significantly reduced. This is not the case for comparable corpora, finding matching sentences and phrases remains a challenging task. Typically, comparable corpora don’t have any information regarding document pair similarity. Generally, there exist many documents in one language which don’t have any corresponding document in the other language. Also, when the correspondence information among the documents is available, the documents in question are not literal translations of each other. Thus, extracting parallel data from such corpora requires special algorithms designed for the corpora in question.

Depending upon the comparability of the comparable corpus, it is not sure that there will always be parallel sentences in comparable corpora rather there might be or might be not, but there could be parallel fragments in comparable sentences abundantly. Parallel fragments have also proved to be helpful for improving SMT performance (Munteanu, Marcu, 2006, Fu, Wei, Lu, Chen, Xu, 2013, Rahimi, Samani, Khadivi, 2014). An interesting aspect of using parallel fragments is that domain dependency is a bit alleviated, the fragments that are found are often named entities and everyday use phrases, thus adding out-domain sentence fragments to in-domain parallel data also helps improve MT performance as shown by Gupta et al. (2013) for English–Bengali language pair. Other than improving MT performance, fragments have also been helpful in other NLP domains requiring parallel data, Belz and Kow (2010) report parallel fragment extraction for improving Natural Language Generation(NLG) systems.

In this article, we present an efficient fragment extraction algorithm making use of information retrieval (IR) and SMT itself to retrieve parallel fragments. Firstly, the foreign language side of the comparable corpus is translated into English and potential matching sentences are retrieved from the English side of the comparable corpus (as described in Sections 5 and 6). Parallel phrase fragments are then identified using Levenshtein distance and the phrase alignment information. Our proposed scheme of fragment extraction is comparable in efficiency and results to all the previous works. We also present a detailed analysis of impact of fragments extracted from related vs non-related corpus. Other than the design efficiency of our approach, we also present a comparison of SMT improvement using fragments versus paralllel sentences extracted by Abdul-Rauf and Schwenk (2009a) from the same corpus thus highlighting the utility of fragments by comparative analysis.

We start by giving a brief overview of comparable corpora and the related work in this field followed by a high-level overview of our parallel sentence extraction system. Section 3 describes our fragment extraction framework followed by description of SMT and IR frameworks used in Section 5 and 6, respectively. Sections 7.2 and 7.1 report in detail the results of our experiments with French–English and Arabic–English language pairs. The paper concludes with a comparison of our approach with one of the earlier approaches detailed in Section 7.3.

Section snippets

Comparable corpora and fragment extraction literature

The concept of a comparable corpus and its use depend largely upon the point of view of the experimenter and the subject of his research, it would be presumptuous to propose a universal definition. Comparable corpora are of various natures, covering a continuum between truly parallel and completely unrelated texts. This leads to the notion of the degree of comparability of document sub-parts of a comparable corpus. This degree of comparability is relative to the amount of features (qualitative

Fragment extraction framework

Fig. 2 shows the general architecture of our scheme. We use SMT output and IR in designing our framework. Having comparable corpora of two languages (L1, L2) at our disposal, we start by translating L1 to L2. Having the two corpora in the same language, we can then identify the contiguous word sequences between the two sentences. By doing so we are being dependent on the SMT translation, good or bad. To eliminate this factor n-best sentences are used both from SMT and IR outputs. We then

Task description

In this paper, we consider the translation from Arabic into English, under the same conditions as the official NIST 2008 evaluation. The used bitexts include various news wire translations1 as well as some texts from the Gale project.2 We also added the 2002–2005 test data to the parallel training data (using all reference translations). This corresponds to a total of about 5.8M Arabic words. Our baseline

SMT framework

The goal of statistical machine translation (SMT) is to produce a target sentence e from a source sentence f. It is today common practice to use phrases as translation units (Koehn, Och, Marcu, Och, Ney, 2003) and a log linear framework in order to introduce several models explaining the translation process: e*=argmaxp(e|f)=argmaxe{exp(iλihi(e,f))} The feature functions hi are the system models and the λi weights are typically optimized to maximize a scoring function on a development set (Och

Information retrieval framework

For each translated sentence, the information retrieval framework retrieves the best possible matching sentence from the English side of the comparable corpus, if it exists, or in the other case, the nearest matching sentence.

Since we use the news comparable corpus, we can logically devise our search space for a news item reported on day X, to be between day X5 and X+5 time period in the English side of the comparable corpus. Based on this logic, we build search spaces for each day using the

Results and discussion

The main purpose of our proposed technique is to identify parallel fragments between source ‘f’ and target language ‘e’ using phrase based alignment information. In this section we report the results for our experiments on Arabic–English and French–English language pair.

The IR framework allows varying the number of retrieved sentences using the n-best option. By doing so we also vary the number of sentences retrieved per sentence. We experimented with various n-best sizes and built SMT systems

Comparative analysis

There has been considerable amount of research in efforts to exploit the available comparable corpora. One such recent effort is a combination of Abdul-Rauf and Schwenk (2011) and Munteanu and Marcu (2005) reported in Chu et al. (2013) in which they report improvements by adding their extracted fragments to SMT systems. In this section we conduct a crude comparative analysis of our approach with their technique at various stages of the fragment extraction pipeline.

Table 6 summarizes the

Conclusion and discussion

We have presented an efficient framework for parallel fragment extraction from comparable corpora. Many NLP technologies, like statistical machine translation and natural language generation thrive on parallel corpora. Lack of parallel corpora have diverted the direction of research towards exploring promising arenas to fill the dearth. Comparable corpora have proved to be a valuable resource in this regard. Interestingly, other than parallel data, which are composed of full sentence

Acknowledgments

This work was partially supported by the Higher Education Commission, Pakistan through the HEC Overseas Scholarship 2005 and the French Government under the project Instar (ANR JCJC06 143038).

References (42)

  • S. Abdul-Rauf et al.

    Exploiting comparable corpora with TER and TERp

    Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora

    (2009)
  • S. Abdul-Rauf et al.

    On the use of comparable corpora to improve SMT performance

    Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09

    (2009)
  • S. Abdul-Rauf et al.

    Parallel sentence generation from comparable corpora for improved SMT

    Mach. Transl.

    (2011)
  • S. Abdul Rauf et al.

    Empirical use of information retrieval to build synthetic data for smt domain adaptation

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2016)
  • A. Belz et al.

    Extracting parallel fragments from comparable corpora for data-to-text generation

    Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics

    (2010)
  • P. Brown et al.

    The mathematics of statistical machine translation

    Comput. Linguist.

    (1993)
  • C. Callison-Burch et al.

    Further meta-evaluation of machine translation

    Proceedings of the Third Workshop on Statistical Machine Translation, StatMT ’08

    (2008)
  • M. Cettolo et al.

    Mining parallel fragments from comparable texts

    Proceedings of the International Workshop on Spoken Language Translation, IWSLT

    (2010)
  • ChuC. et al.

    Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon

    Proceedings of the Sixth International Joint Conference on Natural Language Processing

    (2013)
  • L. Deléger et al.

    Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora

    Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora. Association for Computational Linguistics

    (2009)
  • T.N. Do et al.

    A fully unsupervised approach for mining parallel data from comparable corpora

    Proceedings of European Conference on Machine Translation, EAMT 2010

    (2010)
  • Franz Josef Och, D. M., 2003. Statistical Phrase-based...
  • FuX. et al.

    Phrase-based parallel fragments extraction from comparable corpora

    Proceedings of International Joint Conference on Natural Language Processing, IJCNLP

    (2013)
  • FungP. et al.

    Mining Very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM

    Proceedings of Empirical Methods for Natural Language Processing, EMNLP

    (2004)
  • S. Gahbiche-Braham et al.

    Two ways to use a noisy parallel news corpus for improving statistical machine translation

    Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web

    (2011)
  • U. Germann

    Building a statistical machine translation system from scratch: how much bang for the buck can we expect?

    Proceedings of the Workshop on Data-driven Methods in Machine Translation

    (2001)
  • R. Gupta et al.

    Improving mt system using extracted parallel fragments of text from comparable corpora

    proceedings of 6th workshop of Building and Using Comparable Corpora, BUCC

    (2013)
  • JiH.

    Mining name translations from comparable corpora by creating bilingual information networks

    Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora

    (2009)
  • H. Kaji

    Word sense acquisition from bilingual comparable corpora

    Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL)

    (2003)
  • A. Klementiev et al.

    Named entity transliteration and discovery from multilingual comparable corpora

    Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06

    (2006)
  • P. Koehn et al.

    Moses: open source toolkit for statistical machine translation

    Proceedings of Meeting of the Association for Computational Linguistics

    (2007)
  • Cited by (2)

    View full text