Parallel fragments : Measuring their impact on translation performance
Introduction
In recent decades, construction and research on bilingual corpora has become an area of immense importance and interest. Due to its emerging importance, comparable corpora have become a significant object of study by researchers. These have proved to be beneficial in a variety of tasks such as improving SMT performance using extracted parallel sentences (Munteanu, Marcu, 2005, Abdul-Rauf, Schwenk, 2011), extracting phrasal alignments (Kumano et al., 2007), word sense disambiguation (Kaji, 2003), acquiring synonyms (Shimohata and Sumita, 2005), parallel fragment extraction (Munteanu, Marcu, 2006, Cettolo, Federico, Bertoldi., 2010), extracting lay paraphrases of specialized expressions (Deléger and Zweigenbaum, 2009) and language and translation model adaptation (Snover, Dorr, Schwartz, 2008, Abdul Rauf, Schwenk, Lambert, Nawaz, 2016) etc. They have specifically proved to be valuable for languages and domains which lack parallel corpora.
The world has become a global village and translations play a vital role in bridging communication gaps all over the world. It is not affordable for human beings to translate everything manually so the demand of machine translation (MT) is growing rapidly all over the world. In Statistical MT, translation information is automatically obtained from parallel corpora; parallel corpus is a corpus that contains sentence aligned source texts and their translations. Parallel corpora can be bilingual or multilingual, once a parallel corpus is available, then the rapid development of SMT systems for different language pairs is possible; by examining many samples of human-produced translation, SMT algorithms automatically learn how to translate (Brown et al., 1993).
Because of the high dependence on parallel corpora, the quality and quantity of parallel corpora are crucial for SMT, the insufficiency of parallel data is an issue of concern for SMT system development. Lack of parallel corpus and linguistic resources for many languages and domains is one of the major obstacles for diverse and good SMT systems. Reasons for the scarceness of these resources are the inherent richness of languages, domain diversity etc. Moreover, languages evolve over time, the SMT training corpora also needs to be updated accordingly. Again, this is difficult in the case of parallel corpora.
There are many language pairs which do not have enough parallel corpora. Building such corpora can take much time as corpus building is a slow process for less spoken languages. Germann (2001) report that it takes 140 translation hours to create a 1300 sentence (24,000 tokens) Tamil–English parallel corpus at an average translation rate of 170 words per hour. At this rate it would require 4–5 full time translators to translate 100,000 words in a month. This of course forces to explore other scenarios for parallel corpus creation. For many language pairs, most of the times, comparable corpora do exist. A comparable corpus is a collection of texts in two or more languages which has similar contents in each language but do not have the exact translation of each language pair. We can gather and compile comparable corpora from multilingual newspapers, Wikipedia and different websites which contain articles on same topic in different languages.
Extraction of parallel sentences and segments from comparable corpora is a challenging task. The usual sentence alignment techniques applicable for parallel corpora rely on equivalent sentences and paragraphs, which have same order in the two parts of the bitext. Due to this the search space in sentence alignment is significantly reduced. This is not the case for comparable corpora, finding matching sentences and phrases remains a challenging task. Typically, comparable corpora don’t have any information regarding document pair similarity. Generally, there exist many documents in one language which don’t have any corresponding document in the other language. Also, when the correspondence information among the documents is available, the documents in question are not literal translations of each other. Thus, extracting parallel data from such corpora requires special algorithms designed for the corpora in question.
Depending upon the comparability of the comparable corpus, it is not sure that there will always be parallel sentences in comparable corpora rather there might be or might be not, but there could be parallel fragments in comparable sentences abundantly. Parallel fragments have also proved to be helpful for improving SMT performance (Munteanu, Marcu, 2006, Fu, Wei, Lu, Chen, Xu, 2013, Rahimi, Samani, Khadivi, 2014). An interesting aspect of using parallel fragments is that domain dependency is a bit alleviated, the fragments that are found are often named entities and everyday use phrases, thus adding out-domain sentence fragments to in-domain parallel data also helps improve MT performance as shown by Gupta et al. (2013) for English–Bengali language pair. Other than improving MT performance, fragments have also been helpful in other NLP domains requiring parallel data, Belz and Kow (2010) report parallel fragment extraction for improving Natural Language Generation(NLG) systems.
In this article, we present an efficient fragment extraction algorithm making use of information retrieval (IR) and SMT itself to retrieve parallel fragments. Firstly, the foreign language side of the comparable corpus is translated into English and potential matching sentences are retrieved from the English side of the comparable corpus (as described in Sections 5 and 6). Parallel phrase fragments are then identified using Levenshtein distance and the phrase alignment information. Our proposed scheme of fragment extraction is comparable in efficiency and results to all the previous works. We also present a detailed analysis of impact of fragments extracted from related vs non-related corpus. Other than the design efficiency of our approach, we also present a comparison of SMT improvement using fragments versus paralllel sentences extracted by Abdul-Rauf and Schwenk (2009a) from the same corpus thus highlighting the utility of fragments by comparative analysis.
We start by giving a brief overview of comparable corpora and the related work in this field followed by a high-level overview of our parallel sentence extraction system. Section 3 describes our fragment extraction framework followed by description of SMT and IR frameworks used in Section 5 and 6, respectively. Sections 7.2 and 7.1 report in detail the results of our experiments with French–English and Arabic–English language pairs. The paper concludes with a comparison of our approach with one of the earlier approaches detailed in Section 7.3.
Section snippets
Comparable corpora and fragment extraction literature
The concept of a comparable corpus and its use depend largely upon the point of view of the experimenter and the subject of his research, it would be presumptuous to propose a universal definition. Comparable corpora are of various natures, covering a continuum between truly parallel and completely unrelated texts. This leads to the notion of the degree of comparability of document sub-parts of a comparable corpus. This degree of comparability is relative to the amount of features (qualitative
Fragment extraction framework
Fig. 2 shows the general architecture of our scheme. We use SMT output and IR in designing our framework. Having comparable corpora of two languages (L1, L2) at our disposal, we start by translating L1 to L2. Having the two corpora in the same language, we can then identify the contiguous word sequences between the two sentences. By doing so we are being dependent on the SMT translation, good or bad. To eliminate this factor n-best sentences are used both from SMT and IR outputs. We then
Task description
In this paper, we consider the translation from Arabic into English, under the same conditions as the official NIST 2008 evaluation. The used bitexts include various news wire translations1 as well as some texts from the Gale project.2 We also added the 2002–2005 test data to the parallel training data (using all reference translations). This corresponds to a total of about 5.8M Arabic words. Our baseline
SMT framework
The goal of statistical machine translation (SMT) is to produce a target sentence e from a source sentence f. It is today common practice to use phrases as translation units (Koehn, Och, Marcu, Och, Ney, 2003) and a log linear framework in order to introduce several models explaining the translation process: The feature functions hi are the system models and the λi weights are typically optimized to maximize a scoring function on a development set (Och
Information retrieval framework
For each translated sentence, the information retrieval framework retrieves the best possible matching sentence from the English side of the comparable corpus, if it exists, or in the other case, the nearest matching sentence.
Since we use the news comparable corpus, we can logically devise our search space for a news item reported on day X, to be between day X5 and X+5 time period in the English side of the comparable corpus. Based on this logic, we build search spaces for each day using the
Results and discussion
The main purpose of our proposed technique is to identify parallel fragments between source ‘f’ and target language ‘e’ using phrase based alignment information. In this section we report the results for our experiments on Arabic–English and French–English language pair.
The IR framework allows varying the number of retrieved sentences using the n-best option. By doing so we also vary the number of sentences retrieved per sentence. We experimented with various n-best sizes and built SMT systems
Comparative analysis
There has been considerable amount of research in efforts to exploit the available comparable corpora. One such recent effort is a combination of Abdul-Rauf and Schwenk (2011) and Munteanu and Marcu (2005) reported in Chu et al. (2013) in which they report improvements by adding their extracted fragments to SMT systems. In this section we conduct a crude comparative analysis of our approach with their technique at various stages of the fragment extraction pipeline.
Table 6 summarizes the
Conclusion and discussion
We have presented an efficient framework for parallel fragment extraction from comparable corpora. Many NLP technologies, like statistical machine translation and natural language generation thrive on parallel corpora. Lack of parallel corpora have diverted the direction of research towards exploring promising arenas to fill the dearth. Comparable corpora have proved to be a valuable resource in this regard. Interestingly, other than parallel data, which are composed of full sentence
Acknowledgments
This work was partially supported by the Higher Education Commission, Pakistan through the HEC Overseas Scholarship 2005 and the French Government under the project Instar (ANR JCJC06 143038).
References (42)
- et al.
Exploiting comparable corpora with TER and TERp
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora
(2009) - et al.
On the use of comparable corpora to improve SMT performance
Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’09
(2009) - et al.
Parallel sentence generation from comparable corpora for improved SMT
Mach. Transl.
(2011) - et al.
Empirical use of information retrieval to build synthetic data for smt domain adaptation
IEEE/ACM Trans. Audio Speech Lang. Process.
(2016) - et al.
Extracting parallel fragments from comparable corpora for data-to-text generation
Proceedings of the 6th International Natural Language Generation Conference. Association for Computational Linguistics
(2010) - et al.
The mathematics of statistical machine translation
Comput. Linguist.
(1993) - et al.
Further meta-evaluation of machine translation
Proceedings of the Third Workshop on Statistical Machine Translation, StatMT ’08
(2008) - et al.
Mining parallel fragments from comparable texts
Proceedings of the International Workshop on Spoken Language Translation, IWSLT
(2010) - et al.
Accurate parallel fragment extraction from quasi-comparable corpora using alignment model and translation lexicon
Proceedings of the Sixth International Joint Conference on Natural Language Processing
(2013) - et al.
Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora. Association for Computational Linguistics
(2009)
A fully unsupervised approach for mining parallel data from comparable corpora
Proceedings of European Conference on Machine Translation, EAMT 2010
Phrase-based parallel fragments extraction from comparable corpora
Proceedings of International Joint Conference on Natural Language Processing, IJCNLP
Mining Very-non-parallel corpora: parallel sentence and lexicon extraction via bootstrapping and EM
Proceedings of Empirical Methods for Natural Language Processing, EMNLP
Two ways to use a noisy parallel news corpus for improving statistical machine translation
Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web
Building a statistical machine translation system from scratch: how much bang for the buck can we expect?
Proceedings of the Workshop on Data-driven Methods in Machine Translation
Improving mt system using extracted parallel fragments of text from comparable corpora
proceedings of 6th workshop of Building and Using Comparable Corpora, BUCC
Mining name translations from comparable corpora by creating bilingual information networks
Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: From Parallel to Non-parallel Corpora
Word sense acquisition from bilingual comparable corpora
Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL)
Named entity transliteration and discovery from multilingual comparable corpora
Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06
Moses: open source toolkit for statistical machine translation
Proceedings of Meeting of the Association for Computational Linguistics
Cited by (2)
A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora
2021, Journal of Computer ScienceExploring transfer learning and domain data selection for the bio-medical translation
2019, WMT 2019 - 4th Conference on Machine Translation, Proceedings of the Conference