Syntax-based reordering for statistical machine translation☆
Research highlights
► A syntax-driven approach to handling the problem of word ordering for statistical machine translation. ► The word order challenge is alleviated including morpho-syntactical and statistical information in the context of a pre-translation reordering framework. ► The results are presented for small and large Chinese-to-English and Arabic-to-English translation tasks. ► The experiments are carried out on phrase-based and N-gram-based statistical machine translation systems.
Introduction
One of the most challenging problems facing MT is how to place the translated words in such order that they fit the target language. Some languages, like English or Spanish, have relatively restrictive word orders and follow more monotone mutual word order than, say, Chinese and English. Others, for example Slavic or Baltic languages, allow more flexibility in word order, which, in many cases, serves to define the relationship between the actions and the entities.
A monotone SMT system often suffers from weakness in the distortion model, even if it is able to generate correct word-by-word translations. The problem here is that one syntactic and semantic unit in the source language might appear in a different position in the target language. This issue is especially important if the distance between words that should be reordered is high; in this case, the reordering decision is very difficult to make based on statistical information, due to the dramatic expansion of the search space with the increased number of words involved in the search process.1
A monotone SMT system can efficiently handle different word order if this word order disparity is found within the limits of a multiword translation unit and the system can implicitly memorize each pair of source and target phrases in the training stage. For the majority of translation tasks, however, word reordering disparity cannot be modeled with standard translation units, and distortion-restricted rearrangement of translation units, conducted by the decoder or done prior to the translations is usually employed.
In this paper, we develop an approach to handling the fundamental problem of word ordering for SMT. We propose to alleviate the word order challenge including morpho-syntactical and statistical information in the context of a pre-translation reordering framework.
In particular, we suggest a word reordering technique which tackle:
- 1.
the long-distance reordering problem in a deterministic way, by converting the source portion of the parallel corpus into an intermediate representation, in which source words are reordered to more closely match the target language;
- 2.
short-range reorderings in a non-deterministic way using POS information and an input graph model, as described in the literature (Crego and Mariño, 2006).
One of the fundamental processes underlying any natural language is a linguistic typology defined in terms of the finite verb (V), its subject (S), and its object (O). Typological disparity of languages leads to particularly bad translation performed by monotone SMT systems and to the need for global reordering models with the capability to model long-range dependencies. Because a monotone translation often cannot deal with the reordering to resolve such disparities, we used a parse tree structure in our own work.
The paper is organized as follows: in Section 2 we present a brief review of the known approaches to address the word reordering problem along with the main sources of inspiration for our reordering system. In Section 3 we review the architecture and modeling of the reordering system. Section 4 describes the way to couple multiple word reorderings with a translation system. Section 5 evaluates the contribution of our model to the performance of N-gram-based and phrase-based SMT system. Finally, Section 6 provides analysis of resulting data and highlights the main conclusions drawn.
Section snippets
Related work and sources of inspiration
The word reordering problem has attracted a great deal of attention recently. There have been abundant publications on purely statistical techniques dealing with the word reordering challenge, as well as on approaches involving lexical information (context), or using additional information to reorder the translated words in such a way that they fit the target language.
Syntax-based reordering
This section3 introduces the Syntax-Based Reordering (SBR) approach. Like other preprocessing methods, it splits translation into two independent stages:where a sentence of the source language S is first reordered with respect to the word order of the target language, and then the reordered source sentence S′ is monotonically translated into a target sentence T.
SBR deals with
Coupling SBR and decoding
To improve the reordering power of the translation system, we implemented an additional reordering as described in Crego (2008). Multiple word segmentations are encoded in a word lattice, which is then passed to the input of the decoder, containing reordering alternatives consistent with the previously extracted rules.
A word lattice is defined as a direct acyclic graphwith one root node n0 ∈ V and one goal node nN ∈ V. V and E are, respectively, the set of nodes and edges of the graph G.
Experiments and results
This section details the experiments carried out to evaluate the performance of the SBR approach. To understand the value in terms of accuracy and efficiency of the proposed reordering framework, two directions and four translation tasks have been employed, namely, small and large Chinese-to-English tasks, and small and large Arabic-to-English tasks.
The main reason that the Chinese-to-English and Arabic-to-English translation tasks were chosen as a main experimental field is because European
Discussion and conclusions
In this paper, we have described a new approach to word reordering in SMT, which successfully integrates syntax-based reordering in phrase-based and N-gram-based SMT. This approach correlates with the human intuitive notion of translation. At its best, a successful translation should read as if it were originally written in the new language. When translating a sentence, a human defines the target-language word order according to an extensive set of grammatical and semantic rules, along with
Acknowledgments
Work was partially supported by the Spanish Ministerio de Educación y Ciencia (TIN2006-12767) under FPU grant and by the Spanish Government under grant TEC2006-13964-C03 (AVIVAVOZ project). The authors want to thank Khalil Sima’an (Universiteit van Amsterdam), Mark Dras (Macquarie University), and María José Castro-Bleda (Universidad Politécnica de Valencia) for their valuable discussions and suggestions.
References (44)
- et al.
Distortion models for statistical machine translation
- et al.
Learning dependency translation models as collections of finite state head transducers
Computational Linguistics
(2000) - et al.
Improved statistical machine translation using paraphrases
A hierarchical phrase-based model for statistical machine translation
- et al.
Clause restructuring for statistical machine translation
- et al.
Statistical machine reordering
- et al.
Computing multiple weighted reordering hypotheses for a statistical machine translation phrase-based system
- et al.
TALP phrase-based system and TALP system combination for IWSLT
- Crego, J.M., 2008. Architecture and modeling for N-gram-based statistical machine translation. Ph.D. thesis....
- et al.
Syntax-enhanced N-gram-based SMT
Reordered search and tuple unfolding for ngram-based smt
Improving statistical MT by coupling reordering and decoding
Machine Translation
Learning non-isomorphic tree mappings for machine translation
Translation theory today: a call for transfer theory
Poetics Today
What’s in a translation rule?
Training tree transducers
Arabic tokenization part-of-speech tagging and morphological disambiguation in one fell swoop
Arabic preprocessing schemes for statistical machine translation
Recent improvements in the CMU large scale chinese-english SMT system
Statistical syntax-directed translation with extended domain of locality
Practical approach to syntax-based statistical machine translation
Cited by (19)
A model based transformation paradigm for cross-language collaborations
2013, Advanced Engineering InformaticsCitation Excerpt :Their common goal is to enhance human–machine communication in natural languages. Language translation, also referred as MT [4], is the most commonly used method to implement cross-language collaborations by transforming a source sentence to a target sentence, for example, transforming an English sentence to a Chinese sentence. However, language translation maintains a challenge despite of the efforts in MT. Semantic integrity is a basic standard for cross-language engineering applications.
Towards machine translation of chinese complex structures
2020, Advances in Intelligent Systems and ComputingMachine translation in Indian languages: Challenges and resolution
2019, Journal of Intelligent SystemsTowards computing technologies on machine parsing of English and Chinese Garden path sentences
2019, Advances in Intelligent Systems and ComputingA preordering model based on phrasal dependency tree
2018, Digital Scholarship in the HumanitiesSyntax-based context representation for statistical machine translation
2018, IEICE Transactions on Information and Systems
- ☆
This paper has been recommended for acceptance by Edward J Briscoe.