Syntax-based reordering for statistical machine translation

https://doi.org/10.1016/j.csl.2011.01.001Get rights and content

Abstract

In this paper, we develop an approach called syntax-based reordering (SBR) to handling the fundamental problem of word ordering for statistical machine translation (SMT). We propose to alleviate the word order challenge including morpho-syntactical and statistical information in the context of a pre-translation reordering framework aimed at capturing short- and long-distance word distortion dependencies. We examine the proposed approach from the theoretical and experimental points of view discussing and analyzing its advantages and limitations in comparison with some of the state-of-the-art reordering methods.

In the final part of the paper, we describe the results of applying the syntax-based model to translation tasks with a great need for reordering (Chinese-to-English and Arabic-to-English). The experiments are carried out on standard phrase-based and alternative N-gram-based SMT systems. We first investigate sparse training data scenarios, in which the translation and reordering models are trained on a sparse bilingual data, then scaling the method to a large training set and demonstrating that the improvement in terms of translation quality is maintained.

Research highlights

► A syntax-driven approach to handling the problem of word ordering for statistical machine translation. ► The word order challenge is alleviated including morpho-syntactical and statistical information in the context of a pre-translation reordering framework. ► The results are presented for small and large Chinese-to-English and Arabic-to-English translation tasks. ► The experiments are carried out on phrase-based and N-gram-based statistical machine translation systems.

Introduction

One of the most challenging problems facing MT is how to place the translated words in such order that they fit the target language. Some languages, like English or Spanish, have relatively restrictive word orders and follow more monotone mutual word order than, say, Chinese and English. Others, for example Slavic or Baltic languages, allow more flexibility in word order, which, in many cases, serves to define the relationship between the actions and the entities.

A monotone SMT system often suffers from weakness in the distortion model, even if it is able to generate correct word-by-word translations. The problem here is that one syntactic and semantic unit in the source language might appear in a different position in the target language. This issue is especially important if the distance between words that should be reordered is high; in this case, the reordering decision is very difficult to make based on statistical information, due to the dramatic expansion of the search space with the increased number of words involved in the search process.1

A monotone SMT system can efficiently handle different word order if this word order disparity is found within the limits of a multiword translation unit and the system can implicitly memorize each pair of source and target phrases in the training stage. For the majority of translation tasks, however, word reordering disparity cannot be modeled with standard translation units, and distortion-restricted rearrangement of translation units, conducted by the decoder or done prior to the translations is usually employed.

In this paper, we develop an approach to handling the fundamental problem of word ordering for SMT. We propose to alleviate the word order challenge including morpho-syntactical and statistical information in the context of a pre-translation reordering framework.

In particular, we suggest a word reordering technique which tackle:

  • 1.

    the long-distance reordering problem in a deterministic way, by converting the source portion of the parallel corpus into an intermediate representation, in which source words are reordered to more closely match the target language;

  • 2.

    short-range reorderings in a non-deterministic way using POS information and an input graph model, as described in the literature (Crego and Mariño, 2006).

One of the fundamental processes underlying any natural language is a linguistic typology defined in terms of the finite verb (V), its subject (S), and its object (O). Typological disparity of languages leads to particularly bad translation performed by monotone SMT systems and to the need for global reordering models with the capability to model long-range dependencies. Because a monotone translation often cannot deal with the reordering to resolve such disparities, we used a parse tree structure in our own work.

The paper is organized as follows: in Section 2 we present a brief review of the known approaches to address the word reordering problem along with the main sources of inspiration for our reordering system. In Section 3 we review the architecture and modeling of the reordering system. Section 4 describes the way to couple multiple word reorderings with a translation system. Section 5 evaluates the contribution of our model to the performance of N-gram-based and phrase-based SMT system. Finally, Section 6 provides analysis of resulting data and highlights the main conclusions drawn.

Section snippets

Related work and sources of inspiration

The word reordering problem has attracted a great deal of attention recently. There have been abundant publications on purely statistical techniques dealing with the word reordering challenge, as well as on approaches involving lexical information (context), or using additional information to reorder the translated words in such a way that they fit the target language.

Syntax-based reordering

This section3 introduces the Syntax-Based Reordering (SBR) approach. Like other preprocessing methods, it splits translation into two independent stages:SSTwhere a sentence of the source language S is first reordered with respect to the word order of the target language, and then the reordered source sentence S′ is monotonically translated into a target sentence T.

SBR deals with

Coupling SBR and decoding

To improve the reordering power of the translation system, we implemented an additional reordering as described in Crego (2008). Multiple word segmentations are encoded in a word lattice, which is then passed to the input of the decoder, containing reordering alternatives consistent with the previously extracted rules.

A word lattice is defined as a direct acyclic graphG=(V,E)with one root node n0  V and one goal node nN  V. V and E are, respectively, the set of nodes and edges of the graph G.

Experiments and results

This section details the experiments carried out to evaluate the performance of the SBR approach. To understand the value in terms of accuracy and efficiency of the proposed reordering framework, two directions and four translation tasks have been employed, namely, small and large Chinese-to-English tasks, and small and large Arabic-to-English tasks.

The main reason that the Chinese-to-English and Arabic-to-English translation tasks were chosen as a main experimental field is because European

Discussion and conclusions

In this paper, we have described a new approach to word reordering in SMT, which successfully integrates syntax-based reordering in phrase-based and N-gram-based SMT. This approach correlates with the human intuitive notion of translation. At its best, a successful translation should read as if it were originally written in the new language. When translating a sentence, a human defines the target-language word order according to an extensive set of grammatical and semantic rules, along with

Acknowledgments

Work was partially supported by the Spanish Ministerio de Educación y Ciencia (TIN2006-12767) under FPU grant and by the Spanish Government under grant TEC2006-13964-C03 (AVIVAVOZ project). The authors want to thank Khalil Sima’an (Universiteit van Amsterdam), Mark Dras (Macquarie University), and María José Castro-Bleda (Universidad Politécnica de Valencia) for their valuable discussions and suggestions.

References (44)

  • Y. Al-Onaizan et al.

    Distortion models for statistical machine translation

  • H. Alshawi et al.

    Learning dependency translation models as collections of finite state head transducers

    Computational Linguistics

    (2000)
  • C. Callison-Burch et al.

    Improved statistical machine translation using paraphrases

  • D. Chiang

    A hierarchical phrase-based model for statistical machine translation

  • M. Collins et al.

    Clause restructuring for statistical machine translation

  • M.R. Costa-jussà et al.

    Statistical machine reordering

  • M.R. Costa-jussà et al.

    Computing multiple weighted reordering hypotheses for a statistical machine translation phrase-based system

  • M.R. Costa-jussà et al.

    TALP phrase-based system and TALP system combination for IWSLT

  • Crego, J.M., 2008. Architecture and modeling for N-gram-based statistical machine translation. Ph.D. thesis....
  • J.M. Crego et al.

    Syntax-enhanced N-gram-based SMT

  • J.M. Crego et al.

    Reordered search and tuple unfolding for ngram-based smt

  • J.M. Crego et al.

    Improving statistical MT by coupling reordering and decoding

    Machine Translation

    (2006)
  • J. Eisner

    Learning non-isomorphic tree mappings for machine translation

  • I. Even-Zohar

    Translation theory today: a call for transfer theory

    Poetics Today

    (1981)
  • M. Galley et al.

    What’s in a translation rule?

  • de Gispert, A., 2006. Introducing linguistic knowledge into statistical machine translation. Ph.D. thesis. Universitat...
  • J. Graehl et al.

    Training tree transducers

  • N. Habash et al.

    Arabic tokenization part-of-speech tagging and morphological disambiguation in one fell swoop

  • N. Habash et al.

    Arabic preprocessing schemes for statistical machine translation

  • A.S. Hildebrand et al.

    Recent improvements in the CMU large scale chinese-english SMT system

  • L. Huang et al.

    Statistical syntax-directed translation with extended domain of locality

  • K. Imamura et al.

    Practical approach to syntax-based statistical machine translation

  • Cited by (19)

    • A model based transformation paradigm for cross-language collaborations

      2013, Advanced Engineering Informatics
      Citation Excerpt :

      Their common goal is to enhance human–machine communication in natural languages. Language translation, also referred as MT [4], is the most commonly used method to implement cross-language collaborations by transforming a source sentence to a target sentence, for example, transforming an English sentence to a Chinese sentence. However, language translation maintains a challenge despite of the efforts in MT. Semantic integrity is a basic standard for cross-language engineering applications.

    • Towards machine translation of chinese complex structures

      2020, Advances in Intelligent Systems and Computing
    • A preordering model based on phrasal dependency tree

      2018, Digital Scholarship in the Humanities
    • Syntax-based context representation for statistical machine translation

      2018, IEICE Transactions on Information and Systems
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Edward J Briscoe.

    View full text