Skip to main content
Log in

word.alignment: an R package for computing statistical word alignment and its evaluation

  • Original paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018) quanteda: an R package for the quantitative analysis of textual data. J Open Source Softw 3(30):774. https://doi.org/10.21105/joss.00774

    Article  Google Scholar 

  • Brown PF, Cocke J, Pietra SAD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85

    Google Scholar 

  • Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311

    Google Scholar 

  • Brunning JJJ (2010) Alignment models and algorithms for statistical machine translation. Doctoral dissertation. University of Cambridge

  • Chéragui MA (2012) Theoretical overview of machine translation. In: Proceedings ICWIT, pp 160-169

  • Daneshgar N, Sarmad M (2019) word.alignment: computing word alignment using IBM model 1 (and symmetrization) for a given parallel corpus and its evaluation. R package version 1.1

  • Déchelotte D, Schwenk H, Bonneau-Maynard H, Allauzen A, Adda G (2007) A state-of-the-art statistical machine translation system based on moses. In: MT Summit, pp 127–133

  • Dowle M, Srinivasan A, Short T, Lianoglou S, Saporta R, Antonyan E (2017) data.table: extension of data. frame. R package version 1.10.4-3

  • Feinerer I, Hornik K (2015). tm: text mining package. R package version 0.6-1

  • Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303

    Article  MathSciNet  Google Scholar 

  • Holmqvist M, Ahrenberg L (2011) A gold standard for English–Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), pp 106–113

  • Hornik K (2015). NLP: natural language processing infrastructure. R package version 0.1-7

  • Ildefonso T, Lopes GP (2005) Longest sorted sequence algorithm for parallel text alignment. International conference on computer aided systems theory. Springer, Berlin, pp 81–90

    Google Scholar 

  • Jochim C, Lioma C, Schütze H (2011) Expanding queries with term and phrase translations in patent retrieval. Information retrieval facility conference. Springer, Berlin, pp 16–29

    Google Scholar 

  • Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: International conference on recent advances in natural language processing (RANLP 2009). Borovets, Bulgaria

  • Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp. 81-88

  • Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, pp. 632–641

  • Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers,vol 1. Association for Computational Linguistics, pp. 165–174

  • Nie JY (2010) Cross-language information retrieval. Synth Lect Hum Lang Technol 3(1):1–125

    Article  Google Scholar 

  • Och FJ (2000) Giza++: training of statistical translation models. Technical report, RWTH Aachen, University of Technology

  • Och FJ, Ney H (2000) A comparison of alignment models for statistical machine translation. In: COLING 2000, volume 2: the 18th international conference on computational linguistics

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51

    Article  Google Scholar 

  • Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449

    Article  Google Scholar 

  • Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop. Association for Computational Linguistics, pp. 72–80

  • Sasaki Y (2007) The truth of the F-measure. Teach Tutor Mater 1(5):1–5

    Google Scholar 

  • Simes A, Almeida JJ (2003) NATools-a statistical word aligner workbench. Proces Leng Nat 31(septiembre 2003), 217–224

  • Supreme Council of Information and Communication Technology (2013) Mizan English–Persian Parallel Corpus

  • R Core Team (2015) R: a language and environment for statistical computing R Foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0

  • Vulić I, Moens MF (2010) Term alignment, state of the art overview. Technical report, Katholieke Universiteit Leuven LIIR (Language Intelligence and Information Retrieval)

  • Walker A (2017) openxlsx: read, write and edit XLSX files. R package version 4.0.17

  • Wang X (2004) Evaluation of two word alignment systems. Institutionen för datavetenskap, Umeå

    Google Scholar 

  • Wu H, Wang H (2007) Comparative study of word alignment heuristics and phrase-based SMT. In: Proceedings of the MT Summit XI

Download references

Acknowledgements

We sincerely thank Associate Editor for comments that greatly improved the manuscript. We hereby acknowledge that parts of this computation was performed on the HPC center of Ferdowsi University Of Mashhad (Grant No. 28609).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Majid Sarmad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Daneshgar, N., Sarmad, M. word.alignment: an R package for computing statistical word alignment and its evaluation. Comput Stat 35, 1597–1619 (2020). https://doi.org/10.1007/s00180-020-00979-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-020-00979-z

Keywords