Abstract
Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided.
Similar content being viewed by others
References
Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018) quanteda: an R package for the quantitative analysis of textual data. J Open Source Softw 3(30):774. https://doi.org/10.21105/joss.00774
Brown PF, Cocke J, Pietra SAD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311
Brunning JJJ (2010) Alignment models and algorithms for statistical machine translation. Doctoral dissertation. University of Cambridge
Chéragui MA (2012) Theoretical overview of machine translation. In: Proceedings ICWIT, pp 160-169
Daneshgar N, Sarmad M (2019) word.alignment: computing word alignment using IBM model 1 (and symmetrization) for a given parallel corpus and its evaluation. R package version 1.1
Déchelotte D, Schwenk H, Bonneau-Maynard H, Allauzen A, Adda G (2007) A state-of-the-art statistical machine translation system based on moses. In: MT Summit, pp 127–133
Dowle M, Srinivasan A, Short T, Lianoglou S, Saporta R, Antonyan E (2017) data.table: extension of data. frame. R package version 1.10.4-3
Feinerer I, Hornik K (2015). tm: text mining package. R package version 0.6-1
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303
Holmqvist M, Ahrenberg L (2011) A gold standard for English–Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), pp 106–113
Hornik K (2015). NLP: natural language processing infrastructure. R package version 0.1-7
Ildefonso T, Lopes GP (2005) Longest sorted sequence algorithm for parallel text alignment. International conference on computer aided systems theory. Springer, Berlin, pp 81–90
Jochim C, Lioma C, Schütze H (2011) Expanding queries with term and phrase translations in patent retrieval. Information retrieval facility conference. Springer, Berlin, pp 16–29
Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge
Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: International conference on recent advances in natural language processing (RANLP 2009). Borovets, Bulgaria
Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp. 81-88
Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, pp. 632–641
Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers,vol 1. Association for Computational Linguistics, pp. 165–174
Nie JY (2010) Cross-language information retrieval. Synth Lect Hum Lang Technol 3(1):1–125
Och FJ (2000) Giza++: training of statistical translation models. Technical report, RWTH Aachen, University of Technology
Och FJ, Ney H (2000) A comparison of alignment models for statistical machine translation. In: COLING 2000, volume 2: the 18th international conference on computational linguistics
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449
Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop. Association for Computational Linguistics, pp. 72–80
Sasaki Y (2007) The truth of the F-measure. Teach Tutor Mater 1(5):1–5
Simes A, Almeida JJ (2003) NATools-a statistical word aligner workbench. Proces Leng Nat 31(septiembre 2003), 217–224
Supreme Council of Information and Communication Technology (2013) Mizan English–Persian Parallel Corpus
R Core Team (2015) R: a language and environment for statistical computing R Foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0
Vulić I, Moens MF (2010) Term alignment, state of the art overview. Technical report, Katholieke Universiteit Leuven LIIR (Language Intelligence and Information Retrieval)
Walker A (2017) openxlsx: read, write and edit XLSX files. R package version 4.0.17
Wang X (2004) Evaluation of two word alignment systems. Institutionen för datavetenskap, Umeå
Wu H, Wang H (2007) Comparative study of word alignment heuristics and phrase-based SMT. In: Proceedings of the MT Summit XI
Acknowledgements
We sincerely thank Associate Editor for comments that greatly improved the manuscript. We hereby acknowledge that parts of this computation was performed on the HPC center of Ferdowsi University Of Mashhad (Grant No. 28609).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Daneshgar, N., Sarmad, M. word.alignment: an R package for computing statistical word alignment and its evaluation. Comput Stat 35, 1597–1619 (2020). https://doi.org/10.1007/s00180-020-00979-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-020-00979-z