word.alignment: an R package for computing statistical word alignment and its evaluation

Daneshgar, Neda; Sarmad, Majid

doi:10.1007/s00180-020-00979-z

word.alignment: an R package for computing statistical word alignment and its evaluation

Original paper
Published: 23 March 2020

Volume 35, pages 1597–1619, (2020)
Cite this article

Computational Statistics Aims and scope Submit manuscript

364 Accesses
Explore all metrics

Abstract

Word alignment has lots of applications in various natural language processing (NLP) tasks. As far as we are aware, there is no word alignment package in the R environment. In this paper, word.alignment, a new R software package is introduced which implements a statistical word alignment model as an unsupervised learning. It uses IBM Model 1 as a machine translation model based on the use of the EM algorithm and the Viterbi search in order to find the best alignment. It also provides the symmetric alignment using three heuristic methods such as union, intersection, and grow-diag. It has also the ability to build an automatic bilingual dictionary applying an innovative rule. The generated dictionary is suitable for a number of NLP tasks. This package provides functions for measuring the quality of the word alignment via comparing the alignment with a gold standard alignment based on five metrics as well. It is easily installed and executable on the mostly widely used platforms. Note that it is easily usable and we show that its results are almost everywhere better than some other word alignment tools. Finally, some examples illustrating the use of word.alignment is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Article 30 September 2021

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Article Open access 18 October 2021

Hybrid Word Alignment

References

Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018) quanteda: an R package for the quantitative analysis of textual data. J Open Source Softw 3(30):774. https://doi.org/10.21105/joss.00774
Article Google Scholar
Brown PF, Cocke J, Pietra SAD, Pietra VJD, Jelinek F, Lafferty JD, Mercer RL, Roossin PS (1990) A statistical approach to machine translation. Comput Linguist 16(2):79–85
Google Scholar
Brown PF, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(2):263–311
Google Scholar
Brunning JJJ (2010) Alignment models and algorithms for statistical machine translation. Doctoral dissertation. University of Cambridge
Chéragui MA (2012) Theoretical overview of machine translation. In: Proceedings ICWIT, pp 160-169
Daneshgar N, Sarmad M (2019) word.alignment: computing word alignment using IBM model 1 (and symmetrization) for a given parallel corpus and its evaluation. R package version 1.1
Déchelotte D, Schwenk H, Bonneau-Maynard H, Allauzen A, Adda G (2007) A state-of-the-art statistical machine translation system based on moses. In: MT Summit, pp 127–133
Dowle M, Srinivasan A, Short T, Lianoglou S, Saporta R, Antonyan E (2017) data.table: extension of data. frame. R package version 1.10.4-3
Feinerer I, Hornik K (2015). tm: text mining package. R package version 0.6-1
Fraser A, Marcu D (2007) Measuring word alignment quality for statistical machine translation. Comput Linguist 33(3):293–303
Article MathSciNet Google Scholar
Holmqvist M, Ahrenberg L (2011) A gold standard for English–Swedish word alignment. In: Proceedings of the 18th Nordic conference of computational linguistics (NODALIDA 2011), pp 106–113
Hornik K (2015). NLP: natural language processing infrastructure. R package version 0.1-7
Ildefonso T, Lopes GP (2005) Longest sorted sequence algorithm for parallel text alignment. International conference on computer aided systems theory. Springer, Berlin, pp 81–90
Google Scholar
Jochim C, Lioma C, Schütze H (2011) Expanding queries with term and phrase translations in patent retrieval. Information retrieval facility conference. Springer, Berlin, pp 16–29
Google Scholar
Koehn P (2010) Statistical machine translation. Cambridge University Press, Cambridge
MATH Google Scholar
Lardilleux A, Lepage Y (2009) Sampling-based multilingual alignment. In: International conference on recent advances in natural language processing (RANLP 2009). Borovets, Bulgaria
Moore RC (2005) A discriminative framework for bilingual word alignment. In: Proceedings of the conference on human language technology and empirical methods in natural language processing. Association for Computational Linguistics, pp. 81-88
Neubig G, Watanabe T, Sumita E, Mori S, Kawahara T (2011) An unsupervised model for joint phrase alignment and extraction. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1. Association for Computational Linguistics, pp. 632–641
Neubig G, Watanabe T, Mori S, Kawahara T (2012) Machine translation without words through substring alignment. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers,vol 1. Association for Computational Linguistics, pp. 165–174
Nie JY (2010) Cross-language information retrieval. Synth Lect Hum Lang Technol 3(1):1–125
Article Google Scholar
Och FJ (2000) Giza++: training of statistical translation models. Technical report, RWTH Aachen, University of Technology
Och FJ, Ney H (2000) A comparison of alignment models for statistical machine translation. In: COLING 2000, volume 2: the 18th international conference on computational linguistics
Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput Linguist 29(1):19–51
Article Google Scholar
Och FJ, Ney H (2004) The alignment template approach to statistical machine translation. Comput Linguist 30(4):417–449
Article Google Scholar
Okita T (2009) Data cleaning for word alignment. In: Proceedings of the ACL-IJCNLP 2009 student research workshop. Association for Computational Linguistics, pp. 72–80
Sasaki Y (2007) The truth of the F-measure. Teach Tutor Mater 1(5):1–5
Google Scholar
Simes A, Almeida JJ (2003) NATools-a statistical word aligner workbench. Proces Leng Nat 31(septiembre 2003), 217–224
Supreme Council of Information and Communication Technology (2013) Mizan English–Persian Parallel Corpus
R Core Team (2015) R: a language and environment for statistical computing R Foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0
Vulić I, Moens MF (2010) Term alignment, state of the art overview. Technical report, Katholieke Universiteit Leuven LIIR (Language Intelligence and Information Retrieval)
Walker A (2017) openxlsx: read, write and edit XLSX files. R package version 4.0.17
Wang X (2004) Evaluation of two word alignment systems. Institutionen för datavetenskap, Umeå
Google Scholar
Wu H, Wang H (2007) Comparative study of word alignment heuristics and phrase-based SMT. In: Proceedings of the MT Summit XI

Download references

Acknowledgements

We sincerely thank Associate Editor for comments that greatly improved the manuscript. We hereby acknowledge that parts of this computation was performed on the HPC center of Ferdowsi University Of Mashhad (Grant No. 28609).

Author information

Authors and Affiliations

Department of Statistics, Ferdowsi University of Mashhad, Mashhad, Iran
Neda Daneshgar & Majid Sarmad

Authors

Neda Daneshgar
View author publications
You can also search for this author inPubMed Google Scholar
Majid Sarmad
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Majid Sarmad.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Daneshgar, N., Sarmad, M. word.alignment: an R package for computing statistical word alignment and its evaluation. Comput Stat 35, 1597–1619 (2020). https://doi.org/10.1007/s00180-020-00979-z

Download citation

Received: 05 November 2017
Accepted: 10 March 2020
Published: 23 March 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s00180-020-00979-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

word.alignment: an R package for computing statistical word alignment and its evaluation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

LexFindR: A fast, simple, and extensible R package for finding similar words in a lexicon

Recent advances in Apertium, a free/open-source rule-based machine translation platform for low-resource languages

Hybrid Word Alignment

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now