Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents

Domingo, Miguel; Casacuberta, Francisco

doi:10.1007/978-3-030-30754-7_7

Miguel Domingo¹³ &
Francisco Casacuberta¹³

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11808))

Included in the following conference series:

International Conference on Image Analysis and Processing

1151 Accesses
2 Citations

Abstract

The nature of human language and the lack of a spelling convention make historical documents hard to handle for natural language processing. Spelling normalization tackles this problem by adapting their spelling to modern standards in order to get an orthography consistency. In this work, we compare several character-based machine translation approaches, and propose a method to profit from modern documents to enrich neural machine translation models. We tested our proposal with four different data sets, and observed that the enriched models successfully improved the normalization quality of the neural models. Statistical models, however, yielded a better result.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate (2015). arXiv:1409.0473
Baron, A., Rayson, P.: VARD2: a tool for dealing with spelling variation in historical corpora. In: Postgraduate Conference in Corpus Linguistics (2008)
Google Scholar
Bollmann, M.: Normalization of historical texts with neural network models. Ph.D. thesis, Sprachwissenschaftliches Institut, Ruhr-Universität (2018)
Google Scholar
Bollmann, M., Søgaard, A.: Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In: Proceedings of the International Conference on the Computational Linguistics, pp. 131–139 (2016)
Google Scholar
Brown, P.F., Pietra, V.J.D., Pietra, S.A.D., Mercer, R.L.: The mathematics of statistical machine translation: parameter estimation. Comput. Linguist. 19(2), 263–311 (1993)
Google Scholar
Chatterjee, R., Farajian, M.A., Negri, M., Turchi, M., Srivastava, A., Pal, S.: Multi-source neural automatic post-editing: FBK’s participation in the WMT 2017 ape shared task. In: Proceedings of the Second Conference on Machine Translation, pp. 630–638 (2017)
Google Scholar
Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 1693–1703 (2016)
Google Scholar
Costa-Jussà, M.R., Aldón, D., Fonollosa, J.A.: Chinese-Spanish neural machine translation enhanced with character and word bitmap fonts. Mach. Transl. 31, 35–47 (2017)
Article Google Scholar
Costa-Jussà, M.R., Fonollosa, J.A.: Character-based neural machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 357–361 (2016)
Google Scholar
Domingo, M., Casacuberta, F.: Spelling normalization of historical documents by using a machine translation approach. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 129–137 (2018)
Google Scholar
Jehle, F.: Works of Miguel de Cervantes in Old- and Modern-Spelling. Indiana University Purdue University Fort Wayne (2001)
Google Scholar
Gao, Q., Vogel, S.: Parallel implementations of word alignment tool. In: Proceedings of the Association for Computational Linguistics Software Engineering, Testing, and Quality Assurance Workshop, pp. 49–57 (2008)
Google Scholar
Gehring, J., Auli, M., Grangier, D., Yarats, D., Dauphin, Y.N.: Convolutional sequence to sequence learning (2017). arXiv:1705.03122
Gers, F.A., Schmidhuber, J., Cummins, F.: Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000)
Article Google Scholar
Hämäläinen, M., Säily, T., Rueter, J., Tiedemann, J., Mäkelä, E.: Normalizing early English letters to present-day English spelling. In: Proceedings of the Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, pp. 87–96 (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klein, G., Kim, Y., Deng, Y., Senellart, J., Rush, A.M.: OpenNMT: open-source toolkit for neural machine translation. In: Proceedings of the Association for Computational Linguistics: System Demonstration, pp. 67–72 (2017)
Google Scholar
Koehn, P., et al.: Moses: open source toolkit for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 177–180 (2007)
Google Scholar
Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 48–54 (2003)
Google Scholar
Korchagina, N.: Normalizing medieval German texts: from rules to deep learning. In: Proceedings of the Nordic Conference on Computational Linguistics Workshop on Processing Historical Language, pp. 12–17 (2017)
Google Scholar
Laing, M.: The linguistic analysis of medieval vernacular texts: Two projects at Edinburgh’. In: Rissanen, M., Kytd, M., Wright, S. (eds.) Corpora across the Centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, vol. 25427, pp. 121–141. St Catharine’s College Cambridge (1993)
Google Scholar
Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. arXiv preprint arXiv:1511.04586 (2015)
Lison, P., Tiedemann, J.: OpenSubtitles 2016: extracting large parallel corpora from movie and TV subtitles. In: Proceedings of the International Conference on Language Resources Association (2016)
Google Scholar
Ljubešić, N., Zupan, K., Fišer, D., Erjavec, T.: Dataset of normalised Slovene text KonvNormSl 1.0. Slovenian language resource repository CLARIN.SI (2016). http://hdl.handle.net/11356/1068
Ljubešic, N., Zupan, K., Fišer, D., Erjavec, T.: Normalising slovene data: historical texts vs. user-generated content. In: Proceedings of the Conference on Natural Language Processing, pp. 146–155 (2016)
Google Scholar
Nakov, P., Tiedemann, J.: Combining word-level and character-level models for machine translation between closely-related languages. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 301–305 (2012)
Google Scholar
Och, F.J.: Minimum error rate training in statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 160–167 (2003)
Google Scholar
Och, F.J., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 295–302 (2002)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics, pp. 311–318 (2002)
Google Scholar
Poncelas, A., Shterionov, D., Way, A., Maillette de Buy Wenniger, G., Passban, P.: Investigation backtranslation in neural machine translation. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 249–258 (2018)
Google Scholar
Porta, J., Sancho, J.L., Gómez, J.: Edit transducers for spelling variation in old Spanish. In: Proceedings of the Workshop on Computational Historical Linguistics, pp. 70–79 (2013)
Google Scholar
Post, M.: A call for clarity in reporting BLEU scores. In: Proceedings of the Third Conference on Machine Translation, pp. 186–191 (2018)
Google Scholar
Riezler, S., Maxwell, J.T.: On some pitfalls in automatic evaluation and significance testing for MT. In: Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 57–64 (2005)
Google Scholar
Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)
Article MathSciNet Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Article Google Scholar
Scherrer, Y., Erjavec, T.: Modernizing historical Slovene words with character-based SMT. In: Proceedings of the Biennial International Workshop on Balto-Slavic Natural Language Processing, pp. 58–62 (2013)
Google Scholar
Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015)
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., Makhoul, J.: A study of translation edit rate with targeted human annotation. In: Proceedings of the Association for Machine Translation in the Americas, pp. 223–231 (2006)
Google Scholar
Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proceedings of the International Conference on Spoken Language Processing, pp. 257–286 (2002)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Proc. Adv. Neural Inf. Process. Syst. 27, 3104–3112 (2014)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Tang, G., Cap, F., Pettersson, E., Nivre, J.: An evaluation of neural machine translation models on historical spelling normalization. In: Proceedings of the International Conference on Computational Linguistics, pp. 1320–1331 (2018)
Google Scholar
Tiedemann, J.: Character-based PSMT for closely related languages. In: Proceedings of the Annual Conference of the European Association for Machine Translation, pp. 12–19 (2009)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)
Google Scholar
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation (2016). arXiv:1609.08144
Zens, R., Och, F.J., Ney, H.: Phrase-based statistical machine translation. In: Jarke, M., Lakemeyer, G., Koehler, J. (eds.) KI 2002. LNCS (LNAI), vol. 2479, pp. 18–32. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45751-8_2
Chapter Google Scholar

Download references

Acknowledgments

The research leading to these results has received funding from the European Union through Programa Operativo del Fondo Europeo de Desarrollo Regional (FEDER) from Comunitat Valencia (2014–2020) under project Sistemas de frabricación inteligentes para la indústria 4.0 (grant agreement IDIFEDER/2018/025); and from Ministerio de Economía y Competitividad (MINECO) under project MISMIS-FAKEnHATE (grant agreement PGC2018-096212-B-C31). We gratefully acknowledge the support of NVIDIA Corporation with the donation of a GPU used for part of this research.

Author information

Authors and Affiliations

PRHLT Research Center, Universitat Politècnica de València, Valencia, Spain
Miguel Domingo & Francisco Casacuberta

Authors

Miguel Domingo
View author publications
You can also search for this author in PubMed Google Scholar
Francisco Casacuberta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Miguel Domingo .

Editor information

Editors and Affiliations

University of Verona, Verona, Italy
Marco Cristani
University of Parma, Parma, Italy
Andrea Prati
Fondazione Bruno Kessler, Povo, Italy
Oswald Lanz
Fondazione Bruno Kessler, Povo, Italy
Stefano Messelodi
University of Trento, Povo, Italy
Nicu Sebe

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Domingo, M., Casacuberta, F. (2019). Enriching Character-Based Neural Machine Translation with Modern Documents for Achieving an Orthography Consistency in Historical Documents. In: Cristani, M., Prati, A., Lanz, O., Messelodi, S., Sebe, N. (eds) New Trends in Image Analysis and Processing – ICIAP 2019. ICIAP 2019. Lecture Notes in Computer Science(), vol 11808. Springer, Cham. https://doi.org/10.1007/978-3-030-30754-7_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-30754-7_7
Published: 02 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30753-0
Online ISBN: 978-3-030-30754-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)