Token-based spelling variant detection in Middle Low German texts

Barteld, Fabian; Biemann, Chris; Zinsmeister, Heike

doi:10.1007/s10579-018-09441-5

Token-based spelling variant detection in Middle Low German texts

Original Paper
Published: 09 February 2019

Volume 53, pages 677–706, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

456 Accesses
2 Citations
Explore all metrics

Abstract

In this paper we present a pipeline for the detection of spelling variants, i.e., different spellings that represent the same word, in non-standard texts. For example, in Middle Low German texts in and ihn (among others) are potential spellings of a single word, the personal pronoun ‘him’. Spelling variation is usually addressed by normalization, in which non-standard variants are mapped to a corresponding standard variant, e.g. the Modern German word ihn in the case of in. However, the approach to spelling variant detection presented here does not need such a reference to a standard variant and can therefore be applied to data for which a standard variant is missing. The pipeline we present first generates spelling variants for a given word using rewrite rules and surface similarity. Afterwards, the generated types are filtered. We present a new filter that works on the token level, i.e., taking the context of a word into account. Through this mechanism ambiguities on the type level can be resolved. For instance, the Middle Low German word in can not only be the personal pronoun ‘him’, but also the preposition ‘in’, and each of these has different variants. The detected spelling variants can be used in two settings for Digital Humanities research: On the one hand, they can be used to facilitate searching in non-standard texts. On the other hand, they can be used to improve the performance of natural language processing tools on the data by reducing the number of unknown words. To evaluate the utility of the pipeline in both applications, we present two evaluation settings and evaluate the pipeline on Middle Low German texts. We were able to improve the F1 score compared with previous work from \(0.39\) to \(0.52\) for the search setting and from \(0.23\) to \(0.30\) when detecting spelling variants of unknown words.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Basic Modules for Isolated Spelling Error Correction in Polish Texts

Detection of Common English Grammar Usage Errors

Automatic Detection of Annotation Errors in Polish-Language Corpora

Notes

See also the workshop series on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH) by ACL SIGHUM (https://sighum.wordpress.com).
Following ISO 639-3, we use GML as the abbreviation for Middle Low German in this paper.
In the whole paper, we will ignore differences in capitalization. All types are lowercased before applying and evaluating our pipeline.
In this paper, we are only concerned with spelling variation and therefore ignore other aspects of standard and non-standard languages.
When \(L\), \(L_{\text {morph}}\), and \(S\) are induced from a corpus, we have used a slightly different definition (Barteld 2017): \(v\) can be given as a ratio, as \(L_{\text {morph}}\) is finite. For this, we excluded morphological words, that are instantiated only once in the corpus as they cannot exhibit possible variance.
Simplification does not mean that the resulting types are simpler in the sense that they are shorter. The addition of a h after every g that is not already followed by one would also be an example of a simplification.
Compare also the remark by Jurish (2011) that “[t]he range of a canonicalization function need not be restricted to extant forms; in particular a phonetization function mapping arbitrary input strings to unique phonetic forms can be considered a canonicalization function in this sense” (p. 115). With our definitions, a phonetization function would be a simplification. However, we do not restrict (type-based) normalizations and simplifications to map to unique elements but allow for them to map to multiple elements.
Version 0.3. Publication date 2017-06-15. http://hdl.handle.net/11022/0000-0006-473B-9.
The lemmatization in the corpus includes word-sense disambiguation such that homonyms are distinguished.
This definition leads to a broad definition of spelling variation, as words that are not spelling variants in a strict sense might be conflated due to the lemmatization. One example are the adverbs vele ‘a lot’ and mehr ‘more’ that are derived from the positive and the comparative form of the adjective vele and are therefore lemmatized the same.
The abbreviations follow the Leipzig glossing rules (https://www.eva.mpg.de/lingua/resources/glossing-rules.php).
We define precision to be 1 when there are no candidates generated as there are no falsely generated candidates.
The given distances are always an upper bound. For brevity, we write distance \(2\) instead of the more precise \(\le 2\).

References

Adesam, Y., & Bouma, G. (2016). Old Swedish part-of-speech tagging between variation and external knowledge. In Proceedings of the 10th SIGHUM workshop on language technology for cultural heritage, social sciences, and humanities, (pp. 32–42). Berlin, Germany: Association for Computational Linguistics.
Barteld, F. (2017). Detecting spelling variants in non-standard texts. In Proceedings of the student research workshop at the 15th conference of the European chapter of the association for computational linguistics, (pp. 11–22). Valencia, Spain: Association for Computational Linguistics.
Barteld, F., Schröder, I., & Zinsmeister, H. (2015). Unsupervised regularization of historical texts for POS tagging. In Proceedings of the workshop on corpus-based research in the humanities (CRH), (pp. 3–12). Warsaw, Poland.
Barteld, F., Schröder, I., & Zinsmeister, H. (2016). Dealing with word-internal modification and spelling variation in data-driven lemmatization. In Proceedings of the 10th SIGHUM workshop on language technology for cultural heritage, social sciences, and humanities, (pp. 52–62). Berlin, Germany: Association for Computational Linguistics.
Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Article Google Scholar
Bollmann, M. (2012). (Semi-)automatic normalization of historical texts using distance measures and the Norma tool. In Mambrini F, Passarotti M, Sporleder C (eds) Proceedings of the second workshop on annotation of corpora for research in the humanities (ACRH-2), (pp. 3–14). Lisbon, Portugal.
Bollmann, M. (2013). POS tagging for historical texts with sparse training data. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, (pp. 11–18). Sofia, Bulgaria: Association for Computational Linguistics.
Bollmann, M., & Søgaard, A. (2016). Improving historical spelling normalization with bi-directional LSTMs and multi-task learning. In Proceedings of COLING 2016, the 26th international conference on computational linguistics: technical papers, (pp. 131–139). Osaka, Japan.
Bollmann, M., Petran, F., & Dipper, S. (2011). Applying rule-based normalization to different types of historical texts—An evaluation. In Proceedings of the 5th language & technology conference: human language technologies as a challenge for computer science and linguistics (LTC 2011), (pp. 339–344). Poznan, Poland.
Bollmann, M., Bingel, J., & Søgaard, A. (2017). Learning attention for historical text normalization by learning to pronounce. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 332–344). Vancouver, Canada : Association for Computational Linguistics
Chakrabarty, A., Pandit, O.A., & Garain, U. (2017). Context sensitive lemmatization using two successive bidirectional gated recurrent networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 1481–1491). Vancouver, Canada: Association for Computational Linguistics.
Chollet F, et al. (2015). Keras. https://github.com/fchollet/keras
Chrupała, G. (2014). Normalizing tweets with edit scripts and recurrent neural embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), (pp. 680–686). Baltimore, Maryland, USA: Association for Computational Linguistics.
Ciobanu, M.A., & Dinu, L.P. (2014). Automatic detection of cognates using orthographic alignment. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), (pp. 99–105). Baltimore, Maryland, USA: Association for Computational Linguistics.
Costa Bertaglia, T.F., & Volpe Nunes, MdG. (2016). Exploring word embeddings for unsupervised textual user-generated content normalization. In Proceedings of the 2nd workshop on noisy user-generated text (WNUT), (pp. 112–120). Osaka, Japan: The COLING 2016 Organizing Committee.
Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3), 171–176.
Article Google Scholar
Derczynski, L., Ritter, A., Clark, S., & Bontcheva, K. (2013). Twitter part-of-speech tagging for all: Overcoming sparse and noisy data. In Proceedings of the international conference recent advances in natural language processing (RANLP 2013), (pp. 198–206). Hissar, Bulgaria.
Dipper, S., Lüdeling, A., & Reznicek, M. (2013). NoSta-D: A corpus of German Non-Standard Varieties. In M. Zampieri, S. Diwersy (eds) Non-standard data sources in corpus-based research, ZSM-Studien, vol 5, Shaker, (pp. 69–76).
Ernst-Gerlach, A., & Fuhr, N. (2006). Generating search term variants for text collections with historic spellings. In M. Lalmas, A. MacFarlane, S. Rüger, A. Tombros, T. Tsikrika & A. Yavlinsky (Eds.), Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science (Vol 3936, pp. 49–60). Berlin: Springer.
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., & Herrera, F. (2012). A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(4), 463–484.
Article Google Scholar
Gomes, L., & Pereira Lopes, J.G. (2011). Measuring spelling similarity for cognate identification. In L. Antunes, H. S. Pinto (Eds.), Progress in artificial intelligence: 15th Portuguese conference on artificial intelligence. EPIA 2011. Lecture Notes in Computer Science (Vol 7026, pp. 624–633). Berlin: Springer.
Hamilton, W.L., Leskovec, J., & Jurafsky, D. (2016). Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers), (pp. 1489–1501). Berlin, Germany: Association for Computational Linguistics.
Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalization for social media text. ACM Transactions on Intelligent Systems and Technology, 4(1), 5:1–5:27.
Article Google Scholar
Hauser, A.W., & Schulz, K.U. (2007). Unsupervised learning of edit distance weights for retrieving historical spelling variations. In Proceedings of the first workshop on finite-state techniques and approximate search, (pp. 1–6). Borovets, Bulgaria.
Hogenboom, A., Bal, D., Frasincar, F., Bal, M., De Jong, F., & Kaymak, U. (2015). Exploiting emoticons in polarity classification of text. Journal of Web Engineering, 14(1–2), 22–40.
Google Scholar
Jurish, B. (2010a). Comparing canonicalizations of historical German text. In Proceedings of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, (pp. 72–77). Uppsala, Sweden: Association for Computational Linguistics.
Jurish, B. (2010b). More than words: Using token context to improve canonicalization of historical German. Journal for Language Technology and Computational Linguistics (JLCL), 25(1), 23–39.
Google Scholar
Jurish, B. (2011). Finite-state canonicalization techniques for historical German. Ph.D. thesis, University of Potsdam, Germany.
Jurish, B., Thomas, C., & Wiegand, F. (2014). Querying the Deutsches Textarchiv. In U. Kruschwitz , F. Hopfgartner & C. Gurrin (eds) MindTheGap2014 (pp. 25–30). Berlin, Germany.
Kestemont, M., Daelemans, W., & Pauw, G. D. (2010). Weigh your words–memory-based lemmatization for Middle Dutch. Literary and Linguistic Computing, 25(3), 287–301.
Article Google Scholar
Kim, Y. (2014). Convolutional neural networks for sentence classification. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), (pp. 1746–1751). Doha, Qatar: Association for Computational Linguistics.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. CoRR abs/1412.6980.
Kobus, C., Yvon, F., & Damnati, G. (2008). Normalizing SMS: Are two metaphors better than one? In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), (pp. 441–448). Manchester, United Kingdom.
Koleva, M., Farasyn, M., Desmet, B., Breitbarth, A., & Hoste, V. (2017). An automatic part-of-speech tagger for Middle Low German. International Journal of Corpus Linguistics, 22(1), 107–140.
Article Google Scholar
Lemaitre, G., Nogueira, F., & Aridas, C. K. (2017). Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17), 1–5.
Google Scholar
Levenshtein, V. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady, 10(8), 707–710.
Google Scholar
Li X. L., Liu B. (2005). Learning from positive and unlabeled examples with different data distributions. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge & L. Torgo (Eds.), Machine Learning: ECML 2005. Lecture Notes in Computer Science (Vol 3720, pp. 218–229). Berlin: Springer.
Ljubešić, N., Zupan, K., Fišer, D., & Erjavec, T. (2016). Normalising Slovene data: historical texts vs. user-generated content. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS 2016), (pp. 146–155). Bochum, Germany.
Logačev, P., Goldschmidt, K., & Demske, U. (2014). POS-tagging historical corpora: The case of Early New High German. In Proceedings of the thirteenth international workshop on treebanks and linguistic theories (TLT-13), (pp. 103–112). Tübingen, Germany.
Mihov, S., & Schulz, K. U. (2004). Fast approximate search in large dictionaries. Computational Linguistics, 30(4), 451–477.
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. CoRR abs/1301.3781.
Mordelet, F., & Vert, J.P. (2014). A bagging SVM to learn from positive and unlabeled examples. Pattern Recognition Letters 37(Supplement C):201–209.
Niebaum, H. (2000). Phonetik und Phonologie, Graphetik und Graphemik des Mittelniederdeutschen. In Sprachgeschichte: Ein Handbuch zur Geschichte der deutschen Sprache und ihrer Erforschung, 2nd edn (pp. 1422–1430). Berlin, Boston: DeGruyter.
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12(Oct):2825–2830.
Pettersson, E., Megyesi, B., & Nivre, J. (2012). Rule-based normalisation of historical text — A diachronic study. In Proceedings of KONVENS 2012 (LThist 2012 workshop), (pp. 333–341). Vienna, Austria.
Pettersson, E., Megyesi, B., & Nivre, J. (2013a). Normalisation of historical text using context-sensitive weighted levenshtein distance and compound splitting. In Proceedings of the 19th Nordic conference of computational linguistics (NODALIDA 2013), (pp. 163–179). Oslo, Norway.
Pettersson, E., Megyesi, B., & Tiedemann, J. (2013). An SMT approach to automatic annotation of historical text. Proceedings of the Workshop on Computational Historical Linguistics at NODALIDA, 2013, (pp. 54–69). Oslo, Norway.
Pilz, T., Luther, W., Fuhr, N., & Ammon, U. (2006). Rule-based search in text databases with nonstandard orthography. Literary and Linguistic Computing, 21(2), 179–186.
Article Google Scholar
Piotrowski, M. (2012). Natural language processing for historical texts. Synthesis lectures on human language technologies 17, Morgan & Claypool Publishers.
Schulz, K., & Mihov, S. (2001). Fast string correction with Levenshtein-automata. CIS-Report 01-127. Tech. rep., CIS, University of Munich.
Schulz, K., & Mihov, S. (2002). Fast string correction with Levenshtein-automata. International Journal on Document Analysis and Recognition, 5, 67–85.
Article Google Scholar
Tjong Kim Sang, E., Bollmann, M., Boschker, R., Casacuberta, F., Dietz, F., Dipper, S., et al. (2017). The CLIN27 shared task: Translating historical text to contemporary language for improving automatic linguistic annotation. Computational Linguistics in the Netherlands Journal, 7, 53–64.
Google Scholar
van der Goot, R., Plank, B., & Nissim, M. (2017). To normalize, or not to normalize: The impact of normalization on part-of-speech tagging. In Proceedings of the 3rd workshop on noisy user-generated text, (pp. 31–39). Copenhagen, Denmark: Association for Computational Linguistics.
Weissweiler L. & Fraser A. (2018). Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers. In G. Rehm, T. Declerck (Eds.), Language Technologies for the Challenges of the Digital Age. GSCL 2017. Lecture Notes in Computer Science (Vol 10713, pp. 81–94). Cham: Springer.
Yang, Y., & Eisenstein, J. (2016). Part-of-speech tagging for historical English. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT 2016), (pp. 1318–1328). San Diego, California, USA: Association for Computational Linguistics
Yujian, L., & Bo, L. (2007). A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6), 1091–1095.
Article Google Scholar

Download references

Acknowledgements

The first author was supported by the German Research Foundation (DFG), grant SCHR 999/5-2. We would like to thank the anonymous reviewers for their helpful remarks and Adam Roussel for improving our English. All remaining errors are ours.

Author information

Authors and Affiliations

Institut für Germanistik, Universität Hamburg, Überseering 35, Postfach #15, 22297, Hamburg, Germany
Fabian Barteld & Heike Zinsmeister
Department of Informatics, Language Technology Group, Universität Hamburg, Vogt-Kölln-Straße 30, 22527, Hamburg, Germany
Chris Biemann

Authors

Fabian Barteld
View author publications
You can also search for this author in PubMed Google Scholar
Chris Biemann
View author publications
You can also search for this author in PubMed Google Scholar
Heike Zinsmeister
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fabian Barteld.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barteld, F., Biemann, C. & Zinsmeister, H. Token-based spelling variant detection in Middle Low German texts. Lang Resources & Evaluation 53, 677–706 (2019). https://doi.org/10.1007/s10579-018-09441-5

Download citation

Published: 09 February 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10579-018-09441-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Token-based spelling variant detection in Middle Low German texts

Abstract

Access this article

Similar content being viewed by others

Evaluation of Basic Modules for Isolated Spelling Error Correction in Polish Texts

Detection of Common English Grammar Usage Errors

Automatic Detection of Annotation Errors in Polish-Language Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Token-based spelling variant detection in Middle Low German texts

Abstract

Access this article

Similar content being viewed by others

Evaluation of Basic Modules for Isolated Spelling Error Correction in Polish Texts

Detection of Common English Grammar Usage Errors

Automatic Detection of Annotation Errors in Polish-Language Corpora

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation