research-article

Analysis of Part-Of-Speech Tagging of Historical German Texts

Authors:

Marco BüchlerAuthors Info & Claims

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Pages 41 - 46

https://doi.org/10.1145/3078081.3078111

Published: 01 June 2017 Publication History

Get Access

Abstract

The amount of data in contemporary digital corpora is too large to be processed manually, which increases the necessity for computer linguistic tools in humanities. However, the processing of natural languages is a challenge for automatic tools, because languages are used heterogeneously. To process a text, often taggers are used that are trained on a standardized language variety (e.g. recent newspaper articles). Unfortunately, these training data often differ from the target texts (i.e. the text on which a trained model later is applied) in terms of language variety and register, which is especially the case for historical texts. Therefore, additional, manual analyses are usually inevitable. Training tools on the target language variety, however, can improve the results of these tools so that the manual prost-processing could be avoided. Thus, the need to process large datasets of diachronic texts and to obtain accurate results in a short time-span requires an adaptable approach.

The present paper suggests this adaptable approach, by training taggers on a target language variety, to improve the accuracy of the structure of historical German corpora at the level of part-of-speech-tagging (hereafter POS-tagging).

We trained four taggers (Perceptron tagger [26], Hidden Markov Model (HMM) [1], Conditional Random Fields (CRF) [13], and Unigram [21]) each on data from three different literary periods: Baroque (1600-1700), Romanticism (1790-1840) and Modernism (1880-1930). Compared with pre-tagged data, we obtained a maximum accuracy in POS-tagging of 98.3% for a single period (Modernism with Perceptron trained on Modernism) and a maximum mean accuracy for all three periods of 94.3% (Perceptron trained on Romanticism). Compared with manually tagged data, we obtained a maximum accuracy for one period of 96.8% (Romanticism with CRF and HMM trained on Romanticism) and a maximum mean accuracy for all three periods of 92.3% (Perceptron trained on Romanticism).

In spite of the heterogeneity of literary data, these results demonstrate a high performance of the POS-taggers if the models are trained on target language varieties. Therefore, this adaptable approach provides reliable data allowing the use of taggers for analysis of different historical texts.

References

[1]

L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554--1563, 1966.

Abstract

References

Index Terms

Recommendations

Part-of-speech tagging

Software-specific part-of-speech tagging: an experimental study on stack overflow

Development of Part of Speech Tagger for Assamese Using HMM

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations