skip to main content
10.1145/3078081.3078111acmotherconferencesArticle/Chapter ViewAbstractPublication PagesdatechConference Proceedingsconference-collections
research-article

Analysis of Part-Of-Speech Tagging of Historical German Texts

Published: 01 June 2017 Publication History

Abstract

The amount of data in contemporary digital corpora is too large to be processed manually, which increases the necessity for computer linguistic tools in humanities. However, the processing of natural languages is a challenge for automatic tools, because languages are used heterogeneously. To process a text, often taggers are used that are trained on a standardized language variety (e.g. recent newspaper articles). Unfortunately, these training data often differ from the target texts (i.e. the text on which a trained model later is applied) in terms of language variety and register, which is especially the case for historical texts. Therefore, additional, manual analyses are usually inevitable. Training tools on the target language variety, however, can improve the results of these tools so that the manual prost-processing could be avoided. Thus, the need to process large datasets of diachronic texts and to obtain accurate results in a short time-span requires an adaptable approach.
The present paper suggests this adaptable approach, by training taggers on a target language variety, to improve the accuracy of the structure of historical German corpora at the level of part-of-speech-tagging (hereafter POS-tagging).
We trained four taggers (Perceptron tagger [26], Hidden Markov Model (HMM) [1], Conditional Random Fields (CRF) [13], and Unigram [21]) each on data from three different literary periods: Baroque (1600-1700), Romanticism (1790-1840) and Modernism (1880-1930). Compared with pre-tagged data, we obtained a maximum accuracy in POS-tagging of 98.3% for a single period (Modernism with Perceptron trained on Modernism) and a maximum mean accuracy for all three periods of 94.3% (Perceptron trained on Romanticism). Compared with manually tagged data, we obtained a maximum accuracy for one period of 96.8% (Romanticism with CRF and HMM trained on Romanticism) and a maximum mean accuracy for all three periods of 92.3% (Perceptron trained on Romanticism).
In spite of the heterogeneity of literary data, these results demonstrate a high performance of the POS-taggers if the models are trained on target language varieties. Therefore, this adaptable approach provides reliable data allowing the use of taggers for analysis of different historical texts.

References

[1]
L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. The annals of mathematical statistics, 37(6):1554--1563, 1966.
[2]
Berlin-Brandenburgische Akademie der Wissenschaften. Deutsches Textarchiv. http://www.deutschestextarchiv.de/. Online; accessed 24-May-2016.
[3]
M. Bollmann. Pos tagging for historical texts with sparse training data. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse, pages 11--18, 2013.
[4]
M. A. Boukhaled and J.-G. Ganascia. Extraction et REcherche de MOtifs Syntaxiques. http://eremos.lip6.fr/index_DE.php. Online; accessed 24-May-2016.
[5]
S. Brants, S. Dipper, P. Eisenberg, S. Hansen-Schirra, E. König, W. Lezius, C. Rohrer, G. Smith, and H. Uszkoreit. Tiger: Linguistic interpretation of a german corpus. Research on language and computation, 2(4):597--620, 2004.
[6]
C. by W. N. Francis and H. KuÄŊera. A standard corpus of present-day edited american english, for use with digital computers (brown). Providence, Brown University, 1964, 1971, 1979.
[7]
S. Dipper. Pos-tagging of historical language data: First experiments. In KONVENS, pages 117--121, 2010.
[8]
S. Dipper. Morphological and part-of-speech tagging of historical language data: A comparison. In Workshop on Annotation of Corpora, 2012.
[9]
D. Freeborn, P. French, and D. Langford. Varieties of English. MacMillan Press/Basingstoke/[et al.], 1993.
[10]
F. Frontini, M. A. Boukhaled, and J.-G. Ganascia. Linguistic pattern extraction and analysis for classic french plays. In Presentation at the CONSCILA Workshop, Paris, 2015.
[11]
E. Giesbrecht and S. Evert. Is part-of-speech tagging a solved task? an evaluation of pos taggers for the german web as corpus. In Proceedings of the fifth Web as Corpus workshop, pages 27--35, 2009.
[12]
H. Göttner-Abendroth and J. Jacobs. Der logische Bau von Literaturtheorien. 1978.
[13]
J. M. Hammersley and P. Clifford. Markov fields on finite graphs and lattices. 1971.
[14]
B. Herrmann and G. Lauer. Das "was-bisher-geschah" von kolimo. ein update zum korpus der literarischen moderne. 4th International Conference âĂlJDigital Humanities im deutschsprachigen RaumâĂİ (DhD), Bern, February, 2017.
[15]
B. Jurish. Efficient online k-best lookup in weighted finite-state cascades. Language and Logos: Studies in Theoretical and Computational Linguistics, 72:313--327, 2010.
[16]
B. Jurish. Canonicalizing the deutsches textarchiv. 2013.
[17]
B. Jurish and K.-M. Würzner. Word and sentence tokenization with hidden markov models. JLCL, 28(2):61--83, 2013.
[18]
T. Köppe and S. Winko. Neuere Literaturtheorien. Metzler, 2013.
[19]
I. Lancashire, J. Bradley, W. McCarty, M. Stairs, and T. Wooldridge. Using Tact with Electronic Texts: A Guide to Text-Analysis Computing Tools: Version 2.1 for MS-DOS and PC DOS. Modern Language Association of America, 1996.
[20]
I. Lancashire and G. Hirst. Vocabulary changes in Agatha Christie's mysteries as an indication of dementia: a case study. In 19th Annual Rotman Research Institute Conference, Cognitive Aging: Research and Practice, pages 8--10, 2009.
[21]
C. D. Manning, P. Raghavan, H. Schütze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008.
[22]
W. O'Grady, J. Archibald, and J. Aronoff, Mar kand Rees-Miller. Contemporary linguistics. St. Martin's Press, 2001.
[23]
H. J. Ottenheimer. The anthropology of language. Thomas Wadsworth Wadsworth, 2006.
[24]
G. Pasternack. Theoriebildung in der Literaturwissenschaft, volume 2. W. Fink, 1975.
[25]
J. W. Pennebaker, R. L. Boyd, K. Jordan, and K. Blackburn. The development and psychometric properties of liwc2015. UT Faculty/Researcher Works, 2015.
[26]
F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
[27]
H. Schmid. Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, volume 12, pages 44--49. Citeseer, 1994.
[28]
H. Schmid and F. Laws. Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 777--784. Association for Computational Linguistics, 2008.
[29]
D. Steding. EnsembleForest: A Classifier Combination Method on the example of Part-of-Speech Tagging. Bachelor's thesis, University of Goettingen, Germany, 2017.
[30]
H. Tausch. Literatur um 1800: klassisch-romantische Moderne. Walter de Gruyter, 2011.
[31]
I. H. Witten and E. Frank. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2005.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
June 2017
179 pages
ISBN:9781450352659
DOI:10.1145/3078081
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Natural language processing
  2. diachronic texts
  3. historical German data
  4. reliability

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

DATeCH2017

Acceptance Rates

DATeCH2017 Paper Acceptance Rate 29 of 37 submissions, 78%;
Overall Acceptance Rate 60 of 86 submissions, 70%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 167
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media