Application of Stacked Methods to Part-of-Speech Tagging of Polish

Kuta, Marcin; Wójcik, Wojciech; Wrzeszcz, Michał; Kitowski, Jacek

doi:10.1007/978-3-642-14390-8_35

Marcin Kuta²⁰,
Wojciech Wójcik²⁰,
Michał Wrzeszcz²⁰ &
…
Jacek Kitowski²⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6067))

Included in the following conference series:

International Conference on Parallel Processing and Applied Mathematics

1392 Accesses

Abstract

We compare the accuracy of several single and combination part-of-speech tagging methods applied to Polish and evaluated on the modified corpus of Frequency Dictionary of Contemporary Polish (m-FDCP). Three well known combination methods (weighted voting, distributed voting, and stacked) are analyzed, as well as two new weighted voting methods: MorphCatPrecision and AmbClassPrecision methods are proposed. The MorphCatPrecision method achieves the highest accuracy among all considered weighted voting methods. The best combination method achieves 11.9% error reduction with respect to the best baseline tagger.

We report also the statistical significance of the difference in accuracy between various methods measured by means of the McNemar test.

Selection of the best algorithms was conducted on a multiprocessor supercomuter due to the high time and memory requirements of most of these algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Part of Speech Tagging for Polish: State of the Art and Future Perspectives

Analyzing Tagging Accuracy of Part-of-Speech Taggers

Spoken Spanish PoS tagging: gold standard dataset

Article Open access 02 July 2024

References

Kuta, M., Chrzaszcz, P., Kitowski, J.: Increasing quality of the Corpus of Frequency Dictionary of Contemporary Polish for morphosyntactic tagging of the Polish language. Computing and Informatics 28(3), 2009
Google Scholar
Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proc. of the 1st Conf. on Empirical Methods in Natural Language Processing, pp. 133–142 (1996)
Google Scholar
Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger-generator. In: Proc. of the 4th Workshop on Very Large Corpora, pp. 14–27 (1996)
Google Scholar
Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4), 543–565 (1995)
Google Scholar
Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proc. of the Int. Conf. on New Methods in Language Processing, pp. 44–49 (1994)
Google Scholar
Brants, T.: TnT - a statistical part-of-speech tagger. In: Proc. of the 6th Applied Natural Language Processing Conf., pp. 224–231 (2000)
Google Scholar
Florian, R., Ngai, G.: Fast Transformation-Based Learning Toolkit manual. John Hopkins Univ., USA (2001), http://nlp.cs.jhu.edu/~rflorian/fntbl
Giménez, J., Màrquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proc. of the 4th Int. Conf. on Language Resources and Evaluation, pp. 43–46 (2004)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of the 18th Int. Conf. on Machine Learning (ICML 2001), pp. 282–289 (2001)
Google Scholar
van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2), 199–229 (2001)
Article Google Scholar
Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: Proc. of the 7th Int. Conf. on Computational Linguistics, pp. 191–195 (1998)
Google Scholar
Kuta, M., Wrzeszcz, M., Chrzaszcz, P., Kitowski, J.: Accuracy of baseline and complex methods applied to morphosyntactic tagging of Polish. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2008, Part I. LNCS, vol. 5101, pp. 903–912. Springer, Heidelberg (2008)
Chapter Google Scholar
Hajič, J.: Morphological tagging: Data vs. Dictionaries. In: Proc. of the First Conf. North American Chapter of the Association for Computational Linguistics, pp. 94–101 (2000)
Google Scholar
Kuta, M., Wójcik, W., Wrzeszcz, M., Kitowski, J.: Application of weighted voting taggers to languages described with large tagsets. Computing and Informatics (2009) (submitted)
Google Scholar
Mihalcea, R.: Performance analysis of a part of speech tagging task. In: Proc. of the 4th Int. Conf. on Computational Linguistics and Intelligent Text Processing, pp. 158–166 (2003)
Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, San Francisco (1993)
Google Scholar
Witten, I., Frank, E.: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann Publishers, San Francisco (2005)
MATH Google Scholar
Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner. Technical report ILK 07-07, Tilburg University (2007)
Google Scholar
Przepiórkowski, A., Woliński, M.: A flexemic tagset for Polish. In: Proc. of the Workshop on Morphological Processing of Slavic Languages (EACL 2003), pp. 33–40 (2003)
Google Scholar
Bień, J., Woliński, M.: Enriched Corpus of Frequency Dictionary of Contemporary Polish. Warsaw University, Poland (2001) (in Polish), http://www.mimuw.edu.pl/polszczyzna
Google Scholar
Kuta, M., Chrzaszcz, P., Kitowski, J.: A case study of algorithms for morphosyntactic tagging of Polish language. Computing and Informatics 26(6), 627–647 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science, AGH-UST, al. Mickiewicza 30, Kraków, Poland
Marcin Kuta, Wojciech Wójcik, Michał Wrzeszcz & Jacek Kitowski

Authors

Marcin Kuta
View author publications
You can also search for this author in PubMed Google Scholar
Wojciech Wójcik
View author publications
You can also search for this author in PubMed Google Scholar
Michał Wrzeszcz
View author publications
You can also search for this author in PubMed Google Scholar
Jacek Kitowski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute of Computational and Information Sciences, Czestochowa University of Technology,
Roman Wyrzykowski
Department of Electrical Engineering and Computer Science, University of Tennessee, TN 37996-3450, Knoxville, USA
Jack Dongarra
Institute of Computer and Information Science, Czestochowa University of Technology, Dabrowskiego 73, PL-42-200, Czestochowa, Poland
Konrad Karczewski
Department of Informatics and Mathematical Modeling, Technical University of Denmark, Richard Petersens Plads, Building 321, 2800, Kongens Lyngby, Denmark
Jerzy Wasniewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kuta, M., Wójcik, W., Wrzeszcz, M., Kitowski, J. (2010). Application of Stacked Methods to Part-of-Speech Tagging of Polish. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2009. Lecture Notes in Computer Science, vol 6067. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14390-8_35

Download citation

DOI: https://doi.org/10.1007/978-3-642-14390-8_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14389-2
Online ISBN: 978-3-642-14390-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics