Skip to main content

Large-Scale Experiments with NP Chunking of Polish

  • Conference paper
Text, Speech and Dialogue (TSD 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7499))

Included in the following conference series:

Abstract

The published experiments with shallow parsing for Slavic languages are characterised with small size of the corpora used. With the publication of the National Corpus of Polish (NCP), a new opportunity was opened: to test several chunking algorithms on the 1-million token manually annotated subcorpus of the NCP. We test three Machine Learning techniques: Decision Tree induction, Memory-Based Learning and Conditional Random Fields. We also investigate the influence of tagging errors on the overall chunker performance, which happens to be quite substantial.

This work was financed by the National Centre for Research and Development (NCBiR) project SP/I/1/77065/10.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abney, S.: Parsing by chunks. In: Principle-Based Parsing, pp. 257–278. Kluwer Academic Publishers (1991)

    Google Scholar 

  2. Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Proceedings of the Third ACL Workshop on Very Large Corpora, Cambridge, MA, USA, pp. 82–94 (1995)

    Google Scholar 

  3. Osenova, P.: Bulgarian nominal chunks and mapping strategies for deeper syntactic analyses. In: Proceedings of the Workshop on Treebanks and Linguistic Theories, TLT 2002, Sozopol, Bulgaria, September 20–21 (2002)

    Google Scholar 

  4. Przepiórkowski, A.: Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, Warsaw (2008)

    Google Scholar 

  5. Vučković, K., Tadić, M., Dovedan, Z.: Rule-based chunker for croatian. In: Proceedings of the Sixth International Language Resources and Evaluation (LREC 2008). ELRA, Marrakech, Morocco (2008)

    Google Scholar 

  6. Vučković, K.: Model parsera za Hrvatski jezik. Ph.D. thesis, Department of Information Sciences, Faculty of Humanities and Social Sciences, University of Zagreb, Croatia (2009)

    Google Scholar 

  7. Grác, M., Jakubíček, M., Kovář, V.: Through low-cost annotation to reliable parsing evaluation. In: Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation, Tokio, Waseda University, pp. 555–562 (2010)

    Google Scholar 

  8. Radziszewski, A., Piasecki, M.: A preliminary noun phrase chunker for Polish. In: Proceedings of the Intelligent Information Systems (2010)

    Google Scholar 

  9. Maziarz, M., Radziszewski, A., Wieczorek, J.: Chunking of Polish: guidelines, discussion and experiments with Machine Learning. In: Proceedings of the 5th Language & Technology Conference, LTC 2011, Poznań, Poland (2011)

    Google Scholar 

  10. Nenadić, G.: Local Grammars and Parsing Coordination of Nouns in Serbo-Croatian. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2000. LNCS (LNAI), vol. 1902, pp. 57–62. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  11. Vučković, K., Agić, Ž., Tadić, M.: Improving chunking accuracy on Croatian texts by morphosyntactic tagging. In: Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), European Language Resources Association (ELRA), Valletta (2010)

    Google Scholar 

  12. Przepiórkowski, A., Górski, R.L., Łaziński, M., Pęzik, P.: Recent developments in the National Corpus of Polish. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, ELRA (2010)

    Google Scholar 

  13. Waszczuk, J., Głowińska, K., Savary, A., Przepiórkowski, A.: Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish. In: Proceedings of the International Multiconference on Computer Science and Information Technology (IMCSIT 2010): Computational Linguistics – Applications, CLA 2010, Wisła, Poland, PTI, pp. 531–539 (2010)

    Google Scholar 

  14. Sang, E.F.T.K., Veenstra, J.: Representing text chunks. In: Proceedings of the Ninth Conference on European Chapter of the Association for Computational Linguistics, pp. 173–179. Association for Computational Linguistics, Morristown (1999)

    Chapter  Google Scholar 

  15. Bird, S., Loper, E.: Nltk: The natural language toolkit. In: Proceedings of the ACL Demonstration Session, pp. 214–217. Association for Computational Linguistics, Barcelona (2004)

    Google Scholar 

  16. Daelemans, W., Zavrel, J., van der Sloot, K., van den Bosch, A.: TiMBL: Tilburg Memory Based Learner, version 6.3, reference guide. Technical Report 10-01, ILK (2010)

    Google Scholar 

  17. Sha, F., Pereira, F.C.N.: Shallow parsing with conditional random fields. In: Proceedings of the 2003 Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, HLT/NAACL 2003 (2003)

    Google Scholar 

  18. Wallach, H.M.: Conditional random fields: An introduction. Technical report, Department of Computer and Information Science, University of Pennsylvania (2004)

    Google Scholar 

  19. Radziszewski, A., Śniatowski, T.: A memory-based tagger for Polish. In: Proceedings of the 5th Language & Technology Conference, Poznań, Poland (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Radziszewski, A., Pawlaczek, A. (2012). Large-Scale Experiments with NP Chunking of Polish. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32790-2_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32789-6

  • Online ISBN: 978-3-642-32790-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics