Skip to main content

Word Sense Disambiguation of Czech Texts

  • Conference paper
  • First Online:
Text, Speech and Dialogue (TSD 1999)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1692))

Included in the following conference series:

  • 489 Accesses

Abstract

This contribution refers to the project of BYLL Software Ltd. that uses human aided WSD for the annotation of a fulltext database of the Czech law system named ASPI. We used about 3 mil. words of annotated texts from the law system of the Czech Republic since the 60’s. The annotated law corpus provides certain text regularity, but at the same time it covers wide range of subjects. The goal has been to save as much of the human intervention during text indexing as possi- ble, measured by the number of queries posed to the human annotator, whilst retaining truly minimal error rate (∼0.5 %) in the automatically disambiguated cases. A combination of Naive Bayes, Decision Lists and (minimal number) of manually written rules has been used. The statisti- cal methods showed up to be appropriate for our intention. The results show that we have saved 80% of queries to the human annotator, which proved to be enough to warrant the inclusion of the software into a pro- duction system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Cikhart, O. Lexikáalní disambiguace českých textů. Master thesis,MFFUKPraha, 1998.

    Google Scholar 

  2. Fujii, Atsushi. Corpus-Based Word Sense Disambiguation. PhD thesis, Report No. TR98-0003, University of Library and Information Science, Tokyo Institute of Technology, Japan, 1998.

    Google Scholar 

  3. Gale, William A., Kenneth W. Church, and David Yarowsky. Amethod for disambiguating word senses in a large corpus. Computers and Humanities, 26:415–439, 1992.

    Article  Google Scholar 

  4. Laciga, Z. Praktická aplikace lingvistické analýzy při vyhledávání v česky psaných textech. Sbornik konference EurOpen CZ’ 99, 1999.

    Google Scholar 

  5. Yarowsky, D. Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of Coling-92, 1992.

    Google Scholar 

  6. Yarowsky, D. Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of 32nd meeting of the ACL, Las Cruces NM, 1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cikhart, O., Hajič, J. (1999). Word Sense Disambiguation of Czech Texts. In: Matousek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds) Text, Speech and Dialogue. TSD 1999. Lecture Notes in Computer Science(), vol 1692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48239-3_20

Download citation

  • DOI: https://doi.org/10.1007/3-540-48239-3_20

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66494-9

  • Online ISBN: 978-3-540-48239-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics