Abstract
This contribution refers to the project of BYLL Software Ltd. that uses human aided WSD for the annotation of a fulltext database of the Czech law system named ASPI. We used about 3 mil. words of annotated texts from the law system of the Czech Republic since the 60’s. The annotated law corpus provides certain text regularity, but at the same time it covers wide range of subjects. The goal has been to save as much of the human intervention during text indexing as possi- ble, measured by the number of queries posed to the human annotator, whilst retaining truly minimal error rate (∼0.5 %) in the automatically disambiguated cases. A combination of Naive Bayes, Decision Lists and (minimal number) of manually written rules has been used. The statisti- cal methods showed up to be appropriate for our intention. The results show that we have saved 80% of queries to the human annotator, which proved to be enough to warrant the inclusion of the software into a pro- duction system.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Cikhart, O. Lexikáalní disambiguace českých textů. Master thesis,MFFUKPraha, 1998.
Fujii, Atsushi. Corpus-Based Word Sense Disambiguation. PhD thesis, Report No. TR98-0003, University of Library and Information Science, Tokyo Institute of Technology, Japan, 1998.
Gale, William A., Kenneth W. Church, and David Yarowsky. Amethod for disambiguating word senses in a large corpus. Computers and Humanities, 26:415–439, 1992.
Laciga, Z. Praktická aplikace lingvistické analýzy při vyhledávání v česky psaných textech. Sbornik konference EurOpen CZ’ 99, 1999.
Yarowsky, D. Word-sense disambiguation using statistical models of Roget’s categories trained on large corpora. In Proceedings of Coling-92, 1992.
Yarowsky, D. Decision lists for lexical ambiguity resolution: Application to accent restoration in Spanish and French. In Proceedings of 32nd meeting of the ACL, Las Cruces NM, 1994.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Cikhart, O., Hajič, J. (1999). Word Sense Disambiguation of Czech Texts. In: Matousek, V., Mautner, P., Ocelíková, J., Sojka, P. (eds) Text, Speech and Dialogue. TSD 1999. Lecture Notes in Computer Science(), vol 1692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48239-3_20
Download citation
DOI: https://doi.org/10.1007/3-540-48239-3_20
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66494-9
Online ISBN: 978-3-540-48239-0
eBook Packages: Springer Book Archive