ABSTRACT
The paper describes the processing of linguistic data obtained through OCR, namely their use for the construction of dictionary databases and subsequent lemmatization. The process is demonstrated on the Czech prints from the 19th century.
- IMPACT 2011 Project Periodic Report, 5. http://wwwimpact-project.eu/uploads/media/IMPACT_Annual_report_2011_Publishable_summary_01.pdf.Google Scholar
- Part of the of the Applied Research and Development of National and Cultural Identity Programme (NAKI) funded by the Czech Ministry of Education. For details see http://www.isvav.cz/programmeDetail.do?rowId=DF and http://kramerius-info.nkp.cz/projekt-naki.Google Scholar
- Schulz, K., Gotscharek, A., Depuydt, K., Bień, J. S., Erjavec, T., Kučera, K., Martinez, I., Mhov, S., Souvay, G. 2011. Cross-language perspective on lexicon building and deployment in IMPACT http://bc.klf.uw.edu.pl/280/Google Scholar
- Jungmann, J. 1834--1839. Slovník česko-německý (Czech-German dictionary). Praha.Google Scholar
- Kott, F., Š. 1878--1893. Česko-německý slovník (Czech-German dictionary). Praha. http://kott.ujc.cas.cz/Google Scholar
- Hujer, O., Smetánka, E., Weingart, M., Havránek, B., Šmilauer, V., Získal, A., (eds.). 1935--1957. Příruční slovník jazyka českého (Desk Dictionary of the Czech Language -- PSJČ). Praha. http://bara.ujc.cas.cz/psjc/Google Scholar
- Havránek, B., Bělič, J., Helcl, M., Jedlička, A., (eds.). 1960--1971. Slovník spisovného jazyka českého (Dictionary of the literary Czech language -- SSJČ). Praha. http://ssjc.ujc.cas.cz/Google Scholar
- See www.korpus.cz.Google Scholar
Index Terms
- Data processing and lemmatization in digitized 19th-century Czech texts
Recommendations
CNN-based Context Sensitive Lemmatization
CODS-COMAD '19: Proceedings of the ACM India Joint International Conference on Data Science and Management of DataMorphological analysis is always considered as an important task in natural language processing (NLP). Lemmatization is a major morphological operation that finds the dictionary headword/root of a surface word. In context sensitive languages, the ...
Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER
Text, Speech, and DialogueAbstractContextualized embeddings, which capture appropriate word meaning depending on context, have recently been proposed. We evaluate two methods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS)...
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataTokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Comments