research-article

Data processing and lemmatization in digitized 19^th-century Czech texts

Authors:
Karel Kučera

Charles University in Prague, Czech Republic, Praha

Charles University in Prague, Czech Republic, Praha
View Profile

,
Martin Stluka

Charles University in Prague, Czech Republic, Praha

Charles University in Prague, Czech Republic, Praha
View Profile

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural HeritageMay 2014Pages 193–196https://doi.org/10.1145/2595188.2595220

Published:19 May 2014Publication History

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

Pages 193–196

ABSTRACT

The paper describes the processing of linguistic data obtained through OCR, namely their use for the construction of dictionary databases and subsequent lemmatization. The process is demonstrated on the Czech prints from the 19^th century.

References

IMPACT 2011 Project Periodic Report, 5. http://wwwimpact-project.eu/uploads/media/IMPACT_Annual_report_2011_Publishable_summary_01.pdf.Google Scholar
Part of the of the Applied Research and Development of National and Cultural Identity Programme (NAKI) funded by the Czech Ministry of Education. For details see http://www.isvav.cz/programmeDetail.do?rowId=DF and http://kramerius-info.nkp.cz/projekt-naki.Google Scholar
Schulz, K., Gotscharek, A., Depuydt, K., Bień, J. S., Erjavec, T., Kučera, K., Martinez, I., Mhov, S., Souvay, G. 2011. Cross-language perspective on lexicon building and deployment in IMPACT http://bc.klf.uw.edu.pl/280/Google Scholar
Jungmann, J. 1834--1839. Slovník česko-německý (Czech-German dictionary). Praha.Google Scholar
Kott, F., Š. 1878--1893. Česko-německý slovník (Czech-German dictionary). Praha. http://kott.ujc.cas.cz/Google Scholar
Hujer, O., Smetánka, E., Weingart, M., Havránek, B., Šmilauer, V., Získal, A., (eds.). 1935--1957. Příruční slovník jazyka českého (Desk Dictionary of the Czech Language -- PSJČ). Praha. http://bara.ujc.cas.cz/psjc/Google Scholar
Havránek, B., Bělič, J., Helcl, M., Jedlička, A., (eds.). 1960--1971. Slovník spisovného jazyka českého (Dictionary of the literary Czech language -- SSJČ). Praha. http://ssjc.ujc.cas.cz/Google Scholar
See www.korpus.cz.Google Scholar

Index Terms

Data processing and lemmatization in digitized 19^th-century Czech texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
    1. Document representation
      1. Content analysis and feature selection
      2. Dictionaries

Recommendations

CNN-based Context Sensitive Lemmatization
CODS-COMAD '19: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data

Morphological analysis is always considered as an important task in natural language processing (NLP). Lemmatization is a major morphological operation that finds the dictionary headword/root of a surface word. In context sensitive languages, the ...
Read More
Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER
Text, Speech, and Dialogue
Abstract
Contextualized embeddings, which capture appropriate word meaning depending on context, have recently been proposed. We evaluate two methods for precomputing such embeddings, BERT and Flair, on four Czech text processing tasks: part-of-speech (POS)...
Read More
A novel Arabic lemmatization algorithm
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage
May 2014
200 pages
ISBN:9781450325882
DOI:10.1145/2595188
Program Chairs:
Apostolos Antonacopoulos
University of Salford
,
Klaus U. Schulz
Ludwig-Maximilians-Universität München
Copyright © 2014 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 May 2014
Check for updates
Author Tags
Czech
dictionary database
hyperlemma
lemmatization
lexica
retrieval
Qualifiers
- research-article
Conference

Acceptance Rates
DATeCH '14 Paper Acceptance Rate31of49submissions,63%Overall Acceptance Rate60of86submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 44
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data processing and lemmatization in digitized 19^th-century Czech texts

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

CNN-based Context Sensitive Lemmatization

Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

A novel Arabic lemmatization algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data processing and lemmatization in digitized 19th-century Czech texts

DATeCH '14: Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

ABSTRACT

References

Cited By

Index Terms

Recommendations

CNN-based Context Sensitive Lemmatization

Czech Text Processing with Contextual Embeddings: POS Tagging, Lemmatization, Parsing and NER

A novel Arabic lemmatization algorithm

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Data processing and lemmatization in digitized 19^th-century Czech texts