No abstract available.
Proceeding Downloads
Enabling Annotation of Historical Corpora in an Asynchronous Collaborative Environment
Current research in Corpus Linguistics and related disciplines within the multi-disciplinary field of Digital Humanities, involves computer-aided manual processing of large text corpora. Typically, corpus instances are retrieved with the help of ...
Allegro: User-centered Design of a Tool for the Crowdsourced Transcription of Handwritten Music Scores
In this paper, we describe the challenge of transcribing a large corpus of handwritten music scores. We conducted an evaluation study of three existing optical music recognition (OMR) tools. The evaluation results indicate that OMR approaches do not ...
The RetroC challenge: how to guess the publication year of a text?
This article describes research in automatic content-based temporal classification of texts. Experiments are carried out on a set of texts coming from Polish digital libraries, dating between the years 1814 and 2013. Following successful research in the ...
Parsing Romanian Specialized Dictionaries Structured in Nests
This paper presents a tool for processing dictionaries in Word format and for obtaining the XML format which can be used in various applications. DEPAR (Dictionary Entry Parser) permits the introduction of a specific set of rules to describe the ...
Analysis of Part-Of-Speech Tagging of Historical German Texts
The amount of data in contemporary digital corpora is too large to be processed manually, which increases the necessity for computer linguistic tools in humanities. However, the processing of natural languages is a challenge for automatic tools, because ...
Dependency Parsing on Late-18th-Century German Aesthetic Writings: A Preliminary Inquiry into Schiller and F. Schlegel
Data-driven syntactic parsers are usually trained, tested and developed on web-news data. Little has been done to evaluate them on literary genres of different ages, which are still low-resource varieties in terms of syntactic annotation. In this paper, ...
In search of Poetic Rhythm: Poetry retrieval through text and metre
In this paper a search service developed for the exploitation of a TEI-based Spanish poetry corpus is presented. Besides a textual retrieval, the search service takes advantage of the metrical annotation to retrieve verses and poems with specific ...
Profiling of OCR'ed Historical Texts Revisited
In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a ...
Clear-cut methodology for Arabic OCR and post-correction with low technical skilled annotators
This paper describes an efficient and straightforward methodology for OCR-ing and post-correcting Arabic text material on Islamic embryology collected for the COBHUNI project. As the target texts of the project include diverse diachronic stages of the ...
Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection
The accuracy of Optical Character Recognition (OCR) is sets the limit for the success of subsequent applications used in text analyzing pipeline. Recent models of OCR postprocessing significantly improve the quality of OCR-generated text but require ...
OCR of a Mixed Corpus: Early Printings and Manuscripts of Martianus Capella
This paper deals with the application of the digitization methods designed by LMU CIS team and with the encoding of the data obtained with the aim of building an edition of a Latin author based on the first printed editions and on two manuscripts of the ...
The Impact of Unassimilated Loanwords on the Latin Lexicon. A Qualitative and Quantitative Analysis
The recent enhancement of the morphological analyser for Latin Lemlat with a large Onomasticon enables us to analyse both the morphology and the distribution of loanwords in the Latin lexicon. In this paper, first we describe the categories of proper ...
A Memory-Based Lemmatizer for Ancient Greek
In this paper we present the lemmatizer that we developed for Ancient Greek: GLEM. As far as we know, GLEM is the first publicly available lemmatizer for Ancient Greek that uses POS information to disambiguate and that also assigns output to unseen ...
Implementation of a Latin Grammar in Grammatical Framework
In this paper we present work in developing a computerized grammar for the Latin language.
It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism.
The grammar presented here provides a ...
Node Formation: Using Networks to Inspect Productivity in Affixal Derivation in Classical Latin
This paper investigates the distribution of word formation data through network visualisation, as an entry point for the exploration / analysis of productivity in affixal derivation in Classical Latin. This study uses data from the Word Formation Latin ...
Towards an extensible measurement of metadata quality
This paper describes the structure of an extensible metadata quality assessment framework, which supports multiple metadata schemas, and is flexible enough to work with new schemas. The software has to be scalable to be able to process huge amount of ...
Converting Latin Treebank Data into an SQL Database for Query Purposes
This paper describes how to turn a Latin dependency treebank into queryable information so that it can be browsed online using a tree query engine and its web interface. The annotation layers of the treebank are first introduced, then the query system ...
Porting past Classification Schemes for Narratives to a Linked Data Framework
In this paper we give an overview on a number of achieved and on-going efforts dealing with porting to the Linked Data framework electronic versions of past classification schemes in the field of folktale narratives. Three of those schemes are in the ...
A Software Pipeline for the Reception of Italian Literature in Nineteenth-Century England: Preliminary Testing
This paper presents and discusses a project design aimed at producing synthetic and intuitive visualizations of the reception of Italian literature in nineteenth-century England. In the first part, a processing pipeline is described which combines ...
LAREX: A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books
A semi-automatic open-source tool for layout analysis on early printed books is presented. LAREX uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if ...
Digitization of Old Romanian Texts Printed in the Cyrillic Script
The paper discusses recognition of Romanian texts of the 17th--20th centuries printed in the Cyrillic script, and their conversion to the modern Latin script.
The elaborated technology and a tool pack include historical alphabets, sets of recognition ...
Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables
Censuses comprise a wealth of information at a large (national) scale that allow governments (who commission them) and the public to have a detailed snapshot of how people live (geographical distribution and characteristics). In addition to underpinning ...
Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)
This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text ...
The Ancient Graffiti Project: Geo-Spatial Visualization and Search Tools for Ancient Handwritten Inscriptions
This paper discusses how the Ancient Graffiti Project publishes the digital content of ancient epigraphic material and makes handwritten inscriptions from the first century AD more accessible through the use of geo-referenced, spatial interfaces, ...
Semantic Enrichment on Cultural Heritage collections: A case study using geographic information
Cultural heritage institutions have recently started to explore the added value of sharing their data, using Linked Open Data to integrate and enrich metadata of their collections. The catalogue of the Biblioteca Virtual Miguel de Cervantes contains ...
Toponym disambiguation in historical documents using semantic and geographic features
Historians are often interested in the locations mentioned in digitized collections. However, place names are highly ambiguous and may change over time, which makes it especially hard to automatically ground mentions of places in historical texts to ...
Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection
Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and ...
Index Terms
- Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage
Acceptance Rates
Year | Submitted | Accepted | Rate |
DATeCH2017 | 37 | 29 | 78% |
DATeCH '14 | 49 | 31 | 63% |
Overall | 86 | 60 | 70% |