Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

DATeCH2017: Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

June 2017

2017 Proceeding

Publisher:

Association for Computing Machinery
New York
NY
United States

Conference:

DATeCH2017: 2nd International Conference on Digital Access to Textual Cultural Heritage Göttingen Germany June 1 - 2, 2017

ISBN:

978-1-4503-5265-9

Published:

01 June 2017

Recommend ACM DL

ALREADY A SUBSCRIBER?SIGN IN

Bibliometrics

Abstract

No abstract available.

Proceeding Downloads

PDFFront matter (Title, Copyright, Contents, Foreword by the programme chairs, Foreword by the organisation chairs, DATeCH 2017 organisation, Programme Committee)

PDFBack matter (List of authors)

Select All

Export Citations Save to Binder

SESSION: Transcription

research-article

750 Volunteers Transcribing 31,000 Pages with 8.5 million Entries Online: an Evaluation

Pages 3–8https://doi.org/10.1145/3078081.3078086

research-article

Enabling Annotation of Historical Corpora in an Asynchronous Collaborative Environment

Pages 9–14https://doi.org/10.1145/3078081.3078089

Current research in Corpus Linguistics and related disciplines within the multi-disciplinary field of Digital Humanities, involves computer-aided manual processing of large text corpora. Typically, corpus instances are retrieved with the help of ...

research-article

Allegro: User-centered Design of a Tool for the Crowdsourced Transcription of Handwritten Music Scores

Pages 15–20https://doi.org/10.1145/3078081.3078101

In this paper, we describe the challenge of transcribing a large corpus of handwritten music scores. We conducted an evaluation study of three existing optical music recognition (OMR) tools. The evaluation results indicate that OMR approaches do not ...

research-article

Enhancing Human-Transcribed Records by Using OCR

Pages 21–26https://doi.org/10.1145/3078081.3078094

SESSION: Natural Language Processing

research-article

The RetroC challenge: how to guess the publication year of a text?

Pages 29–34https://doi.org/10.1145/3078081.3078095

This article describes research in automatic content-based temporal classification of texts. Experiments are carried out on a set of texts coming from Polish digital libraries, dating between the years 1814 and 2013. Following successful research in the ...

research-article

Parsing Romanian Specialized Dictionaries Structured in Nests

Pages 35–40https://doi.org/10.1145/3078081.3078088

This paper presents a tool for processing dictionaries in Word format and for obtaining the XML format which can be used in various applications. DEPAR (Dictionary Entry Parser) permits the introduction of a specific set of rules to describe the ...

research-article

Analysis of Part-Of-Speech Tagging of Historical German Texts

Pages 41–46https://doi.org/10.1145/3078081.3078111

The amount of data in contemporary digital corpora is too large to be processed manually, which increases the necessity for computer linguistic tools in humanities. However, the processing of natural languages is a challenge for automatic tools, because ...

research-article

Dependency Parsing on Late-18th-Century German Aesthetic Writings: A Preliminary Inquiry into Schiller and F. Schlegel

A. Salomoni

Pages 47–52https://doi.org/10.1145/3078081.3078091

Data-driven syntactic parsers are usually trained, tested and developed on web-news data. Little has been done to evaluate them on literary genres of different ages, which are still low-resource varieties in terms of syntactic annotation. In this paper, ...

research-article

In search of Poetic Rhythm: Poetry retrieval through text and metre

Pages 53–57https://doi.org/10.1145/3078081.3078085

In this paper a search service developed for the exploitation of a TEI-based Spanish poetry corpus is presented. Besides a textual retrieval, the search service takes advantage of the metrical annotation to retrieve verses and poems with specific ...

SESSION: OCR & Postprocessing

research-article

Profiling of OCR'ed Historical Texts Revisited

Pages 61–66https://doi.org/10.1145/3078081.3078096

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a ...

research-article

Clear-cut methodology for Arabic OCR and post-correction with low technical skilled annotators

Pages 67–70https://doi.org/10.1145/3078081.3078103

This paper describes an efficient and straightforward methodology for OCR-ing and post-correcting Arabic text material on Islamic embryology collected for the COBHUNI project. As the target texts of the project include diverse diachronic stages of the ...

research-article

Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection

Pages 71–75https://doi.org/10.1145/3078081.3078107

The accuracy of Optical Character Recognition (OCR) is sets the limit for the success of subsequent applications used in text analyzing pipeline. Recent models of OCR postprocessing significantly improve the quality of OCR-generated text but require ...

research-article

OCR of a Mixed Corpus: Early Printings and Manuscripts of Martianus Capella

Manuel Ayuso

Pages 77–82https://doi.org/10.1145/3078081.3078110

This paper deals with the application of the digitization methods designed by LMU CIS team and with the encoding of the data obtained with the aim of building an edition of a Latin author based on the first printed editions and on two manuscripts of the ...

SESSION: Natural Language Processing on Latin and Greek

research-article

The Impact of Unassimilated Loanwords on the Latin Lexicon. A Qualitative and Quantitative Analysis

Pages 85–90https://doi.org/10.1145/3078081.3078083

The recent enhancement of the morphological analyser for Latin Lemlat with a large Onomasticon enables us to analyse both the morphology and the distribution of loanwords in the Latin lexicon. In this paper, first we describe the categories of proper ...

research-article

Open Access

A Memory-Based Lemmatizer for Ancient Greek

Pages 91–95https://doi.org/10.1145/3078081.3078100

In this paper we present the lemmatizer that we developed for Ancient Greek: GLEM. As far as we know, GLEM is the first publicly available lemmatizer for Ancient Greek that uses POS information to disambiguate and that also assigns output to unseen ...

research-article

Implementation of a Latin Grammar in Grammatical Framework

Herbert Lange

Pages 97–102https://doi.org/10.1145/3078081.3078108

In this paper we present work in developing a computerized grammar for the Latin language.

It demonstrates the principles and challenges in developing a grammar for a natural language in a modern grammar formalism.

The grammar presented here provides a ...

research-article

Node Formation: Using Networks to Inspect Productivity in Affixal Derivation in Classical Latin

Pages 103–108https://doi.org/10.1145/3078081.3078092

This paper investigates the distribution of word formation data through network visualisation, as an entry point for the exploration / analysis of productivity in affixal derivation in Classical Latin. This study uses data from the Word Formation Latin ...

SESSION: Infrastructure & Linked Open Data

research-article

Towards an extensible measurement of metadata quality

Péter Király

Pages 111–115https://doi.org/10.1145/3078081.3078109

This paper describes the structure of an extensible metadata quality assessment framework, which supports multiple metadata schemas, and is flexible enough to work with new schemas. The software has to be scalable to be able to process huge amount of ...

research-article

Converting Latin Treebank Data into an SQL Database for Query Purposes

Pages 117–122https://doi.org/10.1145/3078081.3078087

This paper describes how to turn a Latin dependency treebank into queryable information so that it can be browsed online using a tree query engine and its web interface. The annotation layers of the treebank are first introduced, then the query system ...

research-article

Porting past Classification Schemes for Narratives to a Linked Data Framework

Pages 123–127https://doi.org/10.1145/3078081.3078105

In this paper we give an overview on a number of achieved and on-going efforts dealing with porting to the Linked Data framework electronic versions of past classification schemes in the field of folktale narratives. Three of those schemes are in the ...

research-article

A Software Pipeline for the Reception of Italian Literature in Nineteenth-Century England: Preliminary Testing

S. Rebora

Pages 129–134https://doi.org/10.1145/3078081.3078102

This paper presents and discusses a project design aimed at producing synthetic and intuitive visualizations of the reception of Italian literature in nineteenth-century England. In the first part, a processing pipeline is described which combines ...

SESSION: Digitisation & Layout Analysis

research-article

LAREX: A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

Pages 137–142https://doi.org/10.1145/3078081.3078097

A semi-automatic open-source tool for layout analysis on early printed books is presented. LAREX uses a rule based connected components approach which is very fast, easily comprehensible for the user and allows an intuitive manual correction if ...

research-article

Digitization of Old Romanian Texts Printed in the Cyrillic Script

Pages 143–148https://doi.org/10.1145/3078081.3078093

The paper discusses recognition of Romanian texts of the 17th--20th centuries printed in the Cyrillic script, and their conversion to the modern Latin script.

The elaborated technology and a tool pack include historical alphabets, sets of recognition ...

research-article

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

Pages 149–154https://doi.org/10.1145/3078081.3078106

Censuses comprise a wealth of information at a large (national) scale that allow governments (who commission them) and the public to have a detailed snapshot of how people live (geographical distribution and characteristics). In addition to underpinning ...

research-article

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

Pages 155–160https://doi.org/10.1145/3078081.3078098

This paper provides the first thorough documentation of a high quality digitization process applied to an early printed book from the incunabulum period (1450-1500). The entire OCR related workflow including preprocessing, layout analysis and text ...

SESSION: Spatial Analysis

research-article

Open Access

The Ancient Graffiti Project: Geo-Spatial Visualization and Search Tools for Ancient Handwritten Inscriptions

Pages 163–168https://doi.org/10.1145/3078081.3078104

This paper discusses how the Ancient Graffiti Project publishes the digital content of ancient epigraphic material and makes handwritten inscriptions from the first century AD more accessible through the use of geo-referenced, spatial interfaces, ...

research-article

Semantic Enrichment on Cultural Heritage collections: A case study using geographic information

Pages 169–174https://doi.org/10.1145/3078081.3078090

Cultural heritage institutions have recently started to explore the added value of sharing their data, using Linked Open Data to integrate and enrich metadata of their collections. The catalogue of the Biblioteca Virtual Miguel de Cervantes contains ...

research-article

Toponym disambiguation in historical documents using semantic and geographic features

Pages 175–180https://doi.org/10.1145/3078081.3078099

Historians are often interested in the locations mentioned in digitized collections. However, place names are highly ambiguous and may change over time, which makes it especially hard to automatically ground mentions of places in historical texts to ...

research-article

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

Pages 181–186https://doi.org/10.1145/3078081.3078084

Named Entity Recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and ...

Index Terms

Proceedings of the 2nd International Conference on Digital Access to Textual Cultural Heritage

Index terms have been assigned to the content through auto-classification.

Comments

Recommendations

Acceptance Rates

DATeCH2017 Paper Acceptance Rate 29 of 37 submissions, 78%;

Overall Acceptance Rate 60 of 86 submissions, 70%

Year	Submitted	Accepted	Rate
DATeCH2017	37	29	78%
DATeCH '14	49	31	63%
Overall	86	60	70%

DATECH

Sections

Proceeding Downloads

750 Volunteers Transcribing 31,000 Pages with 8.5 million Entries Online: an Evaluation

Enabling Annotation of Historical Corpora in an Asynchronous Collaborative Environment

Allegro: User-centered Design of a Tool for the Crowdsourced Transcription of Handwritten Music Scores

Enhancing Human-Transcribed Records by Using OCR

The RetroC challenge: how to guess the publication year of a text?

Parsing Romanian Specialized Dictionaries Structured in Nests

Analysis of Part-Of-Speech Tagging of Historical German Texts

Dependency Parsing on Late-18th-Century German Aesthetic Writings: A Preliminary Inquiry into Schiller and F. Schlegel

In search of Poetic Rhythm: Poetry retrieval through text and metre

Profiling of OCR'ed Historical Texts Revisited

Clear-cut methodology for Arabic OCR and post-correction with low technical skilled annotators

Poor Man's OCR Post-Correction: Unsupervised Recognition of Variant Spelling Applied to a Multilingual Document Collection

OCR of a Mixed Corpus: Early Printings and Manuscripts of Martianus Capella

The Impact of Unassimilated Loanwords on the Latin Lexicon. A Qualitative and Quantitative Analysis

A Memory-Based Lemmatizer for Ancient Greek

Implementation of a Latin Grammar in Grammatical Framework

Node Formation: Using Networks to Inspect Productivity in Affixal Derivation in Classical Latin

Towards an extensible measurement of metadata quality

Converting Latin Treebank Data into an SQL Database for Query Purposes

Porting past Classification Schemes for Narratives to a Linked Data Framework

A Software Pipeline for the Reception of Italian Literature in Nineteenth-Century England: Preliminary Testing

LAREX: A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books

Digitization of Old Romanian Texts Printed in the Cyrillic Script

Unearthing the Recent Past: Digitising and Understanding Statistical Information from Census Tables

Case Study of a highly automated Layout Analysis and OCR of an incunabulum: 'Der Heiligen Leben' (1488)

The Ancient Graffiti Project: Geo-Spatial Visualization and Search Tools for Ancient Handwritten Inscriptions

Semantic Enrichment on Cultural Heritage collections: A case study using geographic information

Toponym disambiguation in historical documents using semantic and geographic features

Names, Right or Wrong: Named Entities in an OCRed Historical Finnish Newspaper Collection

Index Terms

UbiMob '05: Proceedings of the 2nd French-speaking conference on Mobility and ubiquity computing

UbiMob '08: Proceedings of the 4th French-speaking conference on Mobility and ubiquity computing

IHM '09: Proceedings of the 21st International Conference on Association Francophone d'Interaction Homme-Machine

Acceptance Rates