Elsevier

Pattern Recognition Letters

Volume 129, January 2020, Pages 198-204
Pattern Recognition Letters

What is the minimum training data size to reliably identify writers in medieval manuscripts?

https://doi.org/10.1016/j.patrec.2019.11.030Get rights and content

Highlights

  • We developed a tool to support writer identification in medieval documents.

  • We investigated what is the minimum training data size.

  • We have also introduced a multi-expert classification architecture.

  • We have tested the proposed approach on the two medieval bibles.

Abstract

One of the most important research topics in the field of palaeography is the identification of the different scribes who participated in the writing process of a medieval book. Using traditional palaeographic tools, a palaeographer spends a lot of time reading, measuring and comparing thousands of letters or graphic signs. The aim is to evaluate different characteristics, such as height or width of letters, distance between characters, angles of inclination, number and type of abbreviations etc., which allow a reliable identification of the scribes who contributed to the production of a given manuscript. Despite the growing scientific interest that has been observed in recent years in the use of computer techniques applied to palaeographic research, a general agreement has not yet been reached among researchers, either about the effectiveness of automatic analysis tools, or on the features that should be considered to perform such an analysis. However, in the context of a highly standardized school, the use of some basic page layout features can be very useful for automatically identifying the presence of different hands. In this context, the aim of our study is to verify whether it is possible to strongly reduce the amount of data a palaeographer must analyse manually, in an attempt to answer the following question: what is the minimum size of the training set that allows a classification system to identify the different scribal hands reliably? To this purpose, we have considered two well-known and highly efficient classification techniques, progressively varying the size of the training set and comparing the corresponding classification results. To improve the classification reliability, we have also introduced a multi-expert classification architecture, enabling an easy implementation of a reject option. The experimental results, performed on two large sets of digital images extracted from two entire 12th-century Bibles, show that using only a few pages of these bibles as a training set, it is possible to identify automatically the scribal hands in the remaining pages with great reliability.

Introduction

The aim of paleography is to study ancient documents, in an attempt to ascertain the place and the period in which a manuscript was produced as well as the methodologies and the criteria with which the work was shared among the scribes [23]. In this context, one of the most important problems is certainly the identification of the different hands that contributed to the production of a single medieval book.

Over the years, experts in paleography followed different approaches to the analysis of ancient manuscripts, which have been mainly based on the use of traditional tools to measure quantities such as the height and width of letters, distances between characters, angles of inclination and number and types of abbreviations. In this context, computer-based techniques have become widely established as a standard tool for carrying out the measurements and the comparisons required in the analysis of ancient documents in a more efficient and objective way [4]. On the contrary, the debate on the use of computer vision and pattern recognition techniques as a support to traditional palaeographic analysis is still open and characterized by conflicting positions. These techniques, which have become increasingly common in recent years, are based on the application of very effective pattern recognition algorithms to high quality digital images of medieval manuscripts.

Although there has been a growing scientific interest in the use of computer-based techniques in palaeographic research, there is no general consensus either on the type of techniques to be adopted or on their efficacy [5]. This situation is probably due to the fact that, however promising, all these approaches haven’t yet produced widely accepted results, both because of the immaturity in the use of these new technologies, and the lack of a real interdisciplinary research: palaeography experts often ignore the basic concepts related to pattern recognition techniques and are not able to interact effectively with computer experts for adapting pattern recognition techniques to the analysis of ancient documents. On the other hand, computer experts, not knowing the peculiarities of medieval writings, tend to apply the techniques commonly used in modern writing.

In this context, a further problem is due to the enormous heterogeneity of ancient documents, which have very different characteristics according to the different historical periods, language and styles in which they were produced. This makes the application of standard techniques extremely difficult and requires a specific study for each ancient manuscript considered. However, in the context of a highly standardized school, the selection of some basic features, directly derived from page layout analysis, can be very helpful for automatically distinguishing the different scribes who produced the text. These features can be easily extracted by using standard image processing algorithms.

Moving from these considerations, in previous papers [8], [9] we proposed two different architectures for implementing a classification system able to distinguish the different scribal hands present in a mediaeval Latin book. In these studies, we designed a set of features, directly derived from the page layout analysis, following the suggestions of palaeographic and codicological researchers. The experimental results obtained on two giant mediaeval Bibles were very interesting and confirmed the effectiveness of our approach. Such results, however, were obtained by exploiting the results of an analysis performed manually by palaeographers on such bibles: in particular, we randomly selected about half of the labelled data as a training set and the remaining ones as a test set.

This implies that, to be practically applicable, the proposed approach requires the manual labelling of a significant part of the ancient manuscript to be processed, resulting in very time-consuming manual work required of palaeographers. In this framework, the aim of the present study is to verify whether it is possible to reduce strongly the amount of data manually processed by palaeographers, in attempting to answer the following question: what is the minimum amount of training data that allows a classification system to identify the different scribal hands reliably? To this purpose, we have considered both the same classification architectures and the same data used in the previous studies, to build up an experimental protocol according to which the size of the training set is progressively increased then evaluating each time the corresponding classification results.

The experiments, performed on two large sets of digital images extracted from two entire 12th-century Bibles, show that using only a few pages of these bibles as a training set, it is possible to identify the scribal hands automatically in the remaining pages with high reliability. Finally, we want to point out that, to the best of our knowledge, there are no other studies addressing this problem.

The remainder of the paper is organized as follows: Section 2 discusses the related works, Section 3 presents the system architecture, while Section 4 shows the experimental results. Discussion and conclusions are eventually left to Section 5.

Section snippets

Related work

The contributions of pattern recognition experts to the field of palaeography can be broadly subdivided into two categories: techniques using information derived from the “local” characterization of the handwritten trace, and those using information extracted by the observation of the entire handwritten page.

The first approach is based on the analysis of individual letters and signs as well as of their composing strokes. In this context, run-length-based features have been proposed in the

The system architecture

The proposed system receives as input RGB images of single pages of the manuscript to be processed, and performs the following steps for each page (see Fig. 1): pre-processing, segmentation, feature extraction, and writer identification. These steps are detailed in the following subsections.

Experimental results

As anticipated in the Introduction, in our experiments we considered two datasets of digital images obtained from two medieval bibles, namely the “Avila Bible” and the “Trento Bible”. For both datasets, the extracted data were normalized by using the Z-normalization method.

In order to investigate what is the minimum amount of data to train the considered classifiers effectively, so as to allow them to distinguish the different scribal hands reliably, we randomly selected half of the available

Discussion and conclusions

In the context of a highly standardized school, where the use of some basic page layout features can be very useful for automatically identifying the presence of different hands, we tried to verify whether it possible to reduce strongly the amount of training data that must be manually processed by palaeographers. This aspect, in fact, constitutes one the main drawbacks for these kinds of application. In previous studies we obtained very interesting results, but using about half of the

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that co uld have appeared to influence the work reported in this paper.

References (25)

  • T. Cover et al.

    Nearest neighbor pattern classification

    IEEE Trans. Inf. Theor.

    (2006)
  • Cited by (9)

    • Interpol questioned documents review 2019–2022

      2023, Forensic Science International: Synergy
    • GR-RNN: Global-context residual recurrent neural networks for writer identification

      2021, Pattern Recognition
      Citation Excerpt :

      Forensic writer identification refers to the task of identifying a specific writer of a piece of handwriting, which has potential applications in forensic document examination [1] and historical manuscript analysis [2–4]. The classical methods [5–8] use shape or texture features of handwritten text to recognize the writer, which requires a large amount of image information per sample in order to obtain a statistically reliable feature vector [2,6]. Therefore, most studies focus on writer identification using page-level document images which contain several paragraphs or sentences.

    • Offline Writer Identification and Verification Evaluation Protocols for Spanish Database

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text