What is the minimum training data size to reliably identify writers in medieval manuscripts?
Introduction
The aim of paleography is to study ancient documents, in an attempt to ascertain the place and the period in which a manuscript was produced as well as the methodologies and the criteria with which the work was shared among the scribes [23]. In this context, one of the most important problems is certainly the identification of the different hands that contributed to the production of a single medieval book.
Over the years, experts in paleography followed different approaches to the analysis of ancient manuscripts, which have been mainly based on the use of traditional tools to measure quantities such as the height and width of letters, distances between characters, angles of inclination and number and types of abbreviations. In this context, computer-based techniques have become widely established as a standard tool for carrying out the measurements and the comparisons required in the analysis of ancient documents in a more efficient and objective way [4]. On the contrary, the debate on the use of computer vision and pattern recognition techniques as a support to traditional palaeographic analysis is still open and characterized by conflicting positions. These techniques, which have become increasingly common in recent years, are based on the application of very effective pattern recognition algorithms to high quality digital images of medieval manuscripts.
Although there has been a growing scientific interest in the use of computer-based techniques in palaeographic research, there is no general consensus either on the type of techniques to be adopted or on their efficacy [5]. This situation is probably due to the fact that, however promising, all these approaches haven’t yet produced widely accepted results, both because of the immaturity in the use of these new technologies, and the lack of a real interdisciplinary research: palaeography experts often ignore the basic concepts related to pattern recognition techniques and are not able to interact effectively with computer experts for adapting pattern recognition techniques to the analysis of ancient documents. On the other hand, computer experts, not knowing the peculiarities of medieval writings, tend to apply the techniques commonly used in modern writing.
In this context, a further problem is due to the enormous heterogeneity of ancient documents, which have very different characteristics according to the different historical periods, language and styles in which they were produced. This makes the application of standard techniques extremely difficult and requires a specific study for each ancient manuscript considered. However, in the context of a highly standardized school, the selection of some basic features, directly derived from page layout analysis, can be very helpful for automatically distinguishing the different scribes who produced the text. These features can be easily extracted by using standard image processing algorithms.
Moving from these considerations, in previous papers [8], [9] we proposed two different architectures for implementing a classification system able to distinguish the different scribal hands present in a mediaeval Latin book. In these studies, we designed a set of features, directly derived from the page layout analysis, following the suggestions of palaeographic and codicological researchers. The experimental results obtained on two giant mediaeval Bibles were very interesting and confirmed the effectiveness of our approach. Such results, however, were obtained by exploiting the results of an analysis performed manually by palaeographers on such bibles: in particular, we randomly selected about half of the labelled data as a training set and the remaining ones as a test set.
This implies that, to be practically applicable, the proposed approach requires the manual labelling of a significant part of the ancient manuscript to be processed, resulting in very time-consuming manual work required of palaeographers. In this framework, the aim of the present study is to verify whether it is possible to reduce strongly the amount of data manually processed by palaeographers, in attempting to answer the following question: what is the minimum amount of training data that allows a classification system to identify the different scribal hands reliably? To this purpose, we have considered both the same classification architectures and the same data used in the previous studies, to build up an experimental protocol according to which the size of the training set is progressively increased then evaluating each time the corresponding classification results.
The experiments, performed on two large sets of digital images extracted from two entire 12th-century Bibles, show that using only a few pages of these bibles as a training set, it is possible to identify the scribal hands automatically in the remaining pages with high reliability. Finally, we want to point out that, to the best of our knowledge, there are no other studies addressing this problem.
The remainder of the paper is organized as follows: Section 2 discusses the related works, Section 3 presents the system architecture, while Section 4 shows the experimental results. Discussion and conclusions are eventually left to Section 5.
Section snippets
Related work
The contributions of pattern recognition experts to the field of palaeography can be broadly subdivided into two categories: techniques using information derived from the “local” characterization of the handwritten trace, and those using information extracted by the observation of the entire handwritten page.
The first approach is based on the analysis of individual letters and signs as well as of their composing strokes. In this context, run-length-based features have been proposed in the
The system architecture
The proposed system receives as input RGB images of single pages of the manuscript to be processed, and performs the following steps for each page (see Fig. 1): pre-processing, segmentation, feature extraction, and writer identification. These steps are detailed in the following subsections.
Experimental results
As anticipated in the Introduction, in our experiments we considered two datasets of digital images obtained from two medieval bibles, namely the “Avila Bible” and the “Trento Bible”. For both datasets, the extracted data were normalized by using the Z-normalization method.
In order to investigate what is the minimum amount of data to train the considered classifiers effectively, so as to allow them to distinguish the different scribal hands reliably, we randomly selected half of the available
Discussion and conclusions
In the context of a highly standardized school, where the use of some basic page layout features can be very useful for automatically identifying the presence of different hands, we tried to verify whether it possible to reduce strongly the amount of training data that must be manually processed by palaeographers. This aspect, in fact, constitutes one the main drawbacks for these kinds of application. In previous studies we obtained very interesting results, but using about half of the
Declaration of Competing Interests
The authors declare that they have no known competing financial interests or personal relationships that co uld have appeared to influence the work reported in this paper.
References (25)
- et al.
A digital palaeographic approach towards writer identification in the dead sea scrolls
Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods, ICPRAM.
(2017) - et al.
Image-based historical manuscript dating using contour and stroke fragments
Pattern Recognit.
(2016) - et al.
Image-based historical manuscript dating using contour and stroke fragments
Pattern Recognit.
(2016) - et al.
Curvelets based feature extraction of handwritten shapes for ancient manuscripts classification
Document Recognition and Retrieval XIV, San Jose, California, USA, January 30, - February 1
(2007) Quantifying Scribal Behavior: A Novel Approach to Digital Paleography
(2016)- et al.
Large scale style based dating of medieval manuscripts
Proc. of 3rd International Workshop on Historical Document Imaging and Processing
(2015) - et al.
Noir et blanc. Premiers résultats d’une enquête sur la mise en page dans le livre médiéval
Il libro e il testo, Urbino
(1982) Random forests
Mach. Learn.
(2001)- et al.
Text-independent writer identification and verification using textural and allographic features
IEEE Trans. Pattern Anal. Mach. Intell.
(2007) The palaeographical method under the light of a digital approach
Nearest neighbor pattern classification
IEEE Trans. Inf. Theor.
Cited by (9)
Incorporating sparse model machine learning in designing cultural heritage landscapes
2023, Automation in ConstructionInterpol questioned documents review 2019–2022
2023, Forensic Science International: SynergyGR-RNN: Global-context residual recurrent neural networks for writer identification
2021, Pattern RecognitionCitation Excerpt :Forensic writer identification refers to the task of identifying a specific writer of a piece of handwriting, which has potential applications in forensic document examination [1] and historical manuscript analysis [2–4]. The classical methods [5–8] use shape or texture features of handwritten text to recognize the writer, which requires a large amount of image information per sample in order to obtain a statistically reliable feature vector [2,6]. Therefore, most studies focus on writer identification using page-level document images which contain several paragraphs or sentences.
Pattern recognition and artificial intelligence techniques for cultural heritage
2020, Pattern Recognition LettersOffline Writer Identification and Verification Evaluation Protocols for Spanish Database
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)A Survey on Writer Identification and Recognition Methods with a Special Focus on Cultural Heritage
2022, CEUR Workshop Proceedings