1 Introduction

The long 19th century provides historians and fellow humanists with a wealth of retrodigitized printed sources. A significant share of these is made up of serial publications with highly structured content and complex layouts (Visually Rich Documents, VRDs) [1]. A high-profile example of this is the Habsburg Monarchy’s Hof- und Staatsschematismus, which was published from 1702 to 1918 [2]. It provides us with a set of serial data on the administrative and representative elites of Habsburg Central Europe in very high quality [3]. A thorough analysis of this data set would contribute decisively to a significantly better understanding of relevant social processes, power dynamics, social networks and careers in modern Central Europe. The Schematismus could be used to trace the genesis of state and administrative institutions, their functioning and development, and the professional biographies of tens of thousands of officials and decision-makers over more than two centuries, across political ruptures and social transformations.

Fig. 1
figure 1

Example page of Schematismus from 1910, highlighting the complexity of its structure, including multiple columns, hierarchical relationships, unique characters, and special annotations, such as the curly braces of which two of them are found in the middle of the page [4]

However, the complex structure of such publications has made a more comprehensive and quantitative evaluation of this source impossible. Even though OCR-quality has improved dramatically since the early days of the retro-digitisation of historical publications, structure analysis and layout detection remain a challenge. The Schematismus is known for its multi-column layouts, deeply branched hierarchies of several levels, and multimodal page designs that feature text alongside tables and complex lists. These characteristics represent the significant challenges digital historians and humanists encounter when processing large quantities of such documents. Thus, there has not been a comprehensive extraction of information, entire data sets or structures, such as information on hierarchies, on the detailed composition of administrative authorities, or simply careers or biographies with regard to the Schematismus yet. Some studies manually extracted relevant information, which proved tedious, time-consuming and prone to errors.

For years consideration has been given to publishing parts of the Schematismus-series as digital editions. Such undertakings have so far reliably failed because of the immense size of the task; the most important series of such handbooks, digitally published by the Austrian National LibraryFootnote 1, comprises 145 volumes compiled between 1702 and 1918 (Table 1).

Table 1 Data sheet of the Schematismus

We calculated that this series contains between 130,000 and 150,000 printed pages. Neither manual extraction of the information contained therein nor automated processing of the pages, which have very diverse layouts, seemed feasible to us with the solutions currently available. The off-the-shelf solutions we tried, as well as generic layout detection integrated into well-established OCR did not perform too well with the Schematismus.

We assumed that a powerful layout detection, which can divide the individual cells of the Schematismus into meaningful text blocks, could possibly represent a relatively favourable solution. The complexity and diversity of the layouts led us to consider machine learning models as possible solutions, as we expected them to be able to handle the fractal nature of the page layout better than rule-based models. The first research question in this paper therefore investigates, whether high-quality layout detection as a preprocessing step could improve the performance of downstream OCR, in order to obtain a relatively simple solution for extracting the relevant information. We first identified and tested a suitable deep learning architecture—Faster R-CNN. Then we rebuilt the custom font used in the Schematismus. We used this font to synthesize a large amount of Schematismus-styled training data in the first step. Care was taken to ensure that the synthetic training data had a similarly complex and varied layout as the original data sets. The synthetically produced training data was also artificially distorted, dirtied, and twisted. We trained the Faster R-CNN model with the synthetic training data and further finetuned the model with a smaller number of manually annotated pages from the Schematismus.

Once our layout detection was working, we tested its potential with an off-the-shelf distribution of Tesseract.

Then we addressed the second research question, which investigated, to what degree a custom font finetuned distribution of Tesseract could boost OCR results. We finetuned an off-the-shelf distribution of Tesseract with the custom font we had created of the Schematismus and carried out a second test run.

For this study, we only invested limited resources in comparing possible OCR solutions, and we only took a cursory look at the necessary post-processing steps. However, the layout detection solution we present should be able to operate with any downstream OCR solution that can be integrated into a Python pipeline. We expect that it might perform even stronger with OCR-engines developed specifically for historical fonts, such as Kraken OCR or OCR4all.

We have not devoted significant resources to the post-processing step so far, yet we ran some preliminary tests, using GPT4. The results were promising, and we expect that the latest generation of LLMs provides enormous potential to further improve the quality of the OCR results in a post-processing step.

2 Background & related work

Wide variation in document formats and degradation over time make automated layout detection and the development of generic tools quite challenging, especially for historical documents. For these reasons, a broad range of techniques and methods has been developed and applied over time, in order to be able to provide for a better automated information extraction from historical documents.

In this section, we will first present the state of research in history and more broadly the humanities, then initiatives to improve OCR that have been developed and tested in the field of historical document analysis are discussed.

2.1 Availability and limitations of OCRed primary sources

The retro-digitisation of either parts of or entire historical sources has been an issue among historians and fellow humanists since at least the 1950s, particularly with regard to serial sources [5]. However, for the larger part of the past seven decades, information extraction and the production of digital data that could be processed by computing machines, has been executed manually. Even though OCR software has been widely available by the 1990s at latest, particularly processing historical data has remained a complex issue [6], as the quality of results has been varying strongly [7]. We therefore observe a bifurcation in the field of digital historical and humanist research: On the one hand, large amounts of retro-digitised historical data are processed automatically by important providers of research data, such as www.archive.org [8]. Many national libraries, as for instance the Austrian National Library [9], the National Library of Finland [10] or the Munich Digitization Center of the Bavarian State Library [11], and further large transnational initiatives, as for example Europeana [12], also belong to this group. However, users of these data, such as historical research projects, are still relying on the manual transcription or annotation of digitised primary sources, even at scale.Footnote 2 Frequently, this is due to quality issues regarding OCRed documents available online.Footnote 3 Relevant databases that were build predominantly on data extracted by manual labour include the Wiener Datenbank zur Europäischen Familiengeschichte [13], The Emperor’s Desk [14], the prosopographical data processed by The Viennese Court [15], and further the projects run in Social Mobility of Elites [16].

2.2 Common challenges

Historians and digital humanists working with retro-digitised text data are familiar with the phenomenon that OCR-ed texts provided on common platforms may vary in quality. This is mostly due to the fact that texts underwent OCR procedure at different times and different technologies were used. As a result, full-text searches frequently produce inhomogeneous results. Also, a “systematic” OCR error, i.e. distortions that are typical for certain OCR software, can no longer be processed in such a targeted manner if large text corpora consist of parts that were OCRed at different times with different software. In many projects, therefore, raw digitised documents are now re-OCRed, although many historical layouts still pose a challenge for standard OCR but also specialised software [10].Footnote 4 Apart from specialised solutions [17], Tesseract OCR and Transkribus are currently regarded as reference standards in the field of historical OCR, but still encounter limitations that require a high level of manual effort, given the special layouts and structures that will be discussed here [18]. Further, OCR4all offers a toolbox for open source OCR applications that has proved highly performant lately [19], as well as kraken OCR, developed by the EHESS and closely associated with the digital research environment eScriptorium.Footnote 5 Libraries have begun to use the potential of ML to the classification of large quantities of texts [20], especially in the context of research libraries [21].

For historical research, important primary sources such as censuses of the Habsburg Empire have been mostly digitized manually [22, 23]. Such sources usually feature a relatively limited volume and a highly homogeneous layout and structure. However, because of the enormous volume, this method is not feasible for other complex source works, such as the Schematismus. Even in the last initiative we know of, the size and complexity of the Schematismus was ultimately considered insurmountable for only partially automated data extraction, and manual edition of a small part was envisaged as an alternative. Due to these obstacles, very little research is engaging with a deeper exploration of the information stored in the Schematismus, the work of Bavouzet [24] (building entirely on manually extracted data) is clearly standing out as a beacon here.

Only recently have developments in new fields of research such as document intelligence begun to open up the possibilities of machine learning for this area on a larger scale [25,26,27]. Impressive progress has been made in some areas [28].

2.3 Pre-OCR steps to enhance OCR quality

The process of extracting information from historical documents can also be considered as a complete extraction pipeline, instead of individual tasks. Monnier and Aubry present work on an extraction tool for historical documents, which provides improvements in terms of robustness and extraction performance due to mutual reinforcement of text line and image segmentation [29]. The task of segmentation is also highlighted by Gruber et al. [30] where the authors also propose to conduct preprocessing of the image before conducting OCR . Such approaches have already been explored in the past, for example by processing the background of the image [31]. Consequently, techniques such as Generative Adversarial Networks have been explored to achieve super-resolving of the input images [32]. Augmentation has been used on several occasions lately in order to develop economic ways to scale up information extraction from historical documents. Grüning et al. use the deep neural network ARU-Net to address the issue of line detection in historical documents [33]. Martínek et al.[18] realise an approach, which combines fine-tuning OCR-engines with comparatively little data, after training the engine with large synthetic data-sets.

Document layout analysis has received increasing attention in the past few years, though this area remains underexplored [34]. Solutions found in this field are not always tackling specifically historical problems, as for instance [35]. It has been recognised though that the increasing availability and usability of deep neural networks, in particular CNNs, offers entirely new opportunities for the development of custom-made solutions for certain document layout analysis tasks [36]. With regard to OCR, line detection is frequently considered more relevant than layout detection [28]. Nevertheless, significant progress has been made in this area in recent years, with layout detection often being understood as part of a more complex single stage process [1, 26, 27]. LayoutLM presents a promising and versatile solution to address many common tasks in this area [37]. We also tried LayoutLM for the Schematismus, but it did not prove efficient with this particular type of documents.

3 Methodology

State Manuals and Hof- und Staatsschematismen for the Habsburg Empire in particular are commonly provided in PDF format. These documents are accessible via various historical document repositories such as the Austrian National LibraryFootnote 6, the Munich Digitalization Center (MDZ) of the Bavarian State LibraryFootnote 7 or archive.orgFootnote 8. Frequently, however, plain text is not available at all or is available in sub-optimal quality.

Whereas the general issue of suboptimal OCR quality is widely acknowledged in historical and further humanist research, most approaches consider layout a secondary line of attack, when it comes to improving OCR quality. There are few exceptions, as for instance [34, 36, 38]. Even the two probably most important and most widely used out-of-the-box solutions in the field of historical OCR, different distributions of Tesseract and the variety of different OCR and HTR models provided by Transkribus, require significant (manual) effort in data preprocessing, when it comes to extracting information from digitised primary sources. Even conventional document layouts often pose a challenge [6]. Lately, OCR4all has come up with a strong layout detection.

Table 2 Approximate number of persons mentioned in each Schematismus per year and the number of pages in the respective volume

Complex layouts as encountered in the Schematismus, however, are still considered a major challenge by the historical research community. Efficient information extraction from such documents is considered a difficult and complicated task. Several efforts to automatically extract structured data from the Schematismus have failed so far. Generally, particularly in historical texts, OCR is performed page-wise, which means that snippets, belonging to different blocks of text, are performed serially, as common OCR processors work line-wise.

Fig. 2
figure 2

Number of persons (blue) and pages (green, 10fold augmentation) in different volumes of the Schematismus

3.1 Approach and research questions

For our approach, we identified two points of attack, which correspond with the two research questions we formulated. First, we wanted to find out, whether AI-driven layout detection prior to OCR could significantly boost OCR quality. Second, we were interested to see, to what degree finetuning the OCR engine could further improve its performance. To tackle research question 1 in a first step, we split the individual document pages into their layout elements, in order to preserve the context of the different blocks of text. To this end, we used a deep learning convolutional neural network. This is a complex ML algorithm originally developed for object localization and object recognition. We assumed this approach would be suitable, since we consider the identification of large coherent blocks a computer vision problem.

In the next step, we used an OCR algorithm to process the individual image snippets, rather than the entire document page image. Then we addressed the second research question, to which degree OCR accuracy could be improved by fine-tuning the standard OCR tool Tesseract on a custom font, which was designed to look as similar as possible to the original font used in the Schematismus documents.

We chose a sample from 145 volumes of State Manuals and focused on editions that were published in the second half of the 19th century onward, for two reasons:

  1. 1.

    The task is becoming slightly more complex for the decades prior to 1848, as the fonts that were used are more diverse and complex. We do not consider this a major problem, yet for the proof of concept we were interested in streamlining the entire research process and to eliminate additional complexities that were not in our primary line of attack.

  2. 2.

    Even though State Manuals are available for a period of more than 200 years, the mass of data was produced from the 1850s onward, therefore the yield for a solution capable of dealing with documents of this type is expected to be very high (compare Fig. 2 and Table 2).

To deal with research question 1, we built a machine learning model that can segment retrodigitised PDF-documents of the Hof- und Staatshandbücher and split these into their layout-structure elements, such as individual paragraphs and headings. Each of these image snippets was subsequently fed into Tesseract for text extraction. Figure 3 shows a simplified version of this process.

We used two Tesseract OCR models. For research question 2, an instance of Tesseract that had been fine-tuned on our custom font. For comparison and performance assessment, we also used an instance that had not been fine-tuned. OCR accuracy was then calculated by comparing the extracted text with the manually transcribed ground truth. Then, this process was repeated, but without dividing the page into individual segments. To answer the research questions in this study, the accuracies were finally compared in order to evaluate the efficiency of our approach.

Fig. 3
figure 3

This flowchart illustrates how each extracted layout element is processed by OCR

The following subsections will describe in detail the implementation of the different methods that we employed to process PDFs of the Hof- und Staatshandbücher. This includes training data set generation for the development of the layout detection, layout detection itself, and optical character recognition. Each of these three steps constituted a work package of its own. All the research and analysis was conducted within Jupyter Notebook, the diverse tools that were put to use are listed in the following subsections.

3.2 Data set generation

In order to successfully train a convolutional neural network, a sufficient amount of labeled data is required. Creating a training data set by manually drawing bounding boxes on a large quantity of PDFs drawn from historical source documents is time-consuming and labour-intensive due to the large number of pages that would have to be annotated. Therefore, we developed an alternative approach to artificially generate labeled training data. We wrote a Python script that is designed to generate synthetic documents mimicking the style of the Hof- und Staatshandbücher. This Python script generated Latex-code, which was then compiled using luatex [39] to create a PDF file. In the course of this process, the coordinates of the individual text structure elements were determined, which is described in more detail in the following section.

In order for the generated data to be used as training data, it was imperative that the created documents appeared as visually realistic as possible compared to the original documents. Reverse engineering the original documents and paying attention to detail were therefore essential to the creation of synthetic training data.

Due to the fact that each paragraph begins with a last name and a first name, a data set containing thousands of first and last names [40] was randomly sampled in order to obtain names. The pool of samples was restricted to Austria, Hungary, Switzerland, and Germany.

A list of abbreviation explanations from the 1910 Schematismus document was manually transcribed and randomly selected for the text following the names. To vary the length of individual generated paragraphs, the number of sampled texts was also randomly chosen.

Figure 1 displays a page from the original 1910 Schematismus document. Variable column numbers are a key characteristic of such documents and are used almost everywhere. While most sections have three columns, there can be variations in certain sections. For instance, Fig. 8 shows four columns in the name-index section. Thus, generating realistic synthetic documents required the use of the same column layout.

Another key visual element is the relatively distinct font type. Research led us to a font called “Opera-Lyrics-Smooth” that appeared very similar to the original. Even though it represented already a good match, we decided to invest additional effort: Using the open source program “FontForge” [41], we further customised the font in accordance with the original. In order to achieve the best possible match to the original font, screenshots of every letter in the original documents were taken manually, and then the existing letters in the font were adapted according to the screenshots taken.

Another distinctive feature of the Hof- und Staatsschematismus is its excessive use of particular symbols, representing orders and similar distinctions of the persons listed in this source. The use of these symbols allowed the further condensation of information stored in the Schematismus. Nonstandard symbols were mapped to Unicode symbols, which were unlikely to be needed for document generation. Figure 4 illustrates this mapping.

Fig. 4
figure 4

Illustration of how Unicode characters are mapped to Schematismus symbols

This method resulted in the creation of three font types: one for general text paragraphs, one for headlines, and one for italics. An example of this can be seen in Figs. 5 and 6, which shows an original paragraph and a corresponding paragraph reproduced using the custom font. Additionally, when reviewing some original Schematismus documents, it is apparent that font sizes and alignments vary considerably from section to section or even from page to page. In particular, the difference can be observed when examining the headlines of the original documents. To create realistic headlines, four headline types ranging from “H1” to “H4” were used to emulate this feature. The first element was the largest and the rest gradually decreased in size. Table 3 provides a list of all the different classes.

Finally, it is crucial to emphasise some small but very significant visual details. Every paragraph begins with one or two words in bold, the last name of the individual, followed by the first name and some additional titles and awards. Indentation occurs after the first line if the text is too long for a single line. Furthermore, multiple individuals may be grouped together within a large curly bracket, as can be seen in Fig. 1. In name-index pages, multiple individuals with the same last name may be grouped by adding a horizontal line at the beginning, which can be observed in Fig. 7. Additionally, every entry within this section is accompanied by one or more numbers that indicate the page number. Finally, it should be noted that headings are usually centered on the page or within columns and that the end of every text is always marked with a period.

Fig. 5
figure 5

Original paragraph, taken from an existing state manual from the year 1910

Fig. 6
figure 6

Corresponding synthetically generated paragraph reproduced with a custom font

In order to be able to produce text for the synthetic Schematismus-style training documents, several sources were consulted. For the purpose of generating large headlines, a simple list of historical Austrian orders and decorations was used [42]. In order to create headlines with smaller font sizes, a combination of years as strings and Austrian municipality names was used. For paragraph generation, two sources were consulted, as previously mentioned. Through the use of all the above methods and visual keys, we were able to create a large number of realistic looking synthetic documents. An example of such a synthetic Schematismus-style document can be seen in Figure 7.

Fig. 7
figure 7

Example of a synthetic Schematismus-style document used in training-set

In addition to generating a synthetic data set of Schematismus documents for training a machine learning model, annotations for each element of the structure had to be generated along with the generation of the document in order to make this data set effective for training. As part of the process of detecting and localising objects, in this case the layout elements, bounding boxes were used to define the location and size of the individual structure elements within an image. The labels accompanying the bounding boxes indicate the class of the corresponding box, such as “paragraph” or “H1”, which are necessary for classification tasks performed by the machine learning method.

A latex package called “zref-savepos” is used to save the position of characters on the current page and write these coordinates to an external file at compilation time. Using the coordinates, bounding boxes could be computed by parsing the external file. As the individual text elements had already been generated earlier in the same Python script, it was known which label had to be associated with the respective bounding box. In order to construct the data set, the generated documents, which were compiled by the latex compiler and then saved as PDF files, were converted to images. In addition to the image file, a Pascal VOC XML file containing the corresponding annotations was created and stored in a separate directory. A total of 3766 synthetic Schematismus documents have been generated using this approach. Figure 9 shows a synthetic document with its corresponding annotations overlayed.

Fig. 8
figure 8

Example of an original page from the name index section

Fig. 9
figure 9

Example of generated Schematismus-style document alongside with corresponding annotations (bouding boxes and class label) used in the training set

Table 3 Lists of the class types used to generate synthetic Schematismus documents

3.3 Layout detection

For the actual layout detection model to be configured in the next step, we chose a faster region-based convolutional neural network (faster R-CNN) built on a ResNet-50 backbone. The model was created and trained using the PyTorch [43] framework. In PyTorch, version-2 of the faster R-CNN implementation was used [44]. Training of the model was conducted on an Nvidia RTX 4090 with 24 GB of video memory.

3.3.1 Model training and settings

Even though we used primarily the default settings of the model, we found that some adjustments had a significant impact on the model’s performance. The following paragraphs will provide a detailed description of these adjustments.

  1. 1.

    By setting the pre-trained parameter to “True”, the training speed has improved in a way that earlier epochs of training have already begun with a lower training and validation loss than with randomly initialised starting weights.

  2. 2.

    A further parameter that has been tweaked is the anchor generation of the region proposal network in the faster R-CNN model. When identifying areas of interest in an image, anchor boxes are critical because they determine where to look. Thus, they play an essential role when it comes to detecting layout elements within a document image. In order to be able to accommodate a variety of different types of objects, anchor boxes were selected with different aspect ratios and scales. It is imperative to note that there are multiple anchor boxes applied to each sliding window position in the region proposal network. Therefore, it is logical to specify these ratios and scales according to the shape of the objects. Thus, the minimum, maximum, and mean ratio and scale of all bounding boxes within the 3766 generated Schematismus documents have been calculated and used as a guideline to set the anchor-generation parameters. A general characteristic of the anchor boxes chosen is their elongated and narrow shape, which is understandable, given that text lines have a similar shape. Additionally, some objects, such as single lines or headings, were quite small, so the anchor boxes generated were smaller than usual.

  3. 3.

    Further, the maximum number of object detections per image needed to be adjusted. Since most object detection models detect only a few objects at a time within a single image, the default value is 100 objects. It should be noted, however, that since the purpose of this analysis is to detect quite fine-grained layout elements within these Schematismus documents, the number of layout elements within one document may easily exceed 250 (some pages, for instance from the name register of the Schematismus, feature up to 400 objects on one page). In order to completely disregard this upper limit of detections per image, the parameter was set to 1000.

  4. 4.

    Another significant adjustment has been made to the resolution of the images. It should be noted that while the standard image input resolution of the faster R-CNN model is 1333 \(\times \) 800 pixels, this resolution results in the model sometimes being unable to detect smaller objects such as single lines within the Schematismus documents. The reason for this is that both convolution layers and pooling layers in the model further downscale the input image, resulting in loss of important information. Based on the experiments conducted, a resolution of 1988 \(\times \) 1405 pixels has been determined as the input resolution.

As mentioned previously, 3766 synthetic Schematismus-style documents have been generated for the purpose of creating a fully annotated data set. Among these, 3126 served as the training set and 640 served as the validation set. A stratified split of the full data set was used to select the training and validation sets. In our scenario, the stratification process involved counting the occurrences of each class within each generated document page. This information was then used to split the data into two sets, maintaining a similar class distribution in both the validation and training sets. Even though there were more than 3000 annotated training documents available, dataset augmentation strategies have been applied. Adding random augmentations to existing data allowed the training set to be artificially expanded in terms of document variety without increasing the number of documents and thus increasing training time. Therefore, adding augmentations to the training of a faster R-CNN layout detection model of documents can improve its accuracy and robustness.

3.3.2 Training data augmentation and further steps

As a result of applying random transformations such as rotation, scaling, cropping, optical distortion (to simulate page warping), blur, noise adding and page flipping, the model could learn to recognise and locate different layout elements invariant of their angle or size. Considering that all of these parameters were selected randomly, it is extremely unlikely that two identical documents will be input into the model during training. Additionally, augmentation may help to reduce overfitting, which occurs when a model becomes too specialised in recognising only training examples and performs poorly on new, unknown data. By augmenting the training data, the model is exposed to a wider range of layout variations and becomes more adaptable to new and unseen documents.

As for the actual training process, adjustments have been made to the number of epochs, batch size, and learning rate. During each training iteration, batch size determines the number of samples, and thus images, to be processed by the machine learning model before the weights are updated. It is one of the most influential hyperparameters when training deep learning models, and it can be viewed as a trade-off between accuracy and speed. A larger batch size allows more samples to be processed at once, resulting in faster training times and better hardware utilisation. However, larger batch sizes require more memory and may hinder generalisation [45].

By contrast, a smaller batch size results in fewer samples being processed at once. Despite slower training times, this can also prevent, to some extent, overfitting and produce a more generalisable model. Typically, smaller batch sizes are used when the data set is small or when the model requires frequent updating of a large number of parameters. Choosing the correct batch size cannot be achieved in a one-size-fits-all manner since it is heavily dependent on the data set being used. For our study, a batch-size of two has been selected based on experimentation since it fully utilises GPU memory and, along with a scaling factor of 85 percent, produces a relatively fast training process.

In order to maintain a constant variance in gradient expectations, it is recommended to multiply the learning rate by \(\sqrt{k}\) when multiplying the batch size by k [46]. As a result of extensive learning rate optimisation, a base learning rate of 0.005 has been chosen for batch size one. Thus, the final learning rate is \(0.005 \cdot \sqrt{2} \approx 0.007\). This learning rate, along with a weight decay of 0.0005 and a momentum of 0.9, was used to initialise a stochastic gradient descent optimiser. The total number of epochs was set to 100. The model is saved after every epoch if the validation loss is less than the previous saved validation loss. In this manner, one can be assured that the final model, which has been trained for 100 epochs, is the one which worked best on the validation set.

The model, which has been trained purely on synthetic data, gave fairly solid results when used on original documents, as described in detail in the Evaluation section. While these results are already promising, the existing model can also be further fine-tuned using real, original documents. As annotations could not be generated this time, they had to be hand drawn, which is a very time-consuming process.

PyTorch’s TorchServe [47] framework has, however, accelerated this process significantly. Employing this method allowed us to “serve” an existing trained model over the local network, where one could send images to and receive bounding-box and label prediction information. In theory, this is not much different from a simple Python script that runs a model directly to predict the layout elements of documents that are fed into it, but there are some very specific applications it can be used for. With the help of an annotation software called “BoundingBoxEditor” [48] the pre-trained model can be “served” and therefore accelerate the manual annotation process of the original documents significantly. This is due to the fact that the pre-trained model already yielded good results. Therefore, only a few adjustments and error corrections were necessary, such as correcting incorrect classifications or bounding boxes. A total of 39 original Schematismus documents have been manually annotated and saved in this way. Once the original Schematismus documents had been annotated, the annotations were utilised to fine-tune the existing model. To that purpose, the newly created data set had been divided into training and validation sets. Using the pre-trained model’s weights as initial weights, these sets were trained for 100 epochs using the same parameters described earlier. Figure 12 show the training loss and validation loss for this training process.

3.3.3 Layout detection post-processing

When analysing the predictions in detail, it becomes apparent that some bounding boxes are overlapped, leading to sometimes incorrect predictions. An illustration of this phenomenon can be found in Figure 11, in which two different bounding boxes overlap on the second line.

Fig. 10
figure 10

This figure illustrates a randomly selected original Schematismus document with layout elements overlayed. Element predictions with confidence levels less than 0.1 have been omitted. The faster R-CNN model used to get these prediction results has been fine-tuned on 39 original Schematismus documents

Fig. 11
figure 11

The figure illustrates that some bounding boxes overlap and have ambiguous label classifications, there are two overlapping bounding boxes in the middle (magenta and teal)

To address this issue, a bounding box and label classification post-processing step has been developed. The process works by iterating over every predicted bounding box and then calculating the intersection over union (IoU) with every other bounding box. Essentially, the IoU represents the ratio between the overlapping area and the union area, and the closer the IoU is to 1.0, the more similar the bounding boxes are. As soon as the IoU has been calculated for every bounding box, merge candidates are identified by selecting boxes with an IoU score higher than 0.3. As a result, a maximum bounding box is calculated, which encompasses all merge candidates (including the box currently being viewed), and is used to replace the original bounding box. Considering that the merge candidates may be of different classes, the merged bounding box will be labelled with the class tag of the highest confidence score. It should be noted that bounding boxes associated with the “Curly” class will not be selected as merge candidates, as the nature of this class is to have enclosed boxes inside them. Figure 11 illustrates the overlay of the predicted boxes following the application of the post-processing step.

3.4 OCR

To extract the text within the individual elements of the predicted layout, Tesseract 5.0 was used [49]. As mentioned in Subsection 3.2, the font used in Schematismus documents is no longer commonly used. Despite the fact that Tesseract has been pre-trained on a number of fonts, it makes sense to utilise the custom fonts developed for the generation of Schematismus documents to fine-tune the Tesseract optical character recognition model. In addition, due to the symbols that are used in the documents (see Fig. 4) which are unique to Schematismus documents, training on this font is necessary in order to recognise these symbols. To fine-tune Tesseract on such a font, images containing a single block of text rendered in the font must be generated. For this, Tesseract’s built-in function “text2image” has been utilised. This function requires a large textfile of training text as one of its parameters. Despite the fact that there are existing text files for fine-tuning in German, a custom text file has been compiled using the same text generation method as described in Sect. 3.2.

In addition, a custom character mapping file called “unicharset” must be provided in order to map the Schematismus unique symbols to special Unicode characters. The built-in function has been used to create 50,000 images in total. Additionally, to generating individual images containing a single text block and saving them as a “.tif” file, two additional files are generated. One of them is the underlying ground truth, which is saved as a text file. As for the second file, it contains information about every character rendered within the image. It provides details about the character as well as coordinates describing its bounding box. The three files are then combined into a single “.lstmf” file, which is essential for the training process once fine-tuning begins. A note should be made regarding the fact that the German OCR model has been used as a starting point for this training process. Section 4 presents an evaluation of the fine-tuned Tesseract model, as well as additional experiments and preprocessing steps required to obtain the best results.

4 Evaluation

We start the evaluation of our model performance with an explanation of our parameter choices for the layout detection model, discussed in subsection Layout detection. Then, both the Tesseract optical character recognition algorithm and the layout detection model are evaluated in detail. As these two elements perform quite distinctly, evaluation will first take place separately, followed by an evaluation of the combined results in subsection 4.3. Furthermore, any additional pre- or post-processing steps that could be performed to further improve the results will be described.

Fig. 12
figure 12

This figure illustrates the training and validation losses associated with a faster R-CNN model trained on a dataset containing 39 original documents. Initial weights were derived from the weights of the existing pre-trained model

4.1 Evaluation of the layout detection model

A set of eighteen original Schematismus document pages from 1910 was analysed for evaluation of the layout detection model. All the original documents are completely new to the model. It is important to note that these documents were not part of the training or validation set used to fine-tune the model. An example of how predicted bounding boxes and corresponding labels appear is shown in Figure 10. This figure illustrates the model being applied to one of eighteen selected document pages. Predicted elements feature a confidence level between 0 and 1, representing the model’s certainty about the accuracy of its prediction. Boxes, which feature a confidence level below 0.1, have been omitted.

Overlapping bounding boxes were sorted out, using the post-processing step described in subsubsection 3.3.3. As a result of this step, the layout detection model can now be evaluated. For the purpose of obtaining a ground-truth of bounding-boxes with corresponding labels, the eighteen selected original documents are hand-annotated. During the drawing of the bounding boxes, extensive attention has been paid to detail, in order to reproduce a very accurate ground truth.

To measure the predicted bounding box accuracy, the intersection over union (IoU) method is again employed. For each document image in the testing-set, the bounding boxes in the ground-truth set were iterated over and the best matching prediction based on the IoU score was selected. Following the establishment of a list of all best matching predictions, the average over all IoU scores is calculated, which represents the bounding box prediction accuracy for the given page.

It should be noted that in order to distinguish between binary values in a multiclass classification problem, the metrics have to be calculated for each class individually. This is while only considering the class currently in focus to be positive (1) and all the rest as negative (0). Figure 13 illustrates this with a confusion matrix. In order to measure the performance of the classification of each layout element, four different metrics are used: accuracy, precision, recall, \(\hbox {F}_1\)-score (Fig. 14).

Fig. 13
figure 13

This figure illustrates the confusion matrix for all predictions made on the test set. Note that the values have been normalized according to the ground truth, so that each row sums to 100 percent

All the metrics mentioned above have been calculated for each of the eighteen documents in the test set in accordance with the ground truth. Table 4 gives a detailed overview of each metric for both fine-tuned and non-fine-tuned models. Compared to the other documents, the documents with indeces 3 and 5 performed the worst, especially in terms of accuracy. Explanations and illustrations of why these performed so poorly are provided in Sect. 5.1. Apart from that, the results for both bounding box accuracy and classification performance appear promising. Table 5 lists the final performance metrics of the fine-tuned model, averaged across all tested pages.

Table 4 Summary of classification accuracy, precision, recall, \(\hbox {F}_1\)-score, and bounding-box accuracy (based on the average IoU) for both non-fine-tuned and fine-tuned models
Table 5 Performance measures were calculated based on the average of all 18 test documents
Fig. 14
figure 14

Using ground truth, this figure illustrates the distribution of the various layout elements found in the test set

Additionally, Table 6 provides a statistic about the confidence with which the faster R-CNN model has predicted bounding-box and corresponding class labels for each class. There are three approaches that can be used in order to gain insight into this behavior: The first column in the table represents the average confidence level whenever anything has been detected by the model. Column two indicates the confidence level of the model’s prediction, when the associated class was actually correct, and column three indicates the confidence level when the associated class was incorrect. Whenever no incorrect predictions have been made for a particular class, the value has been omitted. Based on the results in column three, it can be seen that the model tends to be overconfident in its predictions.

Table 6 In the Table the confidence levels of the faster R-CNN model in predicting bounding-boxes and corresponding class labels for each class are displayed

4.2 Evaluation of OCR performance

To see how Tesseract OCR performs on original Schematismus documents, a ground truth must be established. As this ground truth must be compiled manually by converting documents into plain text, this takes a considerable amount of time. For the evaluation process, a total of 16 original Schematismus pages have been manually transcribed this way. Then, in a first step, all pages were fed into a distribution of Tesseract OCR, which had not been fine-tuned to the custom font, furthermore, the pages were not preprocessed via layout detection. Tesseract was configured to use built-in page segmentation to partition the outputs of the entire pages into an easily readable and correct format. The resulting outputs did not match expected sequence. Consequently, in order to make a fair comparison between Tesseract’s output and the corresponding ground-truth, the predicted outputs have been manually split and reordered to match the original layout of the specific page, without altering any extracted characters or words. The constrained number of tested pages is also attributed to the time-intensive nature of this additional process. The exact same process was then repeated using a distribution of Tesseract OCR, which had been fine-tuned for the custom font. Following manual alignment, CER and WER were calculated for every block of text. The average over all pages is shown in rows one and two of Table 7. According to the results, the average CER has improved by 7.01 percentage points and the average WER by 10.94 percentage points when using the fine-tuned OCR model.

Table 7 The table presents the average CER and WER resulting from various scenarios
Fig. 15
figure 15

The figure shows the average character-error-rate and word-error-rate calculated with different levels of upscaling/downscaling and padding applied to individual image snippets

4.3 Evaluation of the OCR in combination with the layout detection model

In the next step, the layout detection model was utilised to segment every structure element within the pages into individual elements. To obtain the image snippets, the original images were cropped based on the predicted bounding boxes for each page. As each element’s coordinates are known in the original document, it was possible to sort them in the appropriate order, so no manual reordering was necessary. The Tesseract OCR model was then applied to each image snippet individually. Since it became evident in the prior step that the fine-tuned version performs significantly better, only this version was used. For each individual image snippet, CER and WER were calculated with the corresponding ground truth. Row three in Table 7, the averages of all predictions across all pages are presented.

According to these results, both average CER and WER have improved by another 7.87 and 6.38 percentage points respectively compared to the averages computed based on feeding the full pages into the fine-tuned Tesseract OCR algorithm. Even though we consider these improvements satisfying, we believe that OCR accuracy can be further improved. Although the layout detection model performs well in finding very accurate bounding boxes, sometimes characters are cropped off at the borders of the images. As a result, padding has been added around the predicted bounding box, to address this issue. As described in the official Tesseract guide on improving OCR accuracy [50], Tesseract generally works better with higher resolution images. Therefore, the individual image snippets have been resized using various scales based on the height of each image in order to find the sweet spot. The aspect ratio of the original image was maintained while upscaling (or downscaling) through linear interpolation. Figure 15 illustrate this. Interestingly, the CER and WER reach their minimum values when the original image snippet is left unscaled and padded with two pixels. This observation suggests that the resolution of the scanned pages may be sufficiently high. Row four in Table 7 shows the final average CER and WER values.

According to these results, it is possible to answer the research question, which is how much OCR accuracy can be improved by using a layout detection model as a preprocessing step to segment Schematismus-state documents, and feeding Tesseract individual images containing one layout structure rather than a full page. In comparison to the average CER and WER obtained on a full page using a fine-tuned Tesseract OCR model, an 8.67 percentage point improvement in the CER and 9.01 percentage point improvement in the WER were observed. In comparison with an out-of-the-box Tesseract OCR model, even higher CER and WER improvements were achieved. Here the average CER improved by a total of 15.68 percentage points and the average WER improved by a total of 19.95 percentage points.

Fig. 16
figure 16

This figure illustrates a document with index 5 of the evaluation set, which had the poorest layout-detection performance

5 Discussion

It is evident from the results presented in section 4 that the use of a custom-developed layout detection model to segment Schematismus-style documents together with a Tesseract model fine-tuned to a custom font designed to be as close to the original as possible significantly improved the quality of the extracted text. However, it should be noted that, due to the relatively small sample size of 16 pages, these results might not represent the full picture. The gain that combined layout detection and text extraction provides may be larger than metrics alone can express. That is due to two different reasons.

As each layout element is segmented by bounding boxes, it is known where the blocks are located within a document coordinate-wise. It allows the reordering of the extracted texts so that they correspond to the reading flow of the document. This is essential when extracting text from a document with columns.

The second reason is that due to the classification of the individual bounding boxes it is possible to immediately tell the class of a structure element. Thus, it is possible, for example, to extract only headings and paragraphs from a document. Moreover, this makes it easier to match paragraphs with individual headings, or to get all enclosed structure elements within curly brackets.

As for the Tesseract OCR model, the custom designed font should be improved, to make text extraction even better. Due to the fact that the current version of the font does not include certain characters such as “č” or “ň” in its unichar set, detection of these kinds of letters is not possible, resulting in errors.

The pipeline we outlined in this article is highly adaptable and could be optimised for similar tasks with relatively little effort. Particularly the process we suggest to produce synthetic training data at scale contributes largely to our capacity to adapt the layout detection model quickly. Not only does it allow to produce a significant amount of training data in relatively little time, but the use of synthetic training data also appears to significantly increase the precision of the bounding box prediction.

5.1 Error analysis

According to Table 4, documents with the indices 3 and 5 performed quite poorly compared to others in terms of accuracy. Both documents are visually very similar. As an example, the document with index 5 is shown in Fig. 16. Clearly, this document differs from the typical three-column Schematismus-style document. Aside from the general layout, a key difference is the indentation of each paragraph. Although these aspects were considered during the generation of synthetic documents, resulting in a separate class “BigParagraphs”, the generated structures do not appear to be as similar to the originals as intended, based on layout detection results. Due to the relatively small number of examples of this type of document in the training set used for fine-tuning, we could not observe any improvement. Specifically, only two pages containing “BigParagraphs” were included in the fine-tuning training set, which appears to be too few for the model to effectively learn this class. Therefore, to improve performance on these types of Schematismus documents, more pages similar to those in Fig. 16 must be manually annotated and added to the fine-tuning training set.

5.2 Research questions

With regard to our first research question, we could show how OCR accuracy can be significantly improved by splitting individual document pages into their layout elements as a preprocessing step.

As for the second research question, it has been shown that fine-tuning Tesseract with a custom font results in performance improvement.

In comparison with the performance of an out-of-the-box Tesseract Model for OCR on an entire page of the Schematismus, the results indicated that segmenting and splitting individual document pages into their layout elements with a deep learning convolutional neural network resulted in significantly better OCR accuracy.

5.3 Outlook

The procedure we developed therefore represents a crucial step toward a significantly improved analysis of printed historical documents produced in the larger context of the long 19th century, particularly as we show how each of the two steps can be further adapted to the specific needs, requirements and challenges met by fellow researchers.

However, we expect that both, layout detection and optical character recognition, can be further optimised for even better performance on historical documents. Layout detection may benefit from increasing the training dataset, not only in terms of the number of pages but also by including a wider variety of documents. In other words, by generating documents that are visually similar to much older, in our case Schematismus-style documents produced in the first half of the 19th century, when a different layout was used, and including these in the training set, a more generalized and robust model may be achieved. Further, domain knowledge can be put to use to enhance text extraction. Considering that most of the printed text in Schematismus-style documents consists of abbreviations that are listed and described on specific pages within these documents, this information can be utilised to build a custom spell-checking algorithm to correct errors in the text extraction process. Additionally, it would be of interest to explore whether the methods used in this paper can be applied to other types of historical documents.

Our approach offers a viable solution to a number of common problems in dealing with retro-digitised historical texts in historical and humanities research contexts, yet also in industrial application. In this work, it was first shown that the breakdown of the OCR problem, and its solution in several sub-steps, is very promising. However, careful work and precise adjustment of the training data are necessary preconditions to obtain excellent performance. Future work based on our approach will tackle more diverse layouts and broader scope of documents types.