1 Introduction

Optical Music Recognition (OMR) refers to the field of research that studies how to make computers be able to read music notation [1]. OMR promises great benefits to the digital humanities, making accessible and browsable the musical heritage that only exists as written copies distributed all over the world [6].

As in many other fields, modern Machine Learning techniques, such as Deep Neural Networks, have brought significant improvements in OMR [3, 8, 9]. However, the supervised learning framework assumes that there is an adequate training set from which to learn, and that the system will be used in data that has been generated by the same distribution [5]. Due to the particularities of the musical cultural heritage, this scenario is not that interesting: there are many small-scale musical manuscripts with different graphical characteristics. Given the artistic scope of the application domain, it is difficult to indicate exactly what differentiates manuscripts among themselves, but we can roughly add it up to a number of these factors: authors’ handwriting style, engraving mechanism, type and color of the paper/parchment, color of the ink, and an amount of possible deterioration, among others. This leads to the need of building a training set from each manuscript to attain reliable results, which ends up in a rather inefficient workflow.

That is why in this work we want to study the use of Domain Adaptation (DA) techniques in the context of music manuscript recognition. DA refers to a scenario in which we want to classify data that come from a distribution (or domain) different from the data used to train, although the set of classification labels is the same [2]. Specifically, we assume the case of semi-supervised DA where data from the new (target) domain is available but not labeled. This represents the typical case of musical manuscripts: it is easy to obtain the images of the manuscripts to be transcribed but costly to annotate them conveniently.

In this first work studying this issue for music manuscripts, we focus on DA for the classification of music-notation symbols. This stage is typically considered within the standard workflow for designing OMR systems [10]. Given a new manuscript, state-of-the-art techniques can be used to detect isolated symbols [4] or assume an interactive environment where the user is responsible for locating the symbols manually with ergonomic interfaces [11].

We conduct comprehensive experiments over five different manuscripts of early music with a state-of-the-art neural architecture for semi-supervised DA. We will evaluate the different parameters for the configuration and training of the neural network, and we will compare the results with those obtained by a model that does not use DA. Our results yield interesting conclusions about the type of adaptation that can be achieved and what are the conditions for it to happen. Eventually, the DA results report a significant improvement over the conventional methods. This outcome can be used as a starting point for future models and a comparative baseline for possible improvements of this type of techniques in the context of document image analysis.

In the rest of the paper, we present the considered methodology (Sect. 2), our experimental setup (Sect. 3), the obtained results and their main outcomes (Sect. 4), and the obtained conclusions along with some avenues for future research (Sect. 5).

2 Methodology

For supervised learning classification algorithms, given X, the input space, and Y, the output space (or label space), we have a source domain \(D_S\) over \(X \times Y\) from which a labeled set \(S=\left\{ \left( x_i,y_i\right) \right\} _{i=1}^N\thicksim (D_S)^N\) is i.i.d drawn, where N is the total number of samples. The objective for these algorithms is to learn a mathematical model, or hypothesis function, \(h:X\rightarrow Y\), so that the labels of the new samples are predicted with as little error as possible, thus building what is known as a label classifier. Given that the goal of this work is to study the transfer of knowledge from a label classifier to a different domain (or in other words, study the task of DA), we must build a model with certain requirements.

In our scenario, we have two different domains called the source domain \(D_S\) and the target domain \(D_T\), both being distributions over \({\displaystyle X}\,\times \,{\displaystyle Y}\). The DA learning algorithm will be provided with a labeled source sample S and unlabeled target sample T drawn i.i.d. from \(D_S\) and from \(D_T\), respectively, \(S=\left\{ \left( x_i,y_i\right) \right\} _{i=1}^n\thicksim (D_S)^n;\ T=\left\{ \left( x_i\right) \right\} _{i=1}^{N-n}\thicksim (D_T)^{n'},\) with \(N = n + n'\) being the total number of samples.

Eventually, the goal of the DA algorithm is to build a label classifier, that, just like conventional Convolutional Neural Networks (CNN) classification models, given an input x is able to predict its label y, with the difference being that the input will be drawn from the target domain and the hypothesis must be obtained by applying the previously explained requirements.

2.1 Domain Adversarial Neural Network

The architecture experimented on, the Domain Adversarial Neural Network (DANN), was firstly proposed in the work of [7]. It consists of three parts, two of which are common in any standard feed-forward CNN model: the feature extractor and label classifier (or label predictor), as seen in Fig. 1.

Fig. 1.
figure 1

Basic CNN classification model diagram.

In order for the classification decisions of the label classifier to be made based on features that are both discriminative and invariant to the change of domains, a domain classifier, which contains a Gradient Reversal Layer, must be added to the model (see Fig. 2).

The Gradient Reversal Layer from the domain classifier multiplies the feature extractor’s gradient by a specified negative weight during the back-propagation training. This prevents the training to be performed in a standard way and will carry out what we seek: to maximize the loss of the domain classifier, which in turn ensures domain invariant features to emerge, disabling the model from learning which domain the input belongs to.

These classifiers will work in tandem so that the label classifier is optimized to minimize the loss of label predictions, and the domain classifier is optimized to maximize the loss of the current domain prediction, assuring that a domain invariance exists during the course of learning. Finally, the feature extractor will learn domain invariant features, so we can use the label classifier to classify samples from both the source domain \(D_S\) and the target domain \(D_T\).

Fig. 2.
figure 2

The considered DANN architecture. The Gradient Reversal Layer (GR Layer), which will maximize domain invariance, is situated in the Domain Classifier structure.

3 Experimental Setup

3.1 Network Configuration

The DANN architecture (see Fig. 2) receives an input image of size \(40 \times 40\) pixels and transfers it to the Feature Extractor block. The Feature Extractor is composed by two 2D convolutional layers, and ends with a Flatten layer that converts the output of the Feature Extractor into a vector. Then it follows a bifurcation that leads separately to the Label Classifier and the Domain Classifier. Below we will explain the configuration of each of these parts of the network model.

Let \(Conv2D(f,(k_1,k_2))\) be a convolutional layer with f filters and kernel size of \(k_1\times k_2\), \(MaxPooling((p_1,p_2))\) be a max-pooling operation layer with pool size \(p_1\times p_2\), Dense(ua) be a dense layer with u units and activation a, and Dropout(d) be a dropout operation layer with a dropout ratio of d. Then the specific configuration of the DANN used in this work is as follows:

  • The Feature Extractor considers two blocks, each of which is set as:

    \(Conv2D(32,(3,3)) \rightarrow Conv2D(64,(3,3)) \rightarrow MaxPooling((2,2)) \rightarrow Dropout(0.25) \rightarrow Conv2D(64,(3,3)) \rightarrow Conv2D(64,(3,3)) \rightarrow MaxPooling((2,2)) \rightarrow Dropout(0.3)\).

  • The Label Classifier is built as: \(Dense(128,ReLU) \rightarrow Dropout(0.5) \rightarrow Dense(8,softmax)\), with output being the symbol categories (more details will be given in the experimental section).

  • The Domain Classifier consists of: \(GRLayer()\rightarrow Dense(128,ReLU)\rightarrow Dropout(0.5)\rightarrow Dense(2,softmax)\), with output being two domains, \(d_i \in \lbrace 0, 1\rbrace \).

To train the Label Classifier, the standard back-propagation algorithm is used using Adadelta as optimizer [12] during 100 epochs. The gradient needed to calculate the weights of the model to classify labels is performed by computing the derivative \({\partial E_y}/{\partial w_y}\) of the Label Classifier and transferring it to the Feature Extractor, so the derivative \({\partial E_y}/{\partial w_f}\) is calculated in this part of the network, where \(E_y\) is the classifier’s loss for \(y_i\), and \(w_y\) and \(w_f\) are the weights of the classifier and feature extractor respectively.

Obtaining the gradient for the model to classify domains is different because of the Gradient Reversal Layer, which will multiply the Domain Classifier’s derivative, \({\partial E_d}/{\partial \theta _d}\), by a certain negative number, \(\lambda \), as seen in Fig. 2. By using this negative \(\lambda \) parameter, we force that the features over two different domains are as indistinguishable as possible for inferring the domain of the input, thereby obtaining the desired domain invariance.

3.2 Data and Training

The experimentation was conducted using five sets of music manuscripts from different domains (see Fig. 3). To carry out the experiments, we must first take into account how the data was filtered from these five manuscripts, the amount of data used, the characteristics that describe each domain, and how training has been carried out.

Fig. 3.
figure 3

All five manuscripts used for experimentation. One row at a time, from left to right: b-3-28, b-50-747, b-53-781, b-59-850, BNE-BDH.

Table 1 shows a summary of the datasets used, as well as their type and number of samples they contain. The initial amount of data (column “Total symbols”) contains some labels that are not common in all datasets.

To solve this, the symbol category label sets from each domain have been all intersected in order to obtain a list with only the categories common to all manuscripts, reducing the amount to 15. Additionally, categories which add up to less than 15 elements in all manuscripts have been removed as well, further reducing the amount of categories to 8. This brings us to the final amount of data that will be used for experimentation (see column “Filtered symbols”).

Table 1. Description of the five datasets used in the experimentation.

As previously mentioned, each of the five manuscripts can vary between each other because of different factors, and, for the most part, these factors are present in all of them creating, like this, five domains. Firstly, data from domains b-3-28 through b-59-850 have been obtained from handwritten manuscripts, while BNE-BDH has been typewritten. Secondly, the manuscripts pose varied characteristics, for example, b-50-747 has a lower resolution so it is more blurry, and b-59-850 has an overexposure or very high intensity of light. Thirdly, manuscripts b-3-28, b-50-747, b-59-850 and BNE-BDH have not been altered by adding any type of additional characteristic, but a distinct type of characteristic was added to b-53-781, with the aim of increasing the difficulty and unpredictability of the state in which future manuscripts can be provided, as the DANN model must be robust to these types of alterations and must not require additional manual labor, e.g. image preprocessing, in order to learn and transfer its knowledge. Finally, a synthetic characteristic, which inverts the colors, was manually applied to manuscript b-53-781, which anyway could represent a different scanner mechanism.

Each manuscript’s data is split into 80% for training and 20% for validation. However, for training the DANN, two different sets are required. The first, the training set \(\mathcal {T}\), is the conventional set created from the sample S (source domain), which contains input and label pairs \((x_i,\ y_i)\). The second, the domain set \(\mathcal {D}\), is comprised of data from the input samples S and T, along with its domain-label pairs \((x_i,\ d_i)\), where \(d_i\) indicates from which domain it originates (e.g., 0 indicates that it comes from the source domain and 1 indicates that it comes from the target domain).

Given that the DANN model makes use of two specific classifiers which share the weights of a part of the network, a problem arises during its training. If it is trained in a conventional manner just like a CNN model, after fully training one classifier, its knowledge might be lost or become invalid after doing the same to the second classifier. To solve this, the model will make use of a form of pseudo-concurrent training, using small, equally-sized batches obtained from the domain set \(\mathcal {D}\) to train one and then the other (e.g., an epoch is comprised of X images, and, using batches of size b, it will be trained X/b times for each epoch). The label classifier will make use of the training set \(\mathcal {T}\), while the domain classifier will use the domain set \(\mathcal {D}\).

Experimentation followed a meticulous process with different iterations, where one iteration means all possible permutations by electing each manuscript once as the source domain and the others as the target domain. Additionally, these training iterations were carried out using different values for parameters such as the pseudo-concurrency batch size, values of \(\lambda \) of the gradient reversal layer, the label classifier’s learning rate and the domain classifier’s learning rate.

4 Results

The experimentation has been carried out in two stages, the first stage studies how the general tendency varies by doing a complete search on the previously mentioned parameters, the batch size, gradient reversal layer’s \(\lambda \), and learning rates, using the values shown in Table 2. The second stage of experimentation carries out tests across all possible manuscript permutations of source-domain using in turn every possible permutation of the parameters. Additionally, we name “\(D_s\) Acc.” and “\(D_t\) Acc.” as the average source and target domain accuracy in percentage.

Table 2. Set of values used for the training parameters. Both the CNN and DANN models will use the batch size and the classifier learning rates during the training of the respective models. The \(\lambda \) parameter only affects the DANN model.

Results obtained for the first stage are shown in separated tables, where, for each possible value of the parameter at issue, the average source and target domain accuracies for the CNN and DANN models are shown, when possible. These results are the average obtained as a consequence of carrying out experiments across all possible manuscript permutations of source-target domains, using in turn every possible permutation of the considered parameters. This means that a total of 8640 experiments with different configurations have been carried out: 20 possible manuscript combinations \(\times \) 2 different models (CNN and DANN) \(\times \) 6 batch sizes \(\times \) 4 \(\lambda \) values \(\times \) 3 label classifier LR \(\times \) 3 domain classifier LR. These results will allow us to get an idea of how the parameters affect the general trend of the results.

The tendency for the batch size, Table 3, is for results to decrease substantially as the batch size increases. This fulfills what was previously mentioned about the need to use fairly small sizes, as training one classifier with a large batch will undermine the training of the other. The same tendency arises if we use batches that are too small, since in this case the weights are updated using very few samples. The best results tend to originate from a batch size of 64 samples.

Table 3. Influence of batch sizes in the performance of the CNN and DANN architectures.

The \(\lambda \) parameter (see Table 4) is the variable by which the Gradient Reversal Layer multiplies the derivative of the domain classifier in order for the label classifier to subsequently be trained invariantly to domains. This parameter is equivalent to the learning rate parameter of the back-propagation algorithm, but applied to the learning of the domain invariant characteristics by the Feature Extractor block. Note that this is a particular parameter to the DANN model, and so only its results can be reported. They show a general tendency to obtain better results for small values of \(\lambda \), so in this case it is better to use a small factor to learn the domain invariant characteristics. This may be due to the fact that if a high value is used, the characteristics shared by the two networks are more adjusted for domain detection, spoiling the result obtained by the label classifier (and undoing what it had already learned). And therefore, it is better to adjust the weights little by little in each iteration, to reach a balance between the two networks.

Table 4. Influence of \(\lambda \) values in the performance of the DANN architecture.

In Tables 5 and 6, the results obtained by varying the learning rates (LR) of the two networks are analyzed. The general tendency for the best LR values varies between the two classifiers. Table 5 shows the influence of the label classifier’s learning rate in the performance of the CNN and DANN architecture. In this case, having a LR of 0.5 for the label classifier obtains the best results. Table 6 shows the influence of the domain classifier’s learning rate in the performance of the DANN architecture. In this other case, having a LR of 1.0 for the domain classifier obtains the best results. Both classifiers obtain bad results for a high value of their respective LR. As previously argued, this can be motivated by the fact that having a shared weight section, it is better to modify the weights little by little in each iteration. Therefore, it seems that, on average, it is better to use a domain learning rate value slightly higher than that used for the labels.

Table 5. Influence of the label classifier’s learning rate in the performance of the CNN and DANN architecture.
Table 6. Influence of the domain classifier’s learning rate in the performance of the DANN architecture.

The results for the second stage of the experimentation are shown in Table 7. It reports the results obtained by the CNN and DANN networks for all possible combinations of the five datasets used. In addition, the “target diff” column is added, which shows the difference between the accuracy for the target domain obtained by the DANN and the CNN.

Results for the typewritten manuscript (BNE-BDH) as source and target are promising, as they have good increases in DANN accuracies compared to the CNN ones. An anomaly occurs with b-3-28 and b-53-781 as source domains and BNE-BDH as target, where we obtained results that do not outperform the CNN classifier, given that the target differences are 3.34% and 0.42% respectively. This may be due to the number of samples in the source dataset, since they represent the two datasets with fewer samples.

Table 7. Best results obtained for the different combinations of the datasets used as source and target domains. Target Diff column shows DANN’s \(D_t\) accuracy minus CNN’s \(D_t\) accuracy.

It is also observed that the pair b-3-28 and b-50-747 obtains poor results in the two possible combinations (source, target), with the following target accuracy difference: 5.79% (b-3-28 as source) and −7.10% (b-50-747 as source). In this case, these worse results are probably caused by the similarity of the domains. In Fig. 3, one can see how these two domains are those that present the most similar writing, only with differences in the overall color of the image. In general, it has been observed that DANN architecture requires the source and target domains to have significant differences, since, if this is not the case, it forces the label classifier to be trained with domain invariance, which results in worse or similar accuracies as the CNN architecture. Note that, if the domains are similar, the DANN is actually modifying the features that are already suitable for both.

Additionally, the opposite happens when there is too much difference between domains, as can be seen whenever b-53-781 is used as the target domain (see Fig. 3). Using b-50-747 and b-59-850 as sources, the DANN architecture obtains the experiments’ maximum target accuracy difference: 84.21%.

As a summary, the proposed DANN architecture obtains on average 67.45% target accuracy, which is a 37.80% increase on average than the CNN architecture.

5 Conclusions

This work focuses on the study of the use of DA techniques in the context of musical manuscripts recognition. These techniques dictate that a domain invariance must exist during training so classification decisions can be made based on features that are both discriminative for labels, yet invariant to the change of domains. We implemented an existing DANN architecture, as one of the parts it is comprised of, the domain classifier, includes a gradient reversal layer that will, during back-propagation, ensure these domain invariant features to emerge.

The evaluation of the DANN architecture is carried out by firstly studying how different values of training parameters can affect the general tendency, and then studying the average target domain accuracies using permutations for the five used manuscripts. The parameters evaluated are the training batch size, the gradient reversal layer \(\lambda \), and the learning rates for both label and domain classifiers.

The classification performance obtained by the proposed architecture for the target domain generally outperforms a CNN approach, as our implementation results in an average of 67.45%, an increase of 37.80% from the CNN’s 29.66%. It should be kept in mind that these classification results (of almost 70% on average), are obtained without using any label from target domain; that is, only adapting the knowledge learned from the source domain. It is also worthwhile mentioning the result obtained for the adaptation between typewritten and handwritten domains, reaching in some cases a 66% of accuracy, 61% better than with a CNN.

Future work includes experiments to improve these results with a greater amount of datasets and domains, increasing the number of labels considered. We also intend to evaluate different strategies combining this approach with semi-supervised and incremental methods. Additionally, we would like to extend this research to work with the direct detection of the music symbols in the images [9], instead of assuming a previous segmentation.