Keywords

1 Introduction

The problem of pattern recognition is a data mining branch studied and developed for many years now. In a number of its applications, satisfying results have already been achieved, however, in many fields it is still possible to obtain better results. Certainly, one of the possible research fields is the issue of imbalanced data. For the purposes of this paper, the issue of imbalanced data can be defined as a case in which there are one or more of the following characteristics:

  1. 1.

    there are significant differences in the number of elements between classes;

  2. 2.

    elements within the same class have shapes that do not overlap;

  3. 3.

    objects belonging to different classes are very different in size; and

  4. 4.

    objects in different classes are both simple and complex.

In this paper, the issue of imbalanced data is illustrated on the example of music notation symbols. The characters on the score have all of the four characteristics given above.

Musical notation symbols appear with varied frequency. Some of them, such as quarter and eighth notes, are very common, often appearing several times within a single line of the score. Other (including rest and accidentals) occur frequently, but still much more rarely. There are also symbols (breve note, longa note), which appear occasionally in few musical compositions.

The problem of incompatible shapes within the same class applies to part of the examined symbols. These include, among others, arcs, crescendo, and diminuendo. Different shapes of arcs are shown in Fig. 1.

Fig. 1
figure 1

Incompatible shapes of the “arc” symbol

Fig. 2
figure 2

Different sizes of chosen musical notation symbols. Starting from the top placed are: dot, flat, G clef and arc

Objects belonging to particular classes of musical notation strongly differ in their size. Figure 2 illustrates this diversity. It shows big, medium, and small symbols. These terms are conventional, of course. Certainly, we can count arc into large objects which is well visible on Fig. 2. Into large characters treble clef can be also included. In the terms of size, dot is the opposite of the arc. It is the smallest character that occurs in the score. Into small symbols, but much larger than the dot, we can also count some of the accents, the whole note, or a whole note rest. Comparison of the dots and the arc perfectly shows poor balance of musical notation, in terms of size. Between the extreme values, we can find many characters of intermediate sizes including quarter, flats, and naturals.

General methodology of optical music recognition has been already researched and described in [10]. We would like to highlight that studied problem of imbalance of classes is an original contribution to the field of music symbols classification. The aim of our study is to investigate how well-known classification tools deal with imbalanced data. In this paper, presented are single classifiers only. The research is based on actual opuses. Applied classification algorithms have been implemented in C\(++\). Developed program works with both high- and low-resolution images of musical symbols.

The paper is organized as follows. Section 2 lists the basic information about the classification and used classifiers. In Sect. 3 the learning and testing sets are outlined. Section 5 describes empirical tests.

2 Preliminaries

2.1 Classification

Pattern recognition is usually performed on observed features, which characterize objects, rather then on objects directly. Therefore, we distinguish a mapping from the space of objects \(\mathbb {O}\) into the space features \(\mathbb {X}\), i.e., \(\phi :\mathbb {O}\rightarrow \mathbb {X}\). This mapping is called features extractor. Then, we consider a mapping from the space of features into the space of classes \(\psi :\mathbb {X}\rightarrow \mathbb {C}\). Such a mapping is named classifier. It is worth to notice that the term classifier is used in different contexts: classification of objects and classification of features. Meaning of this term can be concluded from context. Therefore, we will not distinguish explicitly, which meaning is taken.

Composition of the above two mappings constitute the classifier: \(\varPsi =\phi \circ \psi \). In other words, the mapping \(\mathbb {O}\mathop {\rightarrow }\limits ^{\varPsi }\mathbb {C}\) is decomposed to \(\mathbb {O}\mathop {\rightarrow }\limits ^{\phi }\mathbb {X}\xrightarrow {\psi }\mathbb {C}\).

The space of features \(\mathbb {X}\) is usually the Cartesian product of features \(X_{1},X_{2},\) \(\ldots ,X_{n}\), i.e., \(\mathbb {X}=X_{1}\times X_{2}\times \cdots \times X_{n}\). Therefore, mapping \(\varPsi \) and \(\phi \) operate on vectors \(x_{1},x_{2},\ldots ,x_n\), where \(x_{i}\) is a value of the feature \(X_{i}\) for \(i=1,2,\ldots ,n\). For simplicity, a vector of values of features will be simply called a vector of features.

Summarizing, we explore pattern recognition problem searching for an (object) classifier

$$\begin{aligned} \varPsi :\mathbb {O}\rightarrow \mathbb {C} \end{aligned}$$
(1)

which is decomposed to a feature extractor

$$\begin{aligned} \phi :\mathbb {O}\rightarrow \mathbb {X} \end{aligned}$$
(2)

and a (features) classifier or a classification algorithm

$$\begin{aligned} \psi :\mathbb {X}\rightarrow \mathbb {C} \end{aligned}$$
(3)

The classifier \(\psi \) divides features’ space onto so-called decision regions:

$$\begin{aligned} D^{(i)}_X = \psi ^{-1}(i) = \{x\in X:\psi (x)=i\}\quad \text {for every}\quad i\in M \end{aligned}$$
(4)

and then, of course, the features extractor splits the space of objects into classes

$$\begin{aligned} O_{i}=\phi ^{-1}(X^{(i)})= \{o\in \mathbb {O}:\phi (o)\in X^{(i)}\}\quad \text {for every}\quad i\in M \end{aligned}$$
(5)

or equivalently

$$\begin{aligned} O_{i} = \varPsi ^{-1}(i) = (\phi \circ \psi )^{-1}(i) = \phi ^{-1}\big (\psi ^{-1}(i)\big )\quad \text {for every}\quad i\in M \end{aligned}$$
(6)

We assume that the classification algorithm splits the space of features’ values, i.e., it separates the whole space X into pairwise disjoint subsets, which cover the whole space X

$$\begin{aligned} (\forall i,j\in M,i\ne j)\;\;D^{(i)}_X\cap D^{(j)}_X=\emptyset \;\;\;\text {and}\;\;\;\bigcup _{i\in M}D^{(i)}_X=X \end{aligned}$$
(7)

In our current research investigated are classification schemes based on classical sets theory. In such case belongingness is crisp. An element either belongs to given class or not. There are fuzzy generalizations of classical approach to classification. In the next step of our research, we will take a closer look at other information representation models and we will investigate their suitability to optical music recognition. We would like to verify, if other approaches, especially ones involving bipolarity (c.f. [8, 9, 11]) may enhance classification results with imbalanced data.

2.2 Classifiers

In current section, we describe briefly applied classifiers, that is: k-Nearest Neighbors, k-means, Mahalanobis minimal distance, and decision tree.

2.2.1 k-Nearest Neighbors

The k-nearest neighbors algorithm [4] is among the simplest one of all machine learning algorithms. For a given object, which is being classified, its k-nearest neighbors are extracted from the learning set. The object of interest is classified to the class having majority in the set of its k nearest neighbors. k is a positive integer, typically small. If k is equal to 1, then the object is simply assigned to the class of its nearest neighbor. Advantage of the k-nearest neighbors algorithm is its high recognition rate. Among disadvantages, the most important one is kNN’s high time complexity. Its big computation overload is a significant disadvantage.

2.2.2 K-Means

K-means is an algorithm similar to k-Nearest Neighbors, which attempts to solve the problem of large amount of calculations in recognition stage, what is typical for k-NN classifier. In the learning stage, this classifier divides every class from the learning set into k clusters with the use of k-means algorithm [7]. The classifier does not store whole of the training set, but only the calculated centroids, thus the expenditure of calculations is reduced in the recognition stage. In the recognition stage, a search for the nearest centroid is performed. The class, to which given centroid belongs is the answer of the algorithm. The distance between two elements in the recognition stage is determined by previously settled metric.

2.2.3 Mahalanobis Minimal Distance

The Mahalanobis minimal distance classifier can be interpreted as a modification of the naive Bayes algorithm. In this method we assume, that a priori likelihood for every class is equal each to other, i.e., \(\pi _1=\pi _2=\cdots =\pi _M\), and all observations come from normal distribution with the same covariance matrices. With this assumption, the Bayes rule assumes the following form:

$$\begin{aligned} (x-m_k)^T\sum {}^{^{-1}}(x-m_k) \end{aligned}$$
(8)

where x is classified object, \(m_k\) is a mean of \(C_k\) class calculated from training set, and given below

$$\begin{aligned} m_k=\frac{1}{n_k}\sum _{i=1}^{n_k}x_{ik} \end{aligned}$$
(9)

and \(\sum \) is a covariance matrix defined by Eq. 10

$$\begin{aligned} \sum =\frac{1}{n-M}\sum _{k=1}^{M}\sum _{i=1}^{n_k}(x_{ki}-m_{k})(x_{ki}-m_{k})^T \end{aligned}$$
(10)

where \(x_{1k} {\ldots } x_{nk}\) are vectors representing objects from class \(C_k\), \(m_k\) is a mean vector from this class, \(n_k\)—number of elements in class \(C_k\), n—number of all elements and M number of classes. An Eq. 8 is called Mahalanobis distance. In this method, an object x is classified to the class j if square of Mahalanobis distance for this class is the smallest one.

2.2.4 Decision Trees

A decision tree is a decision support tool that uses a tree structure for decision making and classification [2, 14]. Popular algorithms used for construction of decision trees have inductive nature. Top-bottom tree building scheme is used. In this scheme, building a tree starts from the root of the tree. Then, a feature for testing is chosen for this node and the training set is divided into subsets according to values of this feature. For each value there is a corresponding branch leading to a subtree which should be created on the basis of the proper testing subset. This process stops when a stop criterion is fulfilled and the current subtree becomes a leaf.

The stop criterion indicates when the construction process needs to be brought to a standstill, which is when for some set of samples we should make a leaf, not a node. An obvious stop criterion could be situation when

  • a sample set is empty,

  • all samples are from the same class, and

  • attributes set is empty.

In practice, criteria given above sometimes make the model overfitted to learning data. So, other stop criteria or mechanisms, such as pruning, are necessary in order to avoid the overfitting problem.

Finally, classification of a given object is based on finding a path from the root to a leaf along branches of the tree. Choices of branches are done by assigning tests’ results of the features corresponding to nodes. The leaf ending the path gives the class label for the object.

3 Data Set

The recognized set of music notation symbols had about 27,000 objects in 20 classes. There were 12 classes defined as numerous and each of them had about 2000 representatives. Cardinality of other eight classes was much lower and various in each of them. Part of the examined symbols was cut from chosen Fryderyk Chopin’s compositions. Other part of the symbols’ library comes from our team’s other research projects. They were divided into two groups: regular and irregular (rare) classes. Regular classes include flat, sharp, natural, G and F clefs, piano, forte, mezzo-forte, quarter rest, eight rest, sixteenth rest, and flagged stem. Irregular classes consist of accent, breve note, C clef, crescendo, diminuendo, fermata, arc, and 30 s rest. Image sets coming from regular classes consisted of 2000 objects each. Sets of irregular classes are significantly smaller.

4 Feature Extraction

Classification was done on features characterizing every symbol. Vectorized and numerical features characterizing symbols were defined based on the experience of authors. The following features were used in the experiment:

  • histograms, i.e., relations between the number of pixels with a given value of a feature and the number of all pixels. Histogram of black pixels was used in the experiment,

  • horizontal and vertical projections (also known as histograms), i.e., numbers of black pixels in rows (for horizontal projection) and in columns (for vertical projection),

  • horizontal and vertical transitions, i.e., the number of pairs of neighboring white/black pixels in every row for horizontal transitions and in row for vertical transition. Transition allows defining objects with complicated shapes, for example treble clef,

  • left, right, top and bottom margins, i.e., for every row it is the number of white pixels counted from the left edge of the image to the first black pixel (left margin), for every column it is the number of white pixels counted from the top edge of the image to the first black pixel (upper margin) etc. This feature shows the symbol’s position on the image. It is useful, for instance, to distinguish natural from sharp,

  • directions, i.e., for a given pixel it is the longest segment of black pixels in given directions (usually horizontal, vertical, left, and right diagonal directions are considered) which include given black pixel,

  • moments,

  • average 3, i.e., the average value calculated of three neighboring elements in features’ vector,

  • average 5, i.e., the average value calculated of three neighboring elements in features’ vector,

  • difference, i.e., differences between two consecutive values of in features’ vector,

  • numerical features got as casted numerical parameters of the above vectorized featured:

    • max, i.e., the maximum value of all values in features’ vector,

    • min, i.e., the minimum value of all values in features’ vector,

    • ave, i.e., the average value of all values in features’ vector,

    • maxPos, minPos, avePos, i.e., positions of given max, min, and ave values in features’ vector.

5 Experiment and Results

In order to determine and compare the quality of described methods, we have used

  • accuracy calculated by the equation:

    $$\begin{aligned} acc = \frac{ \textit{number of well recognized objects} }{ \textit{number of all objects} } \end{aligned}$$
    (11)
  • classifier’s error:

    $$\begin{aligned} err = \frac{ \textit{number of objects recognized incorrectly} }{ \textit{number of all objects} } \end{aligned}$$
    (12)

In the second stage irregular classes, were added to the previously recognized classes. At this point, attention was paid to changes in the efficiency of recognition and recognition of particular irregular classes. Apart from acc and err, measures showing the influence of classes counting less elements should be used for classifiers assessment.

To evaluate the classifiers, two measures were calculated: sensitivity \((TP)/(TP+FN)\) and precision \((FP)/(TP+FP)\). For these calculations, our multiclass problem was turned to m two class problems (one class contra all others). All measures were calculated for each class. In the end, average measure was determined.

5.1 Recognition of Regular Classes

The experiment was divided into two parts. In the first one, only elements belonging to the regular classes were being recognized. It allowed to determine the appropriate structure of classifiers. For each classifier, the appropriate training set size was established. Dependency of the classifier efficiency upon the size of the training set was examined. The tests were performed for learning sets counting 1, 10, 20, 50, 100, 200, and 400 in each class. Intuition suggests that it should increase with the growing number of learning symbols. To evaluate the classifiers, we used accuracy described in Sect. 5.

Best results were obtained by kNN and decision tree. Tests show that learning set counting 400 elements in each class is enough to achieved good outcome. All results are shown in Table 1.

Table 1 The effectiveness of recognition of regular classes

For some classifiers, other parameters were examined too. In the case of kNN factor k was tested. The k parameter was examined with the training set counting 400 elements for every recognized class. Tests were performed for \(k=1,2,3,5,10,15, 20\). The worst performance this classifier obtained at \(k = 1\). It was 95 %. The highest efficiency, 98 %, was achieved with \(k = 5\) and \(k = 10\). For this reason, for further examinations, the tested parameter will take value that generates less computational costs, which is 5.

Table 2 Learning and testing sets for irregular classes

For k-means classifier parameter, k was determined. Similarly to kNN, the analysis was carried out for the learning set containing 400 symbols in each class. Tests were performed for \(k = 1,2,3,5,10,15, 25, 50\). The worst performance classifier obtained with \(k = 1\). With the growing number of clusters for given class, the efficiency of the method rises. Unfortunately, at the same time computational complexity is increased and the duration of algorithm’s run is prolonged. The increase of effectiveness stopped after k reached 10.

Table 3 Effectiveness of recognition with irregular classes
Fig. 3
figure 3

Recognition of all classes—sensitivity

5.2 Irregular Classes

Establishing the correct structure of classifiers was an introduction to recognize the whole issue. Classes determining the imbalance have been added into regular classes. In this case, apart from the global effectiveness, worth noticing is the recognition within irregular classes. Regular classes had 400 representatives in the training set. The number of other classes is presented in Table 2. The results for all of the classes are shown in the Table 3. Figure 3 shows sensitivity for all classes, Fig. 4—precision.

k-Nearest Neighbors: Tests were performed for \(k = 1\) and \(k = 5\). Global effectiveness compared to the results described in Sect. 5.1 slightly decreased (difference of 0.5). Unfortunately, recognition effectiveness of irregular classes is significantly lower than regular classes. The best performance out of irregular classes was obtained by C clef. This is due to the relatively large training set and the shape of this symbol which distinguishes it from the others. The arc was also recognized with high efficiency. This symbol also had a large training set. Another group of symbols to be considered is the accent, crescendo, and diminuendo. These symbols are very similar to each other and are mistaken for one another. The breve note is also noteworthy. This symbol hardly occurs in modern musical scores, so it is difficult to find many of its copies. This note at \(k = 5\) is not recognized at all, whereas for \(k = 1\) its recognition is equal 100 %.

K-means: The study on this classifier was carried out with \(k = 5\) (\(k = 1\) for breve note). Also in this case, increasing the number of classes resulted in a slight decrease of global effectiveness. In the instance of irregular classes, effectiveness of this method was below k-Nearest Neighbors classifier. Again, the most recognized classes were the C clef and the arc. Note: if the class is counting less elements, it is recognized worse. Classification errors occurred more or less in the same places in which they appeared for the k-Nearest Neighbors, but it was more of them.

Fig. 4
figure 4

Recognition of all classes—precision

Mahalanobis: Global result of the classifier using Mahalanobis distance was also reduced with an increase in the number of analyzed classes. The overall result of this method turned out to be worse than in the k-nearest neighbors method. However, interestingly, particular rare symbols were recognized better by this method. The algorithm was especially able to isolate accent, which has been recognized with a 100 percent efficiency. Unfortunately, crescendo and diminuendo were still mistaken for one another. Problems also appeared with distinguishing 1/32 rest from 1/16 rest. The breve note remained unrecognizable.

Decision tree: Previous tests have shown that a decision tree is sensitive to the size of the training set. Unsurprisingly, symbols from irregular classes were recognized poorly. The global efficiency of this classifier dropped by two percent after adding new classes. Unfortunately, the results of the irregular symbols recognition were not satisfying. Like in previous methods, classified the best were C clef and the arc. This is probably due to their significant representation in the training set. Also here appeared the distortion: accent—diminuendo and diminuendo—accent. What is odd, tree often mistaken fermata symbol for the mezzo-forte. There were also errors in recognition 1/32 rest. The breve note remained unrecognizable too.

6 Conclusions

The article discusses results of classification task performed on images of musical symbols. Difficulty of undertaken research lies in imbalance of classes and high variability of objects. We have concisely illustrated most important issues, which we have had to address in order to proceed with optical music recognition and classification.

In summary, 20 classes of musical notation symbols were classified. 12 of them have been considered as a regular classes, the other as irregular classes. The recognition effectiveness of regular classes was satisfying. k-Nearest Neighbors classifiers and decision tree recognized them with the efficiency of 98 %. Slightly worse results were obtained by the classes counting less elements. The less symbols class contained, the worse were the results.

In order to improve the results, features vector could be modified or other methods of classification could be applied. It is possible to use complex classifiers such as bagging and random forest in further works.