1 Introduction

Pattern recognition is a field of computer science used for numerous targets. Handwritten digit recognition, fingerprint identification and face recognition, for instance, are some of the most well-known and widely investigated applications. The automated taxa identification of benthic macroinvertebrates (Joutsijoki and Juhola 2011a, b, c; Kiranyaz et al. 2010a, b, 2011; Lytle et al. 2010; Tirronen et al. 2009; Ärje et al. 2010) is an interesting and demanding machine learning and pattern recognition problem. It has obtained scant attention, but it has a significant impact in biomonitoring.

Benthic macroinvertebrates (Ambelu et al. 2010) are commonly used indicators of the current state of an aquatic ecosystem due to their intermediate length of life cycle. Typically their life cycle is 1–2 years (Tirronen et al. 2009). This characteristic feature is one reason why benthic macroinvertebrates are commonly used in examinations of human-induced changes in aquatic ecosystems (Ärje et al. 2010). Another special feature in benthic macroinvertebrates is the sensitivity to react changes in aquatic ecosystems. When benthic macroinvertebrates are used in water quality and freshwater ecosystem assessments, they need to be identified. Currently identification process is made manually, which has high costs and is time consuming. Automated classification of benthic macroinvertebrates returns to image classification. Benthic macroinvertebrate specimens are scanned and every specimen is saved as an individual image Kiranyaz et al. (2010a, b, 2011). In images benthic macroinvertebrates are in different positions and their sizes vary in each image. The identification problem becomes harder, because the reliability is a key factor in the classification process. If samples which are collected from freshwater areas are falsely identified, it may lead to dramatic consequences. Purification of an aquatic ecosystem is a demanding task and financial costs can be significant. Hence, unnecessary actions are expensive and they need to be avoided.

Support Vector Machine (SVM) (Cortes and Vapnik 1995; Vapnik 2000) is a relatively new classification method, which has become very popular machine learning method among researchers and practitioners. It was mentioned for only binary classification tasks but quickly different multi-class extensions were developed and the ongoing research in this area is active. SVM has been a huge success story and it has been applied to numerous applications such as land cover classification (He et al. 2005), feature selection in microarray data (Tang et al. 2006) and recognition of human action (Schuldt et al. 2004). Directed Acyclic Graph Support Vector Machine (DAGSVM) is one of the commonly used multi-class extensions in literature. DAGSVM was introduced in (Platt et al. 2000) and it has been used, for instance, in water quality assessment (Jian et al. 2008) and image categorization (Demirkesen and Cherifi 2008). It has also been applied to the benthic macroinvertebrate identification (Joutsijoki and Juhola 2011c, 2012) for the first time. Other multi-class extensions such as one-vs-one, one-vs-all and half-against-half support vector machines were examined in (Joutsijoki 2012; Joutsijoki and Juhola 2011a, b).

In Joutsijoki and Juhola (2011a, b, c) a smaller benthic macroinvertebrate dataset of eight taxonomical groups was used, but in this paper we have a larger dataset compared to Joutsijoki and Juhola (2011a, b, c). The previous preliminary publications have shown that SVM is suited for the automated benthic macroinvertebrate identification and now our objective was to examine with the help of the knowledge of the previous articles whether the results can be generalized to a larger dataset. Because the smaller dataset was imaged with a different resolution from that of the larger dataset, the smaller dataset cannot be used as a training or test set in this article. Moreover, in this paper we widely investigated the classification with different feature sets. We also produced one feature set with Scatter method. Scatter method algorithm is fully described in Juhola and Siermala (2012) and it has been applied to a benthic macroinvertebrate classification in Joutsijoki and Juhola (2012).

We had three research subjects in this paper. Firstly, our task was to investigate what kernel functions are the best alternatives in the dataset. Secondly, we examined how different feature sets influence classification results and our special interest was in the feature set obtained by Scatter method. Thirdly, we explored how DAGSVM manages the classification of benthic macroinvertebrate samples compared to other classfication methods tested. As baselines for DAGSVM we tested Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), Classification Tree (CT), Naïve Bayes (NB), Minimum Mahalanobis Distance Classifier (MMDC) and \(k\)-Nearest Neighbor (\(k\)-NN) (with four measures: Euclidean, cityblock, correlation and cosine). Information about the baseline methods can be found, for instance, from (Cios et al. 2007; Duda et al. 2001). In Sect. 2 we describe briefly the general theory of binary SVM based on (Debnath et al. 2004; Haykin 1999; Joutsijoki and Juhola 2011a; Vapnik 2000) and DAGSVM based on (Joutsijoki and Juhola 2011c; Platt et al. 2000). In Sect. 3 we present the results and analyze these. Section 4 is left for the conclusions and further research questions.

2 Method

2.1 Support vector machine

Let the input space be \(X\subseteq \mathbb {R}^n\) and output space \(Y=\{-1,1\}\). We have a given training set of \(\{(\mathbf{x}_{i},y_{i})\}_{i=1}^{l}\) where \(\mathbf{x}_{i}\in X, \,i=1,2,\ldots ,l\) are the training examples and \(y_{i}\in Y\) is the corresponding class labels of \(\mathbf{x}_{i}\). We can divide our examination into two cases: linearly separable and linearly non-separable. If data is linearly separable in the feature space, we can write the decision function in the form \(f(\mathbf{x})=\langle \phi (\mathbf{x}),\mathbf{w}\rangle +b=0\) where \(\phi (\mathbf{x})\) is a mapped vector \(\mathbf{x}\) by using a nonlinear transformation \(\phi \) from an input space to a higher dimensional feature space (Debnath et al. 2004). Moreover, \(\mathbf{w}\) is a weight vector and \(b\) is a bias term.

The goal of SVM is to find a hyperplane separating classes and maximizing the margin. The margin now means the distance between the hyperplane and the closest points of both classes. Without the constraint of the maximum margin, a separating hyperplane would not be unique. Because we have a linearly separable case, we can make the rescaling of a bias term and weight vector and, hence, we can assume that the closest points of the classes lie on the canonical hyperplanes satifying \(f(\mathbf{x}_{i})=\pm 1\) for some \(i\in \{1,2,\ldots ,l\}\). Thus, the margin has the value of \(\frac{1}{||\mathbf{w}||}\) and the distance between the canonical hyperplanes is

$$\begin{aligned} \frac{2}{||\mathbf{w}||}. \end{aligned}$$
(1)

Because we want to maximize the margin, it is equivalent to the problem where we minimize the denominator of \(||\mathbf{w}||\) in Eq. (1). We can replace term \(||\mathbf{w}||\) with term \(\tfrac{1}{2}||\mathbf{w}||^2\) without changing the solution of the optimatization problem. The replacement was made in order to have the Quadratic Programming form for the optimization problem. Based on Debnath et al. (2004) optimatization problem desired can be now stated as follows:

$$\begin{aligned} \min \;\frac{1}{2}||\mathbf{w}||^2 \end{aligned}$$
(2)

subject to

$$\begin{aligned} y_{i}(\langle \phi (\mathbf{x}_{i}),\mathbf{w}\rangle +b)\ge 1, \quad i=1,2,\ldots ,l. \end{aligned}$$
(3)

Now, the Lagrangian has form

$$\begin{aligned} L(\mathbf{w},b,\varvec{\alpha })=\frac{1}{2}||\mathbf{w}||^2 -\sum _{i=1}^{l}\alpha _{i}[y_{i}(\langle \mathbf{w},\phi (\mathbf{x}_{i})\rangle +b)-1] \end{aligned}$$
(4)

where \(\alpha _{i}\ge 0\) are Lagrangian multipliers. Lagrangian \(L(\mathbf{w},b,\alpha )\) in Eq. (4) needs to be maximized with respect to \(\alpha \) and simultaneously minimized with respect to \(\mathbf{w}\) and \(b\). After differentiating \(L(\mathbf{w},b,\alpha )\) with respect to \(\mathbf{w}\) and \(b\) and setting these results equal to zero, we get two conditions (see Haykin 1999) that

$$\begin{aligned} \mathbf{w}=\sum _{i=1}^{l}\alpha _{i}y_{i}\phi (\mathbf{x}_{i}) \quad \hbox {and}\quad \sum _{i=1}^{l}\alpha _{i}y_{i}=0. \end{aligned}$$
(5)

We see that the optimal solution of \(\mathbf{w}\) is the linear combination of training examples. Because the objective function in this optimatization task is convex, the solution is unique. From the Karush-Kühn-Tucker (KKT) conditions (Haykin 1999) we obtain that the support vectors have nonzero Lagrange multipliers and the rest have Lagrange multipliers equal to zero.

We have a convex objective function and linear constraint in the primal formulation of the optimatization problem. Thus, we are able to represent the problem in a dual form. This problem has the same solution as the primal form has, but in this case the Lagrange multipliers provide the optimal solution. The dual form can be derived from Eq. (4) by applying Eq. (5). Hence, we obtain the following dual form

$$\begin{aligned} L(\alpha )=\sum _{i=1}^{l}\alpha _{i}-\frac{1}{2}\sum _{i=1}^{l}\sum _{j=1}^{l} \alpha _{i}\alpha _{j}y_{i}y_{j}\langle \phi (\mathbf{x}_{i}),\phi (\mathbf{x}_{j})\rangle \end{aligned}$$

subjecto to constraints

$$\begin{aligned} \sum _{i=1}^{l}\alpha _{i}y_{i}=0 \quad \hbox {and}\quad \alpha _{i}\ge 0, \quad i=1,2,\ldots ,l \nonumber . \end{aligned}$$

When the primal form is a minimization problem subject to \(\mathbf{w}\) and \(b\) and maximization task subject to \(\alpha \), the dual problem is a maximization problem subject to Lagrange multipliers \(\alpha _{i}, \,i=1,2,\ldots ,l\). After calculating the optimal solution \(\mathbf{w}=\sum _{i=1}^{l}\alpha _{i}y_{i}\phi (\mathbf{x}_{i})\), we can compute the bias term by selecting an arbitrary support vector \(\mathbf{x}_{j}\) and calculating this from the KKT condition. Thus,

$$\begin{aligned} b=y_{j}-\sum _{i=1}^{l}y_{i}\alpha _{i}K(\mathbf{x}_{i},\mathbf{x}_{j}) \end{aligned}$$

where \(K(\mathbf{x}_{i},\mathbf{x}_{j})=\langle \phi (\mathbf{x}_{i}), \phi (\mathbf{x}_{j})\rangle \) is the kernel function (Debnath et al. 2004).

When we are dealing with the linearly non-separable case, we need to apply non-negative slack variables to Eqs. (2) and (3). Hence,

$$\begin{aligned} \frac{1}{2}||\mathbf{w}||^2+C\sum _{i=1}^{l}\xi _{i} \end{aligned}$$

with respect to

$$\begin{aligned}&y_{i}(\langle \phi (\mathbf{x}_{i}),\mathbf{w}\rangle +b)\ge 1-\xi _{i} \\&\xi _{i}\ge 0,\quad i=1,2,\ldots ,l \nonumber \end{aligned}$$
(6)

where \(C\) is a user-specified parameter (also called box constraint), which controls the trade-off between the maximum margin and minimum classification error. Now, the primal form of the Lagrangian is

$$\begin{aligned} L(\mathbf{w},b,\alpha ,\mu )&= \frac{1}{2}||\mathbf{w}||^2+C\sum _{i=1}^{l}\xi _{i}\nonumber \\&\qquad -\sum _{i=1}^{l}\alpha _{i}[y_{i}(\langle \phi (\mathbf{x}_{i}), \mathbf{w}\rangle +b)-1+\xi _{i}]- \sum _{i=1}^{l}\mu _{i}\xi _{i} \nonumber \end{aligned}$$

subject to constraints in Eq. (6) and \(\alpha _{i},\mu _{i}\ge 0, \,i=1,2,\ldots ,l\). Again we need to minimize \(L(\mathbf{w},b,\alpha ,\mu )\) with respect to \(\mathbf{w}\) and \(\xi \) and, moreover, maximize with respect to \(\alpha \) and \(\mu \). Analogously as in the linearly separable case we can convert the primal problem to a dual problem:

$$\begin{aligned} L(\alpha )=\sum _{i=1}^{l}\alpha _{i}-\tfrac{1}{2}\sum _{i=1}^{l} \sum _{j=1}^{l}\alpha _{i}\alpha _{j}y_{i}y_{j}\langle \phi (\mathbf{x}_{i}), \phi (\mathbf{x}_{j})\rangle . \end{aligned}$$
(7)

subject to constraints

$$\begin{aligned} \sum _{i=1}^{l}y_{i}\alpha _{i}=0\quad \hbox {and}\quad 0\le \alpha _{i}\le C, \end{aligned}$$

when \(i=1,2,\ldots ,l\). As we see, the dual form is almost identical with the linearly separable case. The only difference is that in the former one \(\alpha _{i}\)’s are bounded above a constant \(C\). The optimum solution for the weight vector is

$$\begin{aligned} \mathbf{w}=\sum _{i\in SV}\alpha _{i}y_{i}\phi (\mathbf{x}_{i}) \end{aligned}$$

where \(\alpha _{i}\)’s are the optimized values and \(SV\) is the set of indices of support vectors. Bias can be evaluated by choosing a support vector satisfying \(0<\alpha _{i}<C\) and applying the KKT conditions as before. A new example \(\mathbf{x}\) can be classified according to the sign of the decision function

$$\begin{aligned} f(\mathbf{x})=\sum _{i=1}^{l}\alpha _{i}y_{i}K(\mathbf{x},\mathbf{x}_{i})+b. \end{aligned}$$

2.2 DAGSVM

Directed Acyclic Graph Support Vector Machine (Debnath et al. 2004; Demirkesen and Cherifi 2008; Jian et al. 2008; Joutsijoki and Juhola 2011c; Lorena and de Carvalho 2004; Platt et al. 2000) (DAGSVM) uses a rooted binary Decision Directed Acyclic Graph (DDAG) learning structure. This graph does not have any cycles. A training phase in DAGSVM is similar to that of the one-vs-one method since the graph has \(\frac{M(M-1)}{2}\) nodes where in each one of them contains an SVM classifier. Every node is trained only with the training examples labeled by the classes of the node. The classification of a test example begins at the root node and is continued via a left or right edge depending on the output of the SVM classifier. This way the classification continues until a leaf of the graph is reached where the final class output is met for the test example. The path from the root node to a leaf is called an evaluation path and in the graph construction we need only \(M-1\) comparisons in order to solve the final class for a test example.

DDAG can also be operated on a list where each node eliminates one class from the list (Platt et al. 2000). A test example is evaluated against the decision node that corresponds to the first and last element of the list. According to the output of the classifier in a node one of the classes is removed from the list. An advantage of this method is in the evaluation process but the disadvantage lies in the list (or DDAG) order. The order of the list (or DDAG) is not specified and every different order can produce different results. Hence, it would be an interesting question to find a method for finding an optimal class order in the list (or DDAG). A traditional way is to use the alphabetical (or numerical) order and this is what we have applied also to our experimental tests and an example of numerical order can be found from Fig. 1 in a four-class case.

Fig. 1
figure 1

An example of a 4-class DDAG

3 Experimental tests

3.1 Data description and test arrangements

The image dataset consists of altogether 2,156 images from eight taxonomic groups of benthic macroinvertebrates: {Baetis rhodani, Diura sp., Heptagenia sulphurea, Hydropsyche pellucidulla, Hydropsyche siltalai, Isoperla sp., Rhyacophila nubila and Taeniopteryx nebulosa}. We refer to the classes with capital letters A–H, and in Fig. 2 there is an example image from every taxonomic group. Each one of the taxonomic groups belongs to the EPT-orders. More specifically, Baetis rhodani and Heptagenia sulphurea belong to the Ephemeroptera (dayflies) order, Diura sp., Isoperla sp. and Taeniopteryx nebulosa belong to the Plecoptera (stoneflies) order and Hydropsyche pellucidulla, Hydropsyche siltalai and Rhyacophila nubila belong to the Trichoptera (caddisflies) order. EPT-orders are dominant in clean water and species belonging to these orders are used as indicator species in water quality assessment. A noticeable fact is that Isoperla sp. and Diura sp. were identified only at a genus level when the other taxonomic groups were identified at a species level. The corresponding sizes of the taxonomic groups were 136, 255, 54, 117, 219, 521, 210 and 644. Features were extracted from images by ImageJ program (ImageJ ) and the total number of features in the dataset was 32.

Fig. 2
figure 2

Example images from different taxonomical groups of benthic macroinvertebrates. Order from upper left to lower right is A–H

We used seven different kernel functions in our tests. These were: linear and polynomial kernel functions with degrees of 2, 3, 4 and 5. Moreover, we used Radial Basis Function (RBF) and Sigmoid kernel functions. We have four different parameters altogether in SVM with kernel functions: the box constraint \(C\) is included in SVM automatically, parameter \(\sigma >0\) is only in RBF and parameters \(\kappa >0\) and \(\delta <0\) appear only in Sigmoid kernel. For the box constraint, \(\sigma \) and \(\kappa \) parameter space used was \(\{0.5, 1.0,\ldots ,20.0\}\) and for \(\delta \) equivalent parameter space was \(\{-20.0,-19.5,\ldots ,-0.5\}\). We made an agreement that \(\kappa =-\delta \) due to the computational reasons. Otherwise, the number of tested parameter combinations in Sigmoid would have increased to 64,000. Furthermore, RBF and Sigmoid were tested with 1,600 parameter combinations and the rest of the used kernel functions were tested on 40 parameter values. All the tests were made in Matlab environment together with Bioinformatics Toolbox and Statistics Toolbox. Moreover, in optimal hyperplane finding we used Least Squares method which was introduced in (Suykens and Vandewalle 1999).

We repeated tests altogether with six feature sets. Five (7D, 8D, R8D, 15D and 24D) from six feature sets were selected by the previous articles (Joutsijoki and Juhola 2011a, b, c). The 7D feature set was referred to as a statistical feature set (it could be characterized also as a intensity based feature set) in Joutsijoki and Juhola (2011a). Moreover, 8D feature set was referred to geometrical feature set in Joutsijoki and Juhola (2011a). One feature set, 22D, was selected by the Scatter method (Juhola and Siermala 2012). More specifically feature sets were

  1. 1.

    7D={Mean, Standard Deviation, Mode, Median, Integrated Density, Kurtosis, Skewness},

  2. 2.

    8D={Area, Perimeter, Width, Height, Feret’s Diameter, Major, Minor, Circularity},

  3. 3.

    15D=7D\(\cup \)8D,

  4. 4.

    Random eight feature subset from 15D R8D={Area, Mean, Width, Height, Minor, Integrated Density, Kurtosis, Skewness},

  5. 5.

    22D={Mean, Standard Deviation, Mode, Median, Integrated Density, Skewness, Area, Perimeter, Width, Height, Feret’s Diameter, Major, Minor, Min, X, Y, XM, YM, BX, BY, FeretX, MinFeret}

  6. 6.

    24D=15D\(\cup \){Min, Max, X, Y, XM, YM, BX, BY, Angle},

From the features Integrated Density means the sum of the pixel values in the image or selection. Perimeter is the length of the outside boundary of the selection. Width and Height correspond to the width and height of the smallest rectangle enclosing the selection. Feret’s Diameter is the longest distance between any two points along the selection boundary. Major and Minor are the primary and secondary axis of the best fitting ellipse. Circularity is defined as \(4\pi \times \frac{Area}{Perimeter^2}\). Features X and Y are coordinates of the center point whereas XM and YM are the coordinates of center of mass. Features BX and BY are the coordinates of the upper left corner of the smallest rectangle enclosing the selection. FeretX tells the length of the object’s projection in X direction. MinFeret instead shows the smallest distance between any two points along the selection boundary. Angle is the angle between the primary axis and a line parallel to the X-axis of the image. More details concerning the definitions of the features can be found from (ImageJ ).

Table 1 The best groupwise kernel parameters

In classification we applied 10 times 10-fold cross-validation. The dataset was standardized columnwise to have a mean of zero and unit variance before classification. Other transformations were not made since we wanted to keep the classification procedure as natural as possible. Generally speaking, all transformations in relation to dataset should be made with great caution because every change to dataset complicates the analysis of original data. Optimal parameter selection for DAGSVM was made in a similar manner as in Joutsijoki and Juhola (2012). In SVM training we used the method proposed in (Hsu and Lin 2002) where all individual SVM classifiers were trained with the same parameter values. As a final result a mean of the all results was evaluated. Parameter values obtained with different kernel functions and feature sets are in Tables 1 and 2. In the case of the linear and polynomial kernel functions there is only one number in the parenthesis and it is the box constraint. The order of the parameters in RBF is that the first one is box constraint and the second one is \(\sigma \). In Sigmoid the first parameter is box constraint, the second one is \(\kappa \) and the third one is \(\delta \). In the case of \(k\)-NN we tested four different measures (Euclidean, cityblock, correlation and cosine) and each one these was tested with odd \(k\) values from 1 to 25. When using CT with Matlab, pruning of the tree was made automatically such that the splitting criterion was 10 or more examples in a node. Mean classification rate (also known as sensitivity or true positive rate) and mean accuracy were used as main evaluation measures in this paper. Accuracy is defined as a trace of a confusion matrix divided by the sum of the all elements in a confusion matrix.

3.2 Results

In Tables 3, 4, 5, 6, 7 and 8 we have compressed the results so that we present only the mean classification rates from each class instead of representing complete mean confusion matrices. In these result tables we have collected the mean classification rates from all tested classification methods and we boldfaced the best result (or results if there are multiple topmost instances) from each column. Table 9 shows the obtained mean accuracies of all classification methods with the different feature sets.

Table 2 The best groupwise kernel parameters
Table 3 Mean classification rates (%) with different classification methods and 7D feature set
Table 4 Mean classification rates (%) with different classification methods and 8D feature set
Table 5 Mean classification rates (%) with different classification methods and R8D feature set
Table 6 Mean classification rates (%) with different classification methods and 15D feature set
Table 7 Mean classification rates (%) with different classification methods and 22D feature set
Table 8 Mean classification rates (%) with different classification methods and 24D feature set
Table 9 Obtained mean accuracies (%) with different classification methods and feature sets

The 7D feature set has proved to be a good choice for the classification of benthic macroinvertebrate samples (Joutsijoki and Juhola 2011a, c). With the new dataset the situation was slightly different. Table 3 shows the classification rates with 7D feature and all classification methods tested. Immediately it can be seen that class C (Heptagenia sulphurea) was identified with the poorest classification rates compared to other classes. This is quite a natural occurrence since it was the smallest class (54 examples) in the dataset and small classes can easily be suffocated by larger classes in classification. The highest classification rate in class C case was obtained by the quadratic kernel function being around 47 %. A noticeable detail was also that Naïve Bayes lost class C in classification totally. Classes D and G were also identified below 60 % mean classification rates with every classification method. In the case of class G the interval where the classification rates were spread was from around 40 to 53 % when excluding the result of Sigmoid kernel function. The best classification rate was obtained by the cubic kernel function of DAGSVM.

Minimum Mahalanobis distance classifier (MMDC) was the best alternative for class D having nearly 60 % classification rate. Otherwise, DAGSVM with the linear, quadratic and cubic kernel functions were the only ones reaching over 50 % classification rate. Class D was the second smallest class in the dataset (117 examples) which can partly explain the relatively low results. Class B was recognized significantly better compared to previous classes. Now, DAGSVM with the quadratic and RBF kernel functions, MMDC and \(k\)-NN with all measures tested achieved above 70 % classification rate which can be thought as a good results in such a difficult classification task. In the case of class A the first classification rates above 80 % were achieved. These were obtained by Quadratic Discriminant Analysis and four DAGSVM kernels. The highest result was gained by DAGSVM with the quadratic kernel. Compared to \(k\)-NN results, DAGSVM was better (Sigmoid excluded). Classes E and F are interesting cases since all classification methods succeeded in classifying these classes relatively well. In the case of class E the interval where the results settled was from 61.4 to 87.3 % which is significantly better than in the previous classes. The highest classification rate was given by \(k\)-NN with correlation measure. Class F had similar results to class E. Now, the interval was 64.5–87.6 % and the best result was obtained by \(k\)-NN with cosine measure. An interesting detail is that the Sigmoid kernel function reached 66.4 and 72.0 % classification rates within these classes. This is remarkably higher than in other classes. The largest class, class H, was identified with good classification rates which is quite natural. Now, the best classification rate, 88.6 %, was gained again by \(k\)-NN classifier with Euclidean metric. All other classification methods obtained above 80 % classification rates in this class except CT, MMDC and DAGSVM with Sigmoid kernel.

In (Joutsijoki and Juhola 2011a) 7D, 8D and R8D feature sets were tested. Experimental tests showed in (Joutsijoki and Juhola 2011a) that 7D features were the most suitable for classification. Moreover, 8D feature set was the worst alternative and this tendency also continued in this paper. Again the classes C, D and G were the hardest classes to identify In the case of class C and D many of the classification methods recognized these classes below 20 % classification rates. The highest classification rate of class C was obtained by DAGSVM with the quadratic kernel as in Table 3 being 56.2 %. Furthermore, MMDC gained the best classification rate (50.7 %) similarly as in Table 3. A slight modification compared to Table 3 was that classes C and D changed places between each other in the sense that class C achieved a higher topmost classification rate than class D. In the case of class G the results between Tables 3 and 4 were quite similar to each other with the exception that in Table 4 NB obtained the lowest classification rate among all classification methods instead of DAGSVM with Sigmoid kernel. The best classification rate in class G was gained by DAGSVM with quadratic kernel function (53.3 %).

Classes A, B, E, F and H were recognized with better classification rates than the rest of the classes. Similar tendency was also shown in Table 3. One reason behind this finding can be given by simply watching the example images in Fig. 2 where classes A, B, F and H have similar exterior features. An interesting detail is that the topmost classification rates in these classes were not achieved by DAGSVM. In the case of classes E, F and H \(k\)-NN with Euclidean metric gained the the best result. Moreover, in the class H \(k\)-NN with Euclidean and cityblock metrics achieved the same classification rate being 83.9 %. Class A was identified with very good classification rate (88.8 %) when QDA was used. Compared to other methods, it obtained above 7 % difference to the next classification method. The difference between MMDC and other methods in class B was huge. When MMDC achieved 86.3 % result, the other methods reached below 74 % classification rates. Furthermore, the interval where the classification rates settled was very wide in the case of class A. Classes E and F got the same kind of results with many classification methods. Both of these classes were recognized with highest classification rates by using \(k\)-NN together with Euclidean metric. The corresponding classification rates were 84.6 and 88.9 %. In the case of class E only three methods from all classification methods tested reached above 80 % classification rate when the same number was four in the case of class F. More information about the results of 8D feature set can be found from Table 4.

Table 5 shows the results when a random eight feature subset (R8D) was selected from 15D features. The first difference compared to Tables 3 and 4 is that now the class D was recognized better than in the previous result tables. In Tables 3 and 4 class D obtained the best classification rate with MMDC and now the situation was the same. The classification rate was now 70.5 %. The highest classification rate in class C was 53.9 % which was also the poorest maximum classification rate among all classes and it was obtained by the quadratic kernel. This is, however, a natural occurrence since class C was the smallest class in the dataset. Also, the interval where the mean classification rates settled was wide ranging from nearly 0 to 53.9 %. Just like in Tables 3 and 4 class G was the second poorest class to be recognized. It was identified below 60 % classification rates with every classification method used. DAGSVM with the quadratic kernel function achieved 59.6 % mean classification rate. DAGSVM with the 3rd degree polynomial kernel function and RBF kernel gained also above 50 % mean classification rates when with other classification methods stayed below 50 % results.

Classes B, E and F were quite similarly classified. MMDC achieved the highest classification rate in class B being 87.3 %. Other above 80 % results were \(k\)-NN with correlation measure and DAGSVM with RBF kernel. A major part classification methods reached above 60 % result which is a relatively good result. Both of the classes E and F obtained the highest classification rate with \(k\)-NN and cosine measure which is quite a surprising result. The corresponding results were 88.0 and 88.6 %. Naïve Bayes gained almost the same result in class E as \(k\)-NN with cosine measure. It lost only 0.2 % and this class was clearly the best identified class with NB. All measure alternatives with \(k\)-NN achieved in this class above 83 % classification rate. A surprising aspect was that DAGSVM managed to get above 80 % results only with cubic and RBF kernel functions. In the case of class F only classification methods which achieved to obtain above 80 % classification rates were all \(k\)-NN alternatives. From the quadratic to RBF kernel functions yielded classification rates within an 1.5 % interval. The last two classes, classes A and H, were the most interesting cases. In the case of class A the highest mean classification rate was given by QDA being 92.7 % and at the same time it was the first time when class A was recognized above 90 % classification rate. DAGSVM with the linear, quadratic and cubic kernel functions obtained above 85% classification rates but other classification methods were left clearly behind these results. In the case of class H \(k\)-NN with cityblock and Euclidean metrics yielded above 90 % classification rates and from these two alternatives Euclidean metric was the best one. DAGSVM achieved also above 82% classification rates excluding Sigmoid kernel function. Other classification methods reaching similar kind of level were LDA and \(k\)-NN with correlation and cosine measures.

Next we used the 15D feature set which gave good results in (Joutsijoki and Juhola 2011a) and (Joutsijoki and Juhola 2011b). Similar trends can be seen in Table 6 as in Tables 3, 4 and 5. Classes C, D and G were again the worst recognized classes among all classes, but now the topmost classification rates of all these classes reached above 60 % classification rate. LDA performed with the highest classification rate (61.7 %) in the case of class C being higher than all corresponding results in Tables 3, 4 and 5. An interesting detail is that MMDC performed best in the case of class D compared to other classification methods and this tendency continued also in Table 6. This time the classification rate was 72.3 % and the next best was DAGSVM with the linear kernel function having only 57.5 % classification rate. Class G gained very much same kind of results than in Table 5 and the best classification rate was given by the quadratic kernel function as in Tables 4 and 5. From the other classification methods only MMDC and DAGSVM with RBF kernel function reached above 60 % classification rate.

When considering classes A, B, E, F and H Table 6 covered similar patterns contrast with Tables 3, 4 and 5. In the case of class A QDA gained the highest classification rate (92.2 %) as in Tables 4 and 5. Otherwise, the results of class A were largely similar to results in Table 5. The other classification method which achieved above 90 % classification rate in class A case was DAGSVM with the quadratic kernel function. MMDC obtained the best classification rate in class B as well as it did when 8D and R8D feature sets were used. Now, the classification rate was 90.3 % and it was the only one which exceeded the limit of 90 %. DAGSVM with the RBF kernel and \(k\)-NN with correlation and cosine measures were the next ones after MMDC. Classes E and F were identified with the highest classification rates (86.9 and 87.8 %) when \(k\)-NN was used. This occurred also in previous result tables. From the other classification methods Naïve Bayes gained very good result (86.3 %) as well as QDA (83.6 %) and DAGSVM with RBF kernel function (83.0 %) in the case of class E. When considering class F, \(k\)-NN method together with the quadratic, cubic and RBF kernel functions of DAGSVM were those classification methods which reached above 80 % classification rates. It is interesting to see that specific patterns repeat throughout the tests regardless of the used feature set. A recurrent phenomenon also occurs in the case of class H. In class H two highest classification rates were obtained by \(k\)-NN with Euclidean and cityblock metrics and this happened also in Tables 3, 4 and 5. The best results were now 91.6 and 90.8 % and these were the only ones which reached above 90 % limit. Otherwise, LDA, other \(k\)-NN measures and DAGSVM with the linear, quadratic, cubic and RBF kernel functions were the best alternatives for class H.

Table 7 is an interesting because there are results after Scatter method (Juhola and Siermala 2012) was applied to feature selection. The main point in using Scatter method is to exclude unsatisfactory features and improve classification. With Scatter method we reduced the number of features by ten and got 22D feature set. Again classes C, D and G obtained the poorest results and from these classes especially class C gained 56.7 % as its best result. This was achieved by using LDA and DAGSVM with linear kernel function. Other classification methods except DAGSVM with RBF kernel were left clearly behind from the highest classification rates. In the case of class D MMDC yielded the highest classification rate being 67.8 %. This pattern was seen also in Tables 3, 4, 5 and 6. From the other methods only DAGSVM with the quadratic kernel function reached above 60 % classification rate. A noticeable detail was also that DAGSVM with the linear and RBF kernel functions gained relatively good results. Throughout the studies in Joutsijoki (2012); Joutsijoki and Juhola (2011a, b, c, 2012) the linear, quadratic and RBF kernel function seemed to be the best alternatives for benthic macroinvertebrate classification. Class G achieved results similar to class D in the case of many classification methods. The best classification rate (69.1 %) was given by DAGSVM with the quadratic kernel function which also gained the highest classification rate in Tables 4, 5 and 6 in the case of class G. Other classification methods which obtained above 60 % classification rate were LDA, MMDC, QDA and DAGSVM with the linear kernel.

The rest of the classes were recognized significantly better when considering the highest classification rates of these classes. In the case of class E the topmost classification was achieved by \(k\)-NN with Euclidean metric (88.2 %). Other \(k\)-NN variants succeeded in quite well having above 80 % classification rate and this pattern was repeated also in Tables 3, 4, 5 and 6. It seems that \(k\)-NN is a good choice for the classification of class E regardless of the used feature set. An interesting detail is that Naïve Bayes performs well in the class E classification despite the feature set selection. When considering classes F and H \(k\)-NN method worked well. With the cityblock metric class F was identified with 90.2 % mean classification rate and the class H 90.8 % when using Euclidean metric. Overall, \(k\)-NN was a good choice for the classification of classes F and H. In the case of class F DAGSVM results were left below 85 %, but in the case of class H a small improvement was obtained compared to class F since the highest classification rates were 88.7 % with the quadratic and RBF kernel functions. Two classification methods reached above 90 % classification rates in the case of class A. These were QDA and DAGSVM with the quadratic kernel and from these two QDA was the best one having 92.2 % classification rate. QDA was also the best classification method in Tables 4, 5 and 6. DAGSVM with RBF kernel reached nearly 87 % classification rate and the linear kernel to a slightly above 87 % classification rate. Class B obtained a bit different kind of results. MMDC method achieved again the best mean classification rate in this class having now 99.0 % which was the highest mean classification rate within this paper. Other classification methods were left clearly behind since the next one was QDA with 84.6 % mean classification rate. It seems that MMDC is especially suitable classification method for classes B and D.

Our sixth feature set, 24D, was tested in (Joutsijoki and Juhola 2011b) with great success. Results compared to Scatter method were quite similar. Table 8 showed that the classes C, D and G were the poorest ones to be recognized. LDA identified class C with 57.7 % mean classification rate and the difference to Table 7 was only 1 %. In the case of class D a small 0.3 % improvement was achieved compared to 22D feature set results when MMDC obtained again the highest mean classification rate. Other classification methods were left below 60 % mean classification rates and a general trend compared to Table 7 was that the results were worse than with 22D feature set. It seems that the increment to the number of features did not bring any substantial improvement for the results. In the case of class G the highest mean classification rate 70.0 % was achieved by the quadratic kernel of DAGSVM, which was the best one in the previous result tables excluding Table 3. The next ones were DAGSVM with the RBF kernel and MMDC having mean classification rate of 68.4 %.

In the case of classes E and F \(k\)-NN was the choice for classification. Class E was recognized with 88.1 % mean classification rate when \(k\)-NN together with Euclidean metric was used. The same metric was the best choice also in the case of class F where the corresponding mean classification rate was 90.1 %. Overall, \(k\)-NN with all used measures gained the highest mean classification rates from all classification methods. From the DAGSVM results of classes E and F the RBF kernel function obtained the topmost mean classification rates. When considering class H, DAGSVM with the RBF kernel reached 91.1 % mean classification rate being the highest one, but only within 1 % interval was \(k\)-NN with the Euclidean and cityblock metrics. These three methods were the only ones having above 90 % classification rate in the case of class H. An interesting detail which has occurred in Tables 3, 4, 5, 6, 7 and 8 was that DAGSVM with the Sigmoid kernel obtained always significantly better mean classification rates in class F than in the case of class H. For class A similar trend to the previous result tables can be found since now QDA and DAGSVM with the quadratic kernel achieved the same mean classification rate being 87.8 % which was over 4 % decrease compared to Table 7. Naïve Bayes and DAGSVM with the linear and RBF kernels were the other ones reaching above 80 % mean classification rates in the case of class A. MMDC gained clearly the mean classification rate in the case of class B being 97.1 %. Although this was very good result, it decreased 1.9 % from the result of Table 7. MMDC was the only classification method which beat the 90 % limit. QDA and DAGSVM with the RBF kernel were the next ones after MMDC but they were left to below 87 % mean classification rates.

Table 9 shows all mean accuracies obtained from the different classification methods and feature sets. From the first column it can be seen that the cubic kernel function of DAGSVM was the best choice having 75 % mean accuracy. DAGSVM with the RBF kernel also achieved close to cubic kernel function’s result. A noticeable detail was that \(k\)-NN obtained good mean accuracies reaching from 72.4 to 74.0 %. The second column represented the mean accuracies from 8D feature set. The 8D feature set or geometrical feature set as it was referred to in Joutsijoki and Juhola (2011a) obtained generally lower mean accuracies compared to statistical feature set (7D). Now, the best mean accuracy was gained by the DAGSVM with the quadratic kernel and it was 70.0 %. Close to the quadratic kernel result reached cubic kernel in DAGSVM being 69.1 %. R8D feature set yielded interesting results. Again the quadratic kernel was the best one having 77.6 % mean accuracy. Very close to the result of the quadratic kernel function achieved cubic and RBF kernels. From the other classification methods \(k\)-NN achieved 73.0 % or above mean accuracies while the rest of the methods were left below 73.0 % results.

The increment of number of features had a clear consequence to the accuracies. When the 15D feature set (the union of geometrical and statistical features) was used, DAGSVM with the RBF kernel beat the limit of 80 %. Other noteworthy results were 79.9 % mean accuracy of quadratic kernel and 76.3 % mean accuracy of DAGSVM with the linear kernel. Again, \(k\)-NN was a good alternative for classification since the results were from 75.8 to 76.8 %. A special interest was in the case of 22D feature set, which was obtained by using Scatter method. However, the results did not bring any great surprises. DAGSVM with the quadratic (81.1 %) and RBF (79.9 %) kernels were the best alternatives. Moreover, DAGSVM with the linear kernel and \(k\)-NN with all measures tested obtained good results ranging from 77.5 to 79.1 %. The last feature set, 24D, did not bring any considerable changes to the results compared to 22D results. The same kernel functions obtained the topmost results as in the case of 22D feature set. A difference to the 22D results was that now the quadratic and RBF kernels both achieved above 80 % mean accuracies. Furthermore, \(k\)-NN gained mean accuracies within 1 % with all measures tested.

4 Conclusions

In this paper we examined the suitability of Directed Acyclic Graph Support Vector Machines in benthic macroinvertebrate classification. The previous articles (Joutsijoki and Juhola 2011a, b, c) gave a good starting point of using SVM in this classification task. In this paper the dataset was larger than in articles (Joutsijoki and Juhola 2011a, b, c; Kiranyaz et al. 2010a, b, 2011; Ärje et al. 2010; Tirronen et al. 2009). We performed extensive experimental tests with six feature sets, and every feature set was tested with seven kernel functions. Kernel functions RBF and Sigmoid were tested with 1,600 parameter combinations and other kernel functions were tested with 40 parameter combinations. Furthermore, we applied Scatter method (Juhola and Siermala 2012) to the feature selection. By this means we wanted to find an optimal feature set and decrease the number of features for making the classification procedure computionally lighter. As a baseline for DAGSVM we tested six other classification methods. Furthermore, \(k\)-NN method was tested with four different measures.

The experimental results showed clearly that DAGSVM and \(k\)-NN methods were the best ones for benthic macroinvertebrate classification. Especially in the case of DAGSVM the quadratic and RBF kernel showed their power in classification. Moreover, in \(k\)-NN cityblock and Euclidean metrics were good choices for this classification problem. With the smaller feature sets below 78 % mean accuracies were achieved, but when the number of features was 15 or higher the best mean accuracies were above 80 %. Especially interesting fact was that in 22D feature set case which was gained by Scatter method (Juhola and Siermala 2012) the second highest mean accuracy was obtained when taking into account all gained mean accuracies in the paper. This confirms that Scatter method is a proper algorithm for feature selection and it can be used in benthic macroinvertebrate classification. The highest mean accuracy was given by DAGSVM with RBF kernel function and 24D feature set. It was 81.8 %.

DAGSVM with Sigmoid kernel function was generally speaking the worst alternative as seen in (Joutsijoki and Juhola 2011a, b, c). Reasons behind the low classification rates may lie in preprocessing stage and in the agreement of \(\kappa =-\delta \). Different preprocessing such as linear scaling into interval \([-1\;1]\) or \([0\;1]\) and removing the \(\kappa =-\delta \) constraint could increase the classification rates and accuracies of Sigmoid kernel function, but these are questions due to be researched more closely in future. Moreover, the shape of Sigmoid function can be one reason behind the poor results while Sigmoid is bounded above and below. If the values go far away to the left in the x-axis, the differences in the values of Sigmoid function are so small that there will come some inaccuracy due to the limited computational accuracy of a computer. Correspondingly, if the values are far away on the right hand side of the x-axis, the results may have been influenced by the limited computational accuracy.

In future we need to also examine how other SVM multi-class extensions such as one-vs-all, one-vs-one, error correcting output codes and half-against-half method work with the new dataset. Furthermore, other kernel functions than used in this paper can be considered in future. We can recommend DAGSVM and \(k\)-NN methods to the automated taxa identification of benthic macroinvertebrates. In our experimental results we obtained over 80 % accuracy which can be considered as a good result when human-based accuracy in benthic macroinvertebrate identification is around 80 %. However, it needs to be remembered that this paper is still a preliminary paper, since the classification of benthic macroinvertebrates is a very little researched area. Hence, more research is needed for saying final conclusions from this application, but the basis for the automated benthic macroinvertebrate classification is clearly possible and such a classification could be made with a high accuracy.