1 Introduction

Many methods for the popular bag-of-features model (BoF) have been proposed in recent years  [4, 7, 8]. On images with simple patterns, there is not enough information to be extracted by dictionary learning. There are much fewer visual words to be learned from the simple patterns of visual objects which severely limit the descriptive power of the visual words. The lack of sufficient resolution or descriptive power of object patterns makes it very hard for the classifier to classify the input images very accurately. Recently, a dictionary learning method in [11] is proposed to deal with this problem by extracting more information dictionary learning. The extra information is generated by the clustering results with different values of k using the k-means++ algorithm. However, the high-dimensional feature vectors used in the method [11] makes the curse of dimensionality a much bigger problem with performance degradation.

In this work, it is shown that our dictionary learning method significantly improves the accuracy of the BoF model for classifying the low-resolution images with simple patterns. The improvement is achieved by bringing in the extra information obtained through dictionary learning. In this work, there are three variants of our proposed method using the multi-channel kernel for the SVM, concatenation of features, and random sampling for k with clustering.

The variant of the proposed method using the multi-channel kernel is called Enriched Dictionary Learning with the Multiple-Channel Kernel (ED-M). The variant using concatenation of features is called as Enriched Dictionary Learning by Concatenation (ED-C). The computational efficiency of the ED-M method can be improved using random sampling for different values of k for k-means++. We call this more efficient ED-M method Enriched Dictionary Learning with the Multiple-Channel Kernel and Random Sampling (ED-MRS). In the experiment section, the BoF model using our three proposed algorithms (ED-M, ED-C, and ED-MRS) are compared with the model using previous state-of-the-art dictionary learning methods (see Table 1). Experiments show that the best result can be obtained by the BoF model using our proposed dictionary learning method.

The differences between our three methods and the method [4] are as follows: (a) For concatenation, we combine the histograms of images in our ED-C method instead of combining clustering results in the method [4]. It has been shown in our experiments that this method for concatenation performs much better than the method in [4], and (b) In our methods ED-M and ED-MRS, extra information is added to the BoF model using the multi-channel kernel (again only concatenation is used in [4]).

Table 1. This table summarizes the dictionary learning methods that be compared in Sect. 4. The SVM kernels of the BoF model using them are shown in the rightmost column.

We divide the remainder of this paper into 4 sections. In Sect. 2, a brief introduction of the related work is given. Our dictionary learning methods are described in detail in Sect. 3. In Sect. 4, with the BoF model, our dictionary learning method and all its variants are evaluated on two datasets by comparing them with different dictionary learning methods. The last section concludes our work.

2 Related Work

The BoF model is one of the most popular methods for image classification. In this paper, we focus on the dictionary learning methods for this model.

The k-means algorithm became one of the most widely used dictionary learning methods after it was proposed to generate a dictionary in [2]. There are three main issues in the standard k-means algorithm. The efficiency of the k-means algorithm is lower when the value of k or the number of feature descriptors is very big. The result of the k-means algorithm is easily influenced by the positions of its initial centers. Finally, the noise in the image can affect the result of the k-means algorithm because the algorithm treats each descriptor equally.

Some algorithms are proposed to improve the standard k-means algorithm and the BoF model. The Simple Random Sampling K-means (SRS-K) Algorithm [1] is proposed to improve the efficiency of the k-means algorithm by reducing the number of descriptors. In order to generate the initial centers of the k-means algorithm more reasonably, the k-means++ algorithm with a proven statistical guarantee is used to instead of the k-means algorithm. In addition, the BISecting K-means (BIS-K) Algorithm [12] is proposed to address the same issue. The Spatial Pyramid Matching (SPM) [5] Method is one of the most commonly used variants of the BoF model. In order to improve the efficiency of the SPM Method, the ScSPM Method [10] is proposed. In the ScSPM Method, the sparse coding of the SIFT descriptors is instead of the k-means algorithm as the method for dictionary learning. Although these methods are meaningful, they neglect the information obtained from the clustering results of the k-means algorithms with different values of k. In [11], a dictionary learning method is proposed to improve the accuracy of the BoF model for image classification. It is a also method proposed to exploit extra information obtained from clustering. However, the method makes the curse of dimensionality a bigger problem when k for k-means++ is relatively large. We compare our proposed method with this previous method in our experiments.

3 The Proposed Dictionary Learning Method

3.1 The Motivation

What motivated us is the question: can a single empirically determined optimal k for k-means or k-means++ be improved for dictionary learning? The disadvantages to use an empirical k can be (1) rich information might be lost if the empirical k is not carefully chosen by extensive experiments and (2) sometimes, the optimal k, which should depend on the particular problem, is chosen with previously determined values from prior work or past experience. We, therefore, aim to extract rich information from various histograms using clustering with a large number of possible values of k.

In our experiments, it is demonstrated that our method significantly improves the performance of dictionary learning over state-of-the-art methods for the popular bag-of-features (BoF) model in image classification with biomedical images. However, the method is a general method to improve dictionary learning for images with simple patterns.

With enriched information from dictionary learning, we focus on datasets with low-resolution images with simple patterns for classification using BoF model. Traditionally, the performance of the BoF model with these images may not be ideal because there is not enough information to be obtained from such images using dictionary learning with simple patterns. It is our goal that extra information can be extracted to obtain an enriched dictionary for the BoF model.

We propose a novel dictionary learning method to learn the dictionary using extra information. Using our method, the accuracy of the BoF model can be improved significantly. It is unnecessary for the new method to obtain a very high-dimensional feature vector in the large (\(\sum _{n=2}^k n\)-word) dictionary which makes the curse of dimensionality a bigger problem. In our method, there are three variants of our dictionary learning method: the ED-M, the ED-C, and the ED-MRS (see Table 1).

3.2 Notations

To be more precise, notations used in the rest of this paper are defined in this subsection. N is defined as the number of the images used for training. \(S_{j}\) is the set of SIFT descriptors extracted from the j-th image, and \(\mathbb {S} = \left\{ S_{1} \cup ... \cup S_{N} \right\} \) is the set of all SIFT descriptors from all N training images. To extract enriched information from various histograms with multiple k’s using k-means++, we have the set of all k’s used, \(k\in \mathcal {K} = \left\{ 2,3,...,K \right\} \). For clustering, \(c_{k}\) is the set of k cluster centers obtained with \(k\in \mathcal {K}\) using k-means++. We have the set of all histograms obtained with k-means++ \(h_{k\in \mathcal {K},n\in \{1,2,...,N\}} \in \mathcal {H}\) where \(h_{k,n}\) indicates the histogram generated for the n-th image by using the k-th clustering result.

3.3 Feature Extraction for Enriched Information

Our proposed enriched information step is acheived using Algorithm 1. With k-means++ algorithm for clustering, it has been shown that the performance is better due to its statistical guarantee. In Algorithm 1, for each \(k\in \mathcal {K}\), we have one iteration with clustering and the associated histograms. The k-means++ is, therefore, performed \(\left| \mathcal {K} \right| \) times with k values in \(\mathcal {K}\). \(\left| \mathcal {K} \right| \) sets of clustering centers are then used to obtain \(\left| \mathcal {K} \right| \) sets of histograms for every single image.

figure a

The curse-of-dimensionality which leads to poor performance is avoided in our dictionary learning method with the multi-channel kernel considering that we can have a very-high dimensional feature vector from the \(\sum _{n=2}^k n\)-word dictionary. The kernel makes the feature vectors significantly shorter. To be more specific, in our Algorithm 1, for each image one histogram is generated using the cluster centers from k-means++ with one particular k, so the number of bins is exactly k in this histogram.

3.4 Enriched Dictionary Learning with the Multiple-Channel Kernel (ED-M)

For classification, we propose the ED-M algorithm after feature extraction (see Table 1) is shown in Algorithm 2. The ED-M utilizes the multi-channel kernel of the SVM to combine features obtained in Algorithm 1 for the classification in the BoF model.

With \(\mathcal {H}\) from Algorithm 1, the multi-channel chi-square kernel is calculated using histograms \(\mathcal {H}\). For each channel k, we have histogram \(h_{k,j}\in \mathcal {H}\) where j can be any image in the training set, i.e. \(1<j<N\). The chi-square kernel between two histograms from two images \(j_1\) and \(j_2\) in channel k is obtained by

$$\begin{aligned} \chi ^2(h_{k,j_1},h_{k,j_2}) = 1 - \sum _{z=1}^{k} \frac{\big [ (h_{k,j_1})_{z}-(h_{k,j_2})_{z} \big ] ^{2} }{\frac{1}{2}\big [(h_{k,j_1})_{z} + (h_{k,j_2})_{z}\big ]} \end{aligned}$$
(1)

where z is a particular bin in histogram \((h_{k,j})\) and \((h_{k,j})_{z}\) is the value of bin z in \(h_{k,j}\).

After computing the multi-channel chi-square matrix, the SVM is trained with the matrix to obtain the support vectors. Each element in the matrix \(M_{j_1,j_2}\) of this matrix is computed using

$$\begin{aligned} M_{j_1,j_2} = \frac{1}{|\mathcal {K}|} \sum _{k\in \mathcal {K}} \chi ^2(h_{k,j_1},h_{k,j_2}) \end{aligned}$$
(2)

where \(\chi ^2(.,.)\) is the chi-square kernel in Eq. 1.

figure b

3.5 Enriched Dictionary with the Multiple-Channel Kernel and Random Sampling (ED-MRS)

To improve computational efficiency, we propose a randomized version of the ED-M method using random sampling. The new algorithm is called Enriched Dictionary with the Multiple-Channel Kernel and Random Sampling (ED-MRS). Essentially, a random subset \(\mathcal {K}_{MRS}\) from \(\mathcal {K}\) is obtained before clustering. That means we have \(\mathcal {K}_{MRS}\subset \mathcal {K}\). In our experiments, it is demonstrated that a significant speedup can be acheived with ED-MRS with some reasonable performance decrease.

In the ED-MRS algorithm, \(\mathcal {K}_{MRS}\subset \mathcal {K}\) is obtained using random sampling. \(\mathcal {K}_{MRS}\) is then used as the set of the k values for the k-means++ instead of \(\mathcal {K}\). It means that the only difference between the ED-MRS algorithm and the ED-M algorithm, i.e. Algorithm 2 is the use of randomly sampled values of \(k\in \mathcal {K}_{MRS}\) for ED-MRS. More specifically, \(|\mathcal {K}_{MRS}|\) is used in ED-MRS, instead of \(|\mathcal {K}|\) in the Eq. 2.

3.6 Enriched Dictionary Learning with Concatenation (ED-C)

Instead of using the multi-channel kernel for different values of k, it is natural to also consider a simple concatenation for the proposed method with the chi-square kernel. Therefore, we have enriched dictionary learning with concatenation (ED-C) which combines enriched information from clustering to learn the dictionary (see Algorithm 3 and Table 1).

Like ED-M, in Algorithm 3, \(\mathbb {S}\), and \(\mathcal {K}\) are needed as the input to Algorithm 1 which computes \(\mathcal {H}\). With histograms \(\mathcal {H}\), for the same image j, \(\left\{ h_{k,j}: k \in \mathcal {K} \right\} \) can be concatenated to form one feature vector. With only one feature vector for one image, the SVM is then trained using the chi-square kernel.

figure c

4 Experiments

4.1 Datasets and Performance Metrics

Our experimentsFootnote 1 are conducted using two medical datasets: human epithelial type 2 cells dataset (SNPHEp-2 Cell Dataset)Footnote 2 [9] and pap-smear cells datasetFootnote 3 [3]. One motivation for us to use medical datasets is that medical image classification is a very important problem with challenging domain specific sub-problems to tackle. With fast-growing volumes of medical data/images collected by various systems or projects, medical data analysis has become a big data problem. One way to solve this problem is to use automated medical image classification. Cell image classification is an important type of medical image classification. Recognizing cells is a challenge in image classification because the resolutions of the cell images are usually very low. In addition, the image patterns on the cells are simple so not a lot of features can be extracted from these patterns. We show that our methods performs well in these medical images. The performance of our method in the experiment is evaluated by accuracy which is computed using true negative (tn), false negative (fn), true positive (tp), false positive (fp) and the number of classes (l): \(accuracy = \left( \sum _{i=1}^{l} \left( tp_{i} + tn_{i}\right) / \left( tp_{i} + tn_{i} + fp_{i} + fn_{i} \right) \right) / l .\)

4.2 The SNPHEp-2 Dataset

The anti-nuclear antibody test is a useful diagnostic method for autoimmune diseases and the Indirect Immunofluorescence protocol using human epithelial type 2 (HEp-2) cells is the important standard for the test [6]. The SNPHEp-2 Cell Dataset contains 1,884 cell images with the image size roughly 80-by-80 pixels and the images are divided into 5 classes: Homogeneous, Coarse speckled, Fine speckled, Nucleolar and Centromere. In the public SNPHEp-2 dataset, all the 1,884 cell images are extracted from 40 specimen images and these 40 specimens are equally split into 20 for training and 20 for testing. The numbers of the cell images extracted from different specimens are different. We use the cell images. Hence, the numbers of the cell images for training (905 cell images) or testing (979 cell images) are different. As shown in Table 2, when the value of the k is large, it takes almost 10 days to run a single experiment. For a fair comparison, for each k and each method, we run the program 20 times. However, it takes weeks to finish computation with a 5-fold cross validation. In the SNPHEp-2 dataset, a five-fold validation for training and testing were created by randomly selecting the training (905 cell images) and test (979 cell images) images. The “Split-50-1” means that the first fold of these five-fold validation. Because of limited computational resources, only the first split (Split-50-1) which contain 450 images for training and 493 images for testing, is used for our experiments.

4.3 The PAP-Smear Dataset

The term pap-smear refers to human cell samples stained with the Papanicolau method for the ease of observation under a microscope [3]. The classification of pap-smear cells helps to detect cancer cells. The pap-smear cell dataset collected by Jan Jantzen et al. in 2005 [3]. The dataset consists of 917 cell images which are divided into 7 classes: superficial squamous epithelial, intermediate squamous epithelial, columnar epithelial, mild squamous non-keratinizing dysplasia, moderate squamous non-keratinizing dysplasia, severe squamous non-keratinizing dysplasia, and squamous cell carcinoma in situ intermediate. The whole pap-smear cell dataset [3] is used in our experiments. The dataset is randomly divided into the training set (50\(\%\) images) and the test set (50\(\%\) images) for the experiments. The results of the methods are compared with the reported accuracy of image classification and the computational efficiency.

4.4 Comparison with State-of-the-Art Algorithms

In the experiments, the BoF model using different dictionary learning methods (see Table 1) are compared using the two datasets. All experiments are conducted on an Intel Xeon E5-2690 CPU. We use MATLAB to implement six BoF model variants: three variants using our methods (see Table 1) and the three variants using the k-means, the kmeans++ and the dictionary method in [11]. The remaining one variant is from [10] which is a publicly available version of ScSPM based on sparse coding. To have a fair comparison, all the variants use the SIFT descriptor as the feature descriptor. In order to obtain reliable results, we repeat our experiments for each variant 20 times using the same datasets. The averages over 20 experiments are taken as our results for the accuracy.

To clearly show the comparison for the accuracy at different values of k, Fig. 1 is plotted to show the comparison among six variants (three variants using our methods and three variants based on k-means, k-means++ or the dictionary method in [11]). In Fig. 1, with the value of k increased between 20 and 350, the accuracy rate of the BoF model using our ED-M are larger than that of the model using other methods. On the SNPHEp-2 Cell Dataset and the pap-smear dataset, The BoF model using our ED-M method outperforms that using the k-means++ algorithm by nearly 12.4\(\%\) and 8.6\(\%\).

In Table 2, the highest accuracies from the six BoF model variants in Fig. 1 are compared with the accuracy of ScSPM. With the use of our ED-M method for dictionary learning, the BoF model obtains a much higher accuracy in the comparison. The BoF Model using our ED-MRS method speeding up our ED-M method outperforms the BoF model using other methods except for the ED-M and the ED-C.

Fig. 1.
figure 1

This figure is plotted to show the comparison among six k-means based variants. The dictionary methods for the BoF model are used to denote the BoF model variants.

Table 2. This table shows the comparison on the two datasets. The highest accuracies for the six BoF model variants in Fig. 1 obtained with their respective optimal values of k are compared with the accuracy of ScSPM.

The improvements of the proposed ED-M and ED-C over sparse coding and k-means come at a computational cost which makes our methods slower than traditional k-means dictionary learning. However, the proposed methods can be made more efficient using random sampling techniques like our proposed ED-MRS. With the speed up using ED-MRS to improve the performance of dictionary learning, the computational cost of our improved methods is lowered as can be seen in Table 2.

5 Conclusions

In this paper, a novel dictionary learning method is proposed to improve the accuracy of the BoF model for the image classification with simple patterns such as biomedical images. The improvement is achieved by the adding of the extra information from clustering with all reasonable values of k at the same time. Our experiments demonstrate that, with medical imaging datasets, the proposed dictionary learning method outperforms state-of-the-art methods, such as k-means/k-means++ clustering and sparse coding.