1 Introduction

In recent years, sparse coding and representation have been widely studied to solve the various computer vision and machine learning problems [21]. In [20], Wright J et al. proposed a sparse representation based classification (SRC) method to solve the face recognition under varying illumination. In this method, an input test image is represented as a sparse linear combination of training images, and then the classification is performed by checking which class yields the least coding error. Such a SRC scheme has achieved a great success in face recognition and has been widely studied in the community. It is widely believed that the l1-norm sparsity constraint on coding coefficients plays a key role in the success of SRC [16]. However, Zhang et al. [25] argued that the success of SRC should be largely attributed to the collaborative representation of a test sample by the training samples across all classes. To solve the shortage of training samples, they further proposed an effective collaborative representation based classifier (CRC) by utilizing l2-norm regularization. Moreover, Lei Z et al. devoted to analyze the working mechanism of SRC in [19], and proposed a very simple yet much more efficient face classification scheme, namely collaborative representation based classification with regularized least square (CRC_RLS) [3]. The probability based classifiers is a popular type of classifier widely used in various visual recognition tasks, e.g., Probabilistic Support Vector Machine (PSVM) [14], Probabilistic Principal Component Analysis (PPCA) [8, 17] and Probabilistic Linear Discriminant Analysis (PLDA) [15]. Motivated by the work of probabilistic subspace methods [9, 12, 13], S Cai et al. analyzed the classification mechanism of CRC from a probabilistic viewpoint and proposed a Probabilistic Collaborative Representation based approach for pattern Classification (ProCRC) in [2], which jointly maximized the likelihood that a test sample belongs to each of the multiple classes. The final classification is performed by checking which class has the maximum likelihood.

Traditional classification algorithms [4, 10], including mentioned above, are designed to achieve the lowest recognition errors and assume the same losses for different types of misclassifications. However, this assumption may not be suitable for many real-world applications. For example, it may cause inconvenience to a gallery who is misclassified as an impostor and not allowed to enter the room controlled by a face recognition system, but may result in a serious loss if an impostor is misrecognized as a user and allowed to enter the room.

Cost-sensitive learning always co-exists with class imbalance in most applications with the goal of minimizing the total misclassification cost [22]. Class-imbalance has been considered as one of the most challenging problems in machine learning and data mining. The ratio of imbalance (the size of majority class to minority class) can be as huge as 100, even up to 10,000. Much work has been done in addressing the class imbalance problem. Because the cost that the positive class is misclassified as negative is higher than opposite, cost-sensitive learning is an effective method to deal with the imbalance data classification problem. Published solutions to the class imbalance problem can be categorized as data level and algorithm level approaches. At the algorithm level, solutions try to adapt existing classifier learning algorithms to bias towards the small class [7]. H Lu et al. constructed the filtering deep convolutional network and got a better result on marine organism classification than other methods [11].

In recent year, cost-sensitive learning has been studied widely and become one of the most important topics for solving the class imbalance problem. In [26], Zhou et al. studied empirically the effect of sampling and threshold-moving in training cost-sensitive neural networks, and revealed threshold-moving and soft-ensemble are relatively good choices in training cost-sensitive neural networks. In [18], Sun et al. proposed cost-sensitive boosting algorithms which are developed by introducing cost items into the learning framework of AdaBoost. In [6], Jiang et al. proposed a novel Minority Cloning Technique (MCT) for class-imbalanced cost-sensitive learning. MCT alters the class distribution of training data by cloning each minority class instance according to the similarity between it and the mode of the minority class. In [5], a new cost-sensitive metric was proposed by George to find the optimal tradeoff between the two most critical performance measures of a classification task-accuracy and cost. Generally, users focus more on the minority class and consider the cost of misclassifying a minority class to be more expensive. In our study, we adopted the same strategy to address this problem.

Motivated by the probabilistic collaborative representation based approach for pattern classification [2], we proposed a new method to handle misclassification cost and class-imbalance problem called Cost-Sensitive Collaborative Representation based Classification (CSCRC) via Probability Estimation. In Zhang’s cost-sensitive learning framework, posterior probabilities of a testing sample are estimated by KLR or KNN method. In [2], ProCRC is designed to achieve the lowest recognition errors and assume the same losses for different types of misclassifications, it is difficult to resolve the class imbalance problem. For this case, we introduce cost-sensitive learning framework into ProCRC, which not only derive the relationship between Gaussian function and collaborative representation but also resolve the cost-sensitive problem. Firstly, we used the probabilistic collaborative representation framework to estimate the posterior probabilities. The posterior probabilities were generated directly from the coding coefficients by using a Gaussian function and applying the logarithmic operator to the probabilistic collaborative representation framework, this explained clearly the l2-norm regularized representation scheme used in CRC. Secondly, calculate all the misclassification losses using Zhang’s cost-sensitive learning framework. At last, the test sample is assigned to the class whose loss is minimal. Experimental results on UCI databases validate the effectiveness and efficiency of our methods.

The rest of this paper is organized as follows. Section 2 outlines the details of the relevant method. Section 3 presents the details of the proposed algorithm. Section 4 reports the experiments. Finally, section 5 concludes the paper and offers suggestions for future research.

2 Related work

2.1 Cost-sensitive learning

In multiclass cost-sensitive learning, considering c gallery subjects with their class labels G = {G i }, i = 1, 2, ⋯, c, the labels of impostor are i. In [23], Zhou et al. categorized the costs into three types: the cost of accepting the one which should be rejected is C IG ; the cost of rejecting the one which should be accepted is C GI ; the cost of misidentifying the one as another is C GG . Cost-sensitive learning usually sets the misclassification cost as objective function and identify the label by minimizing loss function. Given a test sample y and its predicted class label ϕ(y), respectively. The label is obtained by minimizing the objective function:

$$ L(y)=\underset{\phi (y)\in \left\{{G}_1,\cdots, {G}_c,I\right\}}{\arg\ \min }\ loss\left(y,\phi (y)\right) $$
(1)

where

$$ loss\left(y,\phi (y)\right)=\left\{\begin{array}{ll}\sum \limits_{i=1}^cP\left({G}_i\left|y\right.\right){C}_{GI}& \mathrm{if}\ \phi (y)=I\\ {}\sum \limits_{\begin{array}{l}i=1\\ {}i\ne \tau \end{array}}^cP\left({G}_i\left|y\right.\right){C}_{GG}+P\left(I\left|y\right.\right){C}_{IG},& \mathrm{if}\ \phi (y)={G}_{\tau}\end{array}\right. $$
(2)

\( \widehat{\phi}(y) \) is the optimal prediction of y, c represents the gallery subjects in classification problem.

2.2 Collaborative representation based classification (CRC_RLS)

Suppose that we have K classes of subjects X = [X 1, X 2, ⋯, X K ], and each class has enough training samples. For a query sample y, we code it collaboratively over the dictionary of all samples X under the l1-norm sparsity constraint. We can write it as y = x + e, where x =  is the component we want to recover from y for classification use and e is the residual (e.g., noise, occlusion and corruption) we want to remove from y. Then, y =  + e. To recover a stable coding coefficient vector \( \widehat{\alpha} \) from y and X, the regularization method is the best choice. If we assume that the model error e follows a Gaussian distribution, then the optimization problem can be written as follows:

$$ \widehat{\alpha}=\arg {\min}_{\alpha}\left\{{\left\Vert y- X\alpha \right\Vert}_2+\lambda {\left\Vert \alpha \right\Vert}_2\right\} $$
(3)

where λ is the regularization parameter. The solution of collaborative representation with regularized least square in Eq. (3) can be easily and analytically derived as:

$$ \widehat{\alpha}={\left({X}^TX+\lambda \cdot I\right)}^{-1}{X}^Ty $$
(4)

Let P = (X T X + λ ⋅ I)−1 X T, then \( \widehat{\alpha}= Py \). Clearly, P is independent of y so that it can be pre-calculated as a projection matrix. For a query sample y, we can simply project it onto P via P y . In addition to using the class-specified representation residual \( {\left\Vert y-{X}_i{\widehat{\alpha}}_i\right\Vert}_2 \) for classification, where \( {\widehat{\alpha}}_i \) is the coding vector associated with class i, the l 2-norm “sparsity” \( {\left\Vert {\widehat{\alpha}}_i\right\Vert}_2 \) also brings some discrimination information. We propose to use both of them in the decision making. We then compute the residual r i (y) as:

$$ {r}_i(y)={\left\Vert y-{X}_i\cdot {\widehat{\alpha}}_i\right\Vert}_2/{\left\Vert {\widehat{\alpha}}_i\right\Vert}_2 $$
(5)

The query sample y is identified by minimizing r i (y) as follow:

$$ L(y)=\arg \underset{i}{\min }\ \left\{{r}_i(y)\right\} $$
(6)

3 Proposed approach

Different data points x have different probabilities of l(x) ∈ l X , where l (x) means the label of x, l x means the label set of all candidate classes in X, and P(l(x) ∈ l X ) should be higher if the l2-norm of αis smaller, vice versa. One intuitive choice is to use a Gaussian function to define such a probability:

$$ P\left(l(x)\in {l}_X\right)\propto \exp \left(-c{\left\Vert \alpha \right\Vert}_2^2\right) $$
(7)

where c is a constant and data points are assigned with different probabilities based on α, where all the data points are inside the subspace spanned by all samples in X. For a sample y outside the subspace, the probability as:

$$ P\left(l(y)\in {l}_X\right)=P\left(l(y)=l(x)\left|l(x)\in {l}_X\right.\right)P\left(l(x)\in {l}_X\right) $$
(8)

P(l(x) ∈ l X ) has been defined in Eq. (7). P(l(y) = l(x)|l(x) ∈ l X ) can be measured by the similarity between x and y. Here we adopt the Gaussian kernel to define it:

$$ P\left(l(y)=l(x)\left|l(x)\in {l}_X\right.\right)\propto \exp \left(-k{\left\Vert y-x\right\Vert}_2^2\right) $$
(9)

where k is a constant, with Eq. (7)–(9), we have

$$ P\left(l(y)\in {l}_X\right)\propto \exp \left(-\left(k{\left\Vert y- X\alpha \right\Vert}_2^2+c{\left\Vert \alpha \right\Vert}_2^2\right)\right) $$
(10)

In order to maximize the probability, we can apply the logarithmic operator to Eq. (10). There is:

$$ {\displaystyle \begin{array}{l}\max P\left(l(y)\in {l}_X\right)=\max \ln \left(P\left(l(y)\in {l}_X\right)\right)\\ {}\kern5em =\mathrm{mi}{\mathrm{n}}_{\alpha }k{\left\Vert y- X\alpha \right\Vert}_2^2+c{\left\Vert \alpha \right\Vert}_2^2\\ {}\kern5em =\mathrm{mi}{\mathrm{n}}_{\alpha }{\left\Vert y- X\alpha \right\Vert}_2^2+\lambda {\left\Vert \alpha \right\Vert}_2^2\end{array}} $$
(11)

where λ = c/k. Interestingly, Eq. (11) shares the same formulation of the representation formula of CRC [19], but it has a clear probabilistic interpretation.

A sample x inside the subspace can be collaboratively represented as:\( x= X\alpha ={\sum}_{k=1}^K{X}_k{\alpha}_k \), where α = [α 1; α 2; ⋯; α k ] and α k is the coding vector associated with X k . Note that x k  = X k α k is a data point falling into the subspace of class k. Then, we have

$$ P\left(l(x)=k\left|l(x)\in {l}_X\right.\right)\propto \exp \left(-\delta {\left\Vert x-{X}_k{\alpha}_k\right\Vert}_2^2\right) $$
(12)

where δ is a constant. For a query sample y, we can compute the probability that l(y) = k as:

$$ {\displaystyle \begin{array}{l}P\left(l(y)=k\right)\\ {}=P\left(l(y)=l(x)\left|l(x)=k\right.\right)\cdot P\left(l(x)=k\right)\\ {}=P\left(l(y)=l(x)\left|l(x)=k\right.\right)\cdot P\left(l(x)=k\left|l(x)\in {l}_X\right.\right)\cdot P\left(l(x)\in {l}_X\right)\end{array}} $$
(13)

Since the probability definition in Eq. (9) is independent of k as long as k ∈ l X , we have P(l(y) = l(x)|l(x) = k) = P(l(y) = l(x)|l(x) ∈ l X ). With Eq. (11)–(12), we have

$$ {\displaystyle \begin{array}{l}P\left(l(y)=k\right)=P\left(l(y)\in {l}_X\right)\cdot P\left(l(x)=k\left|l(x)\in {l}_X\right.\right)\\ {}\kern3.25em \propto \exp \left(-{\left\Vert y- X\alpha \right\Vert}_2^2+\lambda {\left\Vert \alpha \right\Vert}_2^2+\gamma {\left\Vert X\alpha -{X}_k{\alpha}_k\right\Vert}_2^2\right)\end{array}} $$
(14)

where γ = δ/k. Applying the logarithmic operator to Eq. (14) and ignoring the constant term, we have:

$$ \left(\widehat{\alpha}\right)=\arg {\min}_{\alpha}\left\{{\left\Vert y- X\alpha \right\Vert}_2^2+c{\left\Vert \alpha \right\Vert}_2^2+{\left\Vert X\alpha -{X}_k{\alpha}_k\right\Vert}_2^2\right\} $$
(15)

Refer to Eq. (15), let \( {X}_k^{\prime } \) be a matrix which has the same size with X, while only the samples of X k will be assigned to \( {X}_k^{\prime } \) at their corresponding locations in X, i.e., \( {X}_k^{\prime }=\left[0,\cdots, {X}_k,\cdots, 0\right] \). Let \( {\overline{X}}_k^{\prime }=X-{X}_k^{\prime } \). We can then compute the following projection matrix offline:

$$ T={\left({X}^TX+{\left({\overline{X}}_k^{\prime}\right)}^T{\overline{X}}_k^{\prime }+\lambda I\right)}^{-1}{X}^T $$
(16)

where I denotes the identity matrix. Then, \( \widehat{\alpha}= Ty \).

With the model in Eq. (15), a solution vector \( \widehat{\alpha} \) is obtained. The probability P(l(y) = k) can be computed by:

$$ P\left(l(y)=k\right)\propto \exp \left(-\left({\left\Vert y-X\widehat{\alpha}\right\Vert}_2^2+\lambda {\left\Vert \widehat{\alpha}\right\Vert}_2^2+{\left\Vert X\widehat{\alpha}-{X}_k{\widehat{\alpha}}_k\right\Vert}_2^2\right)\right) $$
(17)

Note that \( \left({\left\Vert y-X\widehat{\alpha}\right\Vert}_2^2+\lambda {\left\Vert \widehat{\alpha}\right\Vert}_2^2\right) \) is the same for all classes, and thus we can omit it in computing P(l(y) = k). Then we have:

$$ {P}_k=\exp \left(-\left({\left\Vert X\widehat{\alpha}-{X}_k{\widehat{\alpha}}_k\right\Vert}_2^2\right)\right) $$
(18)

In cost-sensitive learning, the loss function (Eq. (2)) is regarded as an objective function to identify the label of a test sample. In binary classification problem, there are two misclassification costs, and we denote the cost that misclassify positive class as negative class by C 10, and the cost by C 01 conversely. Then a cost matrix can be constructed as shown in Table 1, where G 1, G 0 represents the label of minority class and majority class, respectively.

Table 1 The cost matrix

It is well known that the loss function can be related to the posterior probabilityP(ϕ(y)|y) ≈ P(l(y) = k). Then the loss function can be rewritten as follow:

$$ loss\left(y,\phi (y)\right)=\left\{\begin{array}{ll}\sum \limits_{i={G}_1}{P}_i{C}_{10}& \mathrm{if}\ \phi (y)={G}_0\\ {}\sum \limits_{j={G}_0}{P}_j{C}_{01}&\ \mathrm{if}\ \phi (y)={G}_1\end{array}\right. $$
(19)

The test sample y belongs to the class with higher probability. We can obtain the label of test sample y by minimizing Eq. (19):

$$ L(y)=\arg\ \underset{i\in \left\{0,1\right\}}{\min } loss\left(y,\phi (y)\right) $$
(20)

The whole process of CSSRC is described in Algorithm 1.

figure g

4 Experiments

4.1 Data sets and experimental setting

We tested the proposed method on 10 UCI data sets [1]. Detail information about these data sets is summarized in Table 2.

Table 2 Description of data sets

In cost-sensitive learning, false positive (actual negative but predicted as positive, denoted as FP), false negative (actual positive but predicted as negative, FN), true positive (actual positive and predicted as positive, TP) and true negative (actual negative and predicted as negative TN) can be given in a confusion matrix, as shown in Table 3:

Table 3 Confusion matrix

To binary classification problems, four kinds of misclassification cost are needed, which were referred as CTP, CFP, CTN, and CFN, respectively. CTP and CTN are the costs of true positive (TP) and true negative (TN). In order to simplify the cost matrix, we set CTP = 0, CTN = 0. CFN and CFP are the costs of false negative (FN) and false positive (FP). We always assume that the cost of misclassifying positive class instances is much higher than the cost of misclassifying negative class instances, so we set CFN> > CFP. In this paper, CFP is set to be a unit cost of 1, CFN is assigned as 10. For class imbalance experiment, the imbalance ratio is set as 1, 2,…, 10, respectively. In our experiments, we repeated an experiment for 50 times and got the average results. In our experiments, four evaluation criteria are adopted to evaluate the classification performance: Average Cost (AC), F-measure, G-mean and classification Accuracy. They are defined as follows [24]:

$$ {\displaystyle \begin{array}{c} Recall={Acc}_{+}=\frac{TP}{TP+ FN}\\ {}{Acc}_{-}=\frac{TN}{TN+ FP}\\ {} Accuracy={Acc}_{-}+{Acc}_{+}\\ {} Precision=\frac{TP}{TP+ FP}\\ {}G- mean=\sqrt{Acc_{+}\times {Acc}_{-}}\\ {}F- measure=\frac{2\times Precision\times Recall}{Precision+ Recall}\\ {} AC=\frac{C_{10} FP+{C}_{01} FN}{N}\end{array}} $$

The experiments were performed on Matlab 2014a, and the computer with a 2.6GHz Intel Xeon CPU.

4.2 Experimental results not considering the imbalance

The main idea of collaborative representation is to represent the query samples using training samples in binary classification problem. Figure 1a and b show the coding coefficients of a positive query sample and a negative query sample. We can easily find that the query samples are much related to the samples from the same class and observe the class label of the query samples.

Fig. 1
figure 1

The coding coefficients of positive query sample and negative query sample

We compared the performance of these five methods (SRC, CRC, SVM, ProCRC, CSCRC) on 10 UCI data sets, and the results are summarized in Tables 4, 5 and 6. The last row of Tables 4 and 5 is the average Accuracy and F-measure value for the method on ten data sets. We selected 31 positive samples and 31negative samples randomly from data sets Haberman, Housing, Ionosphere and Balance as test samples, 41 positive samples and 41 negative samples as training samples; 61 positive samples and 61 negative samples as test samples, 101 positive samples and 101 negative samples as training samples from the other 6 data sets. The cost ratio (the cost of false acceptance respect to false rejection) was set as 10. We performed the process for 50 times and get the average results.

Table 4 The Classification Accuracy for the five methods on 10 data sets (where bold entries are the methods with highest classification accuracy on each data set)
Table 5 The F-measure for the five methods on 10 data sets (where bold entries are the methods with highest F-measure on each data set)
Table 6 The Average Cost for the five methods on 10 data sets  (where bold entries are the methods with lowest  average cost on each data set)

On Letter, Balance, Abalone, Car, Nursery, Cmc and Haberman, our method achieved very high Accuracy and F-measure value respect to the other four methods. One of the three data sets does not get the highest value of Accuracy and F-measure, but we achieve the highest value of average Accuracy and F-measure. The values of Accuracy and F-measure are higher than 0.93. In other words, our method have better performance than SRC, CRC, SVM andProCRC.

We calculated the misclassification cost of these five methods on 10 UCI data sets and summarized as Table 6. On Letter, Balance, Abalone, Car, Pima, Nursery, Cmc and Haberman, our method achieves very low average misclassification cost. In Tables 4 and 5, SRC has the highest value on Pima, but CSCRC has the highest value of Average Cost on Pima. Obviously, CSCRC classify the positive samples correctly. Furthermore, the value of Accuracy and F-measure is lower than CRC on Housing and Ionosphere, but the value of Average Cost is inverse.

4.3 Experimental results considering the imbalance

Similarly, we compared the performance of these five methods (SRC, CRC, SVM, ProCRC, CSCRC) on Letter, and evaluated the performance via F-measure, G-mean and Average Cost for the class-imbalance problem. In this experiment, we set the imbalance ratio as [1, 2,…, 10], respectively. The size of minority class is 30 and the majority class is 30 multiply the imbalance ratios in train set, accordingly. We selected 61 positive samples and 61 negative samples as test set. The cost was set as mentioned in section 4.1.

Note that there are also situations in which CSCRC is preferred. From the results on Figs. 2, 3 and 4, we can see that CSCRC has higher F-measure and G-mean than the other four methods except when the imbalance ratio is 1. Meanwhile, CSCRC achieves the lowest Average Cost respect to the other methods. This suggests that CSCRC can focus on more useful data. With the increasing of imbalance ratio, we have more training samples, and the proposed method can classify the samples correctly when the imbalance ratio is up to 4. Generally speaking, class-imbalance is affect by the proposed method. Concretely, CSCRC is not influenced by the distribution of samples, we can also get a better classify result when the imbalance ratio is high.

Fig. 2
figure 2

The result of F-measure on Letter with different imbalance ration

Fig. 3
figure 3

The result of G-mean on Letter with different imbalance ration

Fig. 4
figure 4

The result of Average Cost on Letter with different imbalance ration

4.4 Experimental results on face recognition

This section, we selected two persons from the dataset YaleB for this experiment and performed on Matlab 2014a. We compared the performance of these 4 methods (SRC, CRC, ProCRC, CSCRC), and evaluated the performance via Average misclassification cost and classification accuracy for the cost sensitive problem. In this experiment, the training samples consist of 10, 20, …, 100 images and the rest of this two persons’ images are test samples. We set the misclassification cost as 10 for the error that misclassify negative class sample as positive class sample, the opposite is 1, and training 100 times for per subsets and the results as follows:

The results has summarized in Figs. 5 and 6. With the increasing of training set, the average misclassification cost has reduced and classification accuracy has increased for this four methods. Although the classification accuracy of CSCRC is lower than some of the other methods, we can obtain the lowest misclassification cost. In our method, we pursue the lowest misclassification cost and regard it as the objective function. Traditional methods pursue the highest classification accuracy, but it is unsuitable for solving the cost sensitive problem. CSCRC combines the CRC with cost sensitive learning can well deal with the cost sensitive and class imbalance problem.

Fig. 5
figure 5

The average misclassification cost on YaleB

Fig. 6
figure 6

The classification accuracy on YaleB

5 Conclusions

This paper, we proposed a novel method to handle misclassification cost and class imbalance problem simultaneously called Cost-Sensitive Collaborative Representation Classification based Probability Estimation. The proposed approach adopted probabilistic model and sparse representation coefficient matrix to estimate the posterior probability and then obtained the label of a testing sample by minimizing the misclassification losses. The experimental results show that the proposed CSCRC has a comparable or even lower average cost with higher accuracy compare to the other four classification algorithm.

6 Acknowledgements

The authors want to thank the anonymous reviewers and the associate editor for helpful comments and suggestions. This work is supported by the National Natural Science Foundation of China (Grant Nos. 61562013, 61320106008), Guangxi Colleges and Universities Key Laboratory of Intelligent Processing of Computer Images and Graphics (Grant No. LD16096x), the Center for Collaborative Innovation in the Technology of IOT and the Industrialization (Grant No. WLW20060610), Innovation Project of GUET Graduate Education, the study abroad program for graduate student of Guilin University of Electronic Technology. The authors declare that they have no conflict of interest.