Efficient Large Scale Image Classification via Prediction Score Decomposition

Le, Duy-Dinh; Mai, Tien-Dung; Satoh, Shin’ichi; Ngo, Thanh Duc; Duong, Duc Anh

doi:10.1007/978-3-319-46466-4_46

Duy-Dinh Le^17,18,
Tien-Dung Mai¹⁷,
Shin’ichi Satoh¹⁸,
Thanh Duc Ngo¹⁷ &
…
Duc Anh Duong¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9910))

Included in the following conference series:

European Conference on Computer Vision

Abstract

There has been growing interest in reducing the test time complexity of multi-class classification problems with large numbers of classes. The key idea to solve it is to reduce the number of classifier evaluations used to predict labels. The state-of-the-art methods usually employ the label tree approach that usually suffers the well-know error propagation problem and it is difficult for parallelization for further speedup. We propose another practical approach, with the same goal of using a small number of classifiers to achieve a good trade-off between testing efficiency and classification accuracy. The proposed method analyzes the correlation among classes, suppresses redundancy, and generates a small number of classifiers that best approximate the prediction scores of the original large number of classes. Different from label-tree methods in which each test example follows a different traversing path from the root to a leaf node and results in a different set of classifiers each time, the proposed method applies the same set of classifiers to all test examples. As a result, it is much more efficient in practice, even in the case of using the same number of classifier evaluations as the label-tree methods. Experiments on several large datasets including ILSVRC2010-1K, SUN-397, and Caltech-256 show the efficiency of our method.

You have full access to this open access chapter, Download conference paper PDF

Learning Balanced Trees for Large Scale Image Classification

Sparse coding for image classification base on spatial pyramid representation

Article 01 July 2017

Harnessing Superclasses for Learning from Hierarchical Databases

Keywords

1 Introduction

Multi-class classification, which is the problem of classifying one example with a predefined set of classes, is one of the fundamental problems of computer vision. The availability of large-scale datasets, such as ImageNet [9], SUN [37], and Caltech-256 [17], that have many training and testing examples and many classes has posed significant challenges in computational issues.

One of the challenges that has attracted growing attention is how to discriminate a large number of classes. The complexity in test time grows linearly with the number of classes when using the standard one-versus-all (OvA) approach [1, 31], and this is prohibitive for large-scale datasets used in practical applications. The key idea to solve this problem is to reduce the number of classifiers evaluated for each testing example.

The Error Correcting Output Codes (ECOC) based approaches [2, 8, 12–14, 30, 39, 40] combines multi binary classifiers to solve the multi-class classification problem. Given a testing example, the set of bit predictors is applied to obtain a code, and the second stage involves assignment the class whose codeword is the closest to the code. The computational complexity of ECOC is linear to the number of binary classifier evaluations (i.e. code length). In the case of a large number of classes, learning an efficient coding matrix is challenging and problem-dependent. Furthermore, good coding matrix does not ensure good classification [8].

The tree-based approaches [4–7, 10, 16, 18, 24, 25, 27, 33, 36, 38] use a hierarchical label tree to organize a predefined set of classes. In the testing process, an example is classified by traversing the tree from the root node to a leaf node. Since the number of classifiers at each node is much smaller than the number of original classifiers of OvA methods, in the ideal case, label tree methods achieve sub-linear complexity. To achieve high accuracy, methods have been proposed for optimization of the overall tree loss through building the tree structure and learning classifiers at nodes. Although these methods are considered to be state of the art in large-scale image classification [4, 10, 16, 24, 33], they still have drawbacks such as (i) error propagation problem where errors made at an internal node are propagated through the tree and yield misclassification, and (ii) difficulty in parallelization for further speed-up because the set of classifiers used in the evaluation of a testing example is not known in advance.

In this paper, we propose a novel method for solving the multi-class classification problems with large numbers of classes that does not use a tree structure. To achieve a good trade-off between testing efficiency and classification accuracy, at the first stage, our method analyzes the correlation among classes, suppresses redundancy, and generates a small number of fixed classifiers (pseudo-classifiers) that best approximate the prediction scores of the original large number of classes. Because there are errors in the approximated scores, it does not guarantee good classification accuracy if directly using for prediction. At the second stage, a verification process is used to handle this situation. Specifically, a set of the candidate labels is selected using the top k scores, and the k corresponding OvA classifiers are applied to recompute the scores for final decision.

Our contribution is two-fold:

We propose a novel framework for the multi-class classification problem that is easy to balance between accuracy and speed. Approximated scores can be computed extremely fast using a small number of pseudo-classifiers and can be further speed-up using parallel computing. The verification stage only requires several OvA classifier evaluations to significantly improve classification accuracy. Our method has potential in practice because it is very fast, requires less memory space, is easy to implement, and only has one parameter (number of pseudo-classifiers) to tune for the balance.
We conducted comprehensive experiments showing that the proposed method can achieve better state-of-the-art performance and yet is much more efficient in terms of actual testing time compared with existing methods.

2 Related Work

Computational efficiency is one of the most important considerations in large-scale image classification in which the number of classes is also large. The standard methods, such as one-versus-all [31], and DAG (directed acyclic graph) [29], treat the label space as flat, and therefore, their time complexity as far as testing goes is linearly proportional to the number of classes, which is prohibitive in practical applications.

2.1 Label Tree Approach

The label-tree approach is a popular approach to reduce the time complexity of testing to a sub-linear value. It works by creating a hierarchical structure in the label space.

Label-tree methods [4, 10, 16, 24, 33] involve two issues: (i) learning the tree structure and (ii) learning the classifier weights for each internal node and the labels of the leaf nodes. Learning the label tree parameters requires finding the classifier weights for each node and the labels for the leaf nodes. The label embedding tree proposed by S. Bengio et al. [4] learns the tree structure by applying spectral clustering to a confusion matrix generated by OvA classifiers in order to split the classes into disjoint subsets. The node classifiers are then learned jointly by optimizing the overall tree loss. However, for large numbers of classes, it suffers from drawbacks: (i) learning OvA classifiers in order to generate the confusion matrix is costly, (ii) splitting the classes into disjoint subsets is difficult because the assumption of separability of classes usually fails, and (iii) it might generate an unbalanced tree that leads to a sub-optimal testing time.

In the fast and balanced tree proposed by J. Deng et al. [10], these drawbacks are avoided by performing the splitting process and learning process jointly and by allowing overlaps among the subsets of child nodes. The relaxed hierarchy proposed by T. Gao and D. Koller [16] is an alternative solution based on max-margin optimization in which a subset of confusing classes is allowed to be ignored at each node. This method shares the same idea as the method proposed by Marszalek and C. Schmid [27], but has significant improvements over it. Recently, Liu et al. [24] proposed the probabilistic label tree, which outperforms existing methods. The key idea here is to define the label tree as a probabilistic model and use maximum likelihood optimization to learn the parameters.

2.2 ECOC Approach

ECOC-based methods [2, 8, 12, 14, 30] mainly involve designing an optimal coding matrix that requires a small number of bits for efficiency, good row and column separation for robustness, and high accurate bit predictors. Sparse random codes and random codes described in [2, 12] require a large number of bit predictors (15.log(C) and 10.log(C) respectively where C is the number of classes) to achieve reasonable accuracy. However, it is shown in [31], the accuracy of these methods is worse than that of the OvA approach. Spectral ECOC [39] is based on spectral decomposition on the normalized Laplacians of the similarity graph of the classes. The resulting eigenvectors are used to define partitions. Because it uses one-versus-one (OvO) classifiers to generate the similarity matrix, it is not scalable for classification problems with large number of classes. Recently, Sparse Output Coding (SpOC) [40] is a new encoding and decoding scheme that learns coding matrix and bit predictor separately but still has good balance between error-correcting ability and bit prediction accuracy. However, it uses a predefined class taxonomy to build semantic relatedness matrix for the both stages. It is unknown what happens if this prior knowledge is removed.

2.3 Other Complementary Approaches

There are other approaches proposed for solving the problem of large scale image classification [3, 19, 23, 26, 32]. Most of them adopt the OvA approach due to its competitive performance and its easy parallelization on multi-cores or machines. For example, a method for fast feature extraction and SVM learning for OvA classifiers is described in [23]. The studies described in [1, 3, 19, 26] aim to achieve better accuracy than OvA methods by simultaneously learning shared characteristics common to the classes and minimizing classification loss using trace-norm. Sparselets introduced in [32] is another approach that learns shared intermediate representation for multi-class object detection with deformable part models using sparse reconstruction of object models. The main contributions of the studies described above are scalable learning methods for large datasets with good generalization ability, hence they are complementary with our proposed method.

Another approach [21] is based on attribute based learning. The idea is to analyze visual correlations between classes instead of manual human efforts to design attribute classifiers and relations between classes and attributes for classification.

Recently, methods to speeding up evaluation of Convolutional Neural Networks (CNN) [11, 20] using low-rank approximation were proposed. However, much attention was paid to convolutional layers, i.e. lower layers used for feature extraction, rather than fully connected layers; and the approximations are performed after the network has been fully trained.

3 Preliminaries

Suppose that N images whose feature vectors are $v_i, i=1,\cdots ,N$, are to be classified into C classes $c_j, j=1,\cdots ,C$. We will use $v_i$ to denote a feature vector or image interchangeably. We want to generate an N by C response matrix^{Footnote 1} R, where $r_{(i,j)}$ corresponds to the “response” of the i-th image for the j-th class. The response can be a binary value, $r_{(i,j)}=+1$, if $v_i$ belongs to $c_j$ ($-1$ otherwise) or it can be a score, $r_{(i,j)}\ge 0$, if $v_i$ belongs to $c_j$ ($< 0$ otherwise).

There are a number of ways to train multi-class classifiers; however, one standard way to obtain the responses for a given set of images is to train C classifiers, $f_j(\cdot )$, $j=1,\cdots ,C$, based on the OvA strategy [31], and then to obtain the responses, $r_{(i,j)} = f_j(v_i)$. For multi-class classification (i.e., one-out-of-C classes classification), $v_i$ can be classified into the class $c_j$ whose $f_j(v_i)$ is maximum. For multiple binary classification results, $v_i$ can be classified into classes whose responses are positive.

Now let us consider a smaller number of classifiers, namely, L classifiers $g_k(v)$, $k=1,\cdots ,L$, where $L \ll C$. Let’s assume that f can be sufficiently approximated as $f(v) \approx f'(g_1(v), g_2(v), \cdots , g_L(v))$. If the cost of evaluating $f'$ is significantly cheaper than that of $g_k$, and the cost of $f_j$ is almost the same as that of $g_k$, we can expect that the above approximation will yield significant cost reductions.

4 Proposed Method

4.1 Overview of the Method

We propose a two-stage method for solving the multi-class classification problem. The key idea is to use an extremely fast method for filtering process to find a set of candidate labels, then use robust OvA classifiers for the verification process to get the correct label. At first, a matrix decomposition based technique is used to find a small number of classifiers that best approximate the prediction scores of the original large number of classes. Because there is no guarantee that using the approximated scores directly minimizes classification loss, only the top k scores are used to select a set of k candidate labels. Then a verification process is carried out by applying k OvA classifiers corresponding to the candidate labels to recompute the scores for the final decision.

The first stage uses the same set of fixed classifiers (pseudo-classifiers) to all test examples, leading to be extremely fast in testing process and reasonable performance. Meanwhile the second stage is used to further improve the accuracy by applying a small number of OvA classifiers. By this way, a good trade-off between testing efficiency and classification accuracy is achieved easily.

4.2 Fast Classification via Prediction Score Decomposition

Let us factorize the N by C response matrix R into an N by L matrix A and an L by C matrix B, namely, $ R = A B$. By letting $a_{(i,k)}=g_k(v_i)$, A can be regarded as classification results of N images for L classifiers, and the final response R can be obtained by multiplying B. One way to perform this decomposition is to use singular value decomposition (SVD for short):

$$\begin{aligned} R\approx & {} U \varSigma V^T \end{aligned}$$

(1)

where U and V are composed of left and right singular vectors corresponding to the L largest singular values, and we can set $A=U$ and $B=\varSigma V^T$. In doing so, R can be approximated in the MSE sense. This implies $f'$ to be a linear combination of $g_k$. Given this singular value decomposition, we can use $U = [u_1\,u_2\, \cdots \, u_L]$ to train functions $g_k$ that well approximate $u_k$, namely, $g_k(v_i) \approx u_{(i,k)}$ via regression. However, this process results in a two-stage approximation; namely, in the first stage, R is approximated by singular value decomposition, and following that, each $g_k(\cdot )$ is fit to $u_{(\cdot ,k)}$ by regression. The pseudo-classifier $g_k(\cdot )$ obtained in this way may not be optimal. Instead, we will jointly optimize the decomposition of R and the regressor for $g_k(\cdot )$ in a single step. Therefore, cost of training the regressors is reduced and improved accuracy is expected.

Let’s revisit SVD.

$$\begin{aligned} R= & {} U \varSigma V^T\end{aligned}$$

(2)

$$\begin{aligned} R R^T= & {} U \varSigma V^T V \varSigma U^T = U \varSigma ^2 U^T\end{aligned}$$

(3)

$$\begin{aligned} R R^T U= & {} U \varSigma ^2. \end{aligned}$$

(4)

Instead of obtaining U directly by singular value decomposition, we take into account that U is the result of performing regression on the feature vectors of the images. To do so, we will pose the original problem as an eigenvalue problem,

$$\begin{aligned} R R^T u= & {} \lambda u\end{aligned}$$

(5)

$$\begin{aligned} u^T R R^T u= & {} u^T \lambda u = \lambda \end{aligned}$$

(6)

where u is an eigenvector and $\lambda $ is the corresponding eigenvalue. The first eigenvector corresponding to the largest eigenvalue can be obtained by maximizing (6), and the following eigenvalues and eigenvectors are iteratively obtained using the above process along with Gram-Schmidt orthonormalization.

Now we consider u as regression result, namely, $u_{(i,k)} \approx g_k(v_i)$. We further assume linear regression:

$$\begin{aligned} g_k(v_i)= & {} <w_k, v_i> + b_k \end{aligned}$$

(7)

$$\begin{aligned}= & {} [w_k\, b_k] [v_i\, 1]^T \mathop {=}\limits ^{\text {def}} \tilde{w_k} \tilde{v_i} . \end{aligned}$$

(8)

By defining the matrix of features:

$$\begin{aligned} \tilde{S}= & {} \left[ \begin{array}{cccc} v_1 &{} v_2 &{} \cdots &{} v_N \\ 1 &{} 1 &{} \cdots &{} 1 \end{array}\right] \end{aligned}$$

(9)

we want $u \approx \tilde{S}^T \tilde{w}$. Substituting this into (6), the problem becomes:

$$\begin{aligned} \text{ maximize }&\tilde{w}^T \tilde{S} R R^T \tilde{S}^T \tilde{w} \end{aligned}$$

(10)

$$\begin{aligned} \text{ such } \text{ that }&\tilde{w}^T \tilde{S} \tilde{S}^T \tilde{w} = 1 \end{aligned}$$

(11)

We can use the Lagrange multipliers method to solve it.

$$\begin{aligned} J= & {} \tilde{w}^T \tilde{S} R R^T \tilde{S}^T \tilde{w} - \lambda (\tilde{w}^T \tilde{S} \tilde{S}^T \tilde{w} - 1)\end{aligned}$$

(12)

$$\begin{aligned} \frac{\partial L}{\partial \tilde{w}}= & {} 2 \tilde{S} R R^T \tilde{S}^T \tilde{w} - 2 \lambda \tilde{S} \tilde{S}^T \tilde{w} = 0\end{aligned}$$

(13)

$$\begin{aligned}&\tilde{S} R R^T \tilde{S}^T \tilde{w} = \lambda \tilde{S} \tilde{S}^T \tilde{w}. \end{aligned}$$

(14)

The above can be regarded as a generalized eigenvalue problem $P w = \lambda Q w$, where $P = \tilde{S} R R^T \tilde{S}^T$ and $Q = \tilde{S} \tilde{S}^T$. Obviously P and Q are Hermitian and Q is positive semi-definite. Therefore, the eigenvalues $\lambda $ are real. Let us assume that we have obtained the eigenvalues $\lambda _i$ and eigenvectors $\tilde{w_i}$ where $\lambda _1 \ge \lambda _2 \ge \cdots $. From the properties of the generalized eigenvalue problem, the $\tilde{w_i}$ are Q-orthogonal, namely, $\tilde{w_i} Q \tilde{w_j} = 0$ where $i\ne j$. Since

$$\begin{aligned} \tilde{w_i}^T \tilde{S} \tilde{S}^T \tilde{w_j}= & {} (\tilde{S}^T \tilde{w_i})^T (\tilde{S}^T \tilde{w_j})\end{aligned}$$

(15)

$$\begin{aligned}= & {} u_i^T u_j = 0 \end{aligned}$$

(16)

this ensures the orthogonal relationship among $u_i$. In addition, due to (11), $u_i$ are orthonormal. Therefore, by selecting $\tilde{w_i}$ corresponding to the L largest $\lambda _i$, we can obtain $u_i$ that optimally approximate R, while at the same time $\tilde{w_i}$ defines the optimal regressors.

Let us consider such $\tilde{w_i}, i=1,\cdots ,L$. We can obtain $\tilde{W}=[\tilde{w_1}\,\cdots \,\tilde{w_L}]$. Accordingly, U can be obtained as

$$\begin{aligned} U= & {} \tilde{S}^T \tilde{W}. \end{aligned}$$

(17)

In theory, V can be obtained through an eigenvalue decomposition of $R^T R$. However, due to the estimation error of linear regression in U, the resultant V obtained by the above method may not be in right correspondence with U. So, we obtain V using the relationship $R \approx U \varSigma V^T$, where $\varSigma = diag(\lambda _1, \cdots , \lambda _L)^{1/2}$:

$$\begin{aligned} V= & {} ((U\varSigma )^+ R)^T \end{aligned}$$

(18)

where $X^+$ is Moore-Penrose pseudoinverse $X^+ = (X^TX)^{-1}X^T$.

The algorithms for training and classification are summarized in Algorithms 1 and 2.

4.3 Verification by OvA Classifiers

Given the estimated score matrix $R_{tg} \in \mathbb {R}^{N\times C}$ returned from Algorithm 2, each row $r_{i}$ is corresponding scores of C classifiers for test example $v_{i}$, i.e. $r_{i,j} = f_{j}(v_{i}), (j=1,..,C, i=1,..,N)$. The simplest way to predict label is directly to use these scores: $ y_{i} = \underset{j=1,..,C}{\text {argmax}} \text { }{r_{i,j}}$. However, there might be errors in estimated scores that affect classification accuracy, a verification process is needed.

Specifically, a set of candidate labels $C^{*}=\{c_{1}, c_{2}, ..., c_{k}\}$ corresponding to top k scores is selected. Then the new scores are re-computed $r^{*}_{i,j} = f_{j}(v_{i}), j \in C^{*}$. The final decision is $ y_{i} = \underset{j \in C^{*}}{\text {argmax}} \text { }{r^{*}_{i,j}}$.

5 Experiments

5.1 Datasets

We evaluated our algorithms on several large datasets that are widely used in experiments for large-scale image classification [4, 10, 16, 24], including the ILSVRC2010-1K [9], SUN-397 [37], and Caltech-256 [17]. ILSVRC2010-1K has 1.2M images in 1 K classes for training, 50 K images for validation, and 150 K images for testing. SUN-397 has 108,754 images in 397 classes. Caltech-256 has 29,780 images in 256 classes. With SUN-397 dataset and Caltech-256 dataset, we used 50 % for training, 25 % for validation, and the rest 25 % for testing. We used the same feature settings as in [10, 24] for fair comparison. Specifically, for each image, we used the VLFeat toolbox [34] to extract dense SIFT features from the image. The features were encoded using the LLC coding strategy described in [35], with a codebook having 10,000 visual words, and the image was encoded using a two-level spatial pyramid [22] with $1 \times 1$ and $2 \times 2$ grids. This resulted in a feature vector with approximately 50,000 dimensions. Experiments on CNN features and a larger dataset ImageNet-10K are reported in the Supplementary Material.

5.2 Results

Accuracy Comparison. Table 1 compares the classification accuracy (we use top-1 average per class accuracy) on ILSVRC2010-1K dataset (C = 1,000 classes) of our system with those of two state-of-the-art label-tree methods including Fast-Balanced Tree [10] and Probabilistic Tree [24]. We use the same fashion as described in [24] for comparison. Specifically, the columns of Table 1 represent different tree structures. The tree denoted by $T_{m,n}$ has m children per node when branching and n levels, not including the root node. Test speedup $S_{te}$ is the OvA test cost divided by the label tree test cost (measured as the average number of dot products used to classify each example) as defined in [10].

Similar to [24], to make a fair comparison, we adjusted the number of pseudo-classifiers to achieve a similar test time (i.e. the average number of dot-products to classify each example). Specifically, one example is the tree config $T_{32,2}$, to achieve the test speedup $S_{te}=10.42$, the average number of classifiers to apply for each example is $L_{T_{32,2}}=\frac{C}{S_{te}}=\frac{1,000}{10.42}=96$. The accuracy of our method is reported via two configs. The first one (Ours-[$L_{1}]$) does not use the verification step and the second one (Ours-[$L_{2}+k$]) uses the verification step with the number of OvA classifiers of $k=5$. Because the number of classifier evaluations is fixed in advance in this comparison, for example $L_{T_{32,2}}=96$, in the case of the first one, $L_{1}=L_{T_{32,2}}=96$ pseudo-classifiers is used, while the second one uses only $L_{2}=L_{T_{32,2}}-k=91$ pseudo-classifiers.

Table 1. Classification accuracy comparison of our method and other state of the art methods on ILSVRC2010-1K dataset (C = 1,000 classes). We adopt the notion $T_{m,n}$ of the label-tree methods [10, 24] for different tree configurations. Given $T_{m,n}$ and $S_{te}$, the number of pseudo-classifiers ($L_1$ and $L_2$ in our method corresponding to without/with using the verification step) and the number of bit predictors (ECOC-based methods) are adjusted to reach the target number of classifiers of $L=\frac{C}{S_{te}}$. The same verification process is applied for ECOC-based methods with the number of OvA classifiers of $k=5$ as with Ours-[$L2+k$].

Full size table

In addition, the accuracy of ECOC-based methods is also reported. We used the ECOC library provided by [13] to generate the coding matrix for Random Dense Output Coding (RDOC) and Random Sparse Output Coding (RSOC) [2] with the number of bit predictors being equal to the number of pseudo-classifiers in the second config ($L_2$). As for SpectralECOC [39], we used confusion matrix for computing eigen vectors because the original method used OvO classifiers that are not scalable for large number of classes. The verification step ($k=5$) is used for these ECOC methods. The accuracy of the method based on attribute learning [21] and the accuracy of multi-class classification using OvA classifiers trained with LIBLINEAR [15] (1K-OvA(LIBLINEAR)) are also reported for reference. The results in Table 1 show that:

The classification accuracy of our method (Ours-[$L2+k$]) is significantly better than that of other state of the art methods. The method Ours-[$L2+k$] improves the accuracy over the method Ours-[L1] from $25\,\% - T_{32,2}$ to $53\,\% - T_{6,4}$ showing that the verification process is helpful. Furthermore, its accuracy is very close to that of 1K-OvA(LIBLINEAR) while the number of classifiers is more than ten times smaller.
The classification accuracy of our method (Ours-[L1]) without using the verification step is significantly better than that of Fast-Balanced Tree method [10] and quite comparable with that of the Probabilistic Tree method [24] if using sufficient large number of pseudo-classifiers as shown in $T_{32,2}$. As shown in Fig. 2, our method needs at least 100 pseudo-classifiers to achieve reasonable classification accuracy.

We implemented a variant of the label tree proposed by Bengio et. al [4] for comparison. The tree structure was learned by applying spectral clustering [28] on the confusion matrix similar to [4]. However, for each node, given a label set associated with that node, we trained multi-class classifiers using OvA strategy. The resulting tree, which we call Label Tree-R1, is similar to the tree using Relaxation(1) described in [4] in which node classifiers are optimized independently. The methods such as Fast-Balanced Tree [10] and Probabilistic Tree [24] do not report the classification accuracy on the datasets of SUN-397 and Caltech-256. Therefore, only the accuracy of Label Tree-R1 is reported for comparison. Observation and conclusion from the result (shown in Tables 2 and 3 ) of these two datasets are the same as that of ILSVRC2010-1K dataset.

Table 2. Classification accuracy comparison of our method and other state of the art methods on SUN-397 dataset (C = 397 classes).

Full size table

Effect of the Number of OvA Classifiers in the Verification Stage. Figure 1 shows effect of the number of OvA classifiers k used for the verification stage. The accuracy is improved when using more number of OvA classifiers. The config Ours-[$L_{2}+k$] where $L_{2}=200$ and $k=5$ outperforms 1K-OvA classifiers. It should be noted that the verification stage is necessary only if top-1 accuracy is required (for fair comparison with label tree methods). The top-5 accuracy of Ours-[$L_{1}$] is equal to that of Ours-[$L_{2} + k$] ($k = 5$).

Effect of the Number of Pseudo-classifiers. As described in the training algorithm in Sect. 4.2, our method needs to train OvA classifiers using training images and apply these classifiers to validation images to obtain matrix R from which pseudo-classifiers are generated. Figure 2 shows the relationship between the classification accuracy and the number of pseudo-classifiers used for approximation the classification scores for ILSVRC2010-1K dataset. It also shows the relationship between the number of images used in training OvA classifiers and the number of validation images used to obtain matrix R. We tested with different situations TrainnT-ValnV, where $nT=100, 300$ is the number of training images per class and $nV=30, 50$ is the number of validation images per class. The results indicate that the number of training images has influence to the final performance, while the number of validation images has no influence. A reasonable classification accuracy can be achieved with using 100 pseudo-classifiers.

Table 3. Classification accuracy comparison of our method and other state of the art methods on Caltech-256 dataset (C = 256 classes).

Full size table

Real Processing Time. Prior studies in measurement of testing efficiency only consider the number of average dot products M to classify each example (for estimating $S_{te}$). We argue that M does not reflect the true test time and thus real processing time is more appropriate for practical evaluation. The fact is that tree-based methods rely on a hierarchical structure, meaning that the selection of the classifier used in each level of the tree depends on the decisions of the classifiers used in the previous level of the tree. Therefore, the total cost includes not only the cost of dot-product operators when applying linear classifiers to the test example, but also the cost of switching classifiers when traversing down child nodes.

One advantage of our method in the case of Ours-[L1] is the same set of pseudo-classifiers is applied to all test examples, meaning that it is merely performed by matrix multiplying operator (see formula 7). Therefore, it is extremely fast to select candidate classes for the verification step. Figure 3 shows comparison of the processing time (measured by wall-clock time in seconds) of our method and that of the Label Tree-R1 method for $T_{32,2}$ (note that the processing time of the Label Tree-R1 method is also representative for other label tree methods). In our implementation, we assume all classifiers can be loaded in memory once. We measure the processing time of the methods for different values numTest, that is the number of test examples can be loaded into the memory at a certain time. As for the Label Tree-R1 method, we only count the processing time of dot-product operators and ignore other costs such as loading classifiers at each level. The results indicate that the testing speed of our method is significantly better than that of the Label Tree-R1 method. For example, for the case $numTest=50,000$, with the similar $S_{te} = 10.4$ (i.e. the similar average number of dot products to classify each example), our method requires 3.4 s to return the classification result, while the Label Tree-R1 method using $T_{32,2}$ requires 267.5 s.

Given a target accuracy, as shown in Fig. 1 there are two ways to achieve it that are (i) increasing the number of pseudo-classifiers $L_{1}$ if using Ours-[$L_{1}$] and (ii) increasing the number of OvA classifiers k if using Ours-[$L_{2}+k$].

The former way yields extremely fast classification process for a set of test images because the same set of classifiers is applied by one matrix multiplying operator. For example, given the target accuracy of 21.38 % of the Probabilistic Tree [24] of T32,2, Ours-[L1] needs $L_{1} = 200$ pseudo-classifiers to achieve 22.85 %, but 15 times faster. Similarly, for T6,4, it needs $L_{1} = 100$ pseudo-classifiers to achieve 20.48 % (Probabilistic Tree is 17.2 %), but 25 times faster.

The latter way is more appropriate when the memory is constrained, leading to the limited number of classifiers that can be loaded into the memory. Given a target number of classifier evaluations, as shown in Table 1, Ours-[$L_{2}+k$] achieves the best classification accuracy. Its speed is slower than that of Ours-[$L_{1}$], but much faster than that Label Tree-R1 as shown in Fig. 3.

6 Conclusion

We presented a novel method for multi-class classification in the case of a large number of classes. Our method can find a small set of pseudo-classifiers that best approximate the scores of the original classes. Furthermore, it is easy to implement, and one can simply adjust the accuracy or efficiency of the trade-off by specifying the number of pseudo-classifiers. Comprehensive experiments on large datasets such as ILSVRC2010-1K, SUN-397, and Caltech-256 showed that our method achieves state-of-the-art classification accuracy and is more efficient in terms of testing time than other methods.

Notes

1.
We use score matrix and response matrix interchangeably.

References

Akata, Z., Perronnin, F., Harchaoui, Z., Schmid, C.: Good practice in large-scale learning for image classification. PAMI 36(3), 507–520 (2013)
Article Google Scholar
Allwein, E.L., Schapire, R.E., Singer, Y.: Reducing multi-class to binary: a unifying approach for margin classifiers. J. Mach. Learn. Res. 1, 113–141 (2001)
MathSciNet MATH Google Scholar
Amit, Y., Fink, M., Srebro, N., Ullman, S.: Uncovering shared structures in multiclass classification. In: ICML (2007)
Google Scholar
Bengio, S., Weston, J., Grangier, D.: Label embedding trees for large multi-class task. In: NIPS (2010)
Google Scholar
Beygelzimer, A., Langford, J., Lifshits, Y., Sorkin, G., Strehl, A.: Conditional probability tree estimation analysis and algorithms. In: UAI (2009)
Google Scholar
Beygelzimer, A., Langford, J., Ravikumar, P.: Error-correcting tournaments. In: Gavaldà, R., Lugosi, G., Zeugmann, T., Zilles, S. (eds.) ALT 2009. LNCS (LNAI), vol. 5809, pp. 247–262. Springer, Heidelberg (2009). doi:10.1007/978-3-642-04414-4_22
Chapter Google Scholar
Chen, Y., Crawford, M., Ghosh, J.: Integrating support vector machines in a hierarchical output space decomposition framework. In: IGARSS (2004)
Google Scholar
Crammer, K., Singer, Y.: On the learnability and design of output codes for multiclass problems. Mach. Learn. 47(2–3), 201–233 (2002)
Article MATH Google Scholar
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Deng, J., Satheesh, S., Berg, A., Fei-Fei, L.: Fast and balanced: efficient label tree learning for large scale object recognition. In: NIPS (2011)
Google Scholar
Denton, E.L., Zaremba, W., Bruna, J., Lecun, Y., Fergus, R.: Exploiting linear structure within convolutional networks for efficient evaluation. In: NIPS (2014)
Google Scholar
Dietterich, T.G., Bakiri, G.: Solving multi-class learning problems via error-correcting output codes. J. Artif. Intell. Res. 2, 263–286 (1995)
MATH Google Scholar
Escalera, S., Pujol, O., Radeva, P.: Error-correcting ouput codes library. J. Mach. Learn. Res. 11, 661–664 (2010)
Google Scholar
Escalera, S., Tax, M., Pujol, O., Radeva, P.: Subclass problem-dependent design for error-correcting output codes. PAMI (2008)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 8, 1871–1874 (2008)
MATH Google Scholar
Gao, T., Koller, D.: Discriminative learning of relaxed hierarchy for large-scale visual recognition. In: ICCV (2011)
Google Scholar
Griffin, G., Holub, A., Perona, P.: Caltech-256 object category dataset. Technical report, California Institute of Technology (2007)
Google Scholar
Griffin, G., Perona, P.: Learning and using taxonomies for fast visual categorization. In: CVPR (2008)
Google Scholar
Harchaoui, Z., Douze, M., Paulin, M., Dudik, M., Malick, J.: Large-scale image classification with trace-norm regularization. In: CVPR (2012)
Google Scholar
Jaderberg, M., Vedaldi, A., Zisserman, A.: Speeding up convolutional neural networks with low rank expansions. In: BMVC (2014)
Google Scholar
Kusakunniran, W., Satoh, S., Zhang, J., Wu, Q.: Attribute-based learning for large scale object classification. In: ICME (2013)
Google Scholar
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
Google Scholar
Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., Cao, L., Huang, T.: Large-scale image classification: fast feature extraction and svm training. In: CVPR (2011)
Google Scholar
Liu, B., Sadeghi, F., Tappen, M., Shamir, O., Liu, C.: Probabilistic label trees for efficient large scale image classification. In: CVPR (2013)
Google Scholar
Liu, S., Yi, H., Chia, L.T., Rajan, D.: Adaptive hierarchical multi-class svm classifier for texture-based image classification. In: ICME (2005)
Google Scholar
Loeff, N., Farhadi, A.: Scene discovery by matrix factorization. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 451–464. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88693-8_33
Chapter Google Scholar
Marszałek, M., Schmid, C.: Constructing category hierarchies for visual recognition. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 479–491. Springer, Heidelberg (2008). doi:10.1007/978-3-540-88693-8_35
Chapter Google Scholar
Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: NIPS (2002)
Google Scholar
Platt, J.C., Cristianini, N., Shawe-taylor, J.: Large margin dags for multi-class classification. In: NIPS (2000)
Google Scholar
Pujol, O., Radeva, P., Vitrià, J.: Discriminant ECOC: A heuristic method for application dependent design of error correcting output codes. PAMI (2006)
Google Scholar
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)
MathSciNet MATH Google Scholar
Song, H.O., Girshick, R., Zickler, S., Geyer, C., Felzenszwalb, P., Darrell, T.: Generalized sparselet models for real-time multiclass object recognition. PAMI (2013)
Google Scholar
Sun, M., Huang, W., Savarese, S.: Find the best path: an efficient and accurate classifier for image hierarchies. In: ICCV (2013)
Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat - an open and portable library of computer vision algorithms. In: ACM International Conference on Multimedia (2010)
Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)
Google Scholar
Xia, S., Li, J., Xia, L., Ju, C.: Tree-structured support vector machines for multi-class classification. In: Liu, D., Fei, S., Hou, Z., Zhang, H., Sun, C. (eds.) ISNN 2007. LNCS, vol. 4493, pp. 392–398. Springer, Heidelberg (2007). doi:10.1007/978-3-540-72395-0_50
Chapter Google Scholar
Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: Sun database: large-scale scene recognition from abbey to zoo. In: CVPR (2010)
Google Scholar
Yuan, X., Lai, W., Mei, T., Hua, X., Wu, X., Li, S.: Automatic video genre categorization using hierarchical svm. In: ICIP (2006)
Google Scholar
Zhang, X., Liang, L., Shum, H.: Spectral error correcting output codes for efficient multiclass recognition. In: ICCV (2009)
Google Scholar
Zhao, B., Xing, E.P.: Sparse output coding for large-scale visual recognition. In: CVPR (2013)
Google Scholar

Download references

Acknowledgment

This research is funded by Vietnam National University Ho Chi Minh City (VNU-HCM) under grant number B2015-26-01.

Author information

Authors and Affiliations

University of Information Technology, VNU-HCM, Ho Chi Minh City, Vietnam
Duy-Dinh Le, Tien-Dung Mai, Thanh Duc Ngo & Duc Anh Duong
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Duy-Dinh Le & Shin’ichi Satoh

Authors

Duy-Dinh Le
View author publications
You can also search for this author in PubMed Google Scholar
Tien-Dung Mai
View author publications
You can also search for this author in PubMed Google Scholar
Shin’ichi Satoh
View author publications
You can also search for this author in PubMed Google Scholar
Thanh Duc Ngo
View author publications
You can also search for this author in PubMed Google Scholar
Duc Anh Duong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Duy-Dinh Le .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 51 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Le, DD., Mai, TD., Satoh, S., Ngo, T.D., Duong, D.A. (2016). Efficient Large Scale Image Classification via Prediction Score Decomposition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9910. Springer, Cham. https://doi.org/10.1007/978-3-319-46466-4_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-46466-4_46
Published: 17 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46465-7
Online ISBN: 978-3-319-46466-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Efficient Large Scale Image Classification via Prediction Score Decomposition

Abstract

Similar content being viewed by others

Learning Balanced Trees for Large Scale Image Classification

Sparse coding for image classification base on spatial pyramid representation

Harnessing Superclasses for Learning from Hierarchical Databases

Keywords

1 Introduction

2 Related Work

2.1 Label Tree Approach

2.2 ECOC Approach

2.3 Other Complementary Approaches

3 Preliminaries

4 Proposed Method

4.1 Overview of the Method

4.2 Fast Classification via Prediction Score Decomposition

4.3 Verification by OvA Classifiers

5 Experiments

5.1 Datasets

5.2 Results

6 Conclusion

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 51 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us