1 Introduction

Recognition and detection of real-world objects are challenging, because it is difficult to model objects with significant variations in color, shape and texture. In addition, the backgrounds in which the objects exist are often complex and cluttered, and we have to account for changes of illumination, pose, size, and number of objects in the most contrived situations. Currently, local based image representations [113] prevail in the state-of-the-art object recognition and detection algorithms. These local based image representations follow the bag-of-features framework [5, 6]. It first extracts low-level patch descriptors over a dense grid or salient points, then encodes them into mid-level features in a unsupervised way using mix of Gaussian, K-means or sparse coding, and finally derives the image-level representation using spatial pooling schemes [57]. Usually, carefully designed descriptors such as SIFT [8], SURF [9], LBP [10] and HOG [11] are used as low-level descriptors to gather statistics of pixel attributes within local patches. However, design of hand-crafted descriptors is non-trivial as sufficient prior knowledge is required and well-tuned parameters are necessary to achieve a good performance. Besides, we still lack a deep understanding on the design rules behind them. Recently, Bo et al. [1, 2] tried to answer how SIFT and HOG measure the similarity between image patches and interpret the design philosophy behind them from a kernel’s view. They showed that the inner product of orientation histogram applied in SIFT and HOG is a particular match kernel over image patches. This insight provides a general way to turn pixel-level attributes into patch-level features with match kernels comparing similarities between image patches. Based on that, they designed a set of low-level descriptors called kernel descriptors (KDES) and kernel principal component analysis (KPCA) [14, 15] was used to reduce the dimensionality of KDES. However, KPCA only captures second-order statistics of KDES and cannot preserve its high-order statistics. It inevitably degrades the distinctiveness of KDES for nonlinear clustering and recognition where high-order statistics are needed. Wang et al. [4] merged the image label into the design of patch-level KDES and derived a variant KDES called supervised kernel descriptors (SKDES). Guiding KDES under a supervised framework with the large margin nearest neighbor criterion and low-rank regularization, SKDES reported an improved performance on object recognition.

In this work, we focus on improving the original KDES by embedding context cues into the descriptors and further learning a compact and discriminative Context Kernel Descriptors (CKD) codebook for object recognition and detection using information theoretic learning techniques. In particular, for feature extraction, we develop a set of CKD that enhance the KDES with embedded spatial context. Context cues enforce some degree of spatial consistency which improves the robustness of the resulting descriptors. For feature learning, we adopt the Rényi entropy based Cauchy-Schwarz Quadratic Mutual Information (CSQMI) [28], as an information theoretic measure, to learn a compact and discriminative CKD codebook from a rich and redundant CKD dictionary. In our method, codebook learning involves two steps including the codebook selection and refinement. In the first step, a group of compact and discriminative basis vectors are selected out of all available basis vectors to construct the codebook. By maximizing the CSQMI between the selected basis vectors in the codebook and the remaining basis vectors in the dictionary, we obtain a compact CKD codebook. By maximizing the CSQMI between the low-dimensional CKD generated from the codebook and their class labels, we also boost the discriminability of the learned CKD codebook. In the second step, we further refine the codebook for improved discriminability and low approximation error with a gradient ascent method that maximizes the CSQMI between the low-dimensional CKD and their class labels, given the constraint on a sufficient approximation accuracy. Projecting the full-dimensional CKD onto the learned CKD codebook, we derive the final low-dimensional discriminative CKD for feature representation. Evaluation results on standard recognition benchmarks, and a challenging chicken feet dataset show that our proposed CKD model outperforms the original KDES as well as carefully tuned descriptors like SIFT and some sophisticated deep learning methods.

The low-level patch features used in our work is built upon the KDES. Conceptually, it is related to [1], but our work departs from it in two distinct ways that improve the robustness and discriminability of our feature representation. First, we propose an enhanced match kernel called context match kernel (CMK). CMK strengthens the spatial consistency of the original match kernel by embedding the extra neighboring information into it. Spatial occurrence constraints implicit in the CMK significantly improve the robustness of similarity matching between feature sets, even for ambiguous or impaired features generated from partially occluded objects. Second, rather than using KPCA for reduction of the feature dimensionality, we perform the feature dimensionality reduction by projecting the original high-dimensional CKD onto a compact and discriminative CKD codebook. The CKD codebook is learned from a novel information theoretic feature selection algorithm based on the CSQMI. Because CSQMI is derived from the Rényi quadratic entropy, we can efficiently approximate it using a Parzen window [28]. In addition, considering the geometric interpretation of the CSQMI [28], it allows us to learn a discriminative CKD codebook that captures the cluster structure of input samples as well as the information about their underlying labels. Hence, the low-dimensional CKD derived from our model is more discriminative than the original KDES derived from KPCA.

2 Feature Extraction Using CKD

We enhance the original match kernel in [1] by embedding neighborhood constraints into it. As neighborhood defines an adjacent set of pixels surrounding the center pixel, neighborhood information can be regarded as the spatial context of the center pixel. So we refer to this enhanced match kernel as Context Match Kernel and the resulting descriptors as Context Kernel Descriptors. Intuition behind CMK is that pixels with similar attributes from two patches should have a high probability to have neighboring pixels whose attributes are also similar. Considering the spatial co-occurrence constraint, our CMK significantly improve the matching accuracy. CMK can be easily applied to develop a set of local descriptors using any pixel attributes, such as gradient, color, texture, and shape, etc. Next we derive the CMK, then we introduce several specific CMKs used in this work.

2.1 Formulation of CMK

An image patch can be modelled as a set of pixels \( \varvec{X} = \left\{ {x_{i} } \right\}_{i = 1}^{n} \), where x i is the coordinate of the ith pixel. Let a i be attribute vector of the ith pixel x i . The k-neighborhood N i k of pixel x i in X is defined as a group of pixels (including itself) that are closest to it. Mathematically, N i k  = {x j X| ∥x i − x j ∥ ≤ k; k ≥ 1}. To eliminate the image noise, we smooth the image using a Haar wavelet filter and compute the local gradient in the k-neighborhood. For the k-neighborhood centered at x p , we first normalize the neighborhood’s attribute by voting the pixel’s attribute in N p k with its gradient magnitude weighted by a Gaussian function centered at x p . The width of Gaussian function, which normalizes the attributes contributed from off-center pixels, is controlled by the neighborhood size k. Similarly, we also normalize the attribute in the k-neighborhood centered at x q . With the normalized attributes in N p k and N q k , we then define the context kernel of attributes a between x p and x q as

$$ \begin{aligned} &\varvec{\kappa}_{{\varvec{con}}} [(x_{p} ,a_{p} ),(x_{q} ,a_{q} )] =\varvec{\kappa}_{\varvec{a}} (\bar{a}_{p} ,\bar{a}_{q} ) \\ \bar{a}_{p} = \frac{1}{{\left| {N_{k}^{p} } \right|}}\sum\limits_{{x_{u} \in N_{k}^{p} }} {a_{u} } m_{u} & \exp \left( { - \frac{{8\left\| {x_{u} - x_{p} } \right\|^{2} }}{{k^{2} }}} \right),\bar{a}_{q} = \frac{1}{{\left| {N_{k}^{q} } \right|}}\sum\limits_{{x_{v} \in N_{k}^{q} }} {a_{v} } m_{v} \exp \left( { - \frac{{8\left\| {x_{v} - x_{q} } \right\|^{2} }}{{k^{2} }}} \right) \\ \end{aligned} $$
(1)

where m u and m v are the gradient magnitudes at pixels x u and x v , respectively; \( \bar{a}_{p} \) and \( \bar{a}_{q} \) are the normalized image attributes in k-neighborhood centered at x p and x q , respectively; \( \varvec{\kappa}_{\varvec{a}} (\bar{a}_{p} ,\bar{a}_{q} ) = exp( - \gamma_{a} ||\bar{a}_{p} - \bar{a}_{q} ||^{2} ) = \varphi_{a} (\bar{a}_{p} )^{\text{T}} \varphi_{a} (\bar{a}_{q} ) \) is the Gaussian kernel measuring the similarity of normalized attributes \( \bar{a}_{p} \) and \( \bar{a}_{q} \). The context kernel κ con provides a normalized measure of the attribute similarity between two k-neighborhoods centered at pixels x p and x q . Merging κ con into match kernels [1] and replacing the attribute a in Eq. (1) with specific attributes, we can derive a set of ad hoc attribute based CMKs.

For example, let θ′ p and m′ p be normalized orientation and normalized magnitude of the image gradient at pixel x p , such that θ′ p  = (sinθ p , cosθ p ) and \( m^{\prime}_{p} = \) \( {{m_{p} } \mathord{\left/ {\vphantom {{m_{p} } {\sqrt {\sum\nolimits_{p \in P} {m_{p}^{2} + \tau } } }}} \right. \kern-0pt} {\sqrt {\sum\nolimits_{p \in P} {m_{p}^{2} + \tau } } }} \), with τ being a small positive number. To compare the similarity of gradients between patches P and Q from two different images, the gradient CMK K gck can be defined as

$$ \varvec{K}_{{\varvec{gck}}} (P,Q) = \sum\limits_{p \in P} {\sum\limits_{q \in Q} {m^{\prime}_{p} m^{\prime}_{q} } }\varvec{\kappa}_{\varvec{o}} (\theta^{\prime}_{p} ,\theta^{\prime}_{q} )\varvec{\kappa}_{\varvec{s}} (x_{p} ,x_{q} )\varvec{\kappa}_{{\varvec{con}}} [(x_{p} ,\theta^{\prime}_{p} ),(x_{q} ,\theta^{\prime}_{q} )] $$
(2)

where κ o (θ′ p , θ′ q ) = exp(-γ o θ′ p − θ′ q 2) = \( \varphi_{o} \)(θ′ p )T \( \varphi_{o} \)(θ′ q ) is the orientation kernel measuring the similarity of normalized orientations at two pixels x p and x q ; κ s (x p , x q ) = exp (-γ s x p − x q 2) = \( \varphi_{s} \) (x p )T \( \varphi_{s} \)(x q ) is the spatial kernel measuring how close two pixels are spatially; and κ con [(x p , θ′ p ), (x q , θ′ q )] is given by Eq. (1).

Similarly, to measure the similarity of color attributes between P and Q, the color CMK K cck can be defined as

$$ \varvec{K}_{{\varvec{cck}}} (P,Q) = \sum\limits_{p \in P} {\sum\limits_{q \in Q} {\varvec{\kappa}_{\varvec{c}} (c_{p} ,c_{q} )\varvec{\kappa}_{\varvec{s}} (x_{p} ,x_{q} )\varvec{\kappa}_{{\varvec{con}}} [(x_{p} ,c_{p} ),(x_{q} ,c_{q} )]} } $$
(3)

where \( {\varvec{\kappa}}_{\varvec{c}} \left( {c_{p} ,c_{q} } \right) = \exp \left( { - \gamma_{c} \parallel c_{p} - c_{q} \parallel^{2} } \right) = \varphi_{c} \left( {c_{p} } \right)^{\text{T}} \varphi_{c} \left( {c_{q} } \right) \) is the color kernel measuring the similarity of color values c p and c q . For color images, we use normalized rgb vector as color value, whereas intensity value is used for grayscale images.

For the texture attribute, we derive the texture CMK, K lbpck , based on Local Binary Patterns (lbp) [10]

$$ \varvec{K}_{{\varvec{lbpck}}} (P,Q) = \sum\limits_{p \in P} {\sum\limits_{q \in Q} {\sigma^{\prime}_{p} \sigma^{\prime}_{q} } }\varvec{\kappa}_{{\varvec{lbp}}} (lbp_{p} ,lbp_{q} )\varvec{\kappa}_{\varvec{s}} (x_{p} ,x_{q} )\varvec{\kappa}_{{\varvec{con}}} [(x_{p} ,lbp_{p} ),(x_{q} ,lbp_{q} )] $$
(4)

where \( \sigma^{\prime}_{p} = {{\sigma_{p} } \mathord{\left/ {\vphantom {{\sigma_{p} } {\sqrt {\sum\nolimits_{{p \in N_{3} }} {\sigma_{p}^{2} + \tau } } }}} \right. \kern-0pt} {\sqrt {\sum\nolimits_{{p \in N_{3} }} {\sigma_{p}^{2} + \tau } } }} \) is the normalized standard deviation of pixel values within a 3 × 3 window around x p ; κ lbp (lbp p , lbp q ) = exp(-γ lbp lbp p  − lbp q 2) is a Gaussian match kernel for lbp operator.

As shown in Eqs. (2)-(4), each attribute based CMK consists of four terms: (1) normalized linear kernel, e.g. m p m q for K gck ; 1 for K cck and \( \sigma '_{p} \,\sigma '_{q} \) for K lbpck , weighting the contribution of each pixel to the final attribute based CMK; (2) attribute kernel evaluating the similarity of pixel attributes; (3) spatial kernel κ s measuring the relative distance between two pixels; (4) context kernel κ con comparing the spatial co-occurrence of pixel attributes. In this sense, we formulate these attribute CMKs, defined in Eqs. (2)-(4), in a unified way as

$$ \varvec{K}(P,Q) = \sum\limits_{p \in P} {\sum\limits_{q \in Q} {w_{p} w_{q} } }\varvec{\kappa}_{\varvec{a}} (a_{p} ,a_{q} )\varvec{\kappa}_{\varvec{s}} (x_{p} ,x_{q} )\varvec{\kappa}_{{\varvec{con}}} [(x_{p} ,a_{p} ),(x_{q} ,a_{q} )] $$
(5)

where w p w q and κ a correspond to normalized linear weighting kernel and attribute kernel, respectively.

2.2 Approximation of CMK

Using the inner product representation, we rewrite the match kernel matrix K as

$$ \begin{aligned} \, & \,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\varvec{K}(P,Q) = \left\langle {\varvec{\psi}(Q),\varvec{\psi}(P)} \right\rangle =\varvec{\psi}(P)^{\text{T}}\varvec{\psi}(Q) \\ &\varvec{\psi}(P) = \sum\limits_{p \in P} {w_{p} } \varphi_{a} (a_{p} ) \otimes \varphi_{s} (x_{p} ) \otimes \varphi_{con} (x_{p} ,a_{p} ), \,\varvec{\psi}(Q) = \sum\limits_{q \in Q} {w_{q} } \varphi_{a} (a_{q} ) \otimes \varphi_{s} (x_{q} ) \otimes \varphi_{con} (x_{q} ,a_{q} ) \\ \end{aligned} $$
(6)

where ⨂ is the tensor product and ψ(∙) gives the mapping features in kernel space, namely the CKD. Note that the dimensions of \( \varphi_{a,} \,\varphi_{s} \) and \( \varphi_{con} \) are all infinite, since Gaussian kernel is used. To obtain an accurate approximation of K, we have to uniformly sample \( \varphi_{a} \), \( \varphi_{s} \) and \( \varphi_{con} \) using a dense grid along sufficient basis vectors. In particular, for \( \varphi_{a} \) and \( \varphi_{con} \), we discretize a into G bins and approximate them with their projections onto the subspaces spanned by G basis vectors \( \left\{ {\varphi_{a} \left( {a^{g} } \right)} \right\} \) (g = 1…G). Similarly, for space vector x, we discretize spatial basis vectors into L bins and sample along L basis vectors spatially. Finally, we can approximate ψ(∙) by its projections onto the G × L × G joint basis vectors: \( \left\{ {\phi_{l} } \right\}_{l = 1}^{G \times L \times G} = \left\{ {\varphi_{a} \left( {a^{1} } \right) \otimes \varphi_{s} \left( {x^{1} } \right) \otimes \varphi_{con} \left( {a^{1} } \right), \ldots ,\varphi_{a} \left( {a^{G} } \right) \otimes \varphi_{s} \left( {x^{L} } \right) \otimes \varphi_{con} \left( {a^{G} } \right)} \right\} \).

$$ \varvec{\psi}( \cdot ) \simeq \sum\limits_{l = 1}^{G \times L \times G} {f_{l} } \phi_{l} $$
(7)

where f l is the projection coefficient onto the lth joint basis vector ϕ l . Thus, dimensionality of the resulting CKD ψ is G × L × G. Uniform sampling provides a set of representative joint basis vectors, but does not guarantee their compactness. Projecting onto these basis vectors usually yield a group of redundant CKD. Next, we show how to learn a CKD codebook by selecting and refining a subset of compact and discriminative joint basis vectors using a CSQMI based information theoretic feature learning scheme. Projecting the original CKD ψ onto the codebook reduces the redundancy of ψ and gives a low-dimensional discriminative CKD representation.

3 Feature Learning Using CSQMI

Shannon entropy and its related measures, such as mutual information and Kullback-Leibler divergence (KLD) are widely used in feature learning [1626]. However, Shannon entropy based feature learning methods share the common weakness of high evaluation complexity involved in the estimation of probability density function (pdf) in Shannon entropy [16]. Recently, Rényi entropy [27, 28] has attracted more attentions in information theoretic learning. The most impressive advantage of Rényi entropy is its moderate computational complexity because the estimate of Rényi entropy can be efficiently implemented by the kernel density estimation [29] (e.g. the Parzen windowing). Several novel information theoretic metrics derived from Rényi entropy are introduced in feature learning [3033].

3.1 Rényi Entropy and CSQMI

Let S ∈ℛd be a discrete random variable which has a pdf of p(s), then its Rényi entropy is defined as [27]

$$ H_{\alpha } (\varvec{S}) = \frac{1}{1 - \alpha }\log_{2} \sum\limits_{{\varvec{s} \in \varvec{S}}} {p^{\alpha } (\varvec{s})} $$
(8)

Rényi entropy defines a family of functions that quantify the diversity in a data distribution. Standard Shannon entropy can be treated as a special case of Rényi entropy as α → 1. Rényi entropy of order α = 2, given in Eq. (9), is called Rényi quadratic entropy H 2(S).

$$ H_{2} (\varvec{S}) = - \log_{2} \sum\limits_{{\varvec{s} \in \varvec{S}}} {p^{2} (\varvec{s})} $$
(9)

Similar to KLD defined using Shannon entropy, Cauchy-Schwarz divergence (CSD) based on Rényi quadratic entropy also defines a measure of divergence between different pdfs. Given two discrete random variables S 1 and S 2, with S 1 having a pdf of p 1(s 1 ) and S 2 having a pdf of p 2(s 2 ), the CSD [28, 31] of p 1 and p 2 is given by

$$ CSD(p_{1} ;p_{2} ) = - \log_{2} \frac{{\left( {\sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} , { }\varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{1} (\varvec{s}_{\varvec{1}} )p_{2} (\varvec{s}_{\varvec{2}} )} } \right)^{2} }}{{\sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} }} {p_{1}^{2} (\varvec{s}_{\varvec{1}} )} \sum\limits_{{\varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{2}^{2} (\varvec{s}_{\varvec{2}} )} }} = 2H_{2} (\varvec{S}_{\varvec{1}} , \, \varvec{S}_{\varvec{2}} ) - H_{2} (\varvec{S}_{\varvec{1}} ) - H_{2} (\varvec{S}_{\varvec{2}} ) $$
(10)

where \( H_{2} (\varvec{S}_{\varvec{1}} ,\varvec{S}_{\varvec{2}} ) = - \log_{2} \sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} , { }\varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{1} (\varvec{s}_{\varvec{1}} )p_{2} (\varvec{s}_{\varvec{2}} )} \) measures the similarity (distance) between the two pdfs and can be considered as the Rényi quadratic cross entropy. We can interpret H 2(S 1 , S 2 ) as the information gain from observing p 2 with respect to the “true” density p 1, and vice versa. Hence, the CSD derived from Rényi quadratic entropy is semantically similar to Shannon’s mutual information. Note that CSD (p 1; p 2) ≥ 0 is a symmetric measure that equals zero if and only if p 1(s) = p 2(s), and increases towards positive infinity as the two pdfs are apart further and further. Based on CSD (p 1; p 2), the Cauchy-Schwarz Quadratic Mutual Information between two discrete random variables S 1 and S 2 is defined as [28].

$$ \begin{aligned} I_{CSD} (\varvec{S}_{\varvec{1}} ;\varvec{S}_{\varvec{2}} ) & = CSD(p_{12} (\varvec{s}_{\varvec{1}} ,\varvec{s}_{\varvec{2}} );p_{1} (\varvec{s}_{\varvec{1}} )p_{2} (\varvec{s}_{\varvec{2}} )) \\ & = log_{2} \sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} \varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{12}^{2} (\varvec{s}_{\varvec{1}} ,\varvec{s}_{\varvec{2}} )} + log_{2} \sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} \varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{1}^{2} (\varvec{s}_{\varvec{1}} )p_{2}^{2} (\varvec{s}_{\varvec{2}} )} - 2log_{2} \sum\limits_{{\varvec{s}_{\varvec{1}} \in \varvec{S}_{\varvec{1}} \varvec{s}_{\varvec{2}} \in \varvec{S}_{\varvec{2}} }} {p_{12} (\varvec{s}_{\varvec{1}} ,\varvec{s}_{\varvec{2}} )p_{1} (\varvec{s}_{\varvec{1}} )p_{2} (\varvec{s}_{\varvec{2}} )} \\ \end{aligned} $$
(11)

where p 12(s 1 , s 2 ) is the joint pdf of (S 1 , S 2 ), and p 1(s 1 ) and p 2(s 2 ) are marginal pdf of S 1 and S 2 , respectively. I CSD (S 1 ; S 2 ) ≥ 0 meets the equality if and only if S 1 and S 2 are independent. So I CSD (S 1; S 2 ) is a measure of independence that reflects the information shared between S 1 and S 2 . In other words, it measures how much knowing S 1 reduces the uncertainty about S 2 , and vice versa.

To calculate CSD and I CSD , we have to estimate marginal pdf p(∙) and joint pdf p 12(∙,∙). Fortunately, Principe [28] showed that, for Rényi quadratic entropy and its induced measures such as CSD and I CSD , these marginal and joint pdfs can be efficiently estimated with a Parzen window density estimator [29], even in a high-dimensional feature space like CDK. Whereas, it is not possible for Shannon entropy [28]. This explains why we choose the Rényi quadratic entropy based I CSD , instead of the Shannon entropy based mutual information, as information theoretic measure in our codebook learning algorithm.

In addition, recent findings from Jenssen et al. [30, 31] uncovered the latent connections between Rényi quadratic entropy and mapping features in the kernel space. It shows that, when applying a Gaussian Parzen window estimator, Rényi quadratic entropy estimator is equivalent to ||m||2, where \( \varvec{m} = \frac{1}{M}\sum\limits_{{\varvec{s}_{\varvec{t}} \in \varvec{S}}} {\varphi (\varvec{s}_{\varvec{t}} )} \) is the mean vector of mapping data samples \( \varphi \left( {\varvec{s}}_{\varvec{t}} \right) \) (t = 1,···,M) in the kernel feature space. Meanwhile, the CSD estimator is directly associated with the angle between the mean vectors m 1 and m 2 of the clusters of mapping data samples in the kernel feature space. These clusters correspond to the mapping data samples yielded from p 1(s) and p 2(s), respectively. Consequently, CSQMI, measuring the CSD between a joint pdf and the product of two marginal pdfs, also relates to the cluster structure in the kernel feature space. The relationships between Rényi quadratic entropy, CSD/CSQMI and the mean vector of mapping features in the kernel space provide us the geometric interpretation behind H 2(S) and CSD/CSQMI. It means that the Rényi quadratic entropy based measures are very suitable for the analysis of nonlinear data (even in high-dimensional spaces) because they are able to capture the geometric structure of the data. In contrast, the Shannon entropy and the KLD do not have such good properties.

3.2 Codebook Selection and Refinement Using CSQMI

As mentioned in Sect. 2.2, we approximate the original CKD ψ with a group of redundant joint basis vectors \( \left\{ {\phi_{l} } \right\}_{l = 1}^{G \times L \times G} \).We define these joint basis vectors as dictionary, and represent it as \( {\varvec{\Phi}} \) (\( {\varvec{\Phi}} \) has a cardinality of G × L×G). Assuming that we are given CKD, ψ 1,···, ψ M, of M samples from C classes, for each class c (c = 1,···,C), it has M c samples and the corresponding CKD are denoted as \( {\varvec{\Psi}}_{c} \) = [ψ 1 c ,···, ψ Mc c ]. Then we formulate the CKD of all samples as \( {\varvec{\Psi}} = \{ {\varvec{\Psi}}_{c} \}_{c = 1}^{C} \). Similarly, we denote \( \varvec{F} = \{ \varvec{F}_{c} \}_{c = 1}^{C} \), where F c  = [F 1 c ,···, F Mc c ] = \( \left[ {(f_{c1}^{1} , \cdots ,f_{cG \times L \times G}^{1} )^{\text{T}} , \cdots ,(f_{c1}^{{M_{c} }} , \cdots ,f_{cG \times L \times G}^{{M_{c} }} )^{\text{T}} } \right] \). Then, Eq. (7) can be represented as \( {\varvec{\Psi}} = {\varvec{\Phi}}\varvec{F} \), where \( {\varvec{\Phi}} \) = [ϕ 1,···, ϕ G×L×G ] and \( \varvec{F} = \left[ {\begin{array}{*{20}c} {f_{11}^{1} } & \cdots & {f_{C1}^{{M_{C} }} } \\ \vdots & {} & \vdots \\ {f_{1G \times L \times G}^{1} } & \cdots & {f_{CG \times L \times G}^{{M_{C} }} } \\ \end{array} } \right] \) is the projection coefficients matrix. Given a CKD ψ from a random sample, the uncertainty of its class label L in terms of the class prior probabilities can be measured by H 2(L), given in Eq. (9). Whereas, the CSQMI I CSD (ψ; L) defined in Eq. (11) measures the decrease in uncertainty of the pattern ψ due to the knowledge of the underlying class label L.

Given \( {\varvec{\Psi}} \) and an initial dictionary Φ, we aim to learn a compact and discriminative subset of joint basis vectors Φ * out of Φ, such that cardinality (Φ *) < cardinality (Φ). We refer to Φ * as codebook. Projecting the original CKD \( {\varvec{\Psi}} \) onto the codebook \( {\varvec{\Phi}}^{ * } \) gives a low-dimensional CKD, \( {\varvec{\Psi}}^{ * } \) = \( {\varvec{\Phi}}^{ * } \) F *. We expect \( {\varvec{\Psi}}^{ * } \) should be compact and discriminative. To learn a compact codebook, we maximize the CSQMI between \( {\varvec{\Phi}}^{{{\mathbf{*}}{\mathbf{.}}}} \) and the unselected basis vectors \( {\varvec{\Phi}} - {\varvec{\Phi}}^{ * } \) in \( {\varvec{\Phi}} \), i.e. \( I_{CSD} \left( {{\varvec{\Phi}}^{ * } ;\,{\varvec{\Phi}} - {\varvec{\Phi}}^{*} } \right) \). As \( I_{CSD} \left( {{\varvec{\Phi}}^{ * } ;\,{\varvec{\Phi}} - {\varvec{\Phi}}^{*} } \right) \) signifies how compact the codebook Φ *is, a higher value of \( I_{CSD} \left( {{\varvec{\Phi}}^{{\mathbf{ * }}}{;}\,{\varvec{\Phi}} - {\varvec{\Phi}}^{{\mathbf{ * }}} } \right) \) means a more compact codebook. However, that codebook may not be discriminative, because it does not give any information regarding the new CKD \( {\varvec{\Psi}}^{ * } \) from their class label L. Therefore, we also need to maximize the CSQMI between \( {\varvec{\Psi}}^{ * } \) and L, i.e. I CSD (\( {\varvec{\Psi}}^{{\mathbf{*}}} \); L), which provides the discriminability of the new CKD generated from the codebook \( {\varvec{\Phi}}^{ * } \). To this end, the codebook learning problem can be mathematically formulated as

$$ \mathop {\arg \hbox{max} }\limits_{{{\varvec{\Phi}}^{\varvec{*}} }} \left[ {I_{CSD} ({\varvec{\Phi}}^{\varvec{*}} ;{\varvec{\Phi}} - {\varvec{\Phi}}^{\varvec{*}} ) + \lambda I_{CSD} ({\varvec{\Psi}}^{\varvec{*}} ;L)} \right] $$
(12)

where λ is the weight parameter to make a tradeoff between the compactness and discriminability terms. We use a two-step strategy to optimize the compactness and discriminability of the codebook simultaneously. In the first step (Codebook Selection), the codebook that maximizes Eq. (12) is selected from the initial dictionary in a greedy search manner. In the second step (Codebook Refinement), the selected codebook is refined via a gradient ascent method to further maximize the discriminability term \( I_{CSD} \left( {{\varvec{\Psi}}^{\text{ * }} ;\,L} \right) \) while keeping the approximation error as low as possible.

3.2.1 Codebook Selection

The first term in Eq. (12), i.e. \( I_{CSD} \left( {{\varvec{\Phi}}^{ * } ;{\varvec{\Phi}} - {\varvec{\Phi}}^{ * } } \right) \), is a compactness term which measures the compactness of the codebook \( {\varvec{\Phi}}^{*} \). The second term, i.e. \( I_{CSD} \left( {{\varvec{\Psi}}^{*} ;\,L} \right) \), measures the discriminability of the codebook \( {\varvec{\Phi}}^{*} \). Based on [34], the probability of Bayes classification error resulted from the final CKD \( {\varvec{\Psi}}^{*} \), i.e. \( P(e^{{{\varvec{\Psi}}^{\varvec{*}} }} ) \), has its upper bound given by \( P(e^{{{\varvec{\Psi}}^{\varvec{*}} }} ) \le \frac{1}{2}\left( {H_{2} (L) - I_{CSD} ({\varvec{\Psi}}^{\varvec{*}} ;L)} \right) \). Thus, the selected discriminative codebook \( {\varvec{\Phi}}^{*} \) corresponding to the minimal Bayes classification error bound should maximize the \( I_{CSD} \left( {{\varvec{\Psi}}^{*} ;\,L} \right) \).

During the codebook selection, we start with an empty set of \( {\varvec{\Phi}}^{{\mathbf{ * }}} \) and iteratively select the next best basis vector ϕ * out of the remaining set \( {\varvec{\Phi}} - {\varvec{\Phi}}^{{\mathbf{ * }}} \), such that the mutual information gain between the new codebook \( {\varvec{\Phi}}^{{\mathbf{ * }}} \)ϕ * and the remaining set, as well as the mutual information gain between the CKD derived from the new codebook and the class label, are maximized, i.e.

$$ \mathop {\arg \hbox{max} }\limits_{{\phi^{\mathcal{*}} \in {\varvec{\Phi}} - {\varvec{\Phi}}^{*} }} \left\{ {\left[ {I_{CSD} ({\varvec{\Phi}}^{*} \cup \phi^{\mathcal{*}} ;{\varvec{\Phi}} - ({\varvec{\Phi}}^{\varvec{*}} \cup \phi^{\mathcal{*}} )) - I_{CSD} ({\varvec{\Phi}}^{*} ;{\varvec{\Phi}} - {\varvec{\Phi}}^{*} )} \right] + \left[ {I_{CSD} ({\varvec{\Psi}}^{{{\varvec{\Phi}}^{*} \cup \phi^{*} }} ;L) - I_{CSD} ({\varvec{\Psi}}^{{{\varvec{\Phi}}^{*} }} ;L)} \right]} \right\} $$
(13)

3.2.2 Codebook Refinement

Once the initial codebook \( {\varvec{\Phi}}^{{\mathbf{ * }}} \) is achieved, we refine \( {\varvec{\Phi}}^{{\mathbf{ * }}} \) to further enhance its discriminability by maximizing the discriminability term in Eq. (12), i.e. \( \mathop {\hbox{max} }\limits_{{{\varvec{\Phi}}^{\varvec{*}} }} \lambda I_{CSD} \left( {{\varvec{\Psi}}^{\varvec{*}} ;L} \right) \). To guarantee a compact codebook, we assume that cardinality (\( {\varvec{\Phi}}^{{\mathbf{ * }}} \)) ≪ cardinality (\( {\varvec{\Phi}} \)). Under such an assumption, the projection coefficient is solved by \( \varvec{F}^{*} = {\varvec{\Phi}}^{\dag } {\varvec{\Psi}} \) which minimizes the approximation error \( \varvec{e} = \parallel {\varvec{\Psi}} - {\varvec{\Phi}}^{ * } \varvec{F}^{*} \parallel^{2} \), where \( {\varvec{\Phi}}^{\dag } = pinv\left( {{\varvec{\Phi}}^{*} } \right) = \left( {{\varvec{\Phi}}^{{*{\mathbf{T}}}} {\varvec{\Phi}}^{*} } \right)^{ - 1} {\varvec{\Phi}}^{{\text{*}{\mathbf{T}}}} \) is the pseudo-inverse of \( {\varvec{\Phi}}^{*} \). Thus, the problem of refining \( {\varvec{\Phi}}^{*} \) for improving the discriminability of codebook while keeping its approximation accuracy is converted to the following constraint optimization problem.

$$ \mathop {\hbox{max} }\limits_{{{\varvec{\Phi}}^{*} }} I_{CSD} \left( {{\varvec{\Psi}}^{*} ;L} \right) , {\text{ subject to }}\varvec{F}^{*} = {\varvec{\Phi}}^{\dag } {\varvec{\Psi}} $$
(14)

Since I CSD (∙;∙) is a quadratic symmetric measure, the objective function \( I_{CSD} \left( {{\varvec{\Psi}}^{*} ;\,L} \right) \) is differentiable. We use the gradient ascend method to iteratively refine \( {\varvec{\Phi}}^{*} \) such that \( I_{CSD} \left( {{\varvec{\Psi}}^{*} ;\,L} \right) \) is maximized. In each iteration, \( {\varvec{\Phi}}^{*} \) is updated with a step size υ. After k-th iteration, \( {\varvec{\Phi}}^{*}_{k} \) becomes

$$ \begin{aligned} & {\varvec{\Phi}}^{*}_{k} = {\varvec{\Phi}}^{*}_{k - 1} + \upsilon \frac{{\partial I_{CSD} \left( {{\varvec{\Psi}}^{*} ;L} \right)}}{{\partial {\varvec{\Phi}}^{*} }}\left| {_{{{\varvec{\Phi}}^{*} = {\varvec{\Phi}}^{*}_{k - 1} }} } \right. \\ & \frac{{\partial I_{CSD} \left( {{\varvec{\Psi}}^{*} ;L} \right)}}{{\partial {\varvec{\Phi}}^{*} }} = \sum\limits_{c = 1}^{C} {} \sum\limits_{i = 1}^{{M_{c} }} {\frac{{\partial I_{CSD} \left( {{\varvec{\Psi}}^{*} ;L} \right)}}{{\partial\varvec{\psi}^{*i}_{c} }}} \frac{{\partial\varvec{\psi}^{*i}_{c} }}{{\partial {\varvec{\Phi}}^{*} }} = \sum\limits_{c = 1}^{C} {} \sum\limits_{i = 1}^{{M_{c} }} {\left( {F_{c}^{i} } \right)^{T} } \frac{{\partial I_{CSD} \left( {{\varvec{\Psi}}^{*} ;L} \right)}}{{\partial\varvec{\psi}^{*i}_{c} }} \\ \end{aligned} $$
(15)

Once \( {\varvec{\Phi}}^{ * } \) is refined, we update the projection coefficients F * and the low-dimensional discriminative CKD \( {\varvec{\Psi}}^{ * } \) according to \( \varvec{F}^{*} = {\varvec{\Phi}}^{\dag } {\varvec{\Psi}} \) and \( {\varvec{\Psi}}^{ * } = {\varvec{\Phi}}^{*} \varvec{F}^{*} \), respectively. The bound of \( I_{CSD} \left( {{\varvec{\Psi}}^{*} ;\,L} \right) \) guarantees the convergence of codebook refinement.

4 Experiments

To verify the effectiveness of our method in the context of object recognition and detection, we first investigate the performance of CSQMI based codebook learning on the extended YaleB face dataset [35], then we test our model on Caltech-101 [36] and CIFAR-10 [37] for recognition and on our own chicken feet dataset for detection. We also compare our results with other state-of-the-art works, including the original KDES [1], supervised kernel descriptors [4], handcrafted dense SIFT features [7, 8], and the popular deep feature learning approaches [44, 5153].

4.1 Parameter Configuration

We adopt the code provided from www.cs.washington.edu/robotics/projects/kdes/ to implement the original KDES. To make a fair comparison, in all experiments, except for the final feature dimensionality, we follow the setting of [1] for common parameters used in our method. Namely, basis vectors for κ o , κ c, and κ s are sampled using 25, 5 × 5 × 5, and 5 × 5 uniform grids, respectively. For κ lbp , we choose all 256 basis vectors. For all CKD, κ con shares the same basis vectors with their attribute kernels κ a . We use a 3-level spatial pyramid for pooling CKD at different levels. The pyramid level is set as 1 × 1, 2 × 2 and 4 × 4. Gaussian Parzen windows are used to estimate the CSQMI, and the width parameter σ is tuned using a grid search in the range [0.01σ d , 100σ d ], where σ d is the median distance of all training samples. The best window width is selected by cross-validation. The optimal neighborhood distance parameter, k, is decided using a grid search between 1 and 8. Linear SVM classifiers used in all experiments are implemented with the LIBlinear, downloaded from www.csie.ntu.edu.tw/~cjlin/liblinear/.

4.2 Evaluation of Codebook Learning

We first evaluate the discriminability of our CSQMI based codebook learning method by comparing it with other popular kernel based dimensionality reduction methods on the extended YaleB face dataset [35] that contains 16128 face images from 28 individuals. This dataset is challenging due to varying illumination conditions and expressions.

For each individual, half of the frontal face images are used to train the relevant codebook and feature subset. The remaining frontal face samples are used to test the distinctiveness of the learned codebook. LBP_CKD is applied to extract the face features. KPCA [14, 15], Kernel Fisher Discriminant Analysis (KFDA) [38], and Kernel Locality Preserving Projections (KLPP) [39] are compared with our codebook learning method. For each method, as suggested in [1], a reduced 200-dimensional feature subset is learned. To visualize the results, we randomly select five subjects and plot the distributions of projected samples onto the leading three most significant feature subsets yielded from each method in Fig. 1. As shown in Fig. 1, the clusters of the face samples resulted from our codebook represents a significant improvement on the class separation over that obtained from the alternative kernel based dimensionality reduction methods. This is because that the feature subset derived from CSQMI captures the angular pattern of the cluster distribution of the analyzing face patterns. Consequently, it is more discriminative than the feature subset selected from principal component vectors based only on magnitude of eigenvalues, such as KPCA.

Fig. 1.
figure 1

Visualization of the leading 3-dimensional LBP_CKD features from different methods.

4.3 Evaluation of Object Recognition

Caltech - 101: This dataset is one of the most popular benchmarks for multiclass image recognition. It collects 9144 images from 101 object categories and a background category. Each category has 31 to 800 images with significant color, pose and lighting variations. We use this dataset for a comprehensive comparison on the recognition performance of the original KDES, supervised kernel descriptor (SKDES) [4] and our CKD. A 4-neighborhood which achieves the best performance is used to evaluate the context information for CKD. For each category, following the experimental setup of original KDES [1], we train one-vs-all linear SVM classifiers on 30 images and test on no more than 80 images for KDES and our method. We run five rounds of testing for a confident evaluation. Results of SKDES are quoted from the original papers. Table 1 lists the average recognition accuracy and standard deviation of different options of kernel descriptors. Some recently reported results are also provided for comparison.

Table 1. Comparison of mean recognition accuracy (%) and standard deviation of KDES, SKDES and CKD on Caltech-101.

From Table 1, we observe that our CKD consistently outperforms KDES and SKDES, for both individual and combined version. Except for the gradient CKD (G_CKD), both color CKD (C_CKD) and texture CKD (LBP_CKD) are significantly better than their original KDES. In particular, compared with the original color and texture KDES, the recognition accuracy of C_CKD and LBP_CKD is increased by 62.97 % and 5.69 %, respectively. For the combined version, the accuracy of combined CKD is 83.3 %, which is 6.9 % higher than the original KDES combination and 4.1 % higher than the SKDES combination. We also notice the smaller standard deviation of recognition accuracy in our results compared with that of the SKDES. It means CKD is more robust than SKDES, thanks to the spatial co-occurrence constraints embedded in the CKD. We argue that the performance improvement of CKD comes from two facts: (1) compared with KDES and SKDES, the additional spatial co-occurrence constraint defined in CKD further improves its robustness to the semantic ambiguity, caused by the lack of features in case of partial occlusion; (2) KDES applies KPCA to reduce feature dimensionality, whereas we use CSQMI to learn low-dimensional CKD. KPCA only keeps KDES components that contribute most significantly to image reconstruction. In contrast, our CSQMI criterion selects the CKD that minimize the information redundancy and approximation error while maximize the mutual information between the CKD and its class label in terms of the ‘angle distance’. Therefore, the resulting low-dimensional CKD are more discriminative than KDES in that they reveal the cluster structure of density distribution of pixel attributes and relate to the angular manifold of the object category.

To investigate the impact of codebook size on the recognition performance, we train classifiers using different codebook sizes and compare the recognition accuracy of the combined CKD (COM_CKD) with that of the combined KDES (COM_KDES) in Fig. 2(a). As expected, COM_CKD outperforms COM_KDES consistently over all codebook sizes. We also note a relative small performance drop (14 %) of COM_CKD when codebook size decreases from 500 to 50, whereas for COM_KDES the accuracy drop is 26 %. This verifies the effectiveness of our codebook learning model, which can select discriminative CKD codebook even in low-dimensional situations. We also compare the recognition performance of CKD yielded under different neighborhood distances. As shown in Fig. 2(b), neighborhoods with moderate distances perform better than neighborhoods with small distances, and recognition accuracy tends to decrease for neighborhoods with large distances. This can be understood by the fact that the discriminability of CKD tends to be smoothed, as more noises and outlier data may be included when the neighborhood distance becomes larger.

Fig. 2.
figure 2

Performance comparison at different codebook sizes and neighborhood distances on Caltech-101.

CIFAR - 10: This dataset consists of 60000 tiny images with a size of 32 × 32 pixels. It has 10 categories, with 5000 training images and 1000 test images per category. We choose this dataset to test the performance of our method on recognition of tiny objects. Similar to [1], we calculate CKD around 8 × 8 image patches on a dense grid with a spacing of 2 pixels. A 3-neighborhood which gives the best performance is applied to calculate CKD. The whole training images are split into 10,000/40,000 training/validation set, and the validation set is used to optimize the kernel parameters of γ s , γ o , γ c , and γ lbp using a grid search. Finally, a linear SVM classifier is trained on the whole training set using the optimized kernel parameters.

We compare the performance of COM_CKD with several recent feature learning approaches using deep learning (stochastic pooling based Deep Convolutional Neural Network − spDCNN [52], tiled Convolutional Neural Networks − tCNN [53], Multi-column Deep Neural Networks − MDNN [51]), sparse coding (improved local Coordinate Coding − iLCC [54], spike-and-slab Sparse Coding − ssSC [55]), hierarchical kernel descriptor (HKDES) [2] and spatial pyramid dense SIFT (SPM_SIFT) [7]. For SPM_SIFT, we use a 3-layer spatial pyramid structure and calculate dense SIFT feature in an 8 × 8 patch over a regular grid with a spacing of 2 pixels. Table 2 reports the recognition accuracy of various methods. As we see, COM_CKD and MDNN defeat other methods by a large margin. Compared with MDNN, COM_CKD achieves a comparable performance with only a 0.37 % deficit in classification rate. However, our method is much more simple and efficient than MDNN model. For example, for a 32 × 32 pixel image, our method takes 224.63 ms to calculate the full-dimensional 3-neighborhood COM_CKD and 320.21 s to learn a 200-dimensional discriminative codebook using CSQMI on average on a platform with Intel Core i7 2.7 GHz CPU and 16G RAM. Merging different pixel attributes in the kernel space, CKD tune low-level complementary cues into image-level discriminative descriptors. Even coupled with simple linear SVM classifier, our method still achieves superior performance compared with other sophisticated models.

Table 2. Comparison of recognition accuracy (%) of various methods on CIFAR-10.

To further analyze the classification performance of our method, we visualize the confusion matrix in Fig. 3. The confusion matrix shows that our COM_CKD is able to clearly distinguish animals from rigid artifacts, except for planes and birds. It is understandable because flying birds look very similar to planes (as shown in Fig. 4), especially in low-resolution images. Due to the non-rigid and deformable property of articulated objects, we also observe many confusions between different animals. Among all animal classes, the frog class obtains the highest false positive rate of 18.07 % from other animal classes, but it has with very few false negatives. As expected, car and truck are the most confusing artifact classes, which collectively cause a classification error rate of 8.78 %. Whereas, cat and dog are the most confusing animal classes, which collectively cause a classification error rate of 11.24 %.

Fig. 3.
figure 3

Confusion matrix for CIFAR-10 using COM_CKD. Vertical axis shows the ground truth labels and the predicted labels go along the horizontal axis.

Fig. 4.
figure 4

Some wrongly classified samples between plane and bird.

4.4 Evaluation of Object Detection

To adapt our method for object detection, we train a two-class linear SVM classifier as the detector using COM_CKD features. For an instance image, we decompose it into several scales and detect possible locations of all candidate objects using a sliding window at each scale. Finally, we merge detection results at different scales and remove the duplicate detections at the same location. We test our detector on a chicken feet dataset collected in a chicken slaughter house. The aim of our detector is to find and localize chicken feet. As illustrated in Fig. 6, this chicken feet dataset is very challenging due to the following facts: chicken feet themselves are very small compared with other parts of the body, usually more than forty chickens are squeezed in a box, multiple chicken feet may appear in one image, in many cases feet are severely occluded (most part of feet are hidden under feather), the appearance of feet changes drastically due to different poses, and finally the color of the feet is very similar to feather and chest.

We crop a total of 717 image patches containing chicken feet as positive training examples, and 2000 patches without chicken feet as negative training examples. Another set of 318 images containing chicken feet patches never occurred in the training set are used as test set. Since chicken feet are also tiny, we use the same patch size and sampling grid for the CIFAR-10 dataset to evaluate CKD. The parameters of CKD and SVM are tuned by a 10-fold cross-validation on the training set. To judge the correctness of detections, we adopt standards of the PASCAL Challenge criterion [56], i.e. a detection is considered as correct only if the predicted bounding box overlaps at least 50 % area with the ground-truth bounding box. All other detections of the same object are counted as false positives. We compare the detection performance of our model with that of the HKDES model [2] and a 3-level SPM_SIFT [7] in terms of the Equal Error Rate (EER) on the Precision-Recall (PR) curves, i.e. PR-EER. PR-EER defines the point on the PR curve, where the recall rate equals the precision rate.

Figure 5 plots the Precision-Recall curves for all methods. As we see, among all tested models, COM_CKD achieves the best overall performance (EER = 78.53 %), followed by the HKDES model (EER = 75.61 %) that combines gradient, color and shape cues into KDES. This further confirms that merging different visual cues into object representation can significantly boost the performance of the classifier. One interesting observation is that, expect for C_CKD, results from our single CKD models are better than the sophisticated SIFT method. In particular, EERs of LBP_CKD and G_CKD model are 71.23 % and 69.55 %, respectively, whereas EER of SPM_SIFT is only 59.41 %. Considering individual CKD, C_CKD gives the worst result with EER = 44.10 %. Both LBP_CKD and G_CKD perform well, with LBP_CKD achieving a slightly better average accuracy. This is not surprising. Color difference between chicken feet and other parts (feather and chest) is marginal (refer to Fig. 6). Color distributions of chicken feet and other parts overlap quite much. In particular, the color distribution of feet and chest can hardly allow an acceptable separation based on color cue alone. In contrast, feet show a moderate difference in texture structures from feature and chest. Hence, texture based LBP_CKD outperforms other single feature for this dataset. Figure 6 shows some detection examples resulting from the best COM_CKD feature. Due to the influence of shadow caused by the box boundary and severe occlusions, some small chicken feet under the box shadow (in left images) or hidden by the feather (in right images) are missed by the detector, which give the false negative detections. But for these images no false positive detections appear.

Fig. 5.
figure 5

Precision-Recall curves of all methods tested on chicken feet dataset.

Fig. 6.
figure 6

Detection examples resulting from COM_CKD feature.

5 Conclusion

Based on the context cue and Rényi quadratic entropy based CSQMI, we propose a set of novel kernel descriptors called context kernel descriptors and an information theoretic measure to select a compact and discriminative codebook for object representation in kernel feature space. We evaluate the performance of our algorithm in applications of object recognition and detection. The highlights of our work lie in: (1) the new CKD enhances the original KDES by adding extra spatial co-occurrence constraints to reduce the mismatch of image attributes (features) in kernel space; (2) instead of applying the traditional KPCA for feature dimensionality reduction, CSQMI criterion is employed in our method to learn a subset of low-dimensional discriminative CKD that correspond to the cluster structure of the density distribution of CKD. Evaluation results on both popular benchmark and our own datasets show the effectiveness of our method for generic (especially tiny) object recognition and detection.