1 Introduction

This paper deals with the structured learning problems which learn function: \(f:\mathcal{X} \rightarrow \mathcal{Y}\), where the elements of \(\mathcal{X}\) and \(\mathcal{Y}\) are structured objects such as sequences, trees, bounding boxes, strings. Structured learning arises in lots of real world applications including multi-label classification, natural language parsing, object detection, and so on. Conditional random fields [5, 6], maximum margin markov networks [9] and structured output support vector machines (SSVM) [10] have been developed as powerful tools to predict the structured data. The common approach of these methods is to define a linear scoring function based on a joint feature map over inputs and outputs. There are some drawbacks in these methods. On the one hand, to apply them one requires clearly labeled training sets. Experiments show that some incorrect or incomplete labels can reduce their performance. On the other hand, training these models is computationally cost. So it is difficult or infeasible to solve large scale problems except for some special output structures.

To overcome these drawbacks, a method called Joint Kernel Support Estimation (JKSE) has been proposed in [7]. JKSE is a generative method as it relies on learning the support of the joint-probability density of inputs and outputs. This makes it robust in handling mislabeled data. At the same time, The optimization problem is convex and can be efficiently solved because the one-class SVM is used in it. However, JKSE is not as powerful as SSVM [2]. So we focus on the following problem: How to improve the performance of JKSE? To answer this question, we introduce the privileged information into JKSE.

Privileged information [11] provides useful high-level knowledge that is used only at training time. For example, in the problem of object detection, these information includes the object’s parts, attributes and segmentations. More reliable models [3, 4, 8, 11] can be learned by incorporating these high-level information into SVM, SSVM, one-class SVM.

In this paper, we propose a new method called JKSE+ based on JKSE with privileged information and apply it to the problem of object detection. Some experiments show that our new method JKSE+ performs better than JKSE.

The rest of this paper is organized as follows. We first review the method JKSE in Sect. 2, then introduce our new method JKSE+ in Sect. 3, and the experimental results are presented in Sect. 4.

2 Related Work

This section considers the following structured learning problem: given the training set: \(\left\{ {\left( {{x_{1,}}\ {y_1}}\right) \!,\ ...,\ \left( {{x_l},\ {y_l}} \right) } \right\} \), where \({x_i}\in \mathcal{X}\), \({y_i} \in \mathcal{Y}\). \(\mathcal{X}\) and \(\mathcal{Y}\) are the space of inputs and outputs with some structures respectively. Assume that the input-output pairs \(\left( {x,y} \right) \) follow a joint probability distribution \(p\left( {x,y} \right) \). Our goal is to learn a mapping: \(g:\mathcal{X} \rightarrow \mathcal{Y}\) such that for a new input \({x} \in \mathcal{X}\), the corresponding label \({y} \in \mathcal{Y}\) can be determined by maximizes the posterior probability \(p\left( {y|x} \right) \).

As we all know, The discriminative method directly models the conditional distribution \(p\left( {y|x} \right) \), and the generative method directly models the joint distribution \(p\left( {x,y} \right) \). These two methods are equivalent, i.e. \(\mathop {\arg \max }\limits _{y \in \mathcal{Y}} p\left( {y|x} \right) = \mathop {\arg \max }\limits _{y \in \mathcal{Y}} p\left( {x,y} \right) {{for\; any }} \ x \in \mathcal{X}\). JKSE is a generative method. Suppose that \(p\left( {x,y} \right) = \frac{1}{Z}\exp \left( {\left\langle {w,\varPhi \left( {x,y} \right) } \right\rangle } \right) \). Here, \(Z \equiv \sum \nolimits _{x,y} {\exp \left( {\left\langle {w,\varPhi \left( {x,y} \right) } \right\rangle } \right) }\), and Z is a normalization constant. We can ignore Z during training and testing. The JKSE method translates the task of learning a joint probability distribution \(p\left( {x,y} \right) \) into a one-class SVM problem to estimate the joint probability distribution \(p\left( {x,y} \right) \).

In training phase, JKSE solves the following problem:

$$\begin{aligned} \begin{array}{*{20}{l}} {\mathop {\min }\limits _{w,\xi ,\rho } \frac{1}{2}\parallel w{\parallel ^2} + \frac{1}{{vl}}\sum \limits _{i = 1}^l {{\xi _i} - \rho } }\\ \begin{array}{l} s.t. \; \left\langle {w,\varPhi \left( {{x_i},{y_i}} \right) } \right\rangle \ge \rho - {\xi _i},\quad i = 1,2,...,l,\quad \\ \qquad {\xi _i} \ge 0, \quad i = 1,2,...,l. \end{array} \end{array} \end{aligned}$$
(1)

To get its solution, JKSE solve its dual problem:

$$\begin{aligned} \begin{array}{*{20}{l}} {\mathop {\min }\limits _\alpha \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {{\alpha _i}{\alpha _j}K\left( {\left( {{x_i},{y_i}} \right) ,\left( {{x_j},{y_j}} \right) } \right) } } }\\ \begin{array}{l} s.t.\quad \mathrm{{0}} \le {\alpha _i} \le \frac{1}{{vl}}, \quad i = 1,...,l,\\ \qquad \sum \limits _{i = 1}^l {{\alpha _i} = 1.} \end{array} \end{array} \end{aligned}$$
(2)

where \(K\left( {\left( {x,y} \right) ,\left( {x',y'} \right) } \right) \equiv \left\langle {\varPhi \left( {x,y} \right) ,\varPhi \left( {x',y'} \right) } \right\rangle \) is a joint feature kernel function. If \({\alpha ^*}\) is the solution to the above problem (2), then the solution to the primal problem (1) for w is given as follows:

$$\begin{aligned} {w^*} = \sum \limits _{i = 1}^l {{\alpha _i ^*}\varPhi \left( {{x_i},{y_i}} \right) }. \end{aligned}$$
(3)

Furthermore, in the inference step, for a new input \({x} \in \mathcal{X}\), the corresponding label y is given by:

$$\begin{aligned} y = \mathop {\arg \max }\limits _{y \in \mathcal {Y}} \sum \limits _{i = 1}^l {{\alpha _i}K\left( {\left( {{x_i},{y_i}}\right) \!,\left( {x,y} \right) } \right) }. \end{aligned}$$
(4)

3 JKSE+

Assume that we have some privileged information, \(\left( {x_1^*,x_2^*,...,x_l^*} \right) \in \mathcal{X^*}\) that is available only at the training phase but not available on the test phase. Now we consider the following privileged structured learning problem:

Given a training set \(T = \left\{ {\left( {{x_1},x_1^*,{y_1}} \right) ,...,\left( {{x_l},x_l^*,{y_l}} \right) } \right\} \) where \({x_i} \in \mathcal {X}\), \({x_i^*} \in \mathcal{X^*}\), \({y} \in \mathcal {Y}\), \(i = 1,...,l\), our goal is to find a mapping: \(g:x \rightarrow y\), such that the label of y for any x can be predicted by \(y = g\left( x \right) \).

Now we discuss how the privileged information can be incorporated into the framework of JKSE. Suppose that there exists the best but unknown function: \(\mathop {\arg \max }\limits _{y \in \mathcal Y} \left\langle {{w_0},\varPhi \left( {x,y} \right) } \right\rangle \). The function \(\xi \left( x \right) \) of the input x is defined as follows:

$$\begin{aligned} {\xi ^0} = \xi \left( x \right) = {\left[ {\rho - \left\langle {{w_0},\varPhi \left( {x,y} \right) } \right\rangle } \right] _ + } \end{aligned}$$

where \({\left[ \eta \right] _ + } = \left\{ {\begin{array}{*{20}{c}} {\eta ,\quad if \quad \eta \ge 0,}\\ {0,\quad otherwise.} \end{array}} \right. \) If we know the value of the function \(\xi \left( x \right) \) on each input \({x_i}\left( {i = 1,...,l} \right) \) such as we know the triplets \(\left( {{x_i},\xi _i^0,{y_i}} \right) \) with \(\xi _i^0 = \xi \left( {{x_i}} \right) ,i = 1,...,l\), we can get improved prediction. However, in reality, this is impossible. Instead we use a correcting function to approximate the function \(\xi \left( x \right) \). Similar to one-class SVM with privileged information in [3], we replace \({\xi _i}\) by a mixture of values of the correcting function \(\psi \left( {x_i^*} \right) = \left\langle {{w^*},\varPhi \left( {x_i^*,y_i} \right) } \right\rangle + {b^*}\) and some values \({\zeta _i}\), and get the primal problem of JKSE+:

$$\begin{aligned} \begin{array}{l} \mathop {\min }\limits _{w,\mathrm{{ }}{w^*}\mathrm{{, }}{b^*}\mathrm{{, }}\rho \mathrm{{,}}\zeta } \frac{{vl}}{2}\parallel w{\parallel ^2} + \frac{\gamma }{2}\parallel {w^*}{\parallel ^2} - vl\rho + \sum \limits _{i = 1}^l {\left[ {\left\langle {{w^*},{\varPhi ^*}\left( {{x_i},{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i}} \right] } \\ s.t. \quad \left\langle {w,\varPhi \left( {{x_i},{y_i}} \right) } \right\rangle \ge \rho - \left( {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*}} \right) , \quad i=1,...,l,\\ \qquad \;\,\mathrm{{ }}\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i} \ge 0,\mathrm{{ }}{\zeta _i} \ge 0,\quad i=1,...,l. \end{array} \end{aligned}$$
(5)

The Lagrange function for this problem is:

$$\begin{aligned}&L\left( {w,{w^*},{b^*},\rho ,\zeta ,\mu ,\alpha ,\beta } \right) = \frac{{vl}}{2}\parallel w{\parallel ^2} + \frac{\gamma }{2}\parallel {w^*}{\parallel ^2} - vl\rho \nonumber \\&+ \sum \limits _{i = 1}^l {\left[ {\left\langle {{w^*},{\varPhi ^*}\left( {{x_i},{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i}} \right] } \nonumber \\&{ - \sum \limits _{i = 1}^l {{\mu _i}{\zeta _i}} - \sum \limits _{i = 1}^l {{\alpha _i}\left[ {\left\langle {w,\varPhi \left( {{x_i},{y_i}} \right) } \right\rangle - \rho + \left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*}} \right] } }\nonumber \\&{ - \sum \limits _{i = 1}^l {{\beta _i}\left[ {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i}} \right] } } \end{aligned}$$
(6)

The KKT conditions are as follows:

$$\begin{aligned} {\nabla _w}L = vlw - \sum \limits _{i = 1}^l {{\alpha _i}\varPhi \left( {{x_i},{y_i}} \right) = 0},\qquad \qquad \qquad \quad \,\,\end{aligned}$$
(7)
$$\begin{aligned} {{\nabla _{{w^*}}}L = \gamma {w^*} + \sum \limits _{i = 1}^l {{\varPhi ^*}\left( {x_i^*,{y_i}} \right) - \sum \limits _{i = 1}^l {{\alpha _i}{\varPhi ^*}\left( {x_i^*,{y_i}} \right) - \sum \limits _{i = 1}^l {{\beta _i}{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } } } },\end{aligned}$$
(8)
$$\begin{aligned} \frac{{\partial L}}{{\partial {b^*}}} = l - \sum \limits _{i = 1}^l {{\alpha _i} - \sum \limits _{i = 1}^l {{\beta _i} = 0} },\qquad \qquad \qquad \quad \quad \end{aligned}$$
(9)
$$\begin{aligned} \frac{{\partial L}}{{\partial \rho }} = - vl + \sum \limits _{i = 1}^l {{\alpha _i}} = 0,\qquad \qquad \qquad \qquad \quad \,\,\end{aligned}$$
(10)
$$\begin{aligned} \frac{{\partial L}}{{\partial {\zeta _i}}} = 1 - {\beta _i} - {\mu _i} = 0, i=1,...,l,\qquad \qquad \qquad \quad \,\,\,\end{aligned}$$
(11)
$$\begin{aligned} \rho - \left( {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*}} \right) - \left\langle {w,\varPhi \left( {{x_i},{y_i}} \right) } \right\rangle \le 0, i=1,...,l,\,\,\qquad \end{aligned}$$
(12)
$$\begin{aligned} - \left( {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i}} \right) \le 0, i = 1,...,l,\qquad \qquad \quad \,\,\,\end{aligned}$$
(13)
$$\begin{aligned} - {\zeta _i} \le 0,i = 1,...,l,\qquad \qquad \qquad \qquad \quad \quad \quad \end{aligned}$$
(14)
$$\begin{aligned} {\alpha _i}\left[ {\rho - \left( {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*}} \right) - \left\langle {w,\varPhi \left( {{x_i},{y_i}} \right) } \right\rangle } \right] = 0, i = 1,...,l,\quad \,\,\,\,\,\end{aligned}$$
(15)
$$\begin{aligned} {\beta _i}\left[ {\left\langle {{w^*},{\varPhi ^*}\left( {x_i^*,{y_i}} \right) } \right\rangle + {b^*} + {\zeta _i}} \right] = 0, i = 1,...,l,\qquad \qquad \quad \end{aligned}$$
(16)
$$\begin{aligned} {\mu _i}{\zeta _i} = 0, i = 1,...,l,\qquad \qquad \qquad \qquad \qquad \,\,\,\end{aligned}$$
(17)
$$\begin{aligned} {\alpha _i} \ge 0,{\beta _i} \ge 0,{\mu _i} \ge 0, i = 1,...,l.\qquad \qquad \qquad \quad \,\,\,\, \end{aligned}$$
(18)

From the above KKT conditions and setting \({\delta _i} = 1 - {\beta _i}\) , we can get that

$$\begin{aligned} w = \frac{1}{{vl}}\sum \limits _{i = 1}^l {{\alpha _i}\varPhi \left( {{x_i},{y_i}} \right) } , \qquad \; \end{aligned}$$
(19)
$$\begin{aligned} {w^*} = \frac{1}{\gamma }\sum \limits _{i = 1}^l {\left( {{\alpha _i} - {\delta _i}} \right) {\varPhi ^*}\left( {x_i^*,{y_i}} \right) },\,\,\end{aligned}$$
(20)
$$\begin{aligned} \sum \limits _{i = 1}^l {{\delta _i} = \sum \limits _{i = 1}^l {{\alpha _i} = vl} }, \qquad \,\,\, \end{aligned}$$
(21)
$$\begin{aligned} \mathrm{{0}} \le {\delta _i} \le 1, i = 1,...,l. \qquad \,\; \end{aligned}$$
(22)

So, we can get the dual problem is as follows:

$$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _{\alpha ,\delta } - \frac{1}{{2vl}}\sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {{\alpha _i}{\alpha _j}} K\left( {\left( {{x_i},{y_i}} \right) ,\left( {{x_j},{y_j}} \right) } \right) } \\ \qquad - \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {\frac{1}{{2\gamma }}\left( {{\alpha _i} - {\delta _i}} \right) {K^*}\left( {\left( {x_i^*,{y_i}} \right) ,\left( {x_j^*,{y_j}} \right) } \right) \left( {{\alpha _j} - {\delta _j}} \right) } } \\ s.t.\quad \mathrm{{ }}\sum \limits _{i = 1}^l {{\alpha _i} = vl,\quad \mathrm{{ }}{\alpha _i} \ge 0},\\ \qquad \; \mathrm{{ }}\sum \limits _{i = 1}^l {{\delta _i}} = vl, \quad \mathrm{{ 0}} \le {\delta _i} \le 1. \end{array} \end{aligned}$$
(23)

We use \({K\left( {\left( {{x_i},{y_i}} \right) \!,\left( {{x_j},{y_j}} \right) } \right) }\) and \({{K^*}\left( {\left( {x_i^*,{y_i}} \right) \!,\left( {x_j^*,{y_j}} \right) } \right) }\) to replace the inner product \(\left\langle {\varPhi \left( {{x_i},{y_i}} \right) \!,\varPhi \left( {{x_j},{y_j}} \right) } \right\rangle \) and \(\left\langle {{\varPhi ^*}\left( {x_i^*,{y_i}} \right) \!,{\varPhi ^*}\left( {x_j^*,{y_j}} \right) } \right\rangle \). Therefore, the model’s decision function is \(f\left( x,y \right) = \sum \limits _{i = 1}^l {{\alpha _i}K\left( {\left( {{x_i},{y_i}} \right) ,\left( {x,y} \right) } \right) }\).

We can learn this mapping in JKSE framework as

$$\begin{aligned} y=g\left( x \right) = \mathop {\arg \max }\limits _{y \in \mathcal{Y}} f\left( {x,y} \right) = \mathop {\arg \max }\limits _{y \in \mathcal{Y}} \sum \limits _{i = 1}^l {{\alpha _i}K\left( {\left( {{x_i},{y_i}} \right) ,\left( {x,y} \right) } \right) }. \end{aligned}$$
(24)

Here, the function \(f\left( {x,y} \right) \) is equivalent to a matching function. For example in object detection, when the overlap of an object and a bounding box is higher, the value of the function is greater. Therefore, we output y that maximizes the value of \(f\left( {x,y} \right) \).

Our new algorithm JKSE+ is given as follows:

Algorithm 1

  1. (1)

    Given a training set \(T = \left\{ {\left( {{x_1},x_1^*,{y_1}} \right) ,...,\left( {{x_l},x_l^*,{y_l}} \right) } \right\} \) where \({x_i} \in \mathcal {X}\), \({x_i^*} \in \mathcal{X^*}\), \({y} \in \mathcal {Y}\), \(i = 1,..,l\);

  2. (2)

    Choose the appropriate kernel function \(K\left( {u,v} \right) \), \({K^*}\left( {u',v'} \right) \) and penalty parameters \({v> 0,\gamma > 0}\);

  3. (3)

    Construct and solve convex quadratic programming problem:

    $$\begin{aligned} \begin{array}{l} \mathop {\max }\limits _{\alpha ,\delta } - \frac{1}{{2vl}}\sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {{\alpha _i}{\alpha _j}} K\left( {\left( {{x_i},{y_i}} \right) \!,\left( {{x_j},{y_j}} \right) } \right) } \\ \qquad - \sum \limits _{i = 1}^l {\sum \limits _{j = 1}^l {\frac{1}{{2\gamma }}\left( {{\alpha _i} - {\delta _i}} \right) {K^*}\left( {\left( {x_i^*,{y_i}} \right) \!,\left( {x_j^*,{y_j}} \right) } \right) \left( {{\alpha _j} - {\delta _j}} \right) } } \\ s.t.\quad \mathrm{{ }}\sum \limits _{i = 1}^l {{\alpha _i} = vl,\quad \mathrm{{ }}{\alpha _i} \ge 0}, \\ \qquad \; \mathrm{{ }}\sum \limits _{i = 1}^l {{\delta _i}} = vl, \quad \mathrm{{ 0}} \le {\delta _i} \le 1. \end{array} \end{aligned}$$

    get the solution \({\left( {{\alpha ^*},{\delta ^*}} \right) = \left( {\alpha _1^*,...\alpha _l^*,\delta _1^*,...,\delta _l^*} \right) }\).

  4. (4)

    Construct decision function:

    $$\begin{aligned} y = g\left( x \right) = \mathop {\arg \max }\limits _{y \in \mathcal Y} f\left( {x,y} \right) = \mathop {\arg \max }\limits _{y \in \mathcal Y} \sum \limits _{i = 1}^l {\alpha _i^*K\left( {\left( {{x_i},{y_i}} \right) \!,\left( {x,y} \right) } \right) }. \end{aligned}$$

4 Experiments

In this section, we apply our new method to the problem of object detection. In object detection, given a set of pictures, we hope to learn a mapping \(g:\mathcal X \rightarrow \mathcal Y\), when inputing a picture, we can get the object’s position in the picture by mapping g. Obviously, it is a typical one of structured learning and can be solved by our new method. Some experiments are made in this section.

4.1 Dataset

We use dataset Caltech-UCSD Birds 2011 (CUB-2011) [12] to evaluate our algorithm. This dataset contains two hundred species of birds, each of which has sixty pictures. Each picture contains only one bird, the bird’s position in the picture is indicated by a bounding box. In addition, this dataset provides privilege information, including the bird’s attribute information for each image described as a 312-dimensional vector and segmentation masks.

4.2 Features and Privileged Information

Our feature descriptor adopts the bag-of-visual-words model based on SURF descriptor [1]. We use attribute informations and segmentation masks as privileged information. For the feature extraction of segmentation mask, we use the same strategy as the original image for feature extraction, that is SURF based bag-of-visual-words feature descriptor. It is clear that the feature space of privileged information provides more information relative to the feature space of the original image so that the object’s location in the image can be better detected.

We select 50 pictures as the training set and 10 pictures as the test set. The dimensionality of original visual feature descriptors is 200. In addition, attribute information is described as a 312-dimensional vector, each dimension is a binary variable. We extract the 500-dimensional feature descriptors based on the same bag-of-visual-words model from segmentation masks as in the original picture. So the privilege information has a dimension of 812-dimensional vectors.

In Fig. 1, we can see that more feature descriptors can be extracted in the segmentation masks, which is beneficial to improve the overlap of object detection.

Fig. 1.
figure 1

The picture on the left is the feature descriptor of the original picture. The picture on the right is the feature descriptor of the segmentation mask, which is used as privilege information when training.

Table 1. Dataset
Table 2. Overlap ratio of Object Detection

4.3 Kernal Function

We use the following version of the chi-square kernel function \(\left( {{\chi ^2} - \mathrm{{kernel}}} \right) \):

$$\begin{aligned} K\left( {u,v} \right) = {K^*}\left( {u,v} \right) = {e^{ - \theta \sum \limits _{i = 1}^n {\frac{{{{\left( {{u_i} - {v_i}} \right) }^2}}}{{{u_i} + {v_i}}}} }},u \in {R^n},v \in {R^n}. \end{aligned}$$

This kernel is most commonly applied to histograms generated by bag-of-visual-words model in computer vision [13].

4.4 Experimental Results

To evaluate our JKSE+, we compare it with JKSE. During the training, we adjust the parameters v, \(\gamma \), \(\theta \) on a 8 \(\times \) 8 \(\times \) 8 space spanning values \(\left[ {{{10}^{ - 4}},{{10}^{ - 3}},...,{{10}^3}} \right] \). For JKSE, we also adjust the parameter v, \(\theta \) on a 8\(\times \)8 space spanning values \(\left[ {{{10}^{ - 4}},{{10}^{ - 3}},...,{{10}^3}} \right] \).

We chose ten different birds to compare the detection results of JKSE and JKSE+ (Tables 1 and 2).

The overlap ratio of JKSE+ is higher than that of JKSE in eight datasets.

5 Conclusion

We propose a new method for structured learning with privilege information based on JKSE. Firstly, compared with some traditional methods SSVM, CRFs for structured learning, the resulting optimization problem in our new model JKSE+ is convex and can be easily solved. Secondly, compared with JKSE, the prediction performance of JKSE is improved by using the privileged information. Lastly, we apply JKSE+ to the problem of object detection. Some experimental results show that JKSE+ performs better than JKSE in most cases.

For future work, we will consider some extensions of the JKSE+ method. For example, at the training stage privileged information are provided only for a fraction of inputs or privileged information are described in many different spaces, and so on.