Keywords

1 Introduction

Developing techniques for efficient extraction of usable and meaningful information has become increasingly important with the explosive growth of digital technologies. Low level features like color, texture and shape can be used to classify images into different categories. However, in many cases it is not suitable to use a single class label because of the presence of more than one semantic concept in an image. One way to handle this is by assigning multiple relevant keywords to a given image, reflecting its semantic content. This is often referred to as image annotation.

Learning techniques such as Binary Relevance [2] and Classifier Chains [21], transform an annotation task into a task of binary classification. Another approach to tackle the problem of annotation is by adapting popular learning techniques to deal with multiple labels directly [23, 27]. Multi Label k-Nearest Neighbors (ML-kNN) [26], Multi Label Decision Tree (ML-DT) [24] and Rank-SVM [6] are some of the commonly used methods in this category. Rank-SVM is a ranking based approach coupled with a set size predictor which uses Support Vector Machines to minimize the ranking loss while having a large margin. Among other models, semantic space auto-annotation model [7] constructs a special form of a vector space, called a semantic space, from the labels associated with the images. Images are projected into this space in order to be retrieved or annotated. Latent semantic analysis [11] is used to build this space. The success of these techniques is largely dependent on the effectiveness of the features used.

Learning representations of the data that makes it easier to extract useful information is highly desirable [1] for developing a good classification or annotation framework. Deep learning models are the commonly used techniques for learning representation from raw data. These models aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features, as illustrated in Fig. 1.

Fig. 1.
figure 1

Scheme of learning representations in a multilayered network. Raw pixel values or extracted features are given as input. The input layer is followed by multiple hidden layers that learn increasingly abstract representation.

Deep learning models such as Deep Belief Networks (DBN) [10] and Deep Boltzmann Machines (DBM) [19, 20, 22] have performed well in classification and recognition tasks [15]. These models are formed by pre-training individual layers [8] and then stacking together and training them using error back-propagation. Each layer of a DBN consists of an energy-based model known as Restricted Boltzmann Machine (RBM). An RBM is trained using contrastive divergence to obtain a good reconstruction of the input data [9]. Contrastive divergence and error back-propagation are computationally complex methods. In Deep Convolutional Networks [14, 16, 17], convolution operation is used to extract features from different sub-regions of an image to learn a better representation. Although Deep Convolutional Networks are trained completely using error back-propagation, they use sub-sampling layers to reduce the number of inputs to each layer. To solve the issue of complexity, a model known as Deep Stacking Network (DSN) [5] that consists of many stacking modules was recently proposed. Each module is a specialized neural network consisting of a single non-linear hidden layer and linear input and output layers. Since convex optimization is used to speedup the learning in each module, this model is also called as Deep Convex Network (DCN) [4]. Tensor Deep Stacking Networks (T-DSN) [13], introduced as an extension of the DSN architecture, captures better representations by using two sets of nonlinear nodes in the hidden layer. The T-DSN model has been shown to perform better than the DSN model for image classification and phone recognition tasks. Kernel Deep Convex Network(K-DCN) [3] on the other hand uses kernel trick so that the number of hidden nodes in each module is unbounded.

In this paper, we propose a framework that uses convex deep learning models (T-DSN and K-DCN) for the task of image annotation. We also propose using the features extracted from a Deep Convolutional Network as input to the convex models. The remainder of this paper is organized as follows: Sect. 2 gives a brief discussion on T-DSN and K-DCN. In Sect. 3 we describe the details of our experiments and compare the results with the existing methods.

2 Convex Deep Learning Models

2.1 Tensor Deep Stacking Networks

A tensor deep stacking network is a generalized form of a deep stacking network. The input data is provided to the nodes in the input layer of the first module. The input to the higher modules is obtained by appending output from the module just below it to the original input data. Unlike DSN, each module of TDSN has two sets of hidden layer nodes and thus, two sets of connections between the input layer and the hidden layer as shown in Fig. 2. The output layer nodes are bilinearly dependent on the hidden layer nodes.

Fig. 2.
figure 2

Architecture of tensor deep stacking network.

Let the target vectors t be arranged to form the columns of matrix T, the input data vectors v be arranged to form the columns of matrix V, and \(H_1\) and \(H_2\) denote the set of matrices of the outputs of the hidden units. There are two sets of lower weight parameters (\(W_1\) and \(W_2\)). They are associated with connections from the input layer to the two hidden layers containing \(\textit{L}_1\) and \(\textit{L}_2\) sigmoidal nodes respectively.

Since the hidden layers contain sigmoidal nodes, the output of a hidden layer can be expressed as:

$$\begin{aligned} \begin{aligned} {H_1 = logistic(W_1^T V)} \\ {H_2 = logistic(W_2^T V)} \\ \end{aligned} \end{aligned}$$
(1)

Let \(\mathbf h _1\) be the vector of outputs from the first set of hidden nodes and \(\mathbf h _2\) be the vector of outputs from the second set of hidden nodes. Let \({h_1}_i\) be the \(i^{th}\) entry in \(\mathbf h _1\) and \({h_2}_j\) be the \(j^{th}\) entry in \(\mathbf h _2\).

If C is the number of nodes in the output layer, weights of connections from hidden layers to the output layer are represented as a tensor U \(\in \mathbf R ^{\textit{L}_1} \times \mathbf R ^{\textit{L}_2} \times \mathbf R ^{\textit{C}}\). The tensor U can be considered as a 3-dimensional matrix.

Let \(y_k\) denote the output of \(k^{th}\) node in output layer. The output vector can be obtained by computing \((\mathbf U \times _1 \mathbf h _1) \times _2 \mathbf h _2\) where \(\times _i\) stands for multiplication along the \(i^{th}\) dimension. In a simplified notation

$$\begin{aligned} y_k = \sum \limits _{i = 1}^{L_1} \sum \limits _{j = 1}^{L_2} U_{ijk} {h_1}_{i} {h_2}_{j} \end{aligned}$$
(2)

Let

$$\begin{aligned} \tilde{\mathbf{h }} = \mathbf h _1 \otimes \mathbf h _2 \end{aligned}$$

where \(\otimes \) is the Kronecker product. Let \(\tilde{\mathbf{u _k}}\) be the vectorized version of matrix \(U_k\) in which all columns are appended to form a single vector. The matrix \(U_k\) is obtained by setting the third dimension of tensor U equal to k. Hence, length of \(\tilde{\mathbf{u }}_k\) is \(\textit{L}_1 \textit{L}_2\). Now, we can rewrite Eq. (2) as,

$$\begin{aligned} y_k = \tilde{\mathbf{u _k}}^T \tilde{\mathbf{h }} \end{aligned}$$
(3)

Arranging all \(\tilde{\mathbf{u _k}}\)’s for \(k = 1,2,\ldots ,C\), into a matrix \(\tilde{U}\) = [\(\tilde{\mathbf{u _1}}\) \(\tilde{\mathbf{u _2}}\) ... \(\tilde{\mathbf{u _C}}\)], the overall prediction becomes

$$\begin{aligned} \mathbf y = \tilde{U}^T \tilde{\mathbf{h }} \end{aligned}$$
(4)

where y is the estimate of target vector t.

Thus, bilinear mapping from two hidden layers can be seen as a linear mapping from an implicit hidden representation \(\tilde{\mathbf{h }}\). Aggregating the implicit hidden layer representations for each of the N instances into the columns of an \(L_1 L_2 \times N\) matrix \(\tilde{H}\), we obtain

$$\begin{aligned} Y = \tilde{U}^T \tilde{H} \end{aligned}$$
(5)

where \(\tilde{H}\) contains \(\mathbf h _k\) in \(k^{th}\) column.

The convex formulation for \(\tilde{U}\) in this case is,

$$\begin{aligned} min_{\tilde{U}^T} ||\tilde{U}^T H - T||^2 \end{aligned}$$
(6)

where \(||. ||^2\) represents the squared norm operation.

Solving the optimization (6) we get:

$$\begin{aligned} \tilde{U}^T = T\tilde{H}^T(\tilde{H}\tilde{H}^T)^{-1} \end{aligned}$$
(7)

We see that the output of each hidden node in first layer appears \(L_2\) number of times in \(\tilde{\mathbf{h }}\). So, we have to add errors due to all those terms in order to get the error caused by this particular node. Hence, the equation for weight update needs to be modified to account for this and the modified equations are:

$$\begin{aligned} \varDelta W_1 = \eta V[H_1^T \circ (\varGamma -H_1^T) \circ \varPsi _1 ] \end{aligned}$$
(8)
$$\begin{aligned} \varDelta W_2 = \eta V[H_2^T \circ (\varGamma -H_2^T) \circ \varPsi _2 ] \end{aligned}$$
(9)

Here \(\circ \) is the element-wise multiplication of two matrices, \(\varGamma \) is a matrix of all ones, \(\eta \) is the learning rate and

$$\begin{aligned} \begin{aligned} {\varPsi _1}_{nk} = \sum \limits _{k=1}^{L_2} {H_2}_{nk} \tilde{\varTheta }_{((i-1)L_2+k),n} \\ {\varPsi _2}_{nk} = \sum \limits _{k=1}^{L_1} {H_1}_{nk} \tilde{\varTheta }_{((i-1)L_1+k),n} \end{aligned} \end{aligned}$$
(10)
$$\begin{aligned} \tilde{\varTheta } = 2\tilde{H}^{+}(\tilde{H}T^T)(T\tilde{H}^{+}) - 2T^T(T\tilde{H}^{+}) \end{aligned}$$
(11)

Here \(H_{1}\) is the matrix of outputs of nodes in the first hidden layer, \(H_{2}\) is the matrix of outputs of nodes in the second hidden layer. The dimensions of matrices \(\varPsi _{1}\) and \(\varPsi _{2}\) are \(N\times L_1\) and \(N\times L_2\) respectively. Each of these two matrices \(\varPsi _{1}\) and \(\varPsi _{2}\) acts as a bridge between high dimensional implicit representation \(\tilde{\mathbf{h }}\) and low dimensional representations \(\mathbf u \) and \(\mathbf v \).

Since T-DSN uses convex optimization techniques to directly determine the upper-layer weights, the training time is greatly reduced. However, computing the lower-layer weights is still an iterative process.

2.2 Kernel Deep Convex Networks

A kernel deep convex network (K-DCN), like a T-DSN, is composed by stacking of shallow neural network modules. This model completely eliminates the non-convex learning for the lower-layer weights using the kernel trick. In case of K-DCN, a regularization term C is included in the expression for computing the upper-layer weights U. This modification helps bound the values of elements of U and prevents the model from over-fitting on the training data.

Fig. 3.
figure 3

Architecture of kernel deep convex network with two modules.

The formulation for U takes the form of,

$$\begin{aligned} min_{U} [\frac{1}{2}* Tr\{(Y-T)^T (Y-T)\} + \frac{C}{2}U^TU] \end{aligned}$$

where Y is the predicted output for the output nodes and T is the target output. The closed form expression for U is obtained by solving this minimization as follows :

$$\begin{aligned} U = (CI +HH^T )^{-1} HT^T \end{aligned}$$
(12)

The output of the given module of KDCN is given by,

$$\begin{aligned} \mathbf y _k = TH^T (CI +HH^T )^{-1} \mathbf h _i \end{aligned}$$
(13)

The sigmoidal function of hidden units is replaced with a generic nonlinear mapping function \(\varPhi (\mathbf v )\) from the raw input features \(\mathbf v \). The mapping \(\varPhi (\mathbf v )\) will have high-dimensionality (possibly infinite) which is determined implicitly by a chosen kernel function. The unconstrained optimization problem can be reformulated as follows:

$$\begin{aligned} min_{U} [\frac{1}{2}* Tr\{(Y-T)^T (Y-T)\} + \frac{C}{2}U^TU] \end{aligned}$$

subject to

$$\begin{aligned} T- U^T G(V) = Y-T \end{aligned}$$

where columns of G(V) are formed by applying the transformation \(\varPhi (.)\) on each input \(\mathbf v \). Solving this problem gives

$$\begin{aligned} U= G(V) (CI + K)^{-1} T^T \end{aligned}$$
(14)

where \(K= G^T(V)G(V) \) is the kernel gram matrix of V.

Finally, for each new input vector v in the test set, the prediction of KDCN module is given by

$$\begin{aligned} \mathbf y (\mathbf v ) = U^T \varPhi (\mathbf v ) = T (CI + K)^{-1} \mathbf k ^T(\mathbf v ) \end{aligned}$$
(15)

Here k(v) is the kernel vector such that \(k_n(\mathbf v ) = k(\mathbf v _n,\mathbf v )\) and \(\mathbf v _n\) is a vector from training set.

For the subsequent modules, the output of nodes in the output layer is appended with the raw input. For \(l^{th}\) module (\(l>2\)) Eq. (14) is valid with a slight modification in the kernel function to account for this extra input as follows:

$$\begin{aligned} K= G^T(Z) G(Z) \end{aligned}$$
(16)

where \(Z= V|Y^{(l-1)}|Y^{(l-2)}|....|Y^1\), \(Y^m\) is the prediction of module m, and U|V represents the concatenation of U and V.

Using the Eqs. (15) and (16) we eliminate the need of back-propagation and get a convex expression for training the model. The KDCN model combines the power of deep learning and kernel learning in a principled way. It is fast because there is no back-propagation.

2.3 Framework for Image Annotation

If a concept is present in an image, the corresponding bit in a binary target output vector t is turned on. Each module of a T-DSN is trained to predict t. Once the module is trained and the weights \(W_1\), \(W_2\), and U are learned, Eq. (4) is used to compute the estimated output. For the higher modules, the input data is concatenated with the output of the module below it (or with the output of n modules below it) and used as an augmented input. This process is repeated for all the modules and the output obtained at the last module is retained. Similarly, in case of a K-DCN, Eq. (15) is used to find predictions for each module.

One of the following methods to obtain the annotation labels from the outputs of a model is used.

  1. 1.

    A threshold value is decided empirically using a held-out validation set. In the estimated output vectors, if the posterior probability value for a particular concept exceeds the threshold, it is considered as an annotation label for the image.

  2. 2.

    Based on the average number of labels present in the images, a value k is selected. An image is annotated with those concepts that correspond to the top k values in the estimated output vector.

3 Experiments and Results

In this section, we present the details of image annotation datasets used and the experimental results for T-DSN and K-DCN. We compare the performance of these models with the state-of-the-art performance.

3.1 Experimental Setup

We used MATLAB on an Intel i7 8-core CPU with 16 GB of RAM for running the Rank-SVM. For T-DSN and K-DCN, we used NVIDIA Tesla K20C GPU with CUDA.

In order to reduce the number of multiplications in the computation of \(\tilde{\varTheta }\), Eq. (11) is re-written as:

$$\begin{aligned} \begin{aligned} {\tilde{\varTheta } = 2(\tilde{H}^{+}\tilde{H}T^T - T^T) (T\tilde{H}^{+})} \\ { =2(\tilde{H}^{+}\tilde{H}T^T - T^T) \tilde{U}^{+}} \end{aligned} \end{aligned}$$
(17)

In order to reduce the memory requirements for the computation of \(\tilde{\varTheta }\), Eq. (17) is parenthesized as follows:

$$\begin{aligned} \tilde{\varTheta } = 2(\tilde{H}^{+}(\tilde{H}T^T) - T^T) \tilde{U}^{+} \end{aligned}$$
(18)

In this order of multiplication, we avoid computing \(\tilde{H}^{+}\tilde{H}\), which is a \(N \times N\) matrix. In general, the value of N is large (20,000–50,000). Accommodating such a large matrix in the GPU memory is problematic. Many matrices are reused in the process of training. Matrices are allocated memory only when required and freed immediately after their use in order to make the best use of memory available.

For K-DCN, we used three different types of kernel functions, namely, Gaussian kernel, Polynomial kernel and Histogram Intersection Kernel (HIK). The kernel parameters and regularization parameter were tuned to obtain a range of values for the first module. For the later modules, the tuning is done with respect to the range of parameters obtained for the previous module, and a set of globally optimum parameters was obtained.

3.2 Feature Extraction

We used a deep convolutional network to obtain a useful representation from an image. A deep convolutional network consists of several layers. A convolutional layer consists of a rectangular grid of neurons. Each neuron takes inputs from a rectangular section of the previous layer. The weights for this rectangular section are constrained to be the same for each neuron in the convolutional layer. Constraining the weights makes it work like many different copies of the same feature detector applied to different positions. This constraint also helps in restricting the number of parameters. The output of a neuron in the convolutional layer, l for a filter of size (\(m*n\)) is given by

$$\begin{aligned} s_{ij}^l = f(\sum \limits _{x=0}^{m} \sum \limits _{y=0}^{n} w_{xy} s_{(x+i)(y+j)}^{(l-1)}) \end{aligned}$$
(19)

where \(f(x)=\log (1+e^x)\). This nonlinearity was approximated using a simpler function, \(f(x)= \max (0,x)\), which is known as the rectifier function. The nodes that use the rectifier function are referred to as Rectified Linear Units (ReLU). Use of ReLU reduced the time taken significantly.

Table 1. List of 45 concepts selected for our study on University of Washington annotation benchmark dataset.
Fig. 4.
figure 4

Illustration of images with their annotation labels from the University of Washington annotation benchmark dataset.

The pooling layer takes outputs of small rectangular blocks in the convolutional layer and subsamples it to produce a single output from that block. The pooling layer can take the average, or maximum, or learn a linear combination of outputs of the neurons in the block. In all our experiments, we used max-pooling. Pooling helps the network achieve small amount of translational invariance at each level. Also, it reduces the number of inputs to the next layer. Finally, after two convolutional and max-pooling layers, we added two fully connected layers. The activity of the nodes in the last fully connected layer was used as input to the T-DSN and K-DCN models.

Apart from this, we also used the SIFT features [18] as input to the deep learning models.

3.3 Datasets Used

We test our models with two real-world datasets that contain color images with their annotations: University of Washington annotation benchmark dataset [25] and the MIRFLICKR-25000 collection [12].

The Washington dataset had 1109 color images corresponding to 22 different categories with an average annotation length of 6. Out of all the concepts available, we selected only 45 concepts that had more than 25 images associated with each of them. The list of these 45 concepts is given in Table 1.

Some of the images from this dataset with their annotation labels are shown in Fig. 4. Because of the small number of images, we do not use convolutional features for this dataset.

MIRFLICKR-25000 is a database of 25,000 color images belonging to various categories. The average number of tags per image is 9. Some of the images from this dataset with their annotation labels are shown in Fig. 5. For our studies, we consider the 30 most frequently occurring tags. These tags have at least 150 images associated with each of them. We randomly selected 30 % of the images for testing, and repeated our studies over 5 folds.

3.4 Results

A T-DSN consisting of 3 modules with 100 nodes in each of the hidden layers was used on the University of Washington dataset. In our experiments we observed that having the same number of nodes in both the sets of hidden nodes generally give a better performance.

Fig. 5.
figure 5

Illustration of images with their annotation labels from the MIRFLICKR dataset.

The precision, recall and F-measure for different thresholds in the threshold based decision logic are reported in Table 2.

Table 2. Precision, recall, and F-measure for different thresholds in the threshold based decision logic for annotation of images in the University of Washington data with T-DSN.

We repeated the previous experiment with different values of k in the top-k based decision logic, and the precision, recall, and F-measure values are reported in Table 3.

Table 3. Precision, recall, and F-measure for different values of k in the top-k based decision logic for annotation of images in the University of Washington data with T-DSN.
Table 4. Precision, recall, and F-measure for different thresholds in the threshold based decision logic for annotation of images in the University of Washington data with K-DCN.
Table 5. Precision, recall, and F-measure for different values of k in the top-k based decision logic for annotation of images in the University of Washington data with K-DCN.

We repeated these experiments with K-DCN. Best performance was observed for a Gaussian Kernel. The results of these experiments are reported in Tables 4 and 5. It is observed that the F-measure values for K-DCN are slightly lower when compared with that for T-DSN. One of the possible reasons for this could be that the kernel parameters used might not be the best. The state-of-the-art methods for image annotation, namely, Rank-SVM and semantic space model give F-measure values of 0.61 and 0.63 respectively. Figure 6 compares the actual annotation labels for some randomly selected images in University of Washington dataset with the annotations generated by the T-DSN model.

Fig. 6.
figure 6

Illustration of images with actual annotation labels and predicted annotation labels in the University of Washington dataset with T-DSN.

It is observed that the number of annotation labels generated by the models were slightly higher than that of the ground truth. In many cases, the extra labels are somehow related to the image.

Fig. 7.
figure 7

Precision-recall curves for different models on MIRFLICKR dataset.

For the MIRFLICKR dataset, the study is carried out using the SIFT features and convolutional features. Figure 7 shows the precision-recall curves for different models. The best F-measure values for different models are presented in Table 6.

Table 6. Performance comparison of models for image annotation task on MIRFLICKR dataset.

It is observed that K-DCN and T-DSN perform better with convolutional features. It is also noted that convex deep learning methods perform better than the semantic space annotation method.

4 Summary and Conclusions

In this paper, we used the convex deep learning models, such as T-DSN and K-DCN for image annotation tasks. We also used features extracted from a deep convolutional network for this task. Through the experimental studies, it is observed that the T-DSN and K-DCN models with convolutional features as input give an improved performance. Once the convolutional network is trained on a large set of images, it is easy to extract features. The convex networks take less time to train, making them useful for image annotation tasks in practice.

For the K-DCN model, we have used only a single kernel function for a module. We can extend this by using multiple types of kernel functions. Finding a set of globally optimal parameters for K-DCN is difficult. Similarly, for T-DSN we observed that having different number of nodes in each hidden layer is not beneficial. However, we did not find any criterion for selecting the suitable number of hidden layer nodes. A recipe for selecting the number of nodes in T-DSN and globally optimum parameters for K-DCN will be useful.