Tensor Deep Stacking Networks and Kernel Deep Convex Networks for Annotating Natural Scene Images

Sarangi, Niharjyoti; Sekhar, C. Chandra

doi:10.1007/978-3-319-27677-9_17

Niharjyoti Sarangi¹⁶ &
C. Chandra Sekhar¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9493))

Included in the following conference series:

International Conference on Pattern Recognition Applications and Methods

1147 Accesses

Abstract

Image annotation is defined as the task of assigning semantically relevant tags to an image. Features such as color, texture, and shape are used by many machine learning algorithms for the image annotation task. Success of these algorithms is dependent on carefully handcrafted features. Deep learning models use multiple layers of processing to learn abstract, high level representations from raw data. Deep belief networks are the most commonly used deep learning models formed by pre-training the individual Restricted Boltzmann Machines in a layer-wise fashion and then stacking together and training them using error back-propagation. However, the time taken to train a deep learning model is extensive. To reduce the time taken for training, models that try to eliminate back-propagation by using convex optimization and kernel trick to get a closed-form solution for the weights of the connections have been proposed. In this paper we explore two such models, Tensor Deep Stacking Network and Kernel Deep Convex Network, for the task of automatic image annotation. We use a deep convolutional network to extract high level features from different sub-regions of the images, and then use these features as inputs to these models. Performance of the proposed approach is evaluated on benchmark image datasets.

You have full access to this open access chapter, Download conference paper PDF

Multiple Kernel Learning Based on Weak Learner for Automatic Image Annotation

Automatic image annotation: the quirks and what works

Article 14 June 2018

A Hybrid Architecture Based on CNN for Image Semantic Annotation

Keywords

1 Introduction

Developing techniques for efficient extraction of usable and meaningful information has become increasingly important with the explosive growth of digital technologies. Low level features like color, texture and shape can be used to classify images into different categories. However, in many cases it is not suitable to use a single class label because of the presence of more than one semantic concept in an image. One way to handle this is by assigning multiple relevant keywords to a given image, reflecting its semantic content. This is often referred to as image annotation.

Learning techniques such as Binary Relevance [2] and Classifier Chains [21], transform an annotation task into a task of binary classification. Another approach to tackle the problem of annotation is by adapting popular learning techniques to deal with multiple labels directly [23, 27]. Multi Label k-Nearest Neighbors (ML-kNN) [26], Multi Label Decision Tree (ML-DT) [24] and Rank-SVM [6] are some of the commonly used methods in this category. Rank-SVM is a ranking based approach coupled with a set size predictor which uses Support Vector Machines to minimize the ranking loss while having a large margin. Among other models, semantic space auto-annotation model [7] constructs a special form of a vector space, called a semantic space, from the labels associated with the images. Images are projected into this space in order to be retrieved or annotated. Latent semantic analysis [11] is used to build this space. The success of these techniques is largely dependent on the effectiveness of the features used.

Learning representations of the data that makes it easier to extract useful information is highly desirable [1] for developing a good classification or annotation framework. Deep learning models are the commonly used techniques for learning representation from raw data. These models aim at learning feature hierarchies with features from higher levels of the hierarchy formed by the composition of lower level features, as illustrated in Fig. 1.

Deep learning models such as Deep Belief Networks (DBN) [10] and Deep Boltzmann Machines (DBM) [19, 20, 22] have performed well in classification and recognition tasks [15]. These models are formed by pre-training individual layers [8] and then stacking together and training them using error back-propagation. Each layer of a DBN consists of an energy-based model known as Restricted Boltzmann Machine (RBM). An RBM is trained using contrastive divergence to obtain a good reconstruction of the input data [9]. Contrastive divergence and error back-propagation are computationally complex methods. In Deep Convolutional Networks [14, 16, 17], convolution operation is used to extract features from different sub-regions of an image to learn a better representation. Although Deep Convolutional Networks are trained completely using error back-propagation, they use sub-sampling layers to reduce the number of inputs to each layer. To solve the issue of complexity, a model known as Deep Stacking Network (DSN) [5] that consists of many stacking modules was recently proposed. Each module is a specialized neural network consisting of a single non-linear hidden layer and linear input and output layers. Since convex optimization is used to speedup the learning in each module, this model is also called as Deep Convex Network (DCN) [4]. Tensor Deep Stacking Networks (T-DSN) [13], introduced as an extension of the DSN architecture, captures better representations by using two sets of nonlinear nodes in the hidden layer. The T-DSN model has been shown to perform better than the DSN model for image classification and phone recognition tasks. Kernel Deep Convex Network(K-DCN) [3] on the other hand uses kernel trick so that the number of hidden nodes in each module is unbounded.

In this paper, we propose a framework that uses convex deep learning models (T-DSN and K-DCN) for the task of image annotation. We also propose using the features extracted from a Deep Convolutional Network as input to the convex models. The remainder of this paper is organized as follows: Sect. 2 gives a brief discussion on T-DSN and K-DCN. In Sect. 3 we describe the details of our experiments and compare the results with the existing methods.

2 Convex Deep Learning Models

2.1 Tensor Deep Stacking Networks

A tensor deep stacking network is a generalized form of a deep stacking network. The input data is provided to the nodes in the input layer of the first module. The input to the higher modules is obtained by appending output from the module just below it to the original input data. Unlike DSN, each module of TDSN has two sets of hidden layer nodes and thus, two sets of connections between the input layer and the hidden layer as shown in Fig. 2. The output layer nodes are bilinearly dependent on the hidden layer nodes.

Let the target vectors t be arranged to form the columns of matrix T, the input data vectors v be arranged to form the columns of matrix V, and $H_1$ and $H_2$ denote the set of matrices of the outputs of the hidden units. There are two sets of lower weight parameters ($W_1$ and $W_2$). They are associated with connections from the input layer to the two hidden layers containing $\textit{L}_1$ and $\textit{L}_2$ sigmoidal nodes respectively.

Since the hidden layers contain sigmoidal nodes, the output of a hidden layer can be expressed as:

$$\begin{aligned} \begin{aligned} {H_1 = logistic(W_1^T V)} \\ {H_2 = logistic(W_2^T V)} \\ \end{aligned} \end{aligned}$$

(1)

Let $\mathbf h _1$ be the vector of outputs from the first set of hidden nodes and $\mathbf h _2$ be the vector of outputs from the second set of hidden nodes. Let ${h_1}_i$ be the $i^{th}$ entry in $\mathbf h _1$ and ${h_2}_j$ be the $j^{th}$ entry in $\mathbf h _2$.

If C is the number of nodes in the output layer, weights of connections from hidden layers to the output layer are represented as a tensor U $\in \mathbf R ^{\textit{L}_1} \times \mathbf R ^{\textit{L}_2} \times \mathbf R ^{\textit{C}}$. The tensor U can be considered as a 3-dimensional matrix.

Let $y_k$ denote the output of $k^{th}$ node in output layer. The output vector can be obtained by computing $(\mathbf U \times _1 \mathbf h _1) \times _2 \mathbf h _2$ where $\times _i$ stands for multiplication along the $i^{th}$ dimension. In a simplified notation

$$\begin{aligned} y_k = \sum \limits _{i = 1}^{L_1} \sum \limits _{j = 1}^{L_2} U_{ijk} {h_1}_{i} {h_2}_{j} \end{aligned}$$

(2)

Let

$$\begin{aligned} \tilde{\mathbf{h }} = \mathbf h _1 \otimes \mathbf h _2 \end{aligned}$$

where $\otimes $ is the Kronecker product. Let $\tilde{\mathbf{u _k}}$ be the vectorized version of matrix $U_k$ in which all columns are appended to form a single vector. The matrix $U_k$ is obtained by setting the third dimension of tensor U equal to k. Hence, length of $\tilde{\mathbf{u }}_k$ is $\textit{L}_1 \textit{L}_2$. Now, we can rewrite Eq. (2) as,

$$\begin{aligned} y_k = \tilde{\mathbf{u _k}}^T \tilde{\mathbf{h }} \end{aligned}$$

(3)

Arranging all $\tilde{\mathbf{u _k}}$’s for $k = 1,2,\ldots ,C$, into a matrix $\tilde{U}$ = [$\tilde{\mathbf{u _1}}$ $\tilde{\mathbf{u _2}}$ ... $\tilde{\mathbf{u _C}}$], the overall prediction becomes

$$\begin{aligned} \mathbf y = \tilde{U}^T \tilde{\mathbf{h }} \end{aligned}$$

(4)

where y is the estimate of target vector t.

Thus, bilinear mapping from two hidden layers can be seen as a linear mapping from an implicit hidden representation $\tilde{\mathbf{h }}$. Aggregating the implicit hidden layer representations for each of the N instances into the columns of an $L_1 L_2 \times N$ matrix $\tilde{H}$, we obtain

$$\begin{aligned} Y = \tilde{U}^T \tilde{H} \end{aligned}$$

(5)

where $\tilde{H}$ contains $\mathbf h _k$ in $k^{th}$ column.

The convex formulation for $\tilde{U}$ in this case is,

$$\begin{aligned} min_{\tilde{U}^T} ||\tilde{U}^T H - T||^2 \end{aligned}$$

(6)

where $||. ||^2$ represents the squared norm operation.

Solving the optimization (6) we get:

$$\begin{aligned} \tilde{U}^T = T\tilde{H}^T(\tilde{H}\tilde{H}^T)^{-1} \end{aligned}$$

(7)

We see that the output of each hidden node in first layer appears $L_2$ number of times in $\tilde{\mathbf{h }}$. So, we have to add errors due to all those terms in order to get the error caused by this particular node. Hence, the equation for weight update needs to be modified to account for this and the modified equations are:

$$\begin{aligned} \varDelta W_1 = \eta V[H_1^T \circ (\varGamma -H_1^T) \circ \varPsi _1 ] \end{aligned}$$

(8)

$$\begin{aligned} \varDelta W_2 = \eta V[H_2^T \circ (\varGamma -H_2^T) \circ \varPsi _2 ] \end{aligned}$$

(9)

Here $\circ $ is the element-wise multiplication of two matrices, $\varGamma $ is a matrix of all ones, $\eta $ is the learning rate and

$$\begin{aligned} \begin{aligned} {\varPsi _1}_{nk} = \sum \limits _{k=1}^{L_2} {H_2}_{nk} \tilde{\varTheta }_{((i-1)L_2+k),n} \\ {\varPsi _2}_{nk} = \sum \limits _{k=1}^{L_1} {H_1}_{nk} \tilde{\varTheta }_{((i-1)L_1+k),n} \end{aligned} \end{aligned}$$

(10)

$$\begin{aligned} \tilde{\varTheta } = 2\tilde{H}^{+}(\tilde{H}T^T)(T\tilde{H}^{+}) - 2T^T(T\tilde{H}^{+}) \end{aligned}$$

(11)

Here $H_{1}$ is the matrix of outputs of nodes in the first hidden layer, $H_{2}$ is the matrix of outputs of nodes in the second hidden layer. The dimensions of matrices $\varPsi _{1}$ and $\varPsi _{2}$ are $N\times L_1$ and $N\times L_2$ respectively. Each of these two matrices $\varPsi _{1}$ and $\varPsi _{2}$ acts as a bridge between high dimensional implicit representation $\tilde{\mathbf{h }}$ and low dimensional representations $\mathbf u $ and $\mathbf v $.

Since T-DSN uses convex optimization techniques to directly determine the upper-layer weights, the training time is greatly reduced. However, computing the lower-layer weights is still an iterative process.

2.2 Kernel Deep Convex Networks

A kernel deep convex network (K-DCN), like a T-DSN, is composed by stacking of shallow neural network modules. This model completely eliminates the non-convex learning for the lower-layer weights using the kernel trick. In case of K-DCN, a regularization term C is included in the expression for computing the upper-layer weights U. This modification helps bound the values of elements of U and prevents the model from over-fitting on the training data.

The formulation for U takes the form of,

$$\begin{aligned} min_{U} [\frac{1}{2}* Tr\{(Y-T)^T (Y-T)\} + \frac{C}{2}U^TU] \end{aligned}$$

where Y is the predicted output for the output nodes and T is the target output. The closed form expression for U is obtained by solving this minimization as follows :

$$\begin{aligned} U = (CI +HH^T )^{-1} HT^T \end{aligned}$$

(12)

The output of the given module of KDCN is given by,

$$\begin{aligned} \mathbf y _k = TH^T (CI +HH^T )^{-1} \mathbf h _i \end{aligned}$$

(13)

The sigmoidal function of hidden units is replaced with a generic nonlinear mapping function $\varPhi (\mathbf v )$ from the raw input features $\mathbf v $. The mapping $\varPhi (\mathbf v )$ will have high-dimensionality (possibly infinite) which is determined implicitly by a chosen kernel function. The unconstrained optimization problem can be reformulated as follows:

$$\begin{aligned} min_{U} [\frac{1}{2}* Tr\{(Y-T)^T (Y-T)\} + \frac{C}{2}U^TU] \end{aligned}$$

subject to

$$\begin{aligned} T- U^T G(V) = Y-T \end{aligned}$$

where columns of G(V) are formed by applying the transformation $\varPhi (.)$ on each input $\mathbf v $. Solving this problem gives

$$\begin{aligned} U= G(V) (CI + K)^{-1} T^T \end{aligned}$$

(14)

where $K= G^T(V)G(V) $ is the kernel gram matrix of V.

Finally, for each new input vector v in the test set, the prediction of KDCN module is given by

$$\begin{aligned} \mathbf y (\mathbf v ) = U^T \varPhi (\mathbf v ) = T (CI + K)^{-1} \mathbf k ^T(\mathbf v ) \end{aligned}$$

(15)

Here k(v) is the kernel vector such that $k_n(\mathbf v ) = k(\mathbf v _n,\mathbf v )$ and $\mathbf v _n$ is a vector from training set.

For the subsequent modules, the output of nodes in the output layer is appended with the raw input. For $l^{th}$ module ($l>2$) Eq. (14) is valid with a slight modification in the kernel function to account for this extra input as follows:

$$\begin{aligned} K= G^T(Z) G(Z) \end{aligned}$$

(16)

where $Z= V|Y^{(l-1)}|Y^{(l-2)}|....|Y^1$, $Y^m$ is the prediction of module m, and U|V represents the concatenation of U and V.

Using the Eqs. (15) and (16) we eliminate the need of back-propagation and get a convex expression for training the model. The KDCN model combines the power of deep learning and kernel learning in a principled way. It is fast because there is no back-propagation.

2.3 Framework for Image Annotation

If a concept is present in an image, the corresponding bit in a binary target output vector t is turned on. Each module of a T-DSN is trained to predict t. Once the module is trained and the weights $W_1$, $W_2$, and U are learned, Eq. (4) is used to compute the estimated output. For the higher modules, the input data is concatenated with the output of the module below it (or with the output of n modules below it) and used as an augmented input. This process is repeated for all the modules and the output obtained at the last module is retained. Similarly, in case of a K-DCN, Eq. (15) is used to find predictions for each module.

One of the following methods to obtain the annotation labels from the outputs of a model is used.

1.
A threshold value is decided empirically using a held-out validation set. In the estimated output vectors, if the posterior probability value for a particular concept exceeds the threshold, it is considered as an annotation label for the image.
2.
Based on the average number of labels present in the images, a value k is selected. An image is annotated with those concepts that correspond to the top k values in the estimated output vector.

3 Experiments and Results

In this section, we present the details of image annotation datasets used and the experimental results for T-DSN and K-DCN. We compare the performance of these models with the state-of-the-art performance.

3.1 Experimental Setup

We used MATLAB on an Intel i7 8-core CPU with 16 GB of RAM for running the Rank-SVM. For T-DSN and K-DCN, we used NVIDIA Tesla K20C GPU with CUDA.

In order to reduce the number of multiplications in the computation of $\tilde{\varTheta }$, Eq. (11) is re-written as:

$$\begin{aligned} \begin{aligned} {\tilde{\varTheta } = 2(\tilde{H}^{+}\tilde{H}T^T - T^T) (T\tilde{H}^{+})} \\ { =2(\tilde{H}^{+}\tilde{H}T^T - T^T) \tilde{U}^{+}} \end{aligned} \end{aligned}$$

(17)

In order to reduce the memory requirements for the computation of $\tilde{\varTheta }$, Eq. (17) is parenthesized as follows:

$$\begin{aligned} \tilde{\varTheta } = 2(\tilde{H}^{+}(\tilde{H}T^T) - T^T) \tilde{U}^{+} \end{aligned}$$

(18)

In this order of multiplication, we avoid computing $\tilde{H}^{+}\tilde{H}$, which is a $N \times N$ matrix. In general, the value of N is large (20,000–50,000). Accommodating such a large matrix in the GPU memory is problematic. Many matrices are reused in the process of training. Matrices are allocated memory only when required and freed immediately after their use in order to make the best use of memory available.

For K-DCN, we used three different types of kernel functions, namely, Gaussian kernel, Polynomial kernel and Histogram Intersection Kernel (HIK). The kernel parameters and regularization parameter were tuned to obtain a range of values for the first module. For the later modules, the tuning is done with respect to the range of parameters obtained for the previous module, and a set of globally optimum parameters was obtained.

3.2 Feature Extraction

We used a deep convolutional network to obtain a useful representation from an image. A deep convolutional network consists of several layers. A convolutional layer consists of a rectangular grid of neurons. Each neuron takes inputs from a rectangular section of the previous layer. The weights for this rectangular section are constrained to be the same for each neuron in the convolutional layer. Constraining the weights makes it work like many different copies of the same feature detector applied to different positions. This constraint also helps in restricting the number of parameters. The output of a neuron in the convolutional layer, l for a filter of size ($m*n$) is given by

$$\begin{aligned} s_{ij}^l = f(\sum \limits _{x=0}^{m} \sum \limits _{y=0}^{n} w_{xy} s_{(x+i)(y+j)}^{(l-1)}) \end{aligned}$$

(19)

where $f(x)=\log (1+e^x)$. This nonlinearity was approximated using a simpler function, $f(x)= \max (0,x)$, which is known as the rectifier function. The nodes that use the rectifier function are referred to as Rectified Linear Units (ReLU). Use of ReLU reduced the time taken significantly.

Table 1. List of 45 concepts selected for our study on University of Washington annotation benchmark dataset.

Full size table

The pooling layer takes outputs of small rectangular blocks in the convolutional layer and subsamples it to produce a single output from that block. The pooling layer can take the average, or maximum, or learn a linear combination of outputs of the neurons in the block. In all our experiments, we used max-pooling. Pooling helps the network achieve small amount of translational invariance at each level. Also, it reduces the number of inputs to the next layer. Finally, after two convolutional and max-pooling layers, we added two fully connected layers. The activity of the nodes in the last fully connected layer was used as input to the T-DSN and K-DCN models.

Apart from this, we also used the SIFT features [18] as input to the deep learning models.

3.3 Datasets Used

We test our models with two real-world datasets that contain color images with their annotations: University of Washington annotation benchmark dataset [25] and the MIRFLICKR-25000 collection [12].

The Washington dataset had 1109 color images corresponding to 22 different categories with an average annotation length of 6. Out of all the concepts available, we selected only 45 concepts that had more than 25 images associated with each of them. The list of these 45 concepts is given in Table 1.

Some of the images from this dataset with their annotation labels are shown in Fig. 4. Because of the small number of images, we do not use convolutional features for this dataset.

MIRFLICKR-25000 is a database of 25,000 color images belonging to various categories. The average number of tags per image is 9. Some of the images from this dataset with their annotation labels are shown in Fig. 5. For our studies, we consider the 30 most frequently occurring tags. These tags have at least 150 images associated with each of them. We randomly selected 30 % of the images for testing, and repeated our studies over 5 folds.

3.4 Results

A T-DSN consisting of 3 modules with 100 nodes in each of the hidden layers was used on the University of Washington dataset. In our experiments we observed that having the same number of nodes in both the sets of hidden nodes generally give a better performance.

The precision, recall and F-measure for different thresholds in the threshold based decision logic are reported in Table 2.

Table 2. Precision, recall, and F-measure for different thresholds in the threshold based decision logic for annotation of images in the University of Washington data with T-DSN.

Full size table

We repeated the previous experiment with different values of k in the top-k based decision logic, and the precision, recall, and F-measure values are reported in Table 3.

Table 3. Precision, recall, and F-measure for different values of k in the top-k based decision logic for annotation of images in the University of Washington data with T-DSN.

Full size table

Table 4. Precision, recall, and F-measure for different thresholds in the threshold based decision logic for annotation of images in the University of Washington data with K-DCN.

Full size table

Table 5. Precision, recall, and F-measure for different values of k in the top-k based decision logic for annotation of images in the University of Washington data with K-DCN.

Full size table

We repeated these experiments with K-DCN. Best performance was observed for a Gaussian Kernel. The results of these experiments are reported in Tables 4 and 5. It is observed that the F-measure values for K-DCN are slightly lower when compared with that for T-DSN. One of the possible reasons for this could be that the kernel parameters used might not be the best. The state-of-the-art methods for image annotation, namely, Rank-SVM and semantic space model give F-measure values of 0.61 and 0.63 respectively. Figure 6 compares the actual annotation labels for some randomly selected images in University of Washington dataset with the annotations generated by the T-DSN model.

It is observed that the number of annotation labels generated by the models were slightly higher than that of the ground truth. In many cases, the extra labels are somehow related to the image.

For the MIRFLICKR dataset, the study is carried out using the SIFT features and convolutional features. Figure 7 shows the precision-recall curves for different models. The best F-measure values for different models are presented in Table 6.

Table 6. Performance comparison of models for image annotation task on MIRFLICKR dataset.

Full size table

It is observed that K-DCN and T-DSN perform better with convolutional features. It is also noted that convex deep learning methods perform better than the semantic space annotation method.

4 Summary and Conclusions

In this paper, we used the convex deep learning models, such as T-DSN and K-DCN for image annotation tasks. We also used features extracted from a deep convolutional network for this task. Through the experimental studies, it is observed that the T-DSN and K-DCN models with convolutional features as input give an improved performance. Once the convolutional network is trained on a large set of images, it is easy to extract features. The convex networks take less time to train, making them useful for image annotation tasks in practice.

For the K-DCN model, we have used only a single kernel function for a module. We can extend this by using multiple types of kernel functions. Finding a set of globally optimal parameters for K-DCN is difficult. Similarly, for T-DSN we observed that having different number of nodes in each hidden layer is not beneficial. However, we did not find any criterion for selecting the suitable number of hidden layer nodes. A recipe for selecting the number of nodes in T-DSN and globally optimum parameters for K-DCN will be useful.

References

Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1798–1828 (2013)
Article Google Scholar
Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recogn. 37(9), 1757–1771 (2004)
Article Google Scholar
Deng, L., Tür, G., He, X., Hakkani-Tür, D.Z.: Use of kernel deep convex networks and end-to-end learning for spoken language understanding. In: IEEE Workshop on Spoken Language Technologies, pp. 210–215, December 2012
Google Scholar
Deng, L., Yu, D.: Deep convex network: a scalable architecture for speech pattern classification. In: Interspeech, August 2011
Google Scholar
Deng, L., Yu, D., Platt, J.: Scalable stacking and learning for building deep architectures. In: Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, March 2012
Google Scholar
Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. Adv. Neural Inf. Process. Syst. 14, 681–687 (2001)
Google Scholar
Hare, J., Samangooei, S., Lewis, P., Nixon, M.: Semantic spaces revisited: investigating the performance of auto-annotation and semantic retrieval using semantic spaces. In: Proceedings of the International Conference on Content-based Image and Video Retrieval, pp. 359–368, July 2008
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Article MathSciNet MATH Google Scholar
Hinton, G.E., Osindero, S., Welling, M., Teh, Y.W.: Unsupervised discovery of nonlinear structure using contrastive backpropagation. Cogn. Sci. 30(4), 725–731 (2006)
Article Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proceedings of the Uncertainty in Artificial Intelligence, pp. 289–296 (1999)
Google Scholar
Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: Proceedings of the 2008 ACM International Conference on Multimedia Information Retrieval (2008)
Google Scholar
Hutchinson, B., Deng, L., Yu, D.: Tensor deep stacking networks. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1944–1957 (2013)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Proc. Neural Inf. Process. Syst. 22, 1106–1114 (2012)
Google Scholar
Le Roux, N., Bengio, Y.: Representational power of restricted Boltzmann machines and deep belief networks. Neural Comput. 20(6), 1631–1649 (2008)
Article MathSciNet MATH Google Scholar
LeCun, Y., Kavukcuoglu, K., Farabet, C.: Convolutional networks and applications in vision. In: Proceedings of International Symposium on Circuits and Systems, pp. 253–256 (2010)
Google Scholar
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 609–616 (2009)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Montavon, G., Braun, M.L., Mller, K.-R.: Deep Boltzmann machines as feed-forward hierarchies. Proc. Int. Conf. Artif. Intell. Stat. 22, 798–804 (2012)
Google Scholar
Ranzato, M., Krizhevsky, A., Hinton, G.E.: Factored 3-way restricted Boltzmann machines for modeling natural images. J. Mach. Learn. Res. Proc. Track 9, 621–628 (2010)
Google Scholar
Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Mach. Learn. 85(3), 333–359 (2011)
Article MathSciNet Google Scholar
Salakhutdinov, R., Hinton, G.: Deep Boltzmann machines. Proc. Int. Conf. Artif. Intell. Stat. 5, 448–455 (2009)
MATH Google Scholar
Tsoumakas, G., Katakis, I.: Multi-label classification: an overview. Int. J. Data Warehouse. Min. 3(3), 1–13 (2007)
Article Google Scholar
Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73(2), 185–214 (2008)
Article Google Scholar
Washington, U.: Washington ground truth database. http://www.cs.washington.edu/research/imagedatabase (2004)
Zhang, M.-L., Zhou, Z.-H.: Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn. 40(7), 2038–2048 (2007)
Article MATH Google Scholar
Zhang, M.-L., Zhou, Z.-H.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Madras, Chennai, India
Niharjyoti Sarangi & C. Chandra Sekhar

Authors

Niharjyoti Sarangi
View author publications
You can also search for this author in PubMed Google Scholar
C. Chandra Sekhar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niharjyoti Sarangi .

Editor information

Editors and Affiliations

Technical University of Lisbon, Lisbon, Portugal
Ana Fred
Sapienza Università di Roma, Roma, Italy
Maria De Marsico
Instituto Superior Técnico, Instituto de Telecomunicações, Lisbon, Portugal
Mário Figueiredo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sarangi, N., Sekhar, C.C. (2015). Tensor Deep Stacking Networks and Kernel Deep Convex Networks for Annotating Natural Scene Images. In: Fred, A., De Marsico, M., Figueiredo, M. (eds) Pattern Recognition: Applications and Methods. ICPRAM 2015. Lecture Notes in Computer Science(), vol 9493. Springer, Cham. https://doi.org/10.1007/978-3-319-27677-9_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-27677-9_17
Published: 09 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27676-2
Online ISBN: 978-3-319-27677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Tensor Deep Stacking Networks and Kernel Deep Convex Networks for Annotating Natural Scene Images

Abstract

Similar content being viewed by others

Multiple Kernel Learning Based on Weak Learner for Automatic Image Annotation

Automatic image annotation: the quirks and what works

A Hybrid Architecture Based on CNN for Image Semantic Annotation

Keywords

1 Introduction