1 Introduction

Scene classification and object detection are two challenging problems in computer vision due to high intra-class variance, illumination changes, background clutter and occlusion. Most existing methods assume that data will be labeled and available beforehand in order to train the classification models. It becomes infeasible and unrealistic to know all the labels beforehand with the huge corpus of visual data being generated on a daily basis. Moreover, adaptability of the models to the incoming data is crucial too for long-term performance guarantees. Currently, the big datasets (e.g. ImageNet [1], SUN [2]) are prepared with intensive human labeling, which is difficult to scale up as more and more new images are generated. So, we want to pose a question, ‘Are all the samples equally important to manually label and learn a model from?’. We address this question in the context of joint scene and object classification.

Active learning [3] has been widely used to choose a subset of most informative samples that can achieve similar or better performance than all the data being manually labeled. In order to identify the informative samples, most active learning techniques choose the samples about which the classifier is most uncertain. Expected change in gradients [3], information gain [4], expected prediction loss [5] are some approaches used in the literature to obtain the samples for query. These approaches consider the individual samples to be independent. However, there are various tasks, such as document classification [6] and activity recognition [7], where interrelationships between samples exist. In such cases, it will be advantageous to exploit these relationships to reduce the number of samples to be manually labeled. Some active learning frameworks consider this idea and exploit different contextual relations such as link information [8], social relationships [9], spatial information [10], feature similarity [11], spatio-temporal relationships [12].

Fig. 1.
figure 1

This figure presents the motivation of incorporating relationship among scene and object samples within an image. Here, scene (S) and objects (\(O^1,O^2,\dots ,O^6\)) are predicted by our initial classifier and detectors with some uncertainty. We formulate a graph exploiting scene-object (S-O) and object-object (O-O) relationships. As shown in the figure, even though \( \{S,O^2,O^3,O^4,O^5,O^6\} \) nodes have high uncertainty, manually labeling only 3 of them is good enough to reduce the uncertainty of all the nodes if S-O and O-O relationships are considered. So, the manual labeling cost can be significantly reduced by our proposed approach.

We leverage upon active learning for identifying the samples to label in the problem of joint scene and object recognition. Similar to the applications mentioned above, exploiting mutual relationships between scene and objects can yield better performance [13] than if no relationships are considered. For example, it is unlikely to find a ‘cow’ in a ‘bedroom’, but, the probability of finding ‘bed’ and ‘lamp’ in the same scene may be high. Thus gaining information about a scene can help in enhanced prediction on objects and vice versa. Previously, research in [1417] has shown how to exploit the scene-object relationships to yield better classification performance. However, these methods require data to be manually labeled and available before learning. Although there exist some works involving active learning in scene and object classification [4, 5, 18], they do not exploit the scene-object(S-O) and object-object(O-O) inter-relationships. This is critical because of the hierarchical nature of the relationships between objects and scenes. This relationship can be represented as a graphical model with the samples on the graph, which need to be labeled by a human, chosen using a suitable criterion. The labeling effort can be significantly reduced in this process - labeling a scene node in the graph can possibly resolve ambiguities for multiple object classes. This motivation is portrayed in Fig. 1.

Motivated by the above, we propose a novel active learning framework which exploits the S-O and O-O relationships to jointly learn scene and object classification models. Using mutual relationships between scene and objects, we can leverage upon the fact that manual labeling of one reduces the uncertainty of the other, and thus reduces labeling cost. This is achieved using an information theoretic approach that reduces the joint entropy of a graph. As presented in the figure, exploiting relationships between scene and objects can lead to lesser human labeling effort, compared to when relationships are not considered.

Fig. 2.
figure 2

This figure presents a pictorial representation of the proposed framework. At first, initial classification models and relationship model are learned from a small set of labeled images. Thereafter, as images are available in batches, scene & object classification models provide prediction scores of scene and objects. With these scores and the relationship model, the images are represented as graphs with scene and object nodes. Then, the active learning module is invoked which efficiently chooses the most informative scene or object nodes to query the human. Finally, the labels provided by the human are used to update the classification & relationship models.

Framework Overview. The flow of the proposed algorithm is presented in Fig. 2. We perform two tasks simultaneously:

  1. 1.

    Selection of an image that contains the most informative samples (scene,objects)

  2. 2.

    Given an image, a sample (i.e., a node in the graph representing that image) is chosen in a way that reduces the uncertainty on other samples.

Our framework is divided into two phases. At first, we learn the initial classification models as well as the S-O and O-O relationship model with small amount of labeled data. In the second phase, with incoming unlabeled data, we first classify the unlabeled scene and object samples using the current models. Then, we represent each incoming image as a graph, where scene classification probabilities and object detection scores are utilized to represent the scene and object node potentials. S-O and O-O relations delineate the edge potentials. We compute the marginal probabilities of node variables from the inference on the graphs.

Thereafter, we formulate an information-theoretic approach for selecting the most informative samples. Joint entropy of a graph is computed from the joint distribution of scene and objects that represents the total uncertainty of an image. For a batch of data, our framework chooses the most informative samples based on some uncertainty measures (discussed in Sect. 3) that lead to the maximum decrease in the joint entropy of the graph after labeling. After receiving the label of a node from the human, we infer on the graph conditioned upon the known label. Due to this inference, the other unlabeled nodes gain information from the node labeled by human, which leads to a significant reduction in uncertainties of other nodes. The labels obtained in this process are used to update the scene and object classification models as well as the S-O and O-O relationships.

Main Contributions. Our main contributions are as follows.

  • In computer vision, most of the existing active learning methods involve learning a classification model of one type of variable, e.g., scene, objects, activity, text, etc. On the other hand, the proposed active learning framework learns scene and object classification models simultaneously.

  • In the proposed active learning framework, both scene and object classification models take advantage of the interdependence between them in order to select the most informative samples with the least manual labeling cost. To the best of our knowledge, any previous work using active learning to classify scene and objects together is unknown.

  • Leveraging upon the inter-relationships between scene and objects, we propose a new information-theoretic sample selection strategy along with inference on a graph based on the intuition that learning a sample reduces the uncertainties of other samples. Moreover, our framework facilitates continuous and incremental learning of the classification models as well as the S-O and O-O relationship models, thus dynamically adapting to the changes in incoming data.

1.1 Related Works

Scene and Object Recognition. Many of the scene classification methods use low dimensional features such as color and texture [19], GIST [20], SIFT descriptor [21] and deep feature [22]. In object detection, current state-of-the-art methods are R-CNN [23], SPP-net [24] and fast R-CNN [25]. Another promising approach in recognition tasks has been to exploit the relationships between objects in a scene using a graphical model [13, 26, 27]. A Conditional Random Field (CRF) for integrating the scene and object classification for video sequences was proposed in [14]. A model for joint image segmentation, object and scene class inference was proposed in [13]. In [15], the spatial relationships between the objects within an image were exploited to compute the scene similarity score, based on which the indoor scene categories were predicted. In [16], a CRF model was constructed based on scene, object and the textual data associated with the images on the web, to label the scenes and localize objects within the image. In [28], a projection was formulated from images to a space spanned by object banks, based on which, the image was classified into different categories. In [17], a framework was developed for multiple object classification within an image, where a conditional tree model was learned based on the co-occurrences of objects.

Active Learning. Although the above mentioned works exploit the contextual relationships, they assume that all the data are labeled and available beforehand, which is not feasible and involves huge labeling cost. Active learning has been widely used to reduce the effort of manual labeling in different computer vision tasks including scene classification [4], video segmentation [29], object detection [30], activity recognition [12], tracking [31]. A generalized active learning framework for computer vision problems such as person detection, face recognition and scene classification was proposed in [32]. They used the two concepts of uncertainty and sample diversity to choose the samples for manual labeling. Some of the common techniques to measure uncertainty for selecting the informative data points are presented in [33]. Active learning has been separately used for scene or object classification [4, 18, 30, 34], but not in their joint classification.

In [18], a framework for actively learning scene classification model was proposed, where the authors incorporated two strategies - Best vs. Second Best (BvSB) and K-centroid to select the informative subset of images. A framework based on information density measure and uncertainty measure to obtain the best subset of images for querying the human was proposed in [5]. Although their algorithm can be applied separately for both scene and object classification, they do not exploit the relationships between scene and objects. An active learning framework for object categories was proposed in [35] which considers the case where the labeler itself is uncertain about labeling an image.

In [4], the authors present an active learning framework for scene classification. In their hierarchical model, they focus on querying at the scene level, and whenever unexpected class labels are returned by the human, queries are made at the object level. Thus in their method, there exists a flow of information from the object level to the scene level. However, in our method, there is a flow of information from scene to object level and vice versa, in a collaborative manner, which paves the path for a joint scene-object classification framework.

2 Joint Scene and Object Model

In this section, we discuss how we represent an image in a graphical model with scene and object as hidden variables.

A. Scene Classification Method. In order to represent scenes, we extract features using Convolution Neural Networks (CNN). Given an image, we get a feature vector f from the fc7 layer of a CNN architecture, where \(f \in \mathfrak {R}^{4096\times 1}\). We train a linear multi-class Support Vector Machine (SVM) [36] to compute the probability of \( n^{th} \) class, \(p(S=s_n|f^j)\), where \(f^j\) implies the feature vector corresponding to sample j. We denote the learned model for scene classification as \(\mathcal {P}_s\). Given an image, \(\varPhi _S \in \mathfrak {R}^N\) represents the classification score. N is the total number of scene categories considered in the experiment.

B. Object Detection Method. We use R-CNN presented in [23] to detect the objects in an image. In R-CNN, we extract features from deep network for each object proposal. Then, we train a binary SVM classifier for each object category to get the probability of appearance of an object. After classifying the region we form a vector that represents the confidence scores of the binary classifiers for each category. Thus, for each \(p^{th}\) region we get \(\varPhi _{O^p}\) that represents the detection score vector. Finally, we use bounding box regression method [37] for better object localization. We denote the learned model for scene classification as \(\mathcal {P}_o\).

C. Graphical Model Representation. In this model, two levels of nodes are used - one represents scene \(\upsilon _{s}\) and other set of nodes implies detected objects \(\upsilon _o\). \(\upsilon _o\) is generally represented by \(\upsilon _o=\{\upsilon _{o^1},\upsilon _{o^2},..\upsilon _{o^D}\}\), where D is the number of bounding boxes appearing in an image. The link between them is depicted by edges. The joint distribution of \(\upsilon _s\) and \(\upsilon _o\) over the CRF can be written as

$$\begin{aligned} P(\upsilon _s,\upsilon _o)=\frac{1}{Z} \ \varPsi _{\xi }(\upsilon _s,\upsilon _o) \prod _{i,j \in D \atop i \ne j}\varPsi _{\xi }(\upsilon _{o^i},\upsilon _{o^j})\prod _{w \in \{\upsilon _s,\upsilon _o\}} \varPsi _{v}(w) \end{aligned}$$
(1)

where, Z is normalizing constant. \(\varPsi _{v}(.)\) and \(\varPsi _{\xi }(.)\) denote node and edge potentials.

Node Potentials. Given an image, the scene classifier (\(\mathcal {P}_s\)) produces a vector that contains the probabilities of all the scene labels. From these probabilities we compute scene node potential \(\varPsi _{v}(\upsilon _s)\) as presented in Eq. 2. Similarly, given an image, the object detection scores are used to model the object node potentials \( \varPsi _{v}(\upsilon _o) \) as shown in Eq. 3.

$$\begin{aligned} \varPsi _{v}(\upsilon _s)= & {} \sum _{n \in N} \mathcal {I}(S_n) \beta _{n}^T \ \varPhi _S\end{aligned}$$
(2)
$$\begin{aligned} \varPsi _{v}(\upsilon _o)= & {} \sum _{p \in D} \sum _{m \in M} \ \mathcal {I}(O_m^p) \varOmega _{m}^T \ \varPhi _{O^p} \end{aligned}$$
(3)

Here, \(\varPhi _S\) is a vector of the probability of the scene labels obtained from multi-class SVM classifier. \(\beta _{n}\) is the feature weight vector corresponding to scene label \(S_n\) and \(\mathcal {I}(.)\) is the indicator function, i.e., \(\mathcal {I}(S_n)=1\) when \(S=S_n\), otherwise 0. \(\varOmega _{m}\) is the weight corresponding to the detection score of the object \(O_m\). \(\varPhi _{O^p}\) is the score vector of detecting all the objects in the \(p^{th}\) bounding box. M is the number of object Classes.

Edge Potentials. We use two type of relationships, S-O and O-O. We use co-occurrence frequencies to represent edge potential. The probability of the presence of an object in a particular scene is determined by the co-occurrence statistics. For instance, in a context of ‘highway’ scene, the probability of appearance of ‘car’ will be higher than ‘table’ or ‘chair’. In Eq. 4, \(\varPsi _{\xi }(\upsilon _s,\upsilon _o)\) represents the relationship between S and O. Similarly, \(\varPsi _{\xi }(\upsilon _{o^i},\upsilon _{o^j})\) models the O-O relations.

$$\begin{aligned} \varPsi _{\xi }(\upsilon _s,\upsilon _o)= & {} \sum _{p \in D}\sum _{n \in N}\sum _{m \in M}\mathcal {I}{(S_n)}\mathcal {I}{(O^p_m)}\varPhi _{\xi }(S_n,O_m) \end{aligned}$$
(4)
$$\begin{aligned} \varPsi _{\xi }(\upsilon _{o^i},\upsilon _{o^j})= & {} \sum _{m^\prime \in M} \sum _{m \in M} \mathcal {I}{(O^i_{m^\prime })} \mathcal {I}{(O^j_m)} \ \varPhi _{\xi }(O_{m^\prime },O_m) \end{aligned}$$
(5)

\(\varPhi _{\xi }(S_n,O_m)\) represents the co-occurrence statistics between scene and objects. Larger value implies higher probability of co-occurrence of \(S_n\) and \(O_m\). Here, \(\varPhi _{\xi }(O^i,O^j)\) is the co-occurrence [38] between the detected objects \(O^i\) and \(O^j\). It encodes the information about how often two objects can co-occur in a scene.

Parameter Learning. The initial model parameters of the CRF model are learned from a set of annotated images, object detectors and scene classifier. Given the ground truth object bounding boxes, we use object detectors to obtain detection scores for the corresponding bounding box region. Similarly, we get the classification score from the annotated scene label. Thus, we can easily apply maximum likelihood estimation approach to learn all the parameters \(\{\beta , \varOmega ,\varPhi _{\xi }(S_n,O_m),\varPhi _{\xi }(O_{m^\prime },O_m)\}\) in the model.

Inference of Scene and Object Labels. To compute the marginal distributions of the node and edge, we use Loopy Belief Propagation (LBP) algorithm [39], as our graph contains cycles. LBP is not guaranteed to converge to the true marginal, but has good approximation of the marginal distributions.

3 Active Learning Framework

In the previous section, we represent an image as a graph containing \( \upsilon _{s} \) and \( \upsilon _{o} \) nodes. If we select a node from a graph, such that querying it will minimize the joint entropy of the graph maximally, then it means that the classifier will be able to gain maximum amount of information by labeling that node.

Formulation of Joint Entropy. Consider a fully connected graph \( G = (V,E)\), where V and E are the set of nodes and edges respectively. It may be noted that \(V = \{S,O^1,O^2, \dots , O^D\}\). Let \(\mu _i(\upsilon _i)\) and \(\mu _{ij}(\upsilon _i,\upsilon _j)\) be the marginal probabilities of the node and edge of the graph. Let \(\upsilon _i\) and \(\upsilon _j\) represent the random variables for nodes \( i,j \in V\). In our joint scene and object classification, \(i\in \{S, O^1, O^2, \dots , O^D\}\) as discussed in Sect. 2. The node entropy \(H(\upsilon _i)\) and mutual information \(I(\upsilon _i,\upsilon _j)\) between a pair of nodes are defined as,

$$\begin{aligned} H(\upsilon _i) = \mathbb {E} [- \log _2\mu _i(\upsilon _i)] \ \ \ \ \ \ \ \ \ \ I(\upsilon _i,\upsilon _j)=\mathbb {E} [\log _2 \frac{\mu _{ij}(\upsilon _i,\upsilon _j)}{\mu _i(\upsilon _i)\mu _t(\upsilon _j)}] \end{aligned}$$
(6)

Considering Q nodes in the graph, its joint entropy can be expressed as,

$$\begin{aligned} H(V)&= H(\upsilon _1) + \sum _{i=2}^Q H(\upsilon _i|\upsilon _1,\dots ,\upsilon _{i-1}) \nonumber \\&= H(\upsilon _1) + \sum _{i=2}^Q \Big [H(\upsilon _i) - I(\upsilon _1,\dots ,\upsilon _{i-1};\upsilon _i) \Big ] \end{aligned}$$
(7)

using \(I(\upsilon _1,\dots ,\upsilon _{i-1};\upsilon _i)=H(\upsilon _i)-H(\upsilon _i|\upsilon _1,\dots ,\upsilon _{i-1})\). Again, using the chain rule, \(I(\upsilon _1,\dots ,\upsilon _{i-1};\upsilon _i)=\sum _{j=1}^{i-1} I(\upsilon _j;\upsilon _i|\upsilon _1,\dots ,\upsilon _{j-1})\), Eq. 7 becomes

$$\begin{aligned} H(V) = \sum _{i=1}^Q H(\upsilon _i) - \sum _{i=2}^Q \sum _{j=1}^{i-1} I(\upsilon _j;\upsilon _i|\upsilon _1,\dots ,\upsilon _{j-1}) \end{aligned}$$
(8)

It becomes computationally expensive to compute the conditional mutual information, as the number of node increases [40]. As we consider only pair-wise interactions between S-O and O-O, we approximate the conditional mutual information \(I(\upsilon _j;\upsilon _i|\upsilon _1,\dots ,\upsilon _{j-1}) \approx I(\upsilon _j;\upsilon _i)\). Thus, the joint entropy of the graph can be approximated as,

$$\begin{aligned} H(V) \approx \sum _{i=1}^Q H(\upsilon _i) - \sum _{i=2}^Q \sum _{j=1}^{i-1} I(\upsilon _j;\upsilon _i) = \sum _{i \in V}^{} H(\upsilon _i) - \sum \limits _{(i,j) \in E}^{} I(\upsilon _i;\upsilon _j) \end{aligned}$$
(9)

This expression is actually exact for a tree, but approximate for a graph having cycles. The approximation leads to the expression of joint entropy in Eq. 9, which is similar to the joint entropy expression in Bethe method [40].

Informative Node Selection. In our problem, an image is represented by a graph having several nodes with two types of hidden variables \( \upsilon _{s} \) and \( \upsilon _{o} \). So, we require not only to find the most informative image but also need to choose the node to be manually labeled. If we manually label a node, then we assume that there is no uncertainty involved in that node. Thus, after labeling a node \(\upsilon _i\) with the label l, the node entropy becomes zero, i,e. \(H(\upsilon _i=l)=0\).

Let \( H^p(V) \) be the the joint entropy of image p which can be computed using Eq. 9. We query the node such that \( H^p(V) \) is maximally reduced after labeling the node and inferring the graph conditioned on the new label. Then, after labeling \( \upsilon _i \), we find the optimal node q of image p to be queried asFootnote 1,

$$\begin{aligned} q^*=\arg \underset{q}{\max }\ \ \Big [H^p(\upsilon _q)-\frac{1}{2}\sum _{j \in \mathcal {N}(q)}I^p(\upsilon _q,\upsilon _j) \Big ] \end{aligned}$$
(10)

where \( \mathcal {N}(q) \) represents the neighbor nodes of q. For simplicity, let us define the uncertainty associated with node q of image p as \(J^p_q=H^p(\upsilon _q)-\frac{1}{2}\sum _{j \in \mathcal {N}(q)}I^p(\upsilon _q,\upsilon _j)\) where the joint entropy for an image p is \(H^p(V)=\sum _{q=1}^n J^p_q\) from Eqs. 9 and 10. From Eq. 10, we choose the node to query, which has the maximum uncertainty considering not only the node entropy but also the mutual information between the nodes. Next, we explain how to choose a set of nodes from a batch of images.

Simultaneous Image and Node Selection. We query the nodes of image p only if its joint entropy \( H^p(V) \ge \delta \), where \(\delta \) is a threshold. Since we have the information about all the node uncertainties of all images, we can perform multiple queries across multiple different images such that the learner can learn faster and more efficiently. In this paper, we consider that there is no relation between the images, thus the conditional inference on one image is independent of the other images. Thus, graphs of different images can be conditionally inferred in a parallel manner.

Let, a vector, \( J^p=[J^p_1, J^p_2,\dots ,J^p_Q]^T\) contain the uncertainty associated with Q (dependent on the image) nodes for an image p. Consider another vector, \( \hat{J} =[J^1 \ J^2 \dots J^P]^T\) which is obtained after concatenating all the vectors \(J^p\) for P images, whose joint entropy is higher than threshold \(\delta \). We sort the vector \(\hat{J}\) in descending order to obtain a new vector \(\hat{J}_s\). Then, we perform multiple queries based on \(\hat{J}_s\), which contain uncertainty of nodes from multiple images of a batch. For each image, we choose the node appearing first in \(\hat{J}_s\) for labeling. We perform conditional inference with the new labels in a parallel manner over all the images. The \(\hat{J}_s\) vector is again obtained using the updated uncertainties of the nodes and the process is repeated until \(H^p(V) \le \delta , \forall p\). It may be noted that P decreases or at least remains same in succeeding iterations, because nodes belonging to images attaining joint entropy less than \(\delta \) are not queried and thus not included in \(\hat{J}_s\). Inference reduces the uncertainty on other nodes of the same image.

As uncertainty of nodes decreases, joint entropy is also reduced. Consider a matrix S having dimension \(N_n \times 2\), where \(N_n\) is the total number of nodes of all images in the batch. The first and second columns of S contain the node index of a graph (image) and the image index respectively. The order in which the elements of S are populated is the same as that of \(\hat{J}_s\). We refrain from choosing more than one node per image in each iteration because labeling one node can help the other nodes attain a better decision after inference. The set of nodes \(\mathcal {M}\), chosen for labeling in each iteration can be expressed as,

(11)

where \(\big [ \hat{J}_s \big ]_k\) denote the \(k^{th}\) element of \(\hat{J}_s\) and \(S^{i,m}\) denote the \(i^{th}\) row and \(m^{th}\) column of S, where \(m \in \{1,2\}\). All the steps of active learning are shown in Algorithm 1. The first column of S is used to identify which node of an image should be labeled. To summarize Eq. 11, the optimal set \( \mathcal {M} \) can be obtained by choosing one node which has the highest entropy from each image.

Classifier Update. To classify scene and objects, we use a linear support vector machine (SVM) classifier. The probability of predicted label can be defined as \( \hat{y}=w^Tf(x)+b \), where f(x) is the feature of scene or objects and wb are parameters that determine the hyperplane between two classes. We use soft margins formulation presented in [36] to find the solution of wb. The solution can be found by optimizing, \( \frac{1}{2}w^2+C\sum _1^n \epsilon _i \) subject to \( y_i(w^Tf(x_i)+b)\ge (1-\epsilon _i) \) and \( \epsilon _i\ge 0\) for all i samples, where \( \epsilon _i \) is the slack variable.

Edge Weight Update. We update the co-occurrence statistics with new manually labeled data as presented in Eqs. 4 and 5. lets denote them by \(\varPhi '_{\xi }(S_n,O_m)\) and \(\varPhi '_{\xi }(O_{m'},O_m)\). The updated co-occurrence matrix will be \([\varPhi _{\xi }(S_n,O_m)]_{t+1}\leftarrow [\varPhi _{\xi }(S_n,O_m)]_t+\varPhi '_{\xi }(S_n,O_m)\) and \([\varPhi _{\xi }(O_{m'},O_m)]_{t+1}\leftarrow [\varPhi _{\xi }(O_{m'},O_m)]_t+\varPhi '_{\xi }(O_{m'},O_m)\), where the subscript \(t+1\) indicates the edge potentials after t updates.

figure a

4 Experiments

In this section, we provide experimental analysis of our active learning framework for joint scene and object recognition models on three challenging datasets. For convenience, we will use terms ‘inter-relationship’ and ‘contextual relationship’ to denote scene-object and object-object relationship.

Datasets. In our experiments, we use SUN [41], MIT-67 Indoor [42] and MSRC [43] datasets in order to analyze scene classification and object recognition performance and compare our results. These datasets are appropriate as they provide rich source of contextual information between scene and objects. In SUN dataset, we choose 125 scene classes and 80 object categories to evaluate scene classification and object detection performance, as those contain annotation for both scene and objects. MIT-67 indoor [42] dataset consists of 67 indoor scene categories with large varieties of object categories. For MSRC [43] dataset, we evaluate our results comparing with the ground truth which is available in [13].

Experimental Setup. We use a publicly available software- ‘UGM Toolbox’ [44] to infer the node and edge belief in image graphs. We use pre-trained model ‘VGG net’ [22] which is trained on ‘places-205’ dataset to extract the scene features from CNN. For object recognition, we use the model as presented in [25].

In our online learning process, we perform 5 fold-cross validation, where one fold is used as testing set and the rest are used as training set. We divide the training set into 6 batches. We assume that human-labeled samples are available in the first batch and we use it to obtain the initial S and O classification models and the S-O and O-O relations. It might be possible that we do not have all the classes for scene and objects in the first batch. So, new classes are learned incrementally as batches of data come in. Now, with current batch of data we apply our framework to choose the most informative samples to label and then, update the classification and relationship models with newly labeled data. Finally, we compute our recognition results on the test set with each updated models.

Evaluation Criterion. In order to train the object detectors, we first choose positive and negative examples. We apply standard hard negative mining [37] method to train the binary SVM. We calculate the average precision (AP) of each category by comparing with the ground truth. Precision depends on both correct labeling and localization (overlap between object detection box and ground truth box). Let the computed bounding box of an object be \(O_b\) and the ground truth box be \(G_b\), then the overlap ratio, \(OR=\frac{O_b \cap G_b}{O_b \cup G_b}\). \(OR \ge 0.5\) is considered as correct localization of an object. Before presenting our results, we define all the abbreviations that will be used hereafter

  • \( \diamond \) SOAL: proposed scene-object active learning (SOAL) as discussed in Sect. 3.

  • \( \diamond \) Bv2B: Best vs Second Best active learning strategy proposed in [18].

  • \( \diamond \) IL-SO: Incremental learning (IL) approach presented in [45] is implemented for scene and object (SO) classification.

  • \(\diamond \) No Rel: No relation is considered between scene and objects.

  • \(\diamond \) S-O Rel: Only S-O relations are considered but not O-O relations.

  • \( \diamond \) S-O-O Rel: Both S-O and O-O relationships are considered.

  • \( \diamond \) All+S-O: All samples with S-O relations are considered.

  • \( \diamond \) All+S-O-O: All samples with both S-O and O-O relations are considered.

  • \( \diamond \) All+No Rel: All samples without any relation are considered.

  • \( \diamond \) SO+All: All samples in batch are considered for scene and object classification with S-O-O relationship.

  • \( \diamond \) NL, AL: NL implies no human in the loop, i.e., we do not invoke any human to learn labels. AL denotes active learning. For example, S-NL+O-AL means scene nodes are not queried but object nodes are queried.

Experimental Analysis. We perform the following set of experiments - 1. Comparison with other active learning methods, 2. Comparison of the baselines with different S-O and O-O relations, 3. Comparison against other scene and object recognition methods, and 4. Recognition performance of scene and object models while labeling either scene or object.

Comparison with Other Active Learning Methods. In Figs. 3(a–c) and 4(a–c), we compare our active learning framework with some existing active learning approaches- Bv2B [18], Random Selection, Entropy [46] and IL-SO [45]. In the case of random selection, we pick the samples with uniform distribution. For Bv2B,Entropy and IL-SO, we implement the methods to select the informative samples for scene and objects. The feature extraction stages are the same as ours. We observe that our approach outperforms other methods by a large margin in selecting the most informative samples in both scene and object recognition.

Is Contextual Information Useful in Selecting the Most Informative Samples? We conduct an experiment that implements our proposed active learning strategy by exploiting different set of relations of scene (S) and objects (O). Figures 3(d–f) and 4(d–f) show the plots for S and O respectively on three datasets. It is noticed that the highest accuracy is yielded by S-O-O Rel (proposed), followed by S-O Rel and No-Rel in scene classification as well as in object recognition. This brings out the advantage of exploiting both S-O and O-O relations in actively choosing the samples for manual labeling. Moreover, the manual labeling cost is significantly reduced when we consider more relations. It may also be noted that our proposed framework achieves similar or even better performance by only choosing a smaller subset of training data than building a model with full training set for both scene and objects. For scenes, this subset is \(\mathbf 35\,\%\), \(\mathbf 30\,\% \) and \(\mathbf 42\,\%\) of whole training set on MSRC, SUN and MIT datasets respectively. Similarly, for objects, we require only \(\mathbf 39\,\% \), \(\mathbf 61\,\% \), \(\mathbf 60\,\% \) of whole training set to be manually labeled on these three datasets.

Fig. 3.
figure 3

In this figure, we present the scene classification performance for three datasets- MSRC [43], SUN [2] and MIT-67 Indoor [42] (left to right). Plots (a, b, c) present the comparison of SOAL (proposed) against other state-of-the-art active learning methods. Plots (d, e, f) demonstrate comparison with different contextual relations. Plots (g, h, i) demonstrate the comparison of other scene classification methods. Plots (j, k, l) show the classification performance by utilizing our active learning framework either on scene or objects and both. Please see the experimental section for details. Best viewable in color. (Color figure online)

Comparison Against Other Scene and Object Classification Methods. We also compare our S and O classification performance with other state-of-the-art S and O recognition methods. For scene, we choose Holistic [13], CNN [22], DSIFT [21], MLRep [47], \(S^2\)ICA [48] and MOP-CNN [49]. Similarly, we compare against Holistic [13], R-CNN [23], DPM [37] for object detection performance. Holistic approach exploits interrelationship among S and O using graphical model. We also compare with SO-All. From Figs. 3(g–i) and 4(g–i), we can see that our proposed framework outperforms the other state-of-the-art methods.

How Does Scene and Object Sample Selection Affect Classification Score of Each Other? We perform an experiment to observe how S and O recognition performs, when we implement active sample selection of either scene or object nodes and exploit S-O and O-O relationships to improve the decisions of the other type of nodes. The results are shown in Figs. 3(j–l) and 4(j–l). Let us consider the first scenario (S-NL+O-AL) where we perform AL on the O nodes but use relationships to update the classification probabilities of the S node. We use the first batch to learn the S and O models, but thereafter query to label only object nodes and not scene nodes.

Fig. 4.
figure 4

In this figure, we show the object detection performances on MSRC [43], SUN [2] and MIT-67 Indoor [42] (left to right). Plots (a, b, c) present the comparison of SOAL with other state-of-the-art active learning methods. Plots (d, e, f) demonstrate comparison with different graphical relations. Plots (g, h, i) present the comparison of other object detection methods. Plots (j, k, l) show the detection performance by implementing our active learning framework either on scene or objects and both. Please see the experimental section for details. Best viewable in color. (Color figure online)

The relationship models are updated based on the confidence of scene classifier and manual labeling of the objects obtained from a human annotator. With each update on context model, scene classification accuracy goes up even though the scene classification model is not updated. Similarly, the second scenario involves manual labeling of only S nodes but not O nodes. In this scenario, we do not consider O-O relationships. We can not rely on confidence of object classifiers to model O-O relations as it might provide wrong prediction of object labels. However, involvement of human in both scene and objects makes the sample selection even more efficient and outperforms all the scenarios mentioned above. As shown in Figs. 3(j–l) and 4(j–l), S-AL+O-AL achieves better performance than S-AL+O-NL by approximately \( \mathbf 4\)\(\mathbf 5\,\% \) and \(\mathbf 4.5\)\(\mathbf 5.5\,\% \) in both scene and objects on three datasets.

Fig. 5.
figure 5

Scene prediction and object detection performance on test image with updated model learned from the data of \(1^{st}\), \(4^{th}\) and \(6^{th}\) batch.

Some Examples of Active Learning (AL) Performance. We provide some examples of scene prediction and object detections as shown in Fig. 5. Here, scene prediction and detections are changing as models are learned over samples from each batch. Scene and object models are updated continuously with upcoming batch of data using our AL approach. With each improved model from the batch of data, classifiers become more confident in predicting scene and object labels on test image. More such examples are provided in the supplementary material.

5 Conclusions

In this paper, we propose a novel active learning framework for joint scene and object classification exploiting the interrelationship between them. We exploit the scene-object and object-object interdependencies in order to select the most informative samples to develop better classification models for scenes and objects. Our approach significantly reduces the human effort in labeling samples. We show in the experimental section that with only a small subset of the full training set we achieve better or similar performance compared with using full training set.