Keywords

1 Introduction

Computer vision comes with several challenges, like Place recognition and Scene recognition which are often confusing. During the past few years, we saw in several Scene recognition publications [1,2,3], the need to resolve in the better manner Scene classification and Scene representation. One of the distinctions between these two is that: Scene classification draws effective classifiers, and Scene representation comes with the goal of extracting discriminative features. Also, they can be divided into two main categories: learning-based features as mentioned in [4, 5] and hand-crafted elements [6]. Although, Hand-crafted methods contend census transform histogram (CENTRIST), generalized search trees (GIST) and oriented texture curves (OTC) [7]. In addition, their components investigate low-level visual information such as textual and structural information in scene images. Despite the quality of those features, they are not sufficient for more complex scene images. Moreover, some discriminating objects can be found in a scene with high probability and sometimes appear in other scenes; whereas, multiple objects may appear in separate Scenes with a similar chance. Our goal is to give as input to our finetune TrainDetector model a patch of images containing relevant features. Thus, this research proposes, a compelling semantic descriptor based on Single-Shot-Detector (SSD), Multi-modal Local-Receptive-Field (MM-LRF) and Extreme-Learning-Machine (ELM) that we baptized TrainDetector for scene recognition. The subsequent section of this paper is as follow: Sect. 2 introduces the related studies; Sect. 3 reveals the design proposed model; Sect. 4 presents and will discuss the experimental results; while Sect. 5 gives an adequate conclusion to our work.

2 Related Studies

In this part, a brief review is provided for three main points in our research, such as ELM, scene classification, and scene representation.

Extreme-Learning-Machine.

The following model (see Fig. 1), known as Extreme Learning Machine (ELM), was first introduced in [8] with single hidden feedforward neural networks and achieves great performance in image processing.

Fig. 1.
figure 1

The basic model of an ELM [8]

Let say; we want to learn N different samples \( \left\{ {X,\,T} \right\} = \left\{ {X_{j} ,t_{j} } \right\}_{j = 1}^{N} \) where \( X_{j} \in \,R^{n} \) and \( t_{j} \in \,R^{m} \), with the activation function g(x) and we want to train a single-hidden layer feedforward neural networks (SLFNs) by having K hidden neurons. Instead of assigning values as input for hidden biases and weights, they are randomly generated in ELM. As a result, this process allows converting the nonlinear system to a linear system

$$ Y_{j} = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g_{i} \left( {X_{j} } \right) = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g_{i} \left( {X_{j} W_{i}^{T} + b_{j} } \right) = t_{j} ,j = 1,2, \ldots .. M $$
(1)

Where \( X_{i} \in \,\,R^{n} \) defines the input weight vector acting as a connector between inputs nodes and \( i^{th} \) hidden neuron; and \( Y_{j} \in \,\,R^{m} \) defines the output vector of \( j^{th} \) training sample. Furthermore, \( {\text{g}}\left( . \right) \) represents the nonlinear activation functions; as connector between the \( i^{th} \) hidden neuron and output neurons, we have the \( \beta_{i} = \left( {\beta_{i1} ,\beta_{i2} , \ldots , \beta_{im} } \right)^{T} \) weight vector. So then, we can rewrite the previous equations as:

$$ {\text{H}}\upbeta = \,{\text{T}} $$
(2)

Where T is the target matrix, and H can be explicitly defined as:

$$ H = \left[ {\begin{array}{*{20}c} {g\left( {X_{1} W_{1}^{T} + b_{1} } \right)} & \ldots & {g\left( {X_{1} W_{L}^{T} + b_{L} } \right)} \\ : & \ldots & : \\ {g\left( {X_{1} W_{L}^{T} + b_{1} } \right)} & \ldots & {g\left( {X_{N} W_{L}^{T} + b_{L} } \right)} \\ \end{array} } \right] $$
(3)
$$ \beta = \, \,\left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ : \\ {\beta_{L}^{T} } \\ \end{array} } \right],\,\,T = \,\, \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ : \\ {t_{N}^{T} } \\ \end{array} } \right] $$
(4)

Hence, we can see that compute the value of Y (the output vector) is like finding the least-square (LS) solution to the given linear system. Considering (1), LS will be:

$$ \hat{\beta } = H^{ - 1} $$
(5)

Where \( H^{ - 1} \) is the Moore–Penrose (MP) generalized inverse of matrix H. As mentioned by Huang, et al., we can see a great generalization performance and a considerable increase in the learning speed for ELM using such MP inverse methods.

3 Local Features and Discriminative Objects

To design a TrainDetector, in such a way that we can easily distinguish the area responsible for the extraction of local features and the one responsible for the extraction of global features. As shown (see Fig. 2), the multi-model training architecture presented in this study, goes through three main procedures: unsupervised feature representation which deals with each modality; feature fusion representation with output a feature H obtains after the combination of each features Matrix \( \varvec{H}_{\varvec{i}} \) where \( i \in \left[ {1,2} \right] \); supervised featured classification performs by a Single Shot Model.

Fig. 2.
figure 2

The pipeline of the proposed TrainDetector approach for scene representation. We process as follow: firstly, we compute the features map by extracting relevant part of the images. Secondly, we passed our features map through our SSD to extract object score vectors, discriminative objects, and local representation. Thirdly, through a Place–CNN, we retrieve general feature and global description. Not the least, we classified our Scene.

Object Detection.

As mentioned before, each modality (RGB and Depth) is handled separately. Furthermore, we submit them simultaneously to a single LRF-ELM net layer, which allows us to deform to some extent a part of an object and permits to get low-level features as edges. Moreover, the output can be easily computed to provide the output of each LRF-ELM net layer \( \text{(}\varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} \text{)} \) as \( K \times N \cdot \left( {1 - r + d} \right)^{2} \), where K corresponds to the number of feature maps. N is the input samples, r represents the size of the receptive field and d is the input size \( \varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} \) are both features matrices which are combined into a single one H as follow:

$$ H = \,\,\left[ {H_{1}^{c} ;\,\,H_{1}^{d} } \right]^{T} $$
(6)

This combine matrix H of features will be submitted to our last component (SSD) which will give as out a precise batch of objects detected with high precision.

Unsupervised Feature Representations.

The local receptive fields of this research framework are based on ELM, and it allows us to extract important features. As illustrated (see Fig. 3), we explained the process of learning representation which is obtained after the processing features of each modality. Our LRF-ELM can be divided into two main operations: Firstly, we randomly generate the initial Weight Matrix \( \left( {\hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} } \right) \) with the open field \( r^{2} \), the input size \( d^{2} \). Hence, we obtain a feature map of size \( \left( {1 - r + d} \right) \times \left( {1 - r + d} \right) . \)

Fig. 3.
figure 3

The similar global layout on three different scenes which contains common objects (e.g., shelves and people) and some discriminative objects (e.g., books in the Bookstore and shoes in Shoe store).

$$ \hat{w}_{init}^{c} ,\hat{w}_{init}^{d} \in R^{{r^{2} }} ,\,\hat{W}_{init}^{c} ,\hat{W}_{init}^{d} \in R^{{r^{2} \times t}} ,\,\,t = 1,2,3, \ldots T $$
(7)

Thus, through Singular Value Decomposition (SVD), we orthogonalize \( \hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} \). Secondly, we generate the combinatorial node as follow: we assume the size of the feature map equal to the pooling map, p being the pooling size which is the distance between the edge of the pooling area and the center. Furthermore, \( w_{p,q,t}^{c} ,\,w_{p,q,t}^{d} \), \( C_{i,j,t}^{c} \,\,and\,\,C_{i,j,t}^{d} \) are respectively the combinatorial node (p, q) obtains in \( k^{th} \) pooling map and The node (i, j) obtains in the \( k^{th} \) feature map as shown below:

$$ \left\{ {\begin{array}{*{20}c} {w_{p,q,t}^{c} = \sqrt {\mathop \sum \limits_{i = p - e}^{p + e} \mathop \sum \limits_{j = q - e}^{q + e} C_{i,j,t}^{2c} } } \\ {w_{p,q,t}^{d} = \sqrt {\mathop \sum \limits_{i = p - e}^{p + e} \mathop \sum \limits_{j = q - e}^{q + e} C_{i,j,t}^{2d} } } \\ \end{array} } \right. $$
(8)

Where \( {\text{p}},\,{\text{q}}\, = \, 1\ldots \left( {1 - r + d} \right) \) and \( C_{i,j,t}^{c} , C_{i,j,t}^{d} = 0 \).

Supervised Featured Classification.

Taking as input, the combined feature obtained from the previous step, we have sufficient parameters to process the third step (SSD) which receives data directly to its convolution classifiers layers. It consists of the featured classification so that we obtain a set of score vectors at SoftMax layer.

Discriminative Objects.

Obtaining Objects of an image is just one step in the process. The next step was to select among those objects the one with high discriminative factor, but before we elaborate on that, there is the need for the reader to understand what Discriminative objects means.

For this study therefore, Discriminative objects is defined as: objects that appear with a high probability of occurrence in one class but has a low probability of occurrence in other classes of the dataset. An example can be the objects marked with (+) (see Fig. 3).

The multinomial object distribution for each category is derived from object score vectors at the softmax layer of our network, and it gives the probability statistics of all object classes in a scene category. To be more precise, at the training time, we supplied to an ImageNet-CNN (e.g., the well-known VGGNet) a set of patches \( P = \left[ {p_{1} ,p_{2} , \ldots , p_{i} , \ldots , p_{N} } \right] \) coming from several images of the same category (e.g., kitchen). At the softmax layer, we obtain for each patch a 1000-dimensional score vector representing the occurrence probability of a specific object class. Furthermore, to detect the occurrence of an object \( O_{i} \) in a patch of images, it is essential to compute beforehand score vector S of each patch \( P_{i} \), where \( S = \left[ {S_{1} ,\, \ldots ,\,S_{i} ,\, \ldots S_{N} } \right] \), and we randomly set a confidence level \( \sigma \) for S. Hence, we achieve the detection by applying the equation below:

$$ \delta \left( {x|\sigma } \right) = h\left[ {S_{i} \left( x \right) - \sigma } \right] $$
(9)

Where \( h\left( x \right) = 1 ,\,\,x\, \ge \,0 \) and \( h\left( x \right) = 0 ,\,\,x\, < \,0 \). To avoid to miss some infrequent classes, we apply the function \( f_{O} \) on a batch of images I and to detect the occurrence object \( O_{i} \) without the need of the confidence level as follows:

$$ f_{O} \left( x \right) = \sum\nolimits_{{P_{i} \in I}} {S_{i} } $$
(10)

Where \( P_{i} \) is a patch of the images \( I \) and \( S_{i} \) is the score vector of the patch \( P_{i} \).

Considering a set of images \( I_{C} \in Class C \), we can compute the maximum possibility of an Object O on class C as:

$$ p\left( {O|C} \right) = \frac{1}{{N_{{I_{c} }} }} \sum\nolimits_{{X_{i} \in I_{C} }} {f_{O} \left( {X_{i} } \right)} $$
(11)

We set in this paper, \( p\left( {O|C} \right) \) as the object multinomial distribution of C. (see Fig. 4) shows various objects distributions and the results after computation of \( p\left( {O|C} \right) \) on different classes. Furthermore, we can now compute the posterior probability of scene classes by taking in account the observation of all objects and by applying the Bayes rule, we can obtain:

$$ P\left( {C_{j} |O_{i} } \right) = \frac{{p\left( {O|C_{j} } \right)p\left( {C_{j} } \right)}}{{\mathop \sum \nolimits_{j} p\left( {O|C_{j} } \right)p\left( {C_{j} } \right)}} $$
(12)
Fig. 4.
figure 4

Considerable variation in object multinomial distribution in different scene categories. (a) shows that our model can get with less confidence (<0.5) a book as a discriminative object in bookstore Category, whereas in (b) the model detects shoes (shoes Store) easily and in (c) several other objects can be detected, but jewelry has the highest probability (>0.5).

Where \( p\left( {O|C_{j} } \right) \) is similar to Eq. (11) given the scene class \( C_{j} \) and \( p\left( {C_{j} } \right) \) is a prior scene class probability.

4 Experiments and Comparison Table with the State-of-the-Art Methods

As mentioned in the introduction, we applied our method on three well know dataset including Scene 15 [9], the MIT Indoor 67 [10] and SUN 397 [11]. Also, to better address indoor and outdoor scene, the scene dataset SUN 397 and Scene 15 are used. Also, MIT indoor 67 datasets are used to confirm the accuracy of our method. The following parts describe the experiment performed with those datasets.

  • Scene 15 Dataset [9] which offers relevant images for indoor and outdoor scene contains 4485 gray pictures of 15 different scenes. However, it does not include a training set and a testing set, reason why we choose to compute the mean of the classification performance across splits base on five random splits. Furthermore, we use 100 training images for each category, and we use the remaining image for the test.

  • MIT indoor 67 Dataset [10] is a considerable dataset which contains 15 620 color images and 15 scene categories. It offers an essential variation among groups with at least 100 images per category. As per as the process described in [10], we apply our method to 80 images from each category for training.

  • Sun 397 Dataset [11] is a massive dataset which offers at least 100 images per categories and 397 scenes categories. As per as the protocol defines in [11], we trained our model on 50 training images and 50 test images.

From the above explanation, several approaches have been proposed and applied on SUN 397, MIT Indoor 67, and Scene 15. However, we are comparing those methods with our proposed method as shown in Tables 1, 2 and 3.

Table 1. Comparison of our approach with other Scene 15 dataset.
Table 2. Comparison of our approach with other CNN based approach on MIT Indoor 67
Table 3. Comparison of our approach with another approach on SUN 397 dataset

5 Conclusion

In this paper, we have proposed a novel semantic descriptor TrainDetector framework for scene recognition, in which information of each modality has its extracted feature independently of others, and they have been combined to get the discriminative objects, local and global representation across scenes. We experimented with our framework, three benchmark scene datasets, and we demonstrated the efficiency of our approach.