Novel Scene Recognition Using TrainDetector

Mambou, Sebastien; Krejcar, Ondrej

doi:10.1007/978-3-030-13469-3_59

Sebastien Mambou¹⁷ &
Ondrej Krejcar¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11401))

Included in the following conference series:

Iberoamerican Congress on Pattern Recognition

1927 Accesses

Abstract

Our ability to process the image keeps improving day by day, since the introduction of deep learning. Lastly, this contributed to the advance of object recognition through a Convolutional neural network and Place recognition, which is our concern in this paper. Through this research, it was observed a complexity in the extraction of the correct and relevant features for scene recognition. To address this issue, we extracted at the pixel level several subareas which contain more color intensity than other parts, and we went through each image once to build the feature representation of it. We also noticed that several available models based on Convolution Neural Network requires a Graphics Processing Units (GPU) for their implementation and are difficult to train. We propose in this paper, a novel Scene Recognition method using Single-Shot-Detector (SSD), Multi-modal Local-Receptive-Field (MM-LRF) and Extreme-Learning-Machine (ELM) that we named TrainDetector. It outperforms the state-of-the-art techniques when we apply it to three well-known scene recognition Datasets.

You have full access to this open access chapter, Download conference paper PDF

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

Article 30 January 2023

FCT: fusing CNN and transformer for scene classification

Article 15 September 2022

PDS-Net: A novel point and depth-wise separable convolution for real-time object detection

Article 24 March 2022

Keywords

1 Introduction

Computer vision comes with several challenges, like Place recognition and Scene recognition which are often confusing. During the past few years, we saw in several Scene recognition publications [1,2,3], the need to resolve in the better manner Scene classification and Scene representation. One of the distinctions between these two is that: Scene classification draws effective classifiers, and Scene representation comes with the goal of extracting discriminative features. Also, they can be divided into two main categories: learning-based features as mentioned in [4, 5] and hand-crafted elements [6]. Although, Hand-crafted methods contend census transform histogram (CENTRIST), generalized search trees (GIST) and oriented texture curves (OTC) [7]. In addition, their components investigate low-level visual information such as textual and structural information in scene images. Despite the quality of those features, they are not sufficient for more complex scene images. Moreover, some discriminating objects can be found in a scene with high probability and sometimes appear in other scenes; whereas, multiple objects may appear in separate Scenes with a similar chance. Our goal is to give as input to our finetune TrainDetector model a patch of images containing relevant features. Thus, this research proposes, a compelling semantic descriptor based on Single-Shot-Detector (SSD), Multi-modal Local-Receptive-Field (MM-LRF) and Extreme-Learning-Machine (ELM) that we baptized TrainDetector for scene recognition. The subsequent section of this paper is as follow: Sect. 2 introduces the related studies; Sect. 3 reveals the design proposed model; Sect. 4 presents and will discuss the experimental results; while Sect. 5 gives an adequate conclusion to our work.

2 Related Studies

In this part, a brief review is provided for three main points in our research, such as ELM, scene classification, and scene representation.

Extreme-Learning-Machine.

The following model (see Fig. 1), known as Extreme Learning Machine (ELM), was first introduced in [8] with single hidden feedforward neural networks and achieves great performance in image processing.

Let say; we want to learn N different samples $ \left\{ {X,\,T} \right\} = \left\{ {X_{j} ,t_{j} } \right\}_{j = 1}^{N} $ where $ X_{j} \in \,R^{n} $ and $ t_{j} \in \,R^{m} $, with the activation function g(x) and we want to train a single-hidden layer feedforward neural networks (SLFNs) by having K hidden neurons. Instead of assigning values as input for hidden biases and weights, they are randomly generated in ELM. As a result, this process allows converting the nonlinear system to a linear system

$$ Y_{j} = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g_{i} \left( {X_{j} } \right) = \mathop \sum \limits_{i = 1}^{L} \beta_{i} g_{i} \left( {X_{j} W_{i}^{T} + b_{j} } \right) = t_{j} ,j = 1,2, \ldots .. M $$

(1)

Where $ X_{i} \in \,\,R^{n} $ defines the input weight vector acting as a connector between inputs nodes and $ i^{th} $ hidden neuron; and $ Y_{j} \in \,\,R^{m} $ defines the output vector of $ j^{th} $ training sample. Furthermore, $ {\text{g}}\left( . \right) $ represents the nonlinear activation functions; as connector between the $ i^{th} $ hidden neuron and output neurons, we have the $ \beta_{i} = \left( {\beta_{i1} ,\beta_{i2} , \ldots , \beta_{im} } \right)^{T} $ weight vector. So then, we can rewrite the previous equations as:

$$ {\text{H}}\upbeta = \,{\text{T}} $$

(2)

Where T is the target matrix, and H can be explicitly defined as:

$$ H = \left[ {\begin{array}{*{20}c} {g\left( {X_{1} W_{1}^{T} + b_{1} } \right)} & \ldots & {g\left( {X_{1} W_{L}^{T} + b_{L} } \right)} \\ : & \ldots & : \\ {g\left( {X_{1} W_{L}^{T} + b_{1} } \right)} & \ldots & {g\left( {X_{N} W_{L}^{T} + b_{L} } \right)} \\ \end{array} } \right] $$

(3)

$$ \beta = \, \,\left[ {\begin{array}{*{20}c} {\beta_{1}^{T} } \\ : \\ {\beta_{L}^{T} } \\ \end{array} } \right],\,\,T = \,\, \left[ {\begin{array}{*{20}c} {t_{1}^{T} } \\ : \\ {t_{N}^{T} } \\ \end{array} } \right] $$

(4)

Hence, we can see that compute the value of Y (the output vector) is like finding the least-square (LS) solution to the given linear system. Considering (1), LS will be:

$$ \hat{\beta } = H^{ - 1} $$

(5)

Where $ H^{ - 1} $ is the Moore–Penrose (MP) generalized inverse of matrix H. As mentioned by Huang, et al., we can see a great generalization performance and a considerable increase in the learning speed for ELM using such MP inverse methods.

3 Local Features and Discriminative Objects

To design a TrainDetector, in such a way that we can easily distinguish the area responsible for the extraction of local features and the one responsible for the extraction of global features. As shown (see Fig. 2), the multi-model training architecture presented in this study, goes through three main procedures: unsupervised feature representation which deals with each modality; feature fusion representation with output a feature H obtains after the combination of each features Matrix $ \varvec{H}_{\varvec{i}} $ where $ i \in \left[ {1,2} \right] $; supervised featured classification performs by a Single Shot Model.

Object Detection.

As mentioned before, each modality (RGB and Depth) is handled separately. Furthermore, we submit them simultaneously to a single LRF-ELM net layer, which allows us to deform to some extent a part of an object and permits to get low-level features as edges. Moreover, the output can be easily computed to provide the output of each LRF-ELM net layer $ \text{(}\varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} \text{)} $ as $ K \times N \cdot \left( {1 - r + d} \right)^{2} $, where K corresponds to the number of feature maps. N is the input samples, r represents the size of the receptive field and d is the input size $ \varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} $ are both features matrices which are combined into a single one H as follow:

$$ H = \,\,\left[ {H_{1}^{c} ;\,\,H_{1}^{d} } \right]^{T} $$

(6)

This combine matrix H of features will be submitted to our last component (SSD) which will give as out a precise batch of objects detected with high precision.

Unsupervised Feature Representations.

The local receptive fields of this research framework are based on ELM, and it allows us to extract important features. As illustrated (see Fig. 3), we explained the process of learning representation which is obtained after the processing features of each modality. Our LRF-ELM can be divided into two main operations: Firstly, we randomly generate the initial Weight Matrix $ \left( {\hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} } \right) $ with the open field $ r^{2} $, the input size $ d^{2} $. Hence, we obtain a feature map of size $ \left( {1 - r + d} \right) \times \left( {1 - r + d} \right) . $

$$ \hat{w}_{init}^{c} ,\hat{w}_{init}^{d} \in R^{{r^{2} }} ,\,\hat{W}_{init}^{c} ,\hat{W}_{init}^{d} \in R^{{r^{2} \times t}} ,\,\,t = 1,2,3, \ldots T $$

(7)

Thus, through Singular Value Decomposition (SVD), we orthogonalize $ \hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} $. Secondly, we generate the combinatorial node as follow: we assume the size of the feature map equal to the pooling map, p being the pooling size which is the distance between the edge of the pooling area and the center. Furthermore, $ w_{p,q,t}^{c} ,\,w_{p,q,t}^{d} $, $ C_{i,j,t}^{c} \,\,and\,\,C_{i,j,t}^{d} $ are respectively the combinatorial node (p, q) obtains in $ k^{th} $ pooling map and The node (i, j) obtains in the $ k^{th} $ feature map as shown below:

$$ \left\{ {\begin{array}{*{20}c} {w_{p,q,t}^{c} = \sqrt {\mathop \sum \limits_{i = p - e}^{p + e} \mathop \sum \limits_{j = q - e}^{q + e} C_{i,j,t}^{2c} } } \\ {w_{p,q,t}^{d} = \sqrt {\mathop \sum \limits_{i = p - e}^{p + e} \mathop \sum \limits_{j = q - e}^{q + e} C_{i,j,t}^{2d} } } \\ \end{array} } \right. $$

(8)

Where $ {\text{p}},\,{\text{q}}\, = \, 1\ldots \left( {1 - r + d} \right) $ and $ C_{i,j,t}^{c} , C_{i,j,t}^{d} = 0 $.

Supervised Featured Classification.

Taking as input, the combined feature obtained from the previous step, we have sufficient parameters to process the third step (SSD) which receives data directly to its convolution classifiers layers. It consists of the featured classification so that we obtain a set of score vectors at SoftMax layer.

Discriminative Objects.

Obtaining Objects of an image is just one step in the process. The next step was to select among those objects the one with high discriminative factor, but before we elaborate on that, there is the need for the reader to understand what Discriminative objects means.

For this study therefore, Discriminative objects is defined as: objects that appear with a high probability of occurrence in one class but has a low probability of occurrence in other classes of the dataset. An example can be the objects marked with (+) (see Fig. 3).

The multinomial object distribution for each category is derived from object score vectors at the softmax layer of our network, and it gives the probability statistics of all object classes in a scene category. To be more precise, at the training time, we supplied to an ImageNet-CNN (e.g., the well-known VGGNet) a set of patches $ P = \left[ {p_{1} ,p_{2} , \ldots , p_{i} , \ldots , p_{N} } \right] $ coming from several images of the same category (e.g., kitchen). At the softmax layer, we obtain for each patch a 1000-dimensional score vector representing the occurrence probability of a specific object class. Furthermore, to detect the occurrence of an object $ O_{i} $ in a patch of images, it is essential to compute beforehand score vector S of each patch $ P_{i} $, where $ S = \left[ {S_{1} ,\, \ldots ,\,S_{i} ,\, \ldots S_{N} } \right] $, and we randomly set a confidence level $ \sigma $ for S. Hence, we achieve the detection by applying the equation below:

$$ \delta \left( {x|\sigma } \right) = h\left[ {S_{i} \left( x \right) - \sigma } \right] $$

(9)

Where $ h\left( x \right) = 1 ,\,\,x\, \ge \,0 $ and $ h\left( x \right) = 0 ,\,\,x\, < \,0 $. To avoid to miss some infrequent classes, we apply the function $ f_{O} $ on a batch of images I and to detect the occurrence object $ O_{i} $ without the need of the confidence level as follows:

$$ f_{O} \left( x \right) = \sum\nolimits_{{P_{i} \in I}} {S_{i} } $$

(10)

Where $ P_{i} $ is a patch of the images $ I $ and $ S_{i} $ is the score vector of the patch $ P_{i} $.

Considering a set of images $ I_{C} \in Class C $, we can compute the maximum possibility of an Object O on class C as:

$$ p\left( {O|C} \right) = \frac{1}{{N_{{I_{c} }} }} \sum\nolimits_{{X_{i} \in I_{C} }} {f_{O} \left( {X_{i} } \right)} $$

(11)

We set in this paper, $ p\left( {O|C} \right) $ as the object multinomial distribution of C. (see Fig. 4) shows various objects distributions and the results after computation of $ p\left( {O|C} \right) $ on different classes. Furthermore, we can now compute the posterior probability of scene classes by taking in account the observation of all objects and by applying the Bayes rule, we can obtain:

$$ P\left( {C_{j} |O_{i} } \right) = \frac{{p\left( {O|C_{j} } \right)p\left( {C_{j} } \right)}}{{\mathop \sum \nolimits_{j} p\left( {O|C_{j} } \right)p\left( {C_{j} } \right)}} $$

(12)

Where $ p\left( {O|C_{j} } \right) $ is similar to Eq. (11) given the scene class $ C_{j} $ and $ p\left( {C_{j} } \right) $ is a prior scene class probability.

4 Experiments and Comparison Table with the State-of-the-Art Methods

As mentioned in the introduction, we applied our method on three well know dataset including Scene 15 [9], the MIT Indoor 67 [10] and SUN 397 [11]. Also, to better address indoor and outdoor scene, the scene dataset SUN 397 and Scene 15 are used. Also, MIT indoor 67 datasets are used to confirm the accuracy of our method. The following parts describe the experiment performed with those datasets.

Scene 15 Dataset [9] which offers relevant images for indoor and outdoor scene contains 4485 gray pictures of 15 different scenes. However, it does not include a training set and a testing set, reason why we choose to compute the mean of the classification performance across splits base on five random splits. Furthermore, we use 100 training images for each category, and we use the remaining image for the test.
MIT indoor 67 Dataset [10] is a considerable dataset which contains 15 620 color images and 15 scene categories. It offers an essential variation among groups with at least 100 images per category. As per as the process described in [10], we apply our method to 80 images from each category for training.
Sun 397 Dataset [11] is a massive dataset which offers at least 100 images per categories and 397 scenes categories. As per as the protocol defines in [11], we trained our model on 50 training images and 50 test images.

From the above explanation, several approaches have been proposed and applied on SUN 397, MIT Indoor 67, and Scene 15. However, we are comparing those methods with our proposed method as shown in Tables 1, 2 and 3.

Table 1. Comparison of our approach with other Scene 15 dataset.

Full size table

Table 2. Comparison of our approach with other CNN based approach on MIT Indoor 67

Full size table

Table 3. Comparison of our approach with another approach on SUN 397 dataset

Full size table

5 Conclusion

In this paper, we have proposed a novel semantic descriptor TrainDetector framework for scene recognition, in which information of each modality has its extracted feature independently of others, and they have been combined to get the discriminative objects, local and global representation across scenes. We experimented with our framework, three benchmark scene datasets, and we demonstrated the efficiency of our approach.

References

Luo, J., Boutell, M.: Natural scene classification using overcomplete ICA. Pattern Recognit. 38(10), 1507–1519 (2005)
Article Google Scholar
Mundhenk, T., Flores, A., Hoffman, H.: Classification and segmentation of orbital space based objects against terrestrial distractors for the purpose of finding holes in Shape from Motion 3D reconstruction. In: Proceedings of SPIE, vol. 9025 (2014)
Google Scholar
Wang, Q., Chen, L., Shen, D.: Group-wise registration of large image dataset by hierarchical clustering and alignment. In: Proceedings of SPIE, vol. 7259, no. 35 (2009)
Google Scholar
Newsam, S., Kamath, C.: Comparing shape and texture features for pattern recognition in simulation data. Electron. Imaging 5672, 106–117 (2005)
Google Scholar
Kunter, M., Knorr, S., Krutz, A., Sikora, T.: Unsupervised object segmentation for 2D to 3D conversion. In: Proceedings of SPIE, vol. 7237 (2009)
Google Scholar
Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2011.html. Accessed 2011
Asif, U., Bennamoun, M., Sohel, F.: Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees. http://dblp.uni-trier.de/db/conf/icra/icra2015.html. Accessed 2015
Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. http://ieeexplore.ieee.org/document/1380068. Accessed 2004
Lazebnik, S., Schmid, C., Ponce, J.: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 2006 (2006)
Google Scholar
Quattoni, A., Torralba, A.: Recognizing indoor scenes. http://people.csail.mit.edu/torralba/publications/indoor.pdf. Accessed 2009
Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: Large-scale scene recognition from abbey to zoo. http://ieeexplore.ieee.org/document/5539970. Accessed 2010
Huang, G.-B., Bai, Z., Kasun, L., Vong, C.: Local receptive fields based extreme learning machine. IEEE Comput. Intell. Mag. 10(2), 18–29 (2015)
Article Google Scholar
Preim, B., Botha, C.: Image analysis for medical visualization. https://sciencedirect.com/science/article/pii/b9780124158733000043. Accessed 2014
Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition, pp. 512–519 (2014)
Google Scholar
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., Oliva, A.: learning deep features for scene recognition using places database. https://papers.nips.cc/paper/5349-learning-deep-features-for-scene-recognition-using-places-database. Accessed 2014
Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2015.html. Accessed 2015
Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., Jiang, X.: Learning discriminative and shareable features for scene classification. https://link.springer.com/chapter/10.1007/978-3-319-10590-1_36. Accessed 2014
Xie, G.-S., Zhang, X.-Y., Yan, S., Liu, C.-L.: Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans. Circuits Syst. Video Technol. 27, 1263–1274 (2016)
Article Google Scholar

Download references

Acknowledgement

The work and the contribution were supported by the SPEV project “Smart Solutions in Ubiquitous Computing Environments 2018”, University of Hradec Kralove, Faculty of Informatics and Management, Czech Republic.

Author information

Authors and Affiliations

Center for Basic and Applied Research, Faculty of Informatics and Management, University of Hradec Kralove, Rokitanskeho 62, 500 03, Hradec Kralove, Czech Republic
Sebastien Mambou & Ondrej Krejcar

Authors

Sebastien Mambou
View author publications
You can also search for this author in PubMed Google Scholar
Ondrej Krejcar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sebastien Mambou .

Editor information

Editors and Affiliations

Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Ruben Vera-Rodriguez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Julian Fierrez
Biometrics and Data Pattern Analytics Lab, Universidad Autonoma de Madrid, Madrid, Spain
Aythami Morales

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mambou, S., Krejcar, O. (2019). Novel Scene Recognition Using TrainDetector. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_59

Download citation

DOI: https://doi.org/10.1007/978-3-030-13469-3_59
Published: 03 March 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Novel Scene Recognition Using TrainDetector

Abstract

Similar content being viewed by others

Building discriminative features of scene recognition using multi-stages of inception-ResNet-v2

FCT: fusing CNN and transformer for scene classification

PDS-Net: A novel point and depth-wise separable convolution for real-time object detection

Keywords

1 Introduction