Abstract
Our ability to process the image keeps improving day by day, since the introduction of deep learning. Lastly, this contributed to the advance of object recognition through a Convolutional neural network and Place recognition, which is our concern in this paper. Through this research, it was observed a complexity in the extraction of the correct and relevant features for scene recognition. To address this issue, we extracted at the pixel level several subareas which contain more color intensity than other parts, and we went through each image once to build the feature representation of it. We also noticed that several available models based on Convolution Neural Network requires a Graphics Processing Units (GPU) for their implementation and are difficult to train. We propose in this paper, a novel Scene Recognition method using Single-Shot-Detector (SSD), Multi-modal Local-Receptive-Field (MM-LRF) and Extreme-Learning-Machine (ELM) that we named TrainDetector. It outperforms the state-of-the-art techniques when we apply it to three well-known scene recognition Datasets.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Computer vision comes with several challenges, like Place recognition and Scene recognition which are often confusing. During the past few years, we saw in several Scene recognition publications [1,2,3], the need to resolve in the better manner Scene classification and Scene representation. One of the distinctions between these two is that: Scene classification draws effective classifiers, and Scene representation comes with the goal of extracting discriminative features. Also, they can be divided into two main categories: learning-based features as mentioned in [4, 5] and hand-crafted elements [6]. Although, Hand-crafted methods contend census transform histogram (CENTRIST), generalized search trees (GIST) and oriented texture curves (OTC) [7]. In addition, their components investigate low-level visual information such as textual and structural information in scene images. Despite the quality of those features, they are not sufficient for more complex scene images. Moreover, some discriminating objects can be found in a scene with high probability and sometimes appear in other scenes; whereas, multiple objects may appear in separate Scenes with a similar chance. Our goal is to give as input to our finetune TrainDetector model a patch of images containing relevant features. Thus, this research proposes, a compelling semantic descriptor based on Single-Shot-Detector (SSD), Multi-modal Local-Receptive-Field (MM-LRF) and Extreme-Learning-Machine (ELM) that we baptized TrainDetector for scene recognition. The subsequent section of this paper is as follow: Sect. 2 introduces the related studies; Sect. 3 reveals the design proposed model; Sect. 4 presents and will discuss the experimental results; while Sect. 5 gives an adequate conclusion to our work.
2 Related Studies
In this part, a brief review is provided for three main points in our research, such as ELM, scene classification, and scene representation.
Extreme-Learning-Machine.
The following model (see Fig. 1), known as Extreme Learning Machine (ELM), was first introduced in [8] with single hidden feedforward neural networks and achieves great performance in image processing.
Let say; we want to learn N different samples \( \left\{ {X,\,T} \right\} = \left\{ {X_{j} ,t_{j} } \right\}_{j = 1}^{N} \) where \( X_{j} \in \,R^{n} \) and \( t_{j} \in \,R^{m} \), with the activation function g(x) and we want to train a single-hidden layer feedforward neural networks (SLFNs) by having K hidden neurons. Instead of assigning values as input for hidden biases and weights, they are randomly generated in ELM. As a result, this process allows converting the nonlinear system to a linear system
Where \( X_{i} \in \,\,R^{n} \) defines the input weight vector acting as a connector between inputs nodes and \( i^{th} \) hidden neuron; and \( Y_{j} \in \,\,R^{m} \) defines the output vector of \( j^{th} \) training sample. Furthermore, \( {\text{g}}\left( . \right) \) represents the nonlinear activation functions; as connector between the \( i^{th} \) hidden neuron and output neurons, we have the \( \beta_{i} = \left( {\beta_{i1} ,\beta_{i2} , \ldots , \beta_{im} } \right)^{T} \) weight vector. So then, we can rewrite the previous equations as:
Where T is the target matrix, and H can be explicitly defined as:
Hence, we can see that compute the value of Y (the output vector) is like finding the least-square (LS) solution to the given linear system. Considering (1), LS will be:
Where \( H^{ - 1} \) is the Moore–Penrose (MP) generalized inverse of matrix H. As mentioned by Huang, et al., we can see a great generalization performance and a considerable increase in the learning speed for ELM using such MP inverse methods.
3 Local Features and Discriminative Objects
To design a TrainDetector, in such a way that we can easily distinguish the area responsible for the extraction of local features and the one responsible for the extraction of global features. As shown (see Fig. 2), the multi-model training architecture presented in this study, goes through three main procedures: unsupervised feature representation which deals with each modality; feature fusion representation with output a feature H obtains after the combination of each features Matrix \( \varvec{H}_{\varvec{i}} \) where \( i \in \left[ {1,2} \right] \); supervised featured classification performs by a Single Shot Model.
Object Detection.
As mentioned before, each modality (RGB and Depth) is handled separately. Furthermore, we submit them simultaneously to a single LRF-ELM net layer, which allows us to deform to some extent a part of an object and permits to get low-level features as edges. Moreover, the output can be easily computed to provide the output of each LRF-ELM net layer \( \text{(}\varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} \text{)} \) as \( K \times N \cdot \left( {1 - r + d} \right)^{2} \), where K corresponds to the number of feature maps. N is the input samples, r represents the size of the receptive field and d is the input size \( \varvec{H}_{1}^{\varvec{c}} \,\,and\,\,\varvec{H}_{1}^{\varvec{d}} \) are both features matrices which are combined into a single one H as follow:
This combine matrix H of features will be submitted to our last component (SSD) which will give as out a precise batch of objects detected with high precision.
Unsupervised Feature Representations.
The local receptive fields of this research framework are based on ELM, and it allows us to extract important features. As illustrated (see Fig. 3), we explained the process of learning representation which is obtained after the processing features of each modality. Our LRF-ELM can be divided into two main operations: Firstly, we randomly generate the initial Weight Matrix \( \left( {\hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} } \right) \) with the open field \( r^{2} \), the input size \( d^{2} \). Hence, we obtain a feature map of size \( \left( {1 - r + d} \right) \times \left( {1 - r + d} \right) . \)
Thus, through Singular Value Decomposition (SVD), we orthogonalize \( \hat{W}_{init}^{c} \,\,and\,\,\hat{W}_{init}^{d} \). Secondly, we generate the combinatorial node as follow: we assume the size of the feature map equal to the pooling map, p being the pooling size which is the distance between the edge of the pooling area and the center. Furthermore, \( w_{p,q,t}^{c} ,\,w_{p,q,t}^{d} \), \( C_{i,j,t}^{c} \,\,and\,\,C_{i,j,t}^{d} \) are respectively the combinatorial node (p, q) obtains in \( k^{th} \) pooling map and The node (i, j) obtains in the \( k^{th} \) feature map as shown below:
Where \( {\text{p}},\,{\text{q}}\, = \, 1\ldots \left( {1 - r + d} \right) \) and \( C_{i,j,t}^{c} , C_{i,j,t}^{d} = 0 \).
Supervised Featured Classification.
Taking as input, the combined feature obtained from the previous step, we have sufficient parameters to process the third step (SSD) which receives data directly to its convolution classifiers layers. It consists of the featured classification so that we obtain a set of score vectors at SoftMax layer.
Discriminative Objects.
Obtaining Objects of an image is just one step in the process. The next step was to select among those objects the one with high discriminative factor, but before we elaborate on that, there is the need for the reader to understand what Discriminative objects means.
For this study therefore, Discriminative objects is defined as: objects that appear with a high probability of occurrence in one class but has a low probability of occurrence in other classes of the dataset. An example can be the objects marked with (+) (see Fig. 3).
The multinomial object distribution for each category is derived from object score vectors at the softmax layer of our network, and it gives the probability statistics of all object classes in a scene category. To be more precise, at the training time, we supplied to an ImageNet-CNN (e.g., the well-known VGGNet) a set of patches \( P = \left[ {p_{1} ,p_{2} , \ldots , p_{i} , \ldots , p_{N} } \right] \) coming from several images of the same category (e.g., kitchen). At the softmax layer, we obtain for each patch a 1000-dimensional score vector representing the occurrence probability of a specific object class. Furthermore, to detect the occurrence of an object \( O_{i} \) in a patch of images, it is essential to compute beforehand score vector S of each patch \( P_{i} \), where \( S = \left[ {S_{1} ,\, \ldots ,\,S_{i} ,\, \ldots S_{N} } \right] \), and we randomly set a confidence level \( \sigma \) for S. Hence, we achieve the detection by applying the equation below:
Where \( h\left( x \right) = 1 ,\,\,x\, \ge \,0 \) and \( h\left( x \right) = 0 ,\,\,x\, < \,0 \). To avoid to miss some infrequent classes, we apply the function \( f_{O} \) on a batch of images I and to detect the occurrence object \( O_{i} \) without the need of the confidence level as follows:
Where \( P_{i} \) is a patch of the images \( I \) and \( S_{i} \) is the score vector of the patch \( P_{i} \).
Considering a set of images \( I_{C} \in Class C \), we can compute the maximum possibility of an Object O on class C as:
We set in this paper, \( p\left( {O|C} \right) \) as the object multinomial distribution of C. (see Fig. 4) shows various objects distributions and the results after computation of \( p\left( {O|C} \right) \) on different classes. Furthermore, we can now compute the posterior probability of scene classes by taking in account the observation of all objects and by applying the Bayes rule, we can obtain:
Where \( p\left( {O|C_{j} } \right) \) is similar to Eq. (11) given the scene class \( C_{j} \) and \( p\left( {C_{j} } \right) \) is a prior scene class probability.
4 Experiments and Comparison Table with the State-of-the-Art Methods
As mentioned in the introduction, we applied our method on three well know dataset including Scene 15 [9], the MIT Indoor 67 [10] and SUN 397 [11]. Also, to better address indoor and outdoor scene, the scene dataset SUN 397 and Scene 15 are used. Also, MIT indoor 67 datasets are used to confirm the accuracy of our method. The following parts describe the experiment performed with those datasets.
-
Scene 15 Dataset [9] which offers relevant images for indoor and outdoor scene contains 4485 gray pictures of 15 different scenes. However, it does not include a training set and a testing set, reason why we choose to compute the mean of the classification performance across splits base on five random splits. Furthermore, we use 100 training images for each category, and we use the remaining image for the test.
-
MIT indoor 67 Dataset [10] is a considerable dataset which contains 15 620 color images and 15 scene categories. It offers an essential variation among groups with at least 100 images per category. As per as the process described in [10], we apply our method to 80 images from each category for training.
-
Sun 397 Dataset [11] is a massive dataset which offers at least 100 images per categories and 397 scenes categories. As per as the protocol defines in [11], we trained our model on 50 training images and 50 test images.
From the above explanation, several approaches have been proposed and applied on SUN 397, MIT Indoor 67, and Scene 15. However, we are comparing those methods with our proposed method as shown in Tables 1, 2 and 3.
5 Conclusion
In this paper, we have proposed a novel semantic descriptor TrainDetector framework for scene recognition, in which information of each modality has its extracted feature independently of others, and they have been combined to get the discriminative objects, local and global representation across scenes. We experimented with our framework, three benchmark scene datasets, and we demonstrated the efficiency of our approach.
References
Luo, J., Boutell, M.: Natural scene classification using overcomplete ICA. Pattern Recognit. 38(10), 1507–1519 (2005)
Mundhenk, T., Flores, A., Hoffman, H.: Classification and segmentation of orbital space based objects against terrestrial distractors for the purpose of finding holes in Shape from Motion 3D reconstruction. In: Proceedings of SPIE, vol. 9025 (2014)
Wang, Q., Chen, L., Shen, D.: Group-wise registration of large image dataset by hierarchical clustering and alignment. In: Proceedings of SPIE, vol. 7259, no. 35 (2009)
Newsam, S., Kamath, C.: Comparing shape and texture features for pattern recognition in simulation data. Electron. Imaging 5672, 106–117 (2005)
Kunter, M., Knorr, S., Krutz, A., Sikora, T.: Unsupervised object segmentation for 2D to 3D conversion. In: Proceedings of SPIE, vol. 7237 (2009)
Yu, K., Lin, Y., Lafferty, J.: Learning image representations from the pixel level via hierarchical sparse coding. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2011.html. Accessed 2011
Asif, U., Bennamoun, M., Sohel, F.: Efficient RGB-D object categorization using cascaded ensembles of randomized decision trees. http://dblp.uni-trier.de/db/conf/icra/icra2015.html. Accessed 2015
Huang, G.-B., Zhu, Q.-Y., Siew, C.-K.: Extreme learning machine: a new learning scheme of feedforward neural networks. http://ieeexplore.ieee.org/document/1380068. Accessed 2004
Lazebnik, S., Schmid, C., Ponce, J.: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2, CVPR 2006 (2006)
Quattoni, A., Torralba, A.: Recognizing indoor scenes. http://people.csail.mit.edu/torralba/publications/indoor.pdf. Accessed 2009
Xiao, J., Hays, J., Ehinger, K., Oliva, A., Torralba, A.: SUN database: Large-scale scene recognition from abbey to zoo. http://ieeexplore.ieee.org/document/5539970. Accessed 2010
Huang, G.-B., Bai, Z., Kasun, L., Vong, C.: Local receptive fields based extreme learning machine. IEEE Comput. Intell. Mag. 10(2), 18–29 (2015)
Preim, B., Botha, C.: Image analysis for medical visualization. https://sciencedirect.com/science/article/pii/b9780124158733000043. Accessed 2014
Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: Computer Vision and Pattern Recognition, pp. 512–519 (2014)
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., Oliva, A.: learning deep features for scene recognition using places database. https://papers.nips.cc/paper/5349-learning-deep-features-for-scene-recognition-using-places-database. Accessed 2014
Cimpoi, M., Maji, S., Vedaldi, A.: Deep filter banks for texture recognition and segmentation. http://dblp.uni-trier.de/db/conf/cvpr/cvpr2015.html. Accessed 2015
Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., Jiang, X.: Learning discriminative and shareable features for scene classification. https://link.springer.com/chapter/10.1007/978-3-319-10590-1_36. Accessed 2014
Xie, G.-S., Zhang, X.-Y., Yan, S., Liu, C.-L.: Hybrid CNN and dictionary-based models for scene recognition and domain adaptation. IEEE Trans. Circuits Syst. Video Technol. 27, 1263–1274 (2016)
Acknowledgement
The work and the contribution were supported by the SPEV project “Smart Solutions in Ubiquitous Computing Environments 2018”, University of Hradec Kralove, Faculty of Informatics and Management, Czech Republic.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Mambou, S., Krejcar, O. (2019). Novel Scene Recognition Using TrainDetector. In: Vera-Rodriguez, R., Fierrez, J., Morales, A. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2018. Lecture Notes in Computer Science(), vol 11401. Springer, Cham. https://doi.org/10.1007/978-3-030-13469-3_59
Download citation
DOI: https://doi.org/10.1007/978-3-030-13469-3_59
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-13468-6
Online ISBN: 978-3-030-13469-3
eBook Packages: Computer ScienceComputer Science (R0)