Enhanced Bags of Visual Words Representation Using Spatial Information

Abdi, Lotfi; Kalboussi, Rahma; Meddeb, Aref

doi:10.1007/978-3-319-68548-9_16

Lotfi Abdi^17,18,
Rahma Kalboussi^17,18 &
Aref Meddeb^17,18

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10485))

Included in the following conference series:

International Conference on Image Analysis and Processing

2480 Accesses
1 Citations

Abstract

In order to achieve fast and robust Traffic Signs Recognition (TSR), we introduced a novel way to incorporate both distance and angle information in the BoVW representation. A novel approach for visual words construction was presented, which takes the spatial information of keypoints into account in order to enhance the quality of visual words generated from extracted keypoints. In this paper, we propose a new method for TSR system based on the Bags of Visual Words (BoVW) approach, using the distance and angle information in the BoVW representation. Second, we proposed a new computationally efficient method to model global spatial distribution of visual words by taking into consideration the spatial relationships of its visual words. Experimental results show that the suggested method could reach comparable performance of the state-of-the-art approaches with less computational complexity and shorter training time.

You have full access to this open access chapter, Download conference paper PDF

Spatially Enhanced Bags of Visual Words Representation to Improve Traffic Signs Recognition

Article 27 December 2017

Efficient Traffic Sign Detection Using Bag of Visual Words and Multi-scales SIFT

Recognition by Enhanced Bag of Words Model via Topographic ICA

Keywords

1 Introduction

Extensive studies in human-machine interactivity are necessary to present the Traffic Sign Recognition (TSR) information in a careful way to inform the driver without causing distraction or confusion [3]. The visibility of traffic signs is crucial for the drivers’ safety. In a real environment, it becomes a difficult task to recognize the traffic signs timely and accurately because the visibility of traffic signs may be decreased greatly by some unfavorable factors. For example, very serious accidents happen when drivers do not notice a stop sign. An automatic TSR system is helpful for assisting drivers and is essential for autonomous cars [4].

Road scenes are also generally very cluttered and contain many strong geometric shapes that could easily be misclassified as road signs. In conventional techniques, color, shape information, or geometric features of traffic signs are utilized for performing detection and recognition with regard to the traffic signs in general. Road sign recognition systems usually have developed into two specific phases. First, in each frame, the detection stage identifies also the categories of signs based on shapes (circular, rectanglar, trianglar, etc.). The second task is to classify the detected signs and send the processing results (i.e., the types of signs and their locations) to the display and control units of an Advanced Drive Assistance System (ADAS) [1].

Fatigue, divergence of attention, and occlusion of signs due to road obstruction and natural scenes may lead to miss some important traffic signs, which can result in severe accidents. Guiding the driver attention to an imminent danger, somewhere around the car, is a potential application. After recognizing the traffic signs, a driver may be notified of the recognized traffic signs in a manner of audio or visual information. The recognition of road traffic signs correctly at the right time for that particular place is very important for any vehicle driver to ensure a safe journey for themselves and their passengers. The proposed system may make driving even more comfortable and may help the drivers receive important information regarding the signs, even before their eyes can actually see the sign, and in an easy and comprehensible way.

Several methods were recently proposed to incorporate spatial information to improve the BoVW model such as the spatial pyramid matching method [11], spatio-temporal interest point [7]. Other recent work have focused on the local stability of traffic sign regions [2]. Further, methods such a [8] proposed a novel classification techniques of TSR based on probabilistic latent semantic. The algorithm consists of two parts: (1) classify the shape of the traffic signs and (2) classify its actual class. In order to investigate the effect of coding methods and codebook size of local spatiotemporal features, [6] have proposed a coding method to alleviate the negative effect of quantization error by assigning the local spatiotemporal features to a few nearest visual words.

Correct and timely recognition of road traffic signs is crucial for any vehicle driver to ensure a safe journey for themselves and their passengers. Considering the processing time and classification accuracy as a whole, we have developed a novel technique to incorporate spatial information of visual words to improve accuracy while maintaining short retrieval time. In order to achieve fast and robust TSR, we introduced a novel way to incorporate both distance and angle information in the BoVW representation. A novel approach for visual words construction was presented, which takes the spatial information of keypoints into account in order to enhance the quality of visual words generated from extracted keypoints. This clearly demonstrated the complementarities of the additional relative spatial information provided by our approach to improve accuracy while maintaining short retrieval time.

In this paper, we propose a novel method for traffic sign recognition system based on the bag-of-visual-words approaches. We introduce a novel way to incorporate both distance and angle information in the BoVW representation. We proposed a new computationally efficient method to model the global spatial distribution of visual words by taking into consideration the spatial relationships of its visual words.

2 Traffic Signs Recognition

In both instances, the ability to recognize signs and their underlying information is highly desirable. This information can be used to warn the human driver of an oncoming change, or in more intelligent vehicle systems, to actually control the speed and/or steering of the vehicle. It is therefore necessary to classify the characteristics of the information given and find a way to represent the information according to these characteristics. To overcome this problem, we proposed a novel approch to integrate the spatial information to BoVW model.

2.1 Enhanced BoVW Using Spatial Information

This paper presents a new approach to integrate the spatial information to BoVW model, with explicit local and global structure models. The key idea is to consider the spatial distribution of visual words in an image. In [10], a pairwise spatial histogram is defined according to a discretization of the spatial neighborhood into several bins encoding the relative spatial position (distance and angle) of two visual words. Therefore, combining the frequency of occurrence and spatial information of visual words should be a promising direction for improving the image characterization.

To address this issue, we introduce a novel way to incorporate both distance and angle information in the BoVW representation. This method exploits spatial orientations and distances of all pairs of similar descriptors in the image. In the BoVWs model, a visual vocabulary $Voc=vi, i=\{ 1, \ldots , k \}$, then it is built by clustering these features into a certain number of K visual words. A given descriptor $d{_k}$ is then mapped to a visual word $V{_i}$ using euclidean distance Eq. 1 as follows:

$$\begin{aligned} \ {v(d{_k})} =arg min Dist(v,d{_k}) \end{aligned}$$

(1)

where ${v\in Voc} $, $d{_k}$ is the $k^{th}$ descriptor in the ROI, $Dist(v,d{_k})$ is the distance between the descriptor and the visual word based on the euclidean distance. For this reason, we consider the weighted sum of ROIs to implicity represent spacial information which is important for similarity measurement between images.

In the training stage, the SIFT features are extracted from all the training samples, using a dense grid. Since we are interested in the sign contents, only the descriptors that do not fall outside the sign contour are taken into account. Our system exploits the SIFT features, which have shown a high robustness to varied recording conditions. After the SIFT features are extracted for all the training samples, the number of feature points of each image is not entirely consistent, which will bring great difficulties to subsequent operations. Assignment of a visual feature to the vocabulary depends on the similarity metric. We propose a method that incorporates spatial information at feature level. This method exploits distances and spatial orientations of all pairs of similar descriptors in the image.

2.2 Similarity Measure

In order to improve similarity measurement between pairwise, we propose a simple and efficient method to infuse spatial information. We measure the spatial relationships between visual words using distance and orientation. This is done by creating an additional dictionary comprising of word pairs. These methods are inspired by [19], using a log-polar quantization of image spatial domain.

For each visual word, the average position and the standard deviation is computed based on all the occurrences of the visual word in the image. We consider the interaction between visual words by encoding their spatial distances, orientations and alignments. Figure 1 shows an example to better understand our approach. To encode spatial information, we use the distance (Fig. 1(a)) and orientation (Fig. 1(b)) information between pairs of patches in the image space.

More formally, we consider the set $S{_k}$ of all the pairwise, where at least one patch in the pair belongs to the visual word $w{_k}$. A given pair $(P{_i},P{_j}) \in S{_k}$ is characterized both by a pair of descriptors $(d{_i}, d{_j})$ and a pair of positions in the image space denoted $(p{_i},p{_j})$ is illustrated in Fig. 1. Note that both $d{_i}$ and $p{_i}$ are vectors with $d{_i} \in R^{D}$ and $p{_i} \in R^{2}$.

After clustering the spatial information is implicitly included in the visual vocabulary. A pairwise spatial histogram (Fig. 1(d)) of similar patches is then defined considering a discretization of the image space into M bins denoted $b{_m}$, m = {1, ..., M} with an angle $\theta \in [0, \pi [ $ split into $M{_\theta }$ angle bins and the radius $r \in [ 0, R ]$ split into $M{_r}$ radial bins so that $M = M{_\theta }.M{_r}$.

To classify a new feature, $P{_x}$, into the discovered classes, it is compared with the words included in the vocabulary using the distance described in Eq. 1. We assign the corresponding word i according to the nearest neighbor, but only if that distance is below a matching threshold, $th{_M}$.

$$\begin{aligned} W{_i} = arg min{_{i \in [1,k]}}(d(P{_x}, P{_i})|jd(P{_x,} P{_i}) < th{_M}) \end{aligned}$$

(2)

To represent a ROI in an image, we employ spatial pyramid to enhance voting and indexing criterion of the original inverted indexing technique. Based on the position of each visual words, this method exploits spatial orientations and distances of all pairs of similar descriptors in the image. It is relatively efficient during classification and can well present the spatial information contained in Spatial Pyramid features. In order to train our classifier, the framework of our proposed BoVW system is illustrated in Fig. 2.

For those purpose, a novel structural relationship between patches are defined for evaluating super-pixels’ similarity. Particularly, simple spatial relations between visual words are considered the spatial locations of the words and the spatial relationship between the words were added to describe images in the BoW model. This histogram encodes spatial information (distance and orientation (Fig. 1) of pairwise similar patches, where at least one of the patches belongs to $V{_k}$. To have a global representation, we replace each bin of the BoVW frequency histogram with the spatial histogram associated to $w{_i}$. By this way, we keep the frequency information intact and add the spatial information. This modularity facilitates simple way to assemble the spatial histograms and to obtain the final representation.

3 Experiment Results

To evaluate the performance of the proposed algorithm, we implement the proposed AR-TSR method using the hardware environment of Core i7 640LM 2.13 GHz and the software environment of Windows 7, Visual Studio 2010 using OpenCV Library 2.4.8. In this paper, we focus on the detection of speed limit, unique signs and danger signs. We implement the suggested method in C++ and test the performance on the German Traffic Sign Detection Benchmark (GTSDB) dataset [15]. In the GTSRB dataset, there are 51,839 German traffic signs in 43 classes. These classes of traffic signs have been divided into six subsets speed limit sign subset, danger sign subset, mandatory sign subset, unique sign subset, derestriction sign subset and other prohibitory sign subset [17]. The solution of each image in the dataset is $1366 \times 800$. Here, we present quantitative a analysis of the system presented in this work.

3.1 Performance of the Proposed Method

The database used to train the classifiers has been designed using the ROIs obtained from the detection step and the model fitting methods presented in the previous sections. In order to evaluate the occlusion robustness of the suggested classification method, the content of the detected ROI is identified using the tree classifiers. This classifier is tested on static, low-resolution sign images. Some experiments have also been conducted to measure the performance, the GTSRB has been used, where Table 1 shows the classification rates of the linear SVM.

Table 1. Accuracy results for traffic sign classification

Full size table

Furthermore, we evaluate the classification task on the detected signs returned by the previous detection module. As shown in Table 1, the overall classification accuracy is 99.31%. Note that only 3 (out of 1500) speed limit signs, while only 6 (out of 890) danger signs, are falsely classified. Experiments demonstrate that our approach succeeds in adding relative spatial information into the BoVW model by encoding both the global and local relative distribution of visual words over an image.

3.2 Comparisons with Other State-of-the-Art Methods

In order to verify the discrimination performance and computation efficiency of the proposed feature for traffic sign detection, the experiments on the public available data set of traffic signs are implemented. Because the training and testing samples in the GTSRB and GTSDB dataset are split according to a fixed rule, an absolute performance comparison with other reported approaches is possible. We report these results in Table 2, where the results of the winning system from the IJCNN challenge and some reported results in the IJCNN 2011 are provided as references.

The performance results of the machine learning algorithms are all significantly different from each other. We have compared the suggested method with other state-of-the-art algorithms such as the committee of CNNs [5], Human Performance [16], Multi-Scale CNNs [14], Random Forests [8, 18], wgy@HIT501 [9], VISICS [13], LITS1 [12, 17], and Viola-Jones. The performance is analyzed in terms of detection and recognition accuracy.

Table 2. Performance comparison with other traffic sign recognition methods.

Full size table

According to the results for the GTSRB data set, shown in Table 2, this work achieves a 99.31% recognition accuracy, which is a comparable performance of 0.24% less then the work by [5], and a performance of 0.17% higher than the work by [16], and 0.69% higher than the work by [17], and 1.51% higher than the work by [14]. The accuracy of recognizing unique signs reach 99.31%, which is comparable with the best achieved one. The danger signs which have triangular shape have given the worst results compared with other traffic sign categories.

To prove the effectiveness of our proposed method, a comparison of its performance with the standard BoVW based on the GTSRB data set. The comparison results are reported in Table 3, where we examine the benefit of integration the spatial information to the standard BoVW.

Table 3. Comparison of our approach with the standard BoVW methods.

Full size table

As shown in Table 3, by combining spatial information with the traditional BoVW, our method outperforms the current state-of-the-art methods on the GTSRB databases. Comparisons with the standard BoVW model our method as a richer alternative for the usual BoW approach to build a visual vocabulary. Our method provides more representative semantic information of the traffic signs, including spatial information between visual words. We have shown that the spatial information between visual words is an important information for category level recognition, the distribution of similar interest regions of the images is discriminative and can improve the performance of BoVW method significantly.

4 Conclusions

Driving is a complex, continuous, and multitask process that involves driver’s cognition, perception, and motor movements. The way road traffic signs and vehicle information is displayed impacts strongly driver’s attention with increased mental workload leading to safety concerns. We have developed a novel approach for visual words construction which takes the spatial information of keypoints into account in order to enhance the quality of visual words generated from extracted keypoints. In this paper, we have proposed a new computationally efficient method to model global spatial distribution of visual words and improved the standard BoVW representation. It clearly demonstrates the complementarity of the additional relative spatial information provided by our approach to improve accuracy while maintaining short retrieval time, and can obtain a better traffic signs classification accuracy than the methods based on the traditional BOVW model. Experimental results show that the suggested method could reach comparable performance of the state-of-the-art approaches with less computational complexity and shorter training time.

References

Abdi, L., Meddeb, A.: Deep learning traffic sign detection, recognition and augmentation. In: Proceedings of the Symposium on Applied Computing, pp. 131–136. ACM (2017)
Google Scholar
Abdi, L., Meddeb, A.: In-vehicle augmented reality TSR to improve driving safety and enhance the driver’s experience. Sig. Image Video Process. 1–8 (2017)
Google Scholar
Abdi, L., Meddeb, A., Abdallah, F.B.: Augmented reality based traffic sign recognition for improved driving safety. In: Kassab, M., Berbineau, M., Vinel, A., Jonsson, M., Garcia, F., Soler, J. (eds.) Nets4Cars/Nets4Trains/Nets4Aircraft 2015. LNCS, vol. 9066, pp. 94–102. Springer, Cham (2015). doi:10.1007/978-3-319-17765-6_9
Google Scholar
Azad, R., Azad, B., Kazerooni, I.T.: Optimized method for Iranian road signs detection and recognition system. arXiv preprint arXiv:1407.5324 (2014)
Cireşan, D., Meier, U., Masci, J., Schmidhuber, J.: Multi-column deep neural network for traffic sign classification. Neural Netw. 32, 333–338 (2012)
Article Google Scholar
Danelakis, A., Theoharis, T., Pratikakis, I.: A robust spatio-temporal scheme for dynamic 3D facial expression retrieval. Vis. Comput. 32, 1–13 (2015)
Google Scholar
Dawn, D.D., Shaikh, S.H.: A comprehensive survey of human action recognition with spatio-temporal interest point (STIP) detector. Vis. Comput. 32, 1–18 (2015)
Google Scholar
Haloi, M.: A novel pLSA based traffic signs classification system. arXiv preprint arXiv:1503.06643 (2015)
Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., Igel, C.: Detection of traffic signs in real-world images: the German traffic sign detection benchmark. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Google Scholar
Khan, R., Barat, C., Muselet, D., Ducottet, C.: Spatial histograms of soft pairwise similar patches to improve the bag-of-visual-words model. Comput. Vis. Image Underst. 132, 102–112 (2015)
Article Google Scholar
Li, Y., Xu, J., Zhang, Y., Zhang, C., Yin, H., Lu, H.: Image classification using spatial difference descriptor under spatial pyramid matching framework. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9516, pp. 527–539. Springer, Cham (2016). doi:10.1007/978-3-319-27671-7_44
Chapter Google Scholar
Liang, M., Yuan, M., Hu, X., Li, J., Liu, H.: Traffic sign detection by ROI extraction and histogram features-based recognition. In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Google Scholar
Mathias, M., Timofte, R., Benenson, R., Van Gool, L.: Traffic sign recognition-how far are we from the solution? In: The 2013 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE (2013)
Google Scholar
Sermanet, P., LeCun, Y.: Traffic sign recognition with multi-scale convolutional networks. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2809–2813. IEEE (2011)
Google Scholar
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The German traffic sign recognition benchmark: a multi-class classification competition. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 1453–1460. IEEE (2011)
Google Scholar
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Netw. 32, 323–332 (2012)
Article Google Scholar
Yin, S., Ouyang, P., Liu, L., Guo, Y., Wei, S.: Fast traffic sign recognition with a rotation invariant binary pattern based feature. Sensors 15(1), 2161–2180 (2015)
Article Google Scholar
Zaklouta, F., Stanciulescu, B., Hamdoun, O.: Traffic sign classification using K-d trees and random forests. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2151–2155. IEEE (2011)
Google Scholar
Zhang, E., Mayo, M.: Enhanced spatial pyramid matching using log-polar-based image subdivision and representation. In: 2010 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 208–213. IEEE (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

National Engineering School of Sousse, University of Sousse, Sousse, Tunisia
Lotfi Abdi, Rahma Kalboussi & Aref Meddeb
Networked Objects Control and Communication Systems Laborator, University of Sousse, Sousse, Tunisia
Lotfi Abdi, Rahma Kalboussi & Aref Meddeb

Authors

Lotfi Abdi
View author publications
You can also search for this author in PubMed Google Scholar
Rahma Kalboussi
View author publications
You can also search for this author in PubMed Google Scholar
Aref Meddeb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lotfi Abdi .

Editor information

Editors and Affiliations

University of Catania, Catania, Italy
Sebastiano Battiato
University of Catania, Catania, Italy
Giovanni Gallo
University of Milano-Bicocca, Milan, Italy
Raimondo Schettini
University of Catania, Catania, Italy
Filippo Stanco

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdi, L., Kalboussi, R., Meddeb, A. (2017). Enhanced Bags of Visual Words Representation Using Spatial Information. In: Battiato, S., Gallo, G., Schettini, R., Stanco, F. (eds) Image Analysis and Processing - ICIAP 2017 . ICIAP 2017. Lecture Notes in Computer Science(), vol 10485. Springer, Cham. https://doi.org/10.1007/978-3-319-68548-9_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-68548-9_16
Published: 13 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68547-2
Online ISBN: 978-3-319-68548-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Enhanced Bags of Visual Words Representation Using Spatial Information

Abstract

Similar content being viewed by others

Spatially Enhanced Bags of Visual Words Representation to Improve Traffic Signs Recognition

Efficient Traffic Sign Detection Using Bag of Visual Words and Multi-scales SIFT

Recognition by Enhanced Bag of Words Model via Topographic ICA

Keywords

1 Introduction

2 Traffic Signs Recognition