Keywords

1 Introduction

Extensive studies in human-machine interactivity are necessary to present the Traffic Sign Recognition (TSR) information in a careful way to inform the driver without causing distraction or confusion [3]. The visibility of traffic signs is crucial for the drivers’ safety. In a real environment, it becomes a difficult task to recognize the traffic signs timely and accurately because the visibility of traffic signs may be decreased greatly by some unfavorable factors. For example, very serious accidents happen when drivers do not notice a stop sign. An automatic TSR system is helpful for assisting drivers and is essential for autonomous cars [4].

Road scenes are also generally very cluttered and contain many strong geometric shapes that could easily be misclassified as road signs. In conventional techniques, color, shape information, or geometric features of traffic signs are utilized for performing detection and recognition with regard to the traffic signs in general. Road sign recognition systems usually have developed into two specific phases. First, in each frame, the detection stage identifies also the categories of signs based on shapes (circular, rectanglar, trianglar, etc.). The second task is to classify the detected signs and send the processing results (i.e., the types of signs and their locations) to the display and control units of an Advanced Drive Assistance System (ADAS) [1].

Fatigue, divergence of attention, and occlusion of signs due to road obstruction and natural scenes may lead to miss some important traffic signs, which can result in severe accidents. Guiding the driver attention to an imminent danger, somewhere around the car, is a potential application. After recognizing the traffic signs, a driver may be notified of the recognized traffic signs in a manner of audio or visual information. The recognition of road traffic signs correctly at the right time for that particular place is very important for any vehicle driver to ensure a safe journey for themselves and their passengers. The proposed system may make driving even more comfortable and may help the drivers receive important information regarding the signs, even before their eyes can actually see the sign, and in an easy and comprehensible way.

Several methods were recently proposed to incorporate spatial information to improve the BoVW model such as the spatial pyramid matching method [11], spatio-temporal interest point [7]. Other recent work have focused on the local stability of traffic sign regions [2]. Further, methods such a [8] proposed a novel classification techniques of TSR based on probabilistic latent semantic. The algorithm consists of two parts: (1) classify the shape of the traffic signs and (2) classify its actual class. In order to investigate the effect of coding methods and codebook size of local spatiotemporal features, [6] have proposed a coding method to alleviate the negative effect of quantization error by assigning the local spatiotemporal features to a few nearest visual words.

Correct and timely recognition of road traffic signs is crucial for any vehicle driver to ensure a safe journey for themselves and their passengers. Considering the processing time and classification accuracy as a whole, we have developed a novel technique to incorporate spatial information of visual words to improve accuracy while maintaining short retrieval time. In order to achieve fast and robust TSR, we introduced a novel way to incorporate both distance and angle information in the BoVW representation. A novel approach for visual words construction was presented, which takes the spatial information of keypoints into account in order to enhance the quality of visual words generated from extracted keypoints. This clearly demonstrated the complementarities of the additional relative spatial information provided by our approach to improve accuracy while maintaining short retrieval time.

In this paper, we propose a novel method for traffic sign recognition system based on the bag-of-visual-words approaches. We introduce a novel way to incorporate both distance and angle information in the BoVW representation. We proposed a new computationally efficient method to model the global spatial distribution of visual words by taking into consideration the spatial relationships of its visual words.

2 Traffic Signs Recognition

In both instances, the ability to recognize signs and their underlying information is highly desirable. This information can be used to warn the human driver of an oncoming change, or in more intelligent vehicle systems, to actually control the speed and/or steering of the vehicle. It is therefore necessary to classify the characteristics of the information given and find a way to represent the information according to these characteristics. To overcome this problem, we proposed a novel approch to integrate the spatial information to BoVW model.

2.1 Enhanced BoVW Using Spatial Information

This paper presents a new approach to integrate the spatial information to BoVW model, with explicit local and global structure models. The key idea is to consider the spatial distribution of visual words in an image. In [10], a pairwise spatial histogram is defined according to a discretization of the spatial neighborhood into several bins encoding the relative spatial position (distance and angle) of two visual words. Therefore, combining the frequency of occurrence and spatial information of visual words should be a promising direction for improving the image characterization.

To address this issue, we introduce a novel way to incorporate both distance and angle information in the BoVW representation. This method exploits spatial orientations and distances of all pairs of similar descriptors in the image. In the BoVWs model, a visual vocabulary \(Voc=vi, i=\{ 1, \ldots , k \}\), then it is built by clustering these features into a certain number of K visual words. A given descriptor \(d{_k}\) is then mapped to a visual word \(V{_i}\) using euclidean distance Eq. 1 as follows:

$$\begin{aligned} \ {v(d{_k})} =arg min Dist(v,d{_k}) \end{aligned}$$
(1)

where \({v\in Voc} \), \(d{_k}\) is the \(k^{th}\) descriptor in the ROI, \(Dist(v,d{_k})\) is the distance between the descriptor and the visual word based on the euclidean distance. For this reason, we consider the weighted sum of ROIs to implicity represent spacial information which is important for similarity measurement between images.

In the training stage, the SIFT features are extracted from all the training samples, using a dense grid. Since we are interested in the sign contents, only the descriptors that do not fall outside the sign contour are taken into account. Our system exploits the SIFT features, which have shown a high robustness to varied recording conditions. After the SIFT features are extracted for all the training samples, the number of feature points of each image is not entirely consistent, which will bring great difficulties to subsequent operations. Assignment of a visual feature to the vocabulary depends on the similarity metric. We propose a method that incorporates spatial information at feature level. This method exploits distances and spatial orientations of all pairs of similar descriptors in the image.

2.2 Similarity Measure

In order to improve similarity measurement between pairwise, we propose a simple and efficient method to infuse spatial information. We measure the spatial relationships between visual words using distance and orientation. This is done by creating an additional dictionary comprising of word pairs. These methods are inspired by [19], using a log-polar quantization of image spatial domain.

For each visual word, the average position and the standard deviation is computed based on all the occurrences of the visual word in the image. We consider the interaction between visual words by encoding their spatial distances, orientations and alignments. Figure 1 shows an example to better understand our approach. To encode spatial information, we use the distance (Fig. 1(a)) and orientation (Fig. 1(b)) information between pairs of patches in the image space.

Fig. 1.
figure 1

Spatial histogram of similar pairwise using distance and orientation: (a) spatial distance of similar visual word, (b) spatial orientations of similar pairwise, (c) pairwise similarity distance-orientations information of similar patches, (d) pairwise spatial histograms

More formally, we consider the set \(S{_k}\) of all the pairwise, where at least one patch in the pair belongs to the visual word \(w{_k}\). A given pair \((P{_i},P{_j}) \in S{_k}\) is characterized both by a pair of descriptors \((d{_i}, d{_j})\) and a pair of positions in the image space denoted \((p{_i},p{_j})\) is illustrated in Fig. 1. Note that both \(d{_i}\) and \(p{_i}\) are vectors with \(d{_i} \in R^{D}\) and \(p{_i} \in R^{2}\).

After clustering the spatial information is implicitly included in the visual vocabulary. A pairwise spatial histogram (Fig. 1(d)) of similar patches is then defined considering a discretization of the image space into M bins denoted \(b{_m}\), m = {1, ..., M} with an angle \(\theta \in [0, \pi [ \) split into \(M{_\theta }\) angle bins and the radius \(r \in [ 0, R ]\) split into \(M{_r}\) radial bins so that \(M = M{_\theta }.M{_r}\).

To classify a new feature, \(P{_x}\), into the discovered classes, it is compared with the words included in the vocabulary using the distance described in Eq. 1. We assign the corresponding word i according to the nearest neighbor, but only if that distance is below a matching threshold, \(th{_M}\).

$$\begin{aligned} W{_i} = arg min{_{i \in [1,k]}}(d(P{_x}, P{_i})|jd(P{_x,} P{_i}) < th{_M}) \end{aligned}$$
(2)

To represent a ROI in an image, we employ spatial pyramid to enhance voting and indexing criterion of the original inverted indexing technique. Based on the position of each visual words, this method exploits spatial orientations and distances of all pairs of similar descriptors in the image. It is relatively efficient during classification and can well present the spatial information contained in Spatial Pyramid features. In order to train our classifier, the framework of our proposed BoVW system is illustrated in Fig. 2.

Fig. 2.
figure 2

Proposed approach BoVW for classification

For those purpose, a novel structural relationship between patches are defined for evaluating super-pixels’ similarity. Particularly, simple spatial relations between visual words are considered the spatial locations of the words and the spatial relationship between the words were added to describe images in the BoW model. This histogram encodes spatial information (distance and orientation (Fig. 1) of pairwise similar patches, where at least one of the patches belongs to \(V{_k}\). To have a global representation, we replace each bin of the BoVW frequency histogram with the spatial histogram associated to \(w{_i}\). By this way, we keep the frequency information intact and add the spatial information. This modularity facilitates simple way to assemble the spatial histograms and to obtain the final representation.

3 Experiment Results

To evaluate the performance of the proposed algorithm, we implement the proposed AR-TSR method using the hardware environment of Core i7 640LM 2.13 GHz and the software environment of Windows 7, Visual Studio 2010 using OpenCV Library 2.4.8. In this paper, we focus on the detection of speed limit, unique signs and danger signs. We implement the suggested method in C++ and test the performance on the German Traffic Sign Detection Benchmark (GTSDB) dataset [15]. In the GTSRB dataset, there are 51,839 German traffic signs in 43 classes. These classes of traffic signs have been divided into six subsets speed limit sign subset, danger sign subset, mandatory sign subset, unique sign subset, derestriction sign subset and other prohibitory sign subset [17]. The solution of each image in the dataset is \(1366 \times 800\). Here, we present quantitative a analysis of the system presented in this work.

3.1 Performance of the Proposed Method

The database used to train the classifiers has been designed using the ROIs obtained from the detection step and the model fitting methods presented in the previous sections. In order to evaluate the occlusion robustness of the suggested classification method, the content of the detected ROI is identified using the tree classifiers. This classifier is tested on static, low-resolution sign images. Some experiments have also been conducted to measure the performance, the GTSRB has been used, where Table 1 shows the classification rates of the linear SVM.

Table 1. Accuracy results for traffic sign classification

Furthermore, we evaluate the classification task on the detected signs returned by the previous detection module. As shown in Table 1, the overall classification accuracy is 99.31%. Note that only 3 (out of 1500) speed limit signs, while only 6 (out of 890) danger signs, are falsely classified. Experiments demonstrate that our approach succeeds in adding relative spatial information into the BoVW model by encoding both the global and local relative distribution of visual words over an image.

3.2 Comparisons with Other State-of-the-Art Methods

In order to verify the discrimination performance and computation efficiency of the proposed feature for traffic sign detection, the experiments on the public available data set of traffic signs are implemented. Because the training and testing samples in the GTSRB and GTSDB dataset are split according to a fixed rule, an absolute performance comparison with other reported approaches is possible. We report these results in Table 2, where the results of the winning system from the IJCNN challenge and some reported results in the IJCNN 2011 are provided as references.

The performance results of the machine learning algorithms are all significantly different from each other. We have compared the suggested method with other state-of-the-art algorithms such as the committee of CNNs [5], Human Performance [16], Multi-Scale CNNs [14], Random Forests [8, 18], wgy@HIT501 [9], VISICS [13], LITS1 [12, 17], and Viola-Jones. The performance is analyzed in terms of detection and recognition accuracy.

Table 2. Performance comparison with other traffic sign recognition methods.

According to the results for the GTSRB data set, shown in Table 2, this work achieves a 99.31% recognition accuracy, which is a comparable performance of 0.24% less then the work by [5], and a performance of 0.17% higher than the work by [16], and 0.69% higher than the work by [17], and 1.51% higher than the work by [14]. The accuracy of recognizing unique signs reach 99.31%, which is comparable with the best achieved one. The danger signs which have triangular shape have given the worst results compared with other traffic sign categories.

To prove the effectiveness of our proposed method, a comparison of its performance with the standard BoVW based on the GTSRB data set. The comparison results are reported in Table 3, where we examine the benefit of integration the spatial information to the standard BoVW.

Table 3. Comparison of our approach with the standard BoVW methods.

As shown in Table 3, by combining spatial information with the traditional BoVW, our method outperforms the current state-of-the-art methods on the GTSRB databases. Comparisons with the standard BoVW model our method as a richer alternative for the usual BoW approach to build a visual vocabulary. Our method provides more representative semantic information of the traffic signs, including spatial information between visual words. We have shown that the spatial information between visual words is an important information for category level recognition, the distribution of similar interest regions of the images is discriminative and can improve the performance of BoVW method significantly.

4 Conclusions

Driving is a complex, continuous, and multitask process that involves driver’s cognition, perception, and motor movements. The way road traffic signs and vehicle information is displayed impacts strongly driver’s attention with increased mental workload leading to safety concerns. We have developed a novel approach for visual words construction which takes the spatial information of keypoints into account in order to enhance the quality of visual words generated from extracted keypoints. In this paper, we have proposed a new computationally efficient method to model global spatial distribution of visual words and improved the standard BoVW representation. It clearly demonstrates the complementarity of the additional relative spatial information provided by our approach to improve accuracy while maintaining short retrieval time, and can obtain a better traffic signs classification accuracy than the methods based on the traditional BOVW model. Experimental results show that the suggested method could reach comparable performance of the state-of-the-art approaches with less computational complexity and shorter training time.