Abstract
Traffic sign recognition is among the major tasks on driver assistance system. The convolutional neural networks (CNN) play an important role to find a good accuracy of traffic sign recognition in order to limit the dangerous acts of the driver and to respect the road laws. The accuracy of the Detection and Classification determines how powerful of the technique used is. Whereas SSD Multibox (Single Shot MultiBox Detector) is an approach based on convolutional neural networks paradigm, it is adopted in this paper, firstly because we can rely on it for the real-time applications, this approach runs on 59 FPS (frame per second). Secondly, in order to optimize difficulties in multiple layers of DeeperCNN to provide a finer accuracy. Moreover, our experiment on German traffic sign recognition benchmark (GTSRB) demonstrated that the proposed approach could achieve competitive results (83.2% in 140.000 learning steps) using GPU parallel system and Tensorflow.
Keywords
1 Introduction
Traffic sign recognition is a part of the required researches today; recently, it has drawn considerable attention to limit the dangerous acts of the driver.
Despite the traffic signs are designed with simple geometric shapes and basic colors, the blurry image, the speed of the vehicle and the light conditions make the recognition process difficult [14]. Thereby, the challenge is to recognize them quickly by the computer. Neural networks methods are able to open up a huge research prospect, especially in the driver assistance field. Therefore, the transition from the traditional methods of traffic signs to the advanced one forms a historical milestone. Thus, we find firstly HOG (Histogram of Oriented Gradients), HOF (Histogram of Oriented Flow), etc. that are based on the orientation histogram, and Viola-Jones pattern (Haar features) that makes the detection task more flexible with some instructions and constraints. Secondly, we mention that Convolutional neural networks have created a framework to make the traffic signs recognition very easy, with consideration, the accurate, the scalable, and the fault-tolerant [2, 3, 8, 17, 19].
In general, the traffic signs paradigm depends on two major aspects: detection and classification. The object detection approaches are selective search and deformable part model [5, 20]. Recently we find the sliding windows and region proposal classification [9, 11] that are built in objective to find a fast detection. On the other side, in the classification and in some cases, the methods indicated previously are hybridized with machine learning paradigm (Logistic Regression, Kernel-SVM, PCA, kNN, etc.) in order to identify a good classification/clustering system [13, 21]. The simple example that we can present is R-CNN [6] and its extension Fast R-CNN.
Although the deeper neural networks are difficult to learn proportionally with a high number of layers, the residual networks has led to very deep networks training without problems of vanishing or exploding gradients [7]. Thus, Zeng et al. [22] utilized CNN and they achieved 99.40% accuracy on the GTSRB dataset using ELM (Extreme Learning Machine). Even the accuracy reached a good result, but the limit defined of this approach is that the last convolution layer has 200-dimensional as output. Thus, 2.916.000 output parameters with a small platform of CPU/GPU could make this approach hard to utilize. Indeed, the work of this paper was influenced significantly by the recent revolution of the neuron network to provide an extra help to the driver.
In this paper, we chose Single Shot MultiBox Detector [12] firstly because the tasks of object localization and classification are done in a single forward pass of the network. Secondly, the matching strategy of boxes could combine different input size of objects. And thirdly, the system runs at 59 FPS. On the other hand, it outperforms the Fast R-CNN (7 FPS) [16] and YOLO (45 FPS) [15], and it can operate even with 4K clips. Our application is implemented on TensorFlow architecture using Jetson TX2 AI supercomputer. This paper is not yet based on the final results, the research is still in progress. As results, our approach claims to be correct on 83.2%.
To summarize, our main contributions are as follows: the model is described in Sect. 2. Section 3, the experimental results and the conclusion is given in Sect. 4.
2 The SSD MultiBox Model
SSD MultiBox architecture (Fig. 1) is based on VGG-16 network (a feed forward CNN) and it produces a fixed size series of bounding boxes. Moreover, the usage of multiple layers of SSD leads to the ability to learn, much more during the training phase, and provides a finer accuracy with different scales by the knowledge acquired.
SSD detection utilizes the anchor boxes (Fig. 2(d)) for generating RoIs (Region of Interest) with differing resolutions to capture the invariance size of objects (Fig. 2(a)).
The simple scaling and aspect ratios of SSD play a major role to get the appropriate “Ground Truth” dimensions which lead to calculate the default boxes for each pixel [10] (Eq. (1)).
Besides \(38*38\) feature map of VGG-16 network on SSD architecture, some layers are added, firstly, to produce feature maps of sizes \(19*19, 10*10, 5*5, 3*3\), and \(1*1\), and secondly to predict the bounding boxes.
Furthermore, \(scale_{min}\) = 0.2, \(scale_{max}\) = 0.9, and m represents the feature maps of prediction [12]. For each RoI, SSD classification uses \(3*3\) receptive fields. At the first, to estimate the 4 localization offsets (\(center-x, center-y, width, height\) (\(\varDelta \left( cx, cy, w, h \right) \))). Second, to estimate the confidences categories for all object from the “Ground Truth” boxes.
We indicate that \(conv4\_3\) and \(conv11\_2\) are the key parts to detect the smallest and the biggest objects, respectively. As an example, Fig. 2, it is noteworthy on the \(8*8\) feature map that the cat (object) has one matched box (Fig. 2(b)), none for the child (object). But it is the case for \(4*4\) (Fig. 2(c)).
Thus, the default box on each cell will connect to each other to form the output of this network [12] that is presented as follow:
-
A probability vector of length c, where c are the number of object classes plus the background class that indicates no object.
-
A vector with four elements (x, y, width, height) representing the offset required move the default box position to the real object.
In the training part, a combined Multibox Loss Function is calculated as a combination of localization and classification error [4] which measures how far off the prediction “landed” and \(\alpha \) balances the contribution of the different loss terms.
The details are best-explained below:
Localization Loss
Classification Loss
-
l: the predicted box,
-
g: the ground truth box,
-
\(c_{xy}\): offsets center (cx, cy),
-
x: an indicator for matching the i-th default box to the j-th ground truth box of category p, \(x_{ij}^{p}=\left\{ 0, 1 \right\} \) and \(\sum _{i}x_{ij}^{p}\ge 1\),
-
d: the default bounding box,
-
c: classes confidences
-
w: width,
-
h: height,
-
N: the number of matched default boxes, if N = 0 than L = 0.
By this mechanism, the aim is to find the parameter values that most optimally reduce the loss function, thereby bringing the predictions closer to the ground truth. In general, the model detects as many as possible the objects by categories. Then, in the prediction, non-maxima suppression algorithm is utilized to filter the multiple boxes per object, that may identify in all layers, and it produces the final detections. Finally, the model presents an object similarity percentage.
This model was trained on the GTSRB dataset [18]. The implementation script is written in Python language running on Tensorflow architecture, on a supercomputer AI, Jetson TX2, with GPU NVIDIA Pascal, 256 CUDA cores, CPU HMP Dual Denver 2/2 MB L2 + Quad ARM A57/2 MB L2, and for the memory 8 GB 128 bit and LPDDR4 59.7 GB/s. The GPU is large arrays of small processors designed to process images in video cards of Nvidia.
3 Experimental Results
The main of our experiment is based on the study of the performance of SSD approach to recognizing the traffic signs using the GTSRB dataset. GTSRB contains more than 50,000 images, 39.209 (training) and 12.630 (testing); they were recorded in Germany under various weather conditions with image sizes between 15 * 15 and 250 * 250 pixels.
The GTSRB dataset is adapted to our system with little modifications. Thus, the annotation file of the dataset will be represented by this format:
(Filename, Width, Height, ClassName, ROI.x1, ROI.y1, ROI.x2, ROI.y2) (Fig. 3). We also extended the GTSRB dataset (43 classesFootnote 1) by adding two traffic signs classes (Pedestrians crossing, No motor vehicles) Figs. 4 and 5, typical traffic signs in the Czech Republic (the images are collected from different regions of city “Olomouc”), with 400 images of training, 100 images of testing, the image sizes are between 100 * 100 and 600 * 600 pixels.
At the side of batch size 24, Smooth L1 (localization loss), Non-Maximum Suppression with “Jaccard overlap” of 0.45 per class, and Softmax as a neuron’s activation function (confidence loss) with learning rate \(10^{-3}\). We were preserving the top 100 detections per image in the learning phase.
In Table 1, we presented just 5 periods of training as a first result (5K, 45K, 70K, 100K, 140K). In the phase of training, the images are resized to 300 * 300 pixels, every learning step takes in average around 4 s.
According to our experiment and the curves of Figs. 6, 7, 8 and 9, no clear detection indicated at 5 K learning steps, while at 10 K learning steps, the localization curve begins to decline, it means that the determination of the location of the objects begins to improve. Thus, it converges towards 0 with high steps. In a parallel manner, the classification presents the significant results in the interval [1, 2] at 20K steps, and interesting results in [0.5, 1.5] with 45K. We notice that for getting a high recognition accuracy, the system requires more learning steps. As results, at 140 K learning steps, the accuracy of recognition is 83.2%.
In comparison with the Alexander Shustanov and Pavel Yakimov experiment [1], our approach runs at 59 FPS and our accuracy of recognition is 83.2% (not final results) and it could outperform [1] with more training steps.
In Table 2, we indicate the first results in comparison with previous Traffic signs recognition methods.
At the end of these results, the good results rely on the improvement of the classification; because the classes of the traffic signs which have the same shape (triangle, circle, etc.) with very small dimensions (e.g. 15 * 15) can’t identify in a reliable way the extra shape inside road sign. Thus, the results of different shapes are more efficient than the closest ones (e.g. No overtaking and No overtaking by lorries, or Road narrows (both sides) and Road narrows (right side)). Well, this kind of weakness shows only on small objects. For that, our prospect is to exclude the noisy images, which are misclassified by the network (threshold dimension: under of 100 * 100 pixels) or that they cannot present the objects clearly (bad quality) in the training/testing dataset. Thus, we will replace them with other road signs that will be pre-prepared in advance with high quality to understand how much each ingredient impacts on the final results. After that, we will test the system with different types of image quality and angles of view.
In addition, this study presents the first results (Figs. 10 and 11), and our perspective is to run it in 200K steps. Thereby, theoretically, this approach can outperform the other results indicated in Table 2. On the other side, it will be great to relate this system with other big data components like GPS, street mapping values, traffic signal timing and the edges of the road.
4 Conclusions and Perspectives
In this paper, the automatic model is allowing to evaluate decisions, where SSD Multibox is a base architecture of our experiment. Our model of traffic signs recognition shown a good recognition accuracy 83.2%, and a good cost of computation complexity on GTSRB dataset, in pending the final results for real-time application.
Clearly, 140.000 learning steps are not enough for the best Traffic signs recognition system, but, with an internal memory that increases through more experiences and the new database, the recognition accuracy will be further improved.
On the other side, the TensorFlow platform and Supercomputer AI Jetson TX2 NVIDIA provided considerable values to our implementation, especially in computation time. Besides, our prospect is to improve this work and apply it to autonomous vehicles adapting the “behavior of driver/car” to new specific situations we face and not be standard.
Notes
- 1.
Speed limit 20 km/h, Speed limit 30 km/h, Speed limit 50 km/h, Speed limit 60 km/h, Speed limit 70 km/h, Speed limit 80 km/h, End of speed limit 80 km/h, Speed limit 100 km/h, Speed limit 120 km/h, No overtaking, No overtaking by lorries, Junction with minor roads, Main road, Give way, Stop and give way, No entry for vehicles (both directions), No lorries, No entry for vehicles, Other hazard, Curve to the left, Curve to the right, Double curve, first to the left, Bumpy road, Danger of skidding, Road narrows (right side), Roadworks, Traffic lights ahead, Caution for pedestrians, Caution school, Caution for bicyclists, Be careful in winter, Wild animals, End of all prohibitions, Turn right ahead, Turn left ahead, Ahead only, Ahead or right only, Ahead or left only, Keep right, Keep left, Roundabout, End of no-overtaking zone, End of no-overtaking zone for lorrie.
References
Shustanov, A., Yakimov, P.: CNN design for real-time traffic sign recognition. Procedia Eng. 201, 718–725 (2017)
Caglayan, A., Can, A.B.: An empirical analysis of deep feature learning for RGB-D object recognition. In: Karray, F., Campilho, A., Cheriet, F. (eds.) ICIAR 2017. LNCS, vol. 10317, pp. 312–320. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-59876-5_35
Ciregan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3642–3649, June 2012. https://doi.org/10.1109/CVPR.2012.6248110
Erhan, D., Szegedy, C., Toshev, A., Anguelov, D.: Scalable object detection using deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2147–2154 (2014)
Felzenszwalb, P., McAllester, D., Ramanan, D.: A discriminatively trained, multiscale, deformable part model. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2008. https://doi.org/10.1109/CVPR.2008.4587597
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, pp. 580–587. IEEE Computer Society, Washington, DC (2014). http://dx.doi.org/10.1109/CVPR.2014.81
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015). http://arxiv.org/abs/1512.03385
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV 2015, pp. 1026–1034. IEEE Computer Society, Washington, DC (2015). http://dx.doi.org/10.1109/ICCV.2015.123
Hosang, J.H., Benenson, R., Schiele, B.: How good are detection proposals, really? CoRR abs/1406.6962 (2014). http://arxiv.org/abs/1406.6962
Kosub, S.: A note on the triangle inequality for the Jaccard distance, December 2016
Lampert, C., Blaschko, M., Hofmann, T.: Beyond sliding windows: object localization by efficient subwindow search. In: CVPR 2008, pp. 1–8. Max-Planck-Gesellschaft, IEEE Computer Society, Los Alamitos, June 2008. Best paper award
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Martinović, A., Glavaš, G., Juribašić, M., Sutić, D., Kalafatić, Z.: Real-time detection and recognition of traffic signs. In: The 33rd International Convention MIPRO, pp. 760–765, May 2010
y. Nguwi, Y., Kouzani, A.Z.: A study on automatic recognition of road signs. In: 2006 IEEE Conference on Cybernetics and Intelligent Systems, pp. 1–6, June 2006. https://doi.org/10.1109/ICCIS.2006.252289
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788, June 2016. https://doi.org/10.1109/CVPR.2016.91
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 28, pp. 91–99. Curran Associates, Inc. (2015). http://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
Stallkamp, J., Schlipsing, M., Salmen, J., Igel, C.: The German traffic sign recognition benchmark: a multi-class classification competition. In: IEEE International Joint Conference on Neural Networks, pp. 1453–1460 (2011)
Szegedy, C., et al.: Going deeper with convolutions. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1–9, June 2015. DOI: https://doi.org/10.1109/CVPR.2015.7298594
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. Int. J. Comput. Vis. 104(2), 154–171 (2013). https://doi.org/10.1007/s11263-013-0620-5
Zaklouta, F., Stanciulescu, B.: Real-time traffic sign recognition in three stages. Robot. Auton. Syst.62(1), 16–24 (2014). https://doi.org/10.1016/j.robot.2012.07.019, http://www.sciencedirect.com/science/article/pii/S0921889012001236 new Boundaries of Robotics
Zeng, Y., Xu, X., Fang, Y., Zhao, K.: Traffic sign recognition using deep convolutional networks and extreme learning machine. In: He, X., et al. (eds.) IScIDE 2015. LNCS, vol. 9242, pp. 272–280. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-23989-7_28
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
El Ouadrhiri, A.A., Burian, J., Andaloussi, S.J., El Morabet, R., Ouchetto, O., Sekkaki, A. (2018). Fast-Tracking Application for Traffic Signs Recognition. In: Chmielewski, L., Kozera, R., Orłowski, A., Wojciechowski, K., Bruckstein, A., Petkov, N. (eds) Computer Vision and Graphics. ICCVG 2018. Lecture Notes in Computer Science(), vol 11114. Springer, Cham. https://doi.org/10.1007/978-3-030-00692-1_34
Download citation
DOI: https://doi.org/10.1007/978-3-030-00692-1_34
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00691-4
Online ISBN: 978-3-030-00692-1
eBook Packages: Computer ScienceComputer Science (R0)