Skip to main content
Log in

SignParser: An End-to-End Framework for Traffic Sign Understanding

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In intelligent transportation systems, parsing traffic signs and transmitting traffic information to humans is an urgent need. However, despite the success achieved in the detection and recognition of low-level circular or triangular traffic signs, parsing the more complex and informative rectangular traffic signs remains unexplored and challenging. Our work is devoted to the topic called “Traffic Sign Understanding (TSU)”, which is aimed to parse various traffic signs and generate semantic descriptions for them. To achieve this goal, we propose an end-to-end framework that integrates component detection, content reasoning, and semantic description generation. The component detection module first detects initial components in the sign image. Then the content reasoning module acquires the detailed content of the sign, including final components, their relations, and layout category, which provide local and global information for the subsequent module. In the end, the semantic description generation module mines relational attributes and text semantic attributes from the preceding results, embeds them with the layout categories, and transforms them into semantic descriptions through a dynamic prediction transformer. The three modules are trained jointly in an end-to-end manner for optimizing the overall performance. This method achieves state-of-the-art performance not only in the final semantic description generation stage but also on multiple subtasks of the CASIA-Tencent CTSU Dataset. Abundant ablation experiments are provided to prove the effectiveness of this method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

The CASIA-Tencent CTSU Dataset analysed during the current study is available at http://www.nlpr.ia.ac.cn/databases/CASIA-Tencent CTSU/index.html. The Chinese Traffic Sign Database (CTSDB) analysed during the current study is available at http://www.nlpr.ia.ac.cn/pal/trafficdata/recognition.html.

Notes

  1. The dataset is available at http://www.nlpr.ia.ac.cn/databases/CASIA-TencentCTSU/index.html.

References

  • Aghdam, H. H., Heravi, E. J., & Puig, D. (2017). A practical and highly optimized convolutional neural network for classifying traffic signs in real-time. International Journal of Computer Vision (IJCV), 122, 246–269.

    Article  MathSciNet  Google Scholar 

  • Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., & Xu, J., et al. (2019). Mmdetection: Open mmlab detection toolbox and benchmark. arXiv:1906.07155.

  • CireAan, D., Meier, U., Masci, J., & Schmidhuber, J. (2012). Multi-column deep neural network for traffic sign classification. Neural Networks, 32, 333–338.

    Article  Google Scholar 

  • Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., & Wei, Y. (2017). Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 764–773.

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 886–893.

  • De La Escalera, A., Moreno, L. E., Salichs, M. A., & Armingol, J. M. (1997). Road traffic sign detection and classification. IEEE Transactions on Industrial Electronics (T-IE), 44(6), 848–859.

  • Gao, D., Li, K., Wang, R., Shan, S., & Chen, X. (2020). Multi-modal graph neural network for joint reasoning on vision and scene text. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 12746–12756.

  • Gao, C., Zhu, Q., Wang, P., Li, H., Liu, Y., Hengel, A. V. D., & Wu, Q. (2020). Structured multimodal attentions for textvqa. arXiv:2006.00753.

  • Garcia-Garrido, M. A., Sotelo, M. A., & Martin-Gorostiza, E. (2006). Fast traffic sign detection and recognition under changing lighting conditions, pp. 811–816.

  • Gonzalez, A., Bergasa, L. M., & Yebes, J. J. (2013). Text detection and recognition on traffic panels from street-level imagery using visual appearance. IEEE Transactions on Intelligent Transportation Systems (T-ITS), 15(1), 228–238.

  • Guo, Y., Feng, W., Yin, F., Xue, T., Mei, S., & Liu, C.-L. (2021). Learning to understand traffic signs. In Proceedings of the ACM international conference on multimedia (ACM MM), pp. 2076–2084.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2961–2969.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 770–778.

  • Hechri, A., Hmida, R., & Mtibaa, A. (2015). Robust road lanes and traffic signs recognition for driver assistance system. International Journal of Computational Science and Engineering, 10(1–2), 202–209.

    Article  Google Scholar 

  • Hemadri, V. B., & Kulkarni, U. P. (2017). Recognition of traffic sign based on support vector machine and creation of the Indian traffic sign recognition benchmark. In International conference on cognitive computing and information processing (CCIP), pp. 227–238. Springer.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  CAS  PubMed  Google Scholar 

  • Houben, S., Stallkamp, J., Salmen, J., Schlipsing, M., & Igel, C. (2013). Detection of traffic signs in real-world images: The german traffic sign detection benchmark. In: The International Joint Conference on Neural Networks (IJCNN), pp. 1–8.

  • Hu, R., Singh, A., Darrell, T., & Rohrbach, M. (2020). Iterative answer prediction with pointer-augmented multimodal transformers for textvqa. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 9992–10002.

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980.

  • Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. (2017). Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1), 32–73.

    Article  MathSciNet  Google Scholar 

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (NIPS), pp. 1097–1105.

  • Li, S., Kulkarni, G., Berg, T., Berg, A., & Choi, Y. (2011). Composing simple image descriptions using web-scale n-grams. In Proceedings of the conference on computational natural language learning (CoNLL), pp. 220–228.

  • Li, Y., Ouyang, W., Zhou, B., Wang, K., & Wang, X. (2017). Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 1261–1270.

  • Lin, T. -Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2980–2988.

  • Lowe, D. G. (1999). Object recognition from local scale-invariant features. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 1150–1157.

  • Maas, A. L., Hannun, A. Y., & Ng, A. Y. (2013). Rectifier nonlinearities improve neural network acoustic models. In International conference on machine learning (ICML), p. 3.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., & Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library, pp. 8026–8037.

  • Qi, M., Li, W., Yang, Z., Wang, Y., & Luo, J. (2019). Attentive relational networks for mapping images to scene graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3957–3966.

  • Ren, S., He, K., Girshick, R., & Sun, J. (2016). Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 39(6), 1137–1149.

  • Rong, X., Yi, C., & Tian, Y. (2016). Recognizing text-based traffic guide panels with cascaded localization network. In European conference on computer vision workshops (ECCV), LNCS 9913, pp. 109–121.

  • Sathish, P., & Bharathi, D. (2016) Automatic road sign detection and recognition based on sift feature matching algorithm. In Proceedings of the international conference on soft computing systems (ICSCS), pp. 421–431.

  • Sidorov, O., Hu, R., Rohrbach, M., & Singh, A. (2020). Textcaps: a dataset for image captioning with reading comprehension. In European conference on computer vision (ECCV), pp. 742–758.

  • Tang, K., Niu, Y., Huang, J., Shi, J., & Zhang, H. (2020). Unbiased scene graph generation from biased training. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3716–3725.

  • Tian, Z., Shen, C., Chen, H., & He, T. (2020). Fcos: A simple and strong anchor-free object detector. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 44(4), 1922–1933.

  • Ushiku, Y., Yamaguchi, M., Mukuta, Y., & Harada, T. (2015). Common subspace for model and similarity: Phrase learning for caption generation from images. In Proceedings of the IEEE international conference on computer vision (ICCV), pp. 2668–2676.

  • Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y. (2018). Graph attention networks. In International conference on learning representations (ICLR).

  • Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3156–3164.

  • Wang, G., Ren, G., Jiang, L., & Quan, T. (2014). Hole-based traffic sign detection method for traffic signs with red rim. The Visual Computer, 30(5), 539–551.

    Article  Google Scholar 

  • Woo, S., Kim, D., Cho, D., & Kweon, I. -S. (2018). Linknet: Relational embedding for scene graph, pp. 560–570.

  • Wu, Y., Liu, Y., Li, J., Liu, H., & Hu, X. (2013). Traffic sign detection based on convolutional neural networks. In The international joint conference on neural networks (IJCNN), pp. 1–7.

  • Wu, Q., Shen, C., Liu, L., Dick, A., & Van Den Hengel, A. (2016). What value do explicit high level concepts have in vision to language problems? In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 203–212.

  • Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., & Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (ICML), pp. 2048–2057.

  • Xu, D., Zhu, Y., Choy, C. B., & Fei-Fei, L. (2017). Scene graph generation by iterative message passing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5410–5419.

  • Yakimov, P. (2015). Traffic signs detection using tracking with prediction. In International conference on E-Business and telecommunications (ICETE), pp. 454–467.

  • Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph r-cnn for scene graph generation. In European conference on computer vision (ECCV), pp. 670–685.

  • Yang, Y., Luo, H., Xu, H., & Wu, F. (2015). Towards real-time traffic sign detection and classification. IEEE Transactions on Intelligent Transportation Systems (T-ITS), 17(7), 2022–2031.

  • Yang, Y., Teo, C., Daumé III, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 444–454.

  • Yao, C., Bai, X., Sang, N., Zhou, X., Zhou, S., & Cao, Z. (2016). Scene text detection via holistic, multi-channel prediction. arXiv:1606.09002.

  • Ye, J. -Y., Zhang, Y. -M., Yang, Q., & Liu, C. -L. (2019). Contextual stroke classification in online handwritten documents with graph attention networks. In 2019 International conference on document analysis and recognition (ICDAR), pp. 993–998.

  • Yin, F., Wu, Y. -C., Zhang, X. -Y., & Liu, C. -L. (2017). Scene text recognition with sliding convolutional character models. arXiv:1709.01727.

  • Youssef, A., Albani, D., Nardi, D., Bloisi, D. D. (2016). Fast traffic sign recognition using color segmentation and deep convolutional networks. In International conference on advanced concepts for intelligent vision systems (ACIVS), pp. 205–216.

  • Zaklouta, F., Stanciulescu, B., & Hamdoun, O. (2011). Traffic sign classification using kd trees and random forests. In The international joint conference on neural networks (IJCNN), pp. 2151–2155.

  • Zellers, R., Yatskar, M., Thomson, S., & Choi, Y. (2018). Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5831–5840.

  • Zhan, Y., Yu, J., Yu, T., & Tao, D. (2020). Multi-task compositional network for visual relationship detection. International Journal of Computer Vision (IJCV), 128(8), 2146–2165.

    Article  Google Scholar 

  • Zhang, J., Elhoseiny, M., Cohen, S., Chang, W., & Elgammal, A. (2017) Relationship proposal networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 5678–5686.

  • Zhang, J., Shih, K., Tao, A., Catanzaro, B., & Elgammal, A. (2018). An interpretable model for scene graph generation. arXiv:1811.09543.

  • Zhang, Z., Zhang, C., Shen, W., Yao, C., Liu, W., & Bai, X. (2016). Multi-oriented text detection with fully convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 4159–4167.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunfei Guo.

Additional information

Communicated by Dimosthenis Karatzas.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

See Figs. 12 and 13

Fig. 12
figure 12

References of Chinese on traffic signs in this paper. The signs are arranged following the order they appear in the paper. The Chinese are highlighted by green boxes, below or beside which are the English translations (Color figure online)

Fig. 13
figure 13

Two English traffic signs as examples, their relation annotations, and semantic descriptions. Some of the auxiliary annotations are provided on the right, including components (texts in green boxes, symbols in yellow boxes, and arrowheads in red boxes) and their relations (association relations represented by pink lines and pointing relations represented by light blue lines) (Color figure online)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, Y., Feng, W., Yin, F. et al. SignParser: An End-to-End Framework for Traffic Sign Understanding. Int J Comput Vis 132, 805–821 (2024). https://doi.org/10.1007/s11263-023-01912-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01912-9

Keywords

Navigation