skip to main content
10.1145/3474085.3475362acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Learning to Understand Traffic Signs

Published: 17 October 2021 Publication History

Abstract

One of the intelligent transportation system's critical tasks is to understand traffic signs and convey traffic information to humans. However, most related works are focused on the detection and recognition of traffic sign texts or symbols, which is not sufficient for understanding. Besides, there has been no public dataset for traffic sign understanding research. Our work takes the first step towards addressing this problem. First, we propose a "CASIA-Tencent Chinese Traffic Sign Understanding Dataset" (CTSU Dataset), which contains 5000 images of traffic signs with rich semantic descriptions. Second, we introduce a novel multi-task learning architecture that extracts text and symbol information from traffic signs, reasons the relationship between texts and symbols, classifies signs into different categories, and finally, composes the descriptions of the signs. Experiments show that the task of traffic sign understanding is achievable, and our architecture demonstrates state-of-the-art and superior performance. The CTSU Dataset is available at http://www.nlpr.ia.ac.cn/databases/CASIA-Tencent%20CTSU/index.html.

References

[1]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65--72.
[2]
Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, et al. 2019. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019).
[3]
Yajie Chen and Linlin Huang. 2016. Chinese Traffic Panels Detection and Recognition From Street-Level Images. In MATEC Web of Conferences, Vol. 42. 06001.
[4]
Bo Dai, Yuqi Zhang, and Dahua Lin. 2017b. Detecting visual relationships with deep relational networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3076--3086.
[5]
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017a. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 764--773.
[6]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2961--2969.
[7]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770--778.
[8]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[9]
Jie-Bo Hou, Xiaobin Zhu, Chang Liu, Chun Yang, Long-Huang Wu, Hongfa Wang, and Xu-Cheng Yin. 2020. Detecting text in scene and traffic guide panels with attention anchor mechanism. IEEE Transactions on Intelligent Transportation Systems (TITS) (2020).
[10]
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 8888--8897.
[11]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), Vol. 123, 1 (2017), 32--73.
[12]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NIPS), Vol. 25 (2012), 1097--1105.
[13]
Siming Li, Girish Kulkarni, Tamara Berg, Alexander Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the Conference on Computational Natural Language Learning (CoNLL). 220--228.
[14]
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation. In European Conference on Computer Vision (ECCV). 335--351.
[15]
Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. 2017. Scene graph generation from objects, phrases and region captions. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 1261--1270.
[16]
Chin-Yew Lin and Franz Josef Och. 2004. Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics. In Proceedings of the Association for Computational Linguistics (ACL). 605--612.
[17]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2017. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2980--2988.
[18]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European Conference on Computer Vision (ECCV). 740--755.
[19]
Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. 2013. Rectifier nonlinearities improve neural network acoustic models. In International conference on machine learning (ICML), Vol. 30. 3.
[20]
Lukávs Neumann and Jivr 'i Matas. 2012. Real-time scene text localization and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3538--3545.
[21]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the Association for Computational Linguistics (ACL). 311--318.
[22]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems (NIPS), Vol. 32, 8026--8037.
[23]
X. Peng, X. Chen, and C. Liu. 2020. Real-time Traffic Sign Text Detection Based on Deep Learning. IOP Conference Series: Materials Science and Engineering (MSE), Vol. 768, 7 (2020), 072039 (8pp).
[24]
A Vázquez Reina, RJ López Sastre, S Lafuente Arroyo, and P Gil Jiménez. 2006. Adaptive traffic road sign panels text extraction. In Proceedings of the WSEAS International Conference on Signal Processing, Robotics and Automation (ISPRA). 295--300.
[25]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), Vol. 39, 6 (2016), 1137--1149.
[26]
Johannes Stallkamp, Marc Schlipsing, Jan Salmen, and Christian Igel. 2012. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, Vol. 32 (2012), 323--332.
[27]
Domen Tabernik and Danijel Skovc aj. 2019. Deep learning for large-scale traffic-sign detection and recognition. IEEE Transactions on Intelligent Transportation Systems (TITS), Vol. 21, 4 (2019), 1427--1440.
[28]
Kaihua Tang, Yulei Niu, Jianqiang Huang, Jiaxin Shi, and Hanwang Zhang. 2020. Unbiased scene graph generation from biased training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3716--3725.
[29]
Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. 2019. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 9627--9636.
[30]
Yoshitaka Ushiku, Masataka Yamaguchi, Yusuke Mukuta, and Tatsuya Harada. 2015. Common subspace for model and similarity: Phrase learning for caption generation from images. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). 2668--2676.
[31]
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 4566--4575.
[32]
Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph Attention Networks. International Conference on Learning Representations (ICLR) (2018).
[33]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 3156--3164.
[34]
Shiyuan Wang, Linlin Huang, and Jian Hu. 2018. Text line detection from rectangle traffic panels of natural scene. In Journal of Physics: Conference Series (JPCS), Vol. 960. 012038.
[35]
Sanghyun Woo, Dahun Kim, Donghyeon Cho, and In-So Kweon. 2018. LinkNet: Relational Embedding for Scene Graph. Advances in Neural Information Processing Systems (NIPS), 560--570.
[36]
Qi Wu, Chunhua Shen, Lingqiao Liu, Anthony Dick, and Anton Van Den Hengel. 2016. What value do explicit high level concepts have in vision to language problems?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 203--212.
[37]
Wen Wu, Xilin Chen, and Jie Yang. 2005. Detection of text on road signs from video. IEEE Transactions on Intelligent Transportation Systems (TITS), Vol. 6, 4 (2005), 378--390.
[38]
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5410--5419.
[39]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning (ICML). 2048--2057.
[40]
Jianwei Yang, Jiasen Lu, Stefan Lee, Dhruv Batra, and Devi Parikh. 2018. Graph r-cnn for scene graph generation. In European Conference on Computer Vision (ECCV). 670--685.
[41]
Yi Yang, Hengliang Luo, Huarong Xu, and Fuchao Wu. 2015. Towards real-time traffic sign detection and classification. IEEE Transactions on Intelligent Transportation Systems (TITS), Vol. 17, 7 (2015), 2022--2031.
[42]
Yezhou Yang, Ching Teo, Hal Daumé III, and Yiannis Aloimonos. 2011. Corpus-guided sentence generation of natural images. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 444--454.
[43]
Fei Yin, Yi-Chao Wu, Xu-Yao Zhang, and Cheng-Lin Liu. 2017. Scene text recognition with sliding convolutional character models. arXiv preprint arXiv:1709.01727 (2017).
[44]
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5831--5840.
[45]
Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, and Ahmed Elgammal. 2017. Relationship proposal networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5678--5686.
[46]
Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. 2017. East: an efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 5551--5560.

Cited By

View all
  • (2024)Trust Prophet or Not? Taking a Further Verification Step toward Accurate Scene Text RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681551(1741-1750)Online publication date: 28-Oct-2024
  • (2024)Traffic Sign Interpretation via Natural Language DescriptionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.342861925:11(18939-18953)Online publication date: Nov-2024
  • (2023)A Feedback-Driven DNN Inference Acceleration System for Edge-Assisted Video AnalyticsIEEE Transactions on Computers10.1109/TC.2023.327509472:10(2902-2912)Online publication date: 1-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
ISBN:9781450386517
DOI:10.1145/3474085
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-task learning
  2. semantic description
  3. traffic sign understanding

Qualifiers

  • Research-article

Funding Sources

  • National Key Research and Development Program Grant
  • NSFC

Conference

MM '21
Sponsor:
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)5
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Trust Prophet or Not? Taking a Further Verification Step toward Accurate Scene Text RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681551(1741-1750)Online publication date: 28-Oct-2024
  • (2024)Traffic Sign Interpretation via Natural Language DescriptionIEEE Transactions on Intelligent Transportation Systems10.1109/TITS.2024.342861925:11(18939-18953)Online publication date: Nov-2024
  • (2023)A Feedback-Driven DNN Inference Acceleration System for Edge-Assisted Video AnalyticsIEEE Transactions on Computers10.1109/TC.2023.327509472:10(2902-2912)Online publication date: 1-Oct-2023
  • (2023)Towards the Semantic Interpretation of Arbitrary Traffic Signs: Semantic Parsing for Action-oriented Signs2023 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA58977.2023.00269(1771-1777)Online publication date: 15-Dec-2023
  • (2023)Visual Traffic Knowledge Graph Generation from Scene Images2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.01975(21547-21556)Online publication date: 1-Oct-2023
  • (2023)SignParser: An End-to-End Framework for Traffic Sign UnderstandingInternational Journal of Computer Vision10.1007/s11263-023-01912-9132:3(805-821)Online publication date: 17-Oct-2023
  • (2022)On Salience-Sensitive Sign Classification in Autonomous Vehicle Path Planning: Experimental Explorations with a Novel Dataset2022 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)10.1109/WACVW54805.2022.00070(636-644)Online publication date: Jan-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media