Abstract
The sharp surge in the number of vehicles on the road leads to numerous traffic violations (Road Transport and Highways in Ministry of road transport and highways, 2021. https://morth.nic.in/road-accident-in-india). The detection of traffic violations in a dynamic environment is a complex task. This paper focuses on detecting a particular traffic violation, such as riding a motorcycle without a helmet. We propose to address the problem as an object detection task. In this paper, we propose a novel convolutional encoder–transformer decoder architecture (CETD) for the object detection task. The proposed architecture comprises two primary modules: a convolutional neural network (CNN)-based convolution encoder that extracts high-level features from input images and a transformer-based decoder that utilizes attention mechanisms to identify relevant components, such as helmets or missing helmets, in the image. This architecture is designed to achieve accurate object detection and localization in images by combining advanced feature extraction techniques with state-of-the-art attention mechanisms. Layer normalization module of the proposed architecture acts as an intermediate bias stabilizer for the encoder–decoder network. The design also includes a standard backbone feature extractor and fused backbone feature extractor. The model gives strong confidence in detecting occluded objects compared to other state-of-the-art models. The detector works in an end-to-end fashion with fewer handcrafted features. We have studied the applicability of the model with a miniature version of the COCO dataset. The model gives a competitive performance with models like Faster region-based convolutional neural network (Faster R-CNN) and Mask region-based convolutional neural network on this dataset. The proposed model is also fine-tuned on the traffic data with occlusion for helmet detection. The model’s performance on helmet detection from traffic data is comparable with the state-of-the-art real-time detectors such as EspiNet, a modified Faster R-CNN network, single-shot multibox detector (SSD), and You Only Look Once version5 (YOLO v5) detector. Specifically, the model has outperformed EspiNet V2 (a modified Faster R-CNN network) by 0.47, SSD by 3.9, and YOLO v5 by 0.37 in terms of mean average precision. Moreover, the model’s mean average precision has been further improved by 0.87 using object-aware copy–paste augmentation. The model’s average occlusion detection confidence is 5.1 percent more than YOLO v5. Experimental results show that the proposed model has better adaptivity on specific object (helmet) detection and generic object detection tasks.
Similar content being viewed by others
Availability of data and materials
Data will be available on request.
References
Ahmed M, Hashmi KA, Pagani A et al (2021) Survey and performance analysis of deep learning based object detection in challenging environments. Sensors. https://doi.org/10.3390/s21155116
Anwer M, Shareef S, Ali A (2021) Accident vehicle types classification: a comparative study between different deep learning models. Indones J Electr Eng Comput Sci 21:1474–1484. https://doi.org/10.11591/ijeecs.v21.i3.pp1474-1484
Arman M, Hasan M, Sadia F et al (2020) Detection and classification of road damage using R-CNN and faster R-CNN: a deep learning approach, pp 730–741. https://doi.org/10.1007/978-3-030-52856-0_58
Ba J, Kiros J, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Buch N, Velastin SA, Orwell J (2011) A review of computer vision techniques for the analysis of urban traffic. IEEE Trans Intell Transp Syst 12(3):920–939
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. arXiv:2005.12872
Carion N, Massa F, Synnaeve G et al (2020) End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T et al (eds) Computer vision—ECCV 2020. Springer, Cham, pp 213–229
Caron M, Misra I, Mairal J et al (2020) Unsupervised learning of visual features by contrasting cluster assignments. In: Proceedings of Advances in Neural Information Processing Systems (NeurIPS)
Chalavadi V, Singh D, Mohan CK et al (2017) Detection of motorcyclists without helmet in videos using convolutional neural network. In: 2017 International Joint Conference on Neural Networks (IJCNN), pp. 3036–3041
Chen Q, Wang Y, Yang T et al (2021) You only look one-level feature. arXiv:2103.09460
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol 1, pp 886–893
Dequito C, Dichaves I, Juan R et al (2021) Vision-based bicycle and motorcycle detection using a yolo-based network. J Phys Conf Ser 1922:012,003. https://doi.org/10.1088/1742-6596/1922/1/012003
Dhyanjith G, Manohar N, Raj AV (2021) Helmet detection using yolo v3 and single shot detector. In: 2021 6th International Conference on Communication and Electronics Systems (ICCES), pp 1844–1848. https://doi.org/10.1109/ICCES51350.2021.9489194
Espinosa JE, Velastín SA, Branch JW (2020) Detection of motorcycles in urban traffic using video analysis: a review. IEEE Trans Intell Transp Syst 22:6115–6130
Farhadi A, Redmon J (2018) Yolov3: an incremental improvement. In: Computer Vision and Pattern Recognition, pp 1804–02767
Felzenszwalb PF, Girshick RB, McAllester DA et al (2009) Object detection with discriminatively trained part based models. IEEE Trans Pattern Anal Mach Intell 32:1627–1645
Ghiasi G, Cui Y, Srinivas A et al (2020) Simple copy-paste is a strong data augmentation method for instance segmentation. arXiv:2012.07177
Ghiasi G, Cui Y, Srinivas A et al (2021) Simple copy-paste is a strong data augmentation method for instance segmentation. In: CVPR
Girshick R, Donahue J, Darrell T et al (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158. https://doi.org/10.1109/TPAMI.2015.2437384
Girshick RB (2015) Fast r-cnn. In: 2015 IEEE International Conference on Computer Vision (ICCV), pp 1440–1448
Girshick RB, Felzenszwalb PF, McAllester D (2012) Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/~rbg/latent-release5/
Guindel C, Martín D, Armingol JM (2017) Joint object detection and viewpoint estimation using cnn features. In: 2017 IEEE International Conference on Vehicular Electronics and Safety (ICVES), pp 145–150
Guo M, Xue D, Li P et al (2020) Vehicle pedestrian detection method based on spatial pyramid pooling and attention mechanism. Information. https://doi.org/10.3390/info11120583
Han C, Gao G, Zhang Y (2018) Real-time small traffic sign detection with revised faster-rcnn. Multimed Tools Appl 78:13263–13278
Haris M, Glowacz A (2021) Road object detection: a comparative study of deep learning-based algorithms. Electronics. https://doi.org/10.3390/electronics10161932
He K, Zhang X, Ren S et al (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37:1904–1916
Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv:1502.03167
Jia W, Xu S, Liang Z et al (2021) Real-time automatic helmet detection of motorcyclists in urban traffic using improved yolov5 detector. IET Image Process. https://doi.org/10.1049/ipr2.12295
Khan S, Rahmani H, Shah SAA et al (2018) A guide to convolutional neural networks for computer vision. In: A Guide to Convolutional Neural Networks for Computer Vision
Kumar A, Zhang ZJ, Lyu H (2020) Object detection in real time based on improved single shot multi-box detector algorithm. EURASIP J Wirel Commun Netw 2020(1):204. https://doi.org/10.1186/s13638-020-01826-x
Law H, Deng J (2018) Cornernet: detecting objects as paired keypoints. In: 15th European Conference on Computer Vision, ECCV 2018. Springer, pp 765–781
Li Y, Wei H, Han Z et al (2020) Deep learning-based safety helmet detection in engineering management based on convolutional neural networks. Adv Civ Eng 2020:9703560
Lin H, Deng JD, Albers D et al (2020) Helmet use detection of tracked motorcycles using cnn-based multi-task learning. IEEE Access 8:162073–162084. https://doi.org/10.1109/ACCESS.2020.3021357
Lin TY, Goyal P, Girshick RB et al (2017) Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), pp 2999–3007
Liu L, Ouyang W, Wang X et al (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128(2):261–318. https://doi.org/10.1007/s11263-019-01247-4
Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector. In: ECCV
Mao L (2022) Layer normalization explained. Layer Normalization Explained, https://leimao.github.io/blog/Layer-Normalization/
Messelodi S, Modena C, Zanin M (2005) A computer vision system for the detection and classification of vehicles at urban road intersections. Pattern Anal Appl 8:17–31. https://doi.org/10.1007/s10044-004-0239-9
Miller D (2020) Probabilistic object detection with an ensemble of experts. In: European Conference on Computer Vision, Springer, pp 46–55
Nguyen M, Dang TH, Nguyen TT et al (2022) Improve object detection performance with efficient task-alignment module. In: 2022 IEEE 11th Global Conference on Consumer Electronics (GCCE). IEEE, pp 930–933
Organization WH et al (2018) Global status report on road safety 2018: summary. Technical report, World Health Organization
Oviedo JEE, Velast SA, Bedoya JWB (2019) Espinet v2: a region based deep learning model for detecting motorcycles in urban scenarios. DYNA
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 6517–6525
Redmon J, Divvala S, Girshick RB et al (2016) You only look once: unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 779–788
Ren S, He K, Girshick RB et al (2015) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39:1137–1149
Road Transport and Highways (2021) Ministry of road transport and highways. https://morth.nic.in/road-accident-in-india
Samet N, Hicsonmez S, Akbas E (2020) Houghnet: integrating near and long-range evidence for bottom-up object detection. In: European Conference on Computer Vision (ECCV)
Silva R, Aires K, Santos T, et al (2013) Automatic detection of motorcyclists without helmet. In: Proceedings of the 2013 39th Latin American Computing Conference, CLEI 2013, pp 1–7. https://doi.org/10.1109/CLEI.2013.6670613
Singh B, Davis L (2018) An analysis of scale invariance in object detection—snip. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3578–3587
Sivasankaran SK, Rangam H, Balasubramanian V (2021) Investigation of factors contributing to injury severity in single vehicle motorcycle crashes in India. Int J Inj Control Saf Promot 28(2):243–254. https://doi.org/10.1080/17457300.2021.1908367
Song S, Que Z, Hou J et al (2019) An efficient convolutional neural network for small traffic sign detection. J Syst Archit 97:269–277
Soviany P, Ionescu RT (2018) Optimizing the trade-off between single-stage and two-stage deep object detectors using image difficulty prediction. In: 2018 20th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC), pp 209–214. https://doi.org/10.1109/SYNASC.2018.00041
Sun P, Jiang Y, Xie E et al (2021) What makes for end-to-end object detection? arXiv:2012.05780
Tan M, Pang R, Le QV (2020) Efficientdet: scalable and efficient object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10,778–10,787
Tian Z, Shen C, Chen H et al (2019) Fcos: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 9626–9635
Ulyanov D, Vedaldi A, Lempitsky V (2016) Instance normalization: the missing ingredient for fast stylization. arXiv:1607.08022
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Guyon I, Luxburg UV, Bengio S, et al (eds) Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Velastin S, Fernandez R, Espinosa Oviedo J et al (2020) Detecting, tracking and counting people getting on/off a metropolitan train using a standard video camera. Sensors 2020:6251. https://doi.org/10.3390/s20216251
Viola PA, Jones M (2001) Robust real-time object detection. In: Robust Real-time Object Detection
Wang J, Song L, Li Z et al (2020) End-to-end object detection with fully convolutional network. arXiv preprint arXiv:2012.03544
Wang W, Wu B, Yang S et al (2018) Road damage detection and classification with faster r-cnn. In: 2018 IEEE International Conference on Big Data (Big Data), pp 5220–5223. https://doi.org/10.1109/BigData.2018.8622354
Wen X, Yuan H, Song C et al (2007) An algorithm based on svm ensembles for motorcycle recognition. In: 2007 IEEE International Conference on Vehicular Electronics and Safety, pp 1–5
Wu X, Sahoo D, Hoi S (2020) Recent advances in deep learning for object detection. Neurocomputing 396:39–64
Wu Z, Shen C, Hengel AV (2019) Wider or deeper: revisiting the resnet model for visual recognition. Pattern Recognit 90:119–133
Xiong R, Yang Y, He D et al (2020) On layer normalization in the transformer architecture. arXiv:2002.04745
Xu J, Sun X, Zhang Z et al (2019) Understanding and improving layer normalization. arXiv:1911.07013
Yang H, Fan B, Guo L (2020) Anchor-free object detection with mask attention. EURASIP J Image Video Process 2020(1):29. https://doi.org/10.1186/s13640-020-00517-3
Zhao ZQ, Zheng P, Xu ST et al (2019) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 30(11):3212–3232
Zou Z, Shi Z, Guo Y et al (2019) Object detection in 20 years: a survey. arXiv:1905.05055
Acknowledgements
The authors expressed their in-depth gratitude for providing GPU time for training by the Centre of Excellence Artificial Intelligence Lab, NIT, Tiruchirappalli.
Funding
No funding was provided to carry out this research work.
Author information
Authors and Affiliations
Contributions
Both the authors contributed equally to this work.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no other conflict of interest to disclose.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jayanthan, K.S., Domnic, S. An attentive convolutional transformer-based network for road safety. J Supercomput 79, 16351–16377 (2023). https://doi.org/10.1007/s11227-023-05293-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05293-1