Skip to main content
Log in

ESDAR-net: towards high-accuracy and real-time driver action recognition for embedded systems

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Existing driver action recognition approaches suffer from a bottleneck problem which is the trade-off between recognition accuracy and computational efficiency. More specifically, the high-capacity spatial-temporal deep learning model is unable to realize real-time driver action recognition on vehicle-mounted device. To overcome such limitation, this paper puts forward a novel driver action recognition solution suitable for embedded systems. The proposed ESDAR-Net is a multi-branch deep learning framework and directly processes compressed videos. To reduce the computational cost, a lightweight 2D/3D convolutional network is employed for spatial-temporal modeling. Moreover, two strategies are implemented to boost the accuracy performance: (1) cross-layer connection module (CLCM) and spatial-temporal trilinear pooling module (STTPM) are designed to adaptively fuse appearance and motion information; (2) complementary knowledge from the high-capacity spatial-temporal deep learning model is distilled and transferred to the proposed ESDAR-Net. Experimental results show that the proposed ESDAR-Net satisfies both high-accuracy and real-time for driver action recognition. The accuracy on SEU-DAR-V1, SEU-DAR-V2 reaches 98.7%, 96.5%, with learnable parameters of 2.19M, FLOPs of 0.253G, and speed of 27 clips/s on JETSON TX2.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability

The data that support the findings of this study are not publicly available due to the problem of portraiture right.

References

  1. Abouelnaga Y, Eraqi HM, Moustafa MN (2017) Real-time distracted driver posture classification. arXiv preprint arXiv:1706.09498

  2. Ahmed ST, Basha SM, Ramachandran M, Daneshmand M, Gandomi AH (2023) An edge-ai enabled autonomous connected ambulance route resource recommendation protocol (aca-r3) for ehealth in smart cities. IEEE Internet of Things Journal

  3. Ahmed M, Masood S, Ahmad M, Abd El-Latif AA (2021) Intelligent driver drowsiness detection for traffic safety based on multi cnn deep model and facial subsampling. IEEE Trans Intell Transp Syst 23(10):19 743--19 752

    Article  Google Scholar 

  4. Arnab A, Dehghani M, Heigold G, Sun C, Lučić M, Schmid C (2021) Vivit: A video vision transformer. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6836–6846

  5. Basha SM, Ahmed ST, Iyengar NCSN, Caytiles RD (2021) Inter-locking dependency evaluation schema based on block-chain enabled federated transfer learning for autonomous vehicular systems. In: 2021 Second International Conference on Innovative Technology Convergence (CITC), pp 46–51. IEEE

  6. Boujemaa KS, Berrada I, Fardousse K, Naggar O, Bourzeix F (2021) Toward road safety recommender systems: Formal concepts and technical basics. IEEE Trans Intell Transp Syst, pp 1–20. https://doi.org/10.1109/TITS.2021.3052771

  7. Cao M, Zheng L, Jia W, Liu X (2021) Joint 3d reconstruction and object tracking for traffic video analysis under iov environment. IEEE Trans Intell Transp Syst 22(6):3577–3591. https://doi.org/10.1109/TITS.2020.2995768

    Article  Google Scholar 

  8. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6299–6308

  9. Chaudhry R, Ravichandran A, Hager G, Vidal R (2009) Histograms of oriented optical flow and binet-cauchy kernels on nonlinear dynamical systems for the recognition of human actions. In: 2009 IEEE Conf Comput Vis Pattern Recognit, pp 1932–1939. IEEE

  10. Chen LW, Chen HM (2021) Driver behavior monitoring and warning with dangerous driving detection based on the internet of vehicles. IEEE Trans Intell Transp Syst 22(11):7232–7241. https://doi.org/10.1109/TITS.2020.3004655

    Article  Google Scholar 

  11. Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40(4):834–848. https://doi.org/10.1109/TPAMI.2017.2699184

    Article  Google Scholar 

  12. Chen G, Choi W, Yu X, Han T, Chandraker M (2017) Learning efficient object detection models with knowledge distillation. In: Advances in Neural Information Processing Systems, vol 30. Curran Associates, Inc

  13. Chen J, Ho CM (2022) Mm-vit: Multi-modal video transformer for compressed video action recognition. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 1910–1921

  14. Donahue J, Anne Hendricks L, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrell T (2015) Long-term recurrent convolutional networks for visual recognition and description. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 2625–2634

  15. Duan K, Bai S, Xie L, Qi H, Huang Q, Tian Q (2019) Centernet: Keypoint triplets for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp 6568–6577. https://doi.org/10.1109/ICCV.2019.00667

  16. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  17. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Lee DD, Sugiyama M, Luxburg UV, Guyon I, Garnett R (eds) Advances in Neural Information Processing Systems 29:3468–3476. Curran Associates, Inc. http://papers.nips.cc/paper/6433-spatiotemporal-residual-networks-for-video-action-recognition.pdf

  18. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: 2017 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 7445–7454. https://doi.org/10.1109/CVPR.2017.787

  19. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit (CVPR)

  20. Feng Y, Sun X, Diao W, Li J, Gao X (2021) Double similarity distillation for semantic image segmentation. IEEE Trans Image Process 30:5363–5376. https://doi.org/10.1109/TIP.2021.3083113

    Article  Google Scholar 

  21. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 770–778

  22. Hinton G, Vinyals O, Dean J et al (2015) Distilling the knowledge in a neural network. arXiv preprint 2(7). arXiv:1503.02531

  23. Hoang Ngan Le T, Zheng Y, Zhu C, Luu K, Savvides M (2016) Multiple scale faster-rcnn approach to driver’s cell-phone usage and hands on steering wheel detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp 46–53

  24. Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint hyperimagehttp://arxiv.org/abs/1704.04861arXiv:1704.04861

  25. Hu Y, Lu M, Lu X (2018) Driving behaviour recognition from still images by using multi-stream fusion cnn. Mach Vis Appl. https://doi.org/10.1007/s00138-018-0994-z

    Article  Google Scholar 

  26. Hu Y, Lu M, Lu X (2020) Feature refinement for image-based driver action recognition via multi-scale attention convolutional neural network. Signal Process Image Commun 81(115):697. https://doi.org/10.1016/j.image.2019.115697 . http://www.sciencedirect.com/science/article/pii/S0923 596519300980

  27. Hu Y, Lu M, Xie C, Lu X (2021) Video-based driver action recognition via hybrid spatial-temporal deep learning framework. Multimedia Systems 27(3):483–501

    Article  Google Scholar 

  28. Huang G, Liu Z, Van Der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 4700–4708

  29. Hu Y, Lu M, Lu X (2018) Spatial-temporal fusion convolutional neural network for simulated driving behavior recognition. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV), pp 1271–1277. https://doi.org/10.1109/ICARCV.2018.8581201

  30. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360

  31. Joe Yue-Hei Ng, Hausknecht M, Vijayanarasimhan S, Vinyals O, Monga R, Toderici G (2015) Beyond short snippets: Deep networks for video classification. In: 2015 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 4694–4702

  32. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: Proc IEEE Conf Comput Vis Pattern Recognit (CVPR)

  33. Koesdwiady A, Bedawi SM, Ou C, Karray F (2017) End-to-end deep learning for driver distraction recognition. In: Karray F, Campilho A, Cheriet F (eds) Image Analysis and Recognition. Springer International Publishing, Cham, pp 11–18

    Chapter  Google Scholar 

  34. Kopuklu O, Kose N, Gunduz A, Rigoll G (2019) Resource efficient 3d convolutional neural networks. In: Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp 0–0

  35. Korbar B, Tran D, Torresani L (2019) Scsampler: Sampling salient clips from video for efficient action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  36. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KQ (eds) Advances in Neural Information Processing Systems, vol 25. Curran Associates Inc

    Google Scholar 

  37. Kuehne H, Jhuang H, Stiefelhagen R, Serre T (2013) Hmdb51: A large video database for human motion recognition. In: Nagel WE, Kröner DH, Resch MM (eds) High Performance Computing in Science and Engineering 12:571–582. Springer, Berlin Heidelberg, Berlin, Heidelberg

    Google Scholar 

  38. Liu H, Liu W, Chi Z, Wang Y, Yu Y, Chen J, Jin T (2022) Fast human pose estimation in compressed videos. IEEE Transactions on Multimedia, pp 1–1. https://doi.org/10.1109/TMM.2022.3141888

  39. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: 2015 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965

  40. Lu M, Hu Y, Lu X (2019) Dilated light-head r-cnn using tri-center loss for driving behavior recognition. Image Vis Comput 90(103):800

    Google Scholar 

  41. Lu M, Hu Y, Lu X (2020) Driver action recognition using deformable and dilated faster r-cnn with optimized region proposals. Appl Intell 50(4):1100–1111

    Article  Google Scholar 

  42. Maji S, Bourdev L, Malik J (2011) Action recognition from a distributed representation of pose and appearance. In: CVPR 2011, pp 3177–3184. IEEE

  43. Masood S, Rai A, Aggarwal A, Doja MN, Ahmad M (2020) Detecting distraction of drivers using convolutional neural network. Pattern Recogn Lett 139:79–85

    Article  Google Scholar 

  44. Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131

  45. Mehta S, Rastegari M (2022) Mobilevit: Light-weight, general-purpose, and mobile-friendly vision transformer. In: International Conference on Learning Representations. https://openreview.net/forum?id=vh-0sUt8HlG

  46. Moslemi N, Azmi R, Soryani M (2019) Driver distraction recognition using 3d convolutional neural networks. In: 2019 4th International Conference on Pattern Recognition and Image Analysis (IPRIA), pp 145–151. IEEE

  47. National Bureau of Statistics (2021) Traffic accident report. https://data.stats.gov.cn

  48. Peng Y, Zhao Y, Zhang J (2019) Two-stream collaborative learning with spatial-temporal attention for video classification. IEEE Transactions on Circuits and Systems for Video Technology 29(3):773–786. https://doi.org/10.1109/TCSVT.2018.2808685

    Article  Google Scholar 

  49. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: Unified, real-time object detection. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 779–788

  50. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28

  51. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 4510–4520

  52. Shou Z, Lin X, Kalantidis Y, Sevilla-Lara L, Rohrbach M, Chang SF, Yan Z (2019) Dmc-net: Generating discriminative motion cues for fast compressed video action recognition. In: Proc IEEE/CVF Conf Comput Vis Pattern Recognit, pp 1268–1277

  53. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: Ghahramani Z, Welling M, Cortes C, Lawrence N, Weinberger KQ (eds) Advances in Neural Information Processing Systems 27. Curran Associates Inc

    Google Scholar 

  54. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  55. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402. http://arxiv.org/abs/1212.0402

  56. Tomar S (2006) Converting video formats with ffmpeg. Linux journal 2006(146):10

    Google Scholar 

  57. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision, pp 4489–4497

  58. Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M (2018) A closer look at spatiotemporal convolutions for action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6450–6459

  59. Wang P, Cao Y, Shen C, Liu L, Shen HT (2017) Temporal pyramid pooling-based convolutional neural network for action recognition. IEEE Transactions on Circuits and Systems for Video Technology 27(12):2613–2622. https://doi.org/10.1109/TCSVT.2016.2576761

    Article  Google Scholar 

  60. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: Proceedings of the IEEE international conference on computer vision, pp 3551–3558

  61. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Gool LV (2016) Temporal segment networks: Towards good practices for deep action recognition. In: European conference on computer vision, Springer, pp 20–36

  62. Wu CY, Zaheer M, Hu H, Manmatha R, Smola AJ, Krähenbühl P (2018) Compressed video action recognition. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6026–6035

  63. Yan C, Coenen F, Zhang BL (2014) Driving posture recognition by joint application of motion history image and pyramid histogram of oriented gradients. In: Advances in Mechatronics, Automation and Applied Information Technologies, Advanced Materials Research 846:1102–1105. Trans Tech Publications. https://doi.org/10.4028/www.scientific.net/AMR.846-847.1102

  64. Yang J, Liu J, Han R, Wu J (2021) Generating and restoring private face images for internet of vehicles based on semantic features and adversarial examples. IEEE Trans Intell Transp Syst, pp 1–11. https://doi.org/10.1109/TITS.2021.3102266

  65. Yan C, Zhang B, Coenen F (2015) Driving posture recognition by convolutional neural networks. In: 2015 11th International Conference on Natural Computation (ICNC), pp 680–685. https://doi.org/10.1109/ICNC.2015.7378072

  66. Yu Z, Yu J, Fan J, Tao D (2017) Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 1821–1830

  67. Zhang C, Li R, Kim W, Yoon D, Patras P (2020) Driver behavior recognition via interwoven deep convolutional neural nets with multi-stream inputs. Ieee Access 8:191,138--191,151

    Article  Google Scholar 

  68. Zhang B, Wang L, Wang Z, Qiao Y, Wang H (2016) Real-time action recognition with enhanced motion vector cnns. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 2718–2726

  69. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proc IEEE Conf Comput Vis Pattern Recognit, pp 6848–6856

  70. Zhao C, Gao Y, He J, Lian J (2012) Recognition of driving postures by multiwavelet transform and multilayer perceptron classifier. Eng Appl Artif Intell 25(8):1677–1686. https://doi.org/10.1016/j.engappai.2012.09.018 . http://www.sciencedirect.com/science/article/pii/S0952 197612002564

  71. Zhao CH, Zhang BL, He J, Lian J (2012) Recognition of driving postures by contourlet transform and random forests. IET Intell Transp Syst 6(2):161–168. https://doi.org/10.1049/iet-its.2011.0116

    Article  Google Scholar 

  72. Zhao CH, Zhang BL, Zhang XZ, Zhao SQ, Li HX (2013) Recognition of driving postures by combined features and random subspace ensemble of multilayer perceptron classifiers. Neural Comput & Applic 22(1):175–184. https://doi.org/10.1007/s00521-012-1057-4

  73. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: 2017 IEEE Conf Comput Vis Pattern Recognit (CVPR), pp 6230–6239.https://doi.org/10.1109/CVPR.2017.660

  74. Zhao C, Zhang B, Lian J, He J, Lin T, Zhang X (2011) Classification of driving postures by support vector machines. In: 2011 Sixth International Conference on Image and Graphics, pp 926–930. https://doi.org/10.1109/ICIG.2011.184

Download references

Acknowledgements

The authors would like to thank the editor and the anonymous reviewers for their valuable comments and constructive suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 62203012, No. 61871123 and No. 61901221), the Open Research Fund of AnHui Key Laboratory of Detection Technology and Energy Saving Devices (No. JCKJ2022A07), Anhui Polytechnic University of Technology Introduced Talent Research Startup Fund (No. 2022YQQ009) and the Youth Foundation of Anhui Polytechnic University (No. Xjky2022039).

Funding

The authors would like to thank the editor and the anonymous reviewers for their valuable comments and constructive suggestions. This work was supported in part by the National Natural Science Foundation of China (No. 62203012, No. 61871123 and No. 61901221), the Open Research Fund of AnHui Key Laboratory of Detection Technology and Energy Saving Devices (No. JCKJ2022A07), Anhui Polytechnic University of Technology Introduced Talent Research Startup Fund (No. 2022YQQ009) , the Youth Foundation of Anhui Polytechnic University (No. Xjky2022039), Anhui Province Higher Education Quality Engineering Project (No. 2022jyxm139 and No. 2022kcsz027), Anhui University Collaborative Innovation Project (No. GXXT-2020-0069) and Anhui Natural Science Foundation Project (2108085MF220).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Huicheng Yang or Xiaobo Lu.

Ethics declarations

Conflict of Interest

There is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hu, Y., Shuai, Z., Yang, H. et al. ESDAR-net: towards high-accuracy and real-time driver action recognition for embedded systems. Multimed Tools Appl 83, 18281–18307 (2024). https://doi.org/10.1007/s11042-023-15777-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15777-0

Keywords

Navigation