Elsevier

Pattern Recognition

Volume 66, June 2017, Pages 229-238
Pattern Recognition

DeepSafeDrive: A grammar-aware driver parsing approach to Driver Behavioral Situational Awareness (DB-SAW)

https://doi.org/10.1016/j.patcog.2016.11.028Get rights and content

Highlights

  • Incorporate grammatical structure telling the relationship between parts into CNNs.

  • Propose a fast and effective segmentation using prior knowledge of deep probability map to define within-subgraph and between-subgraph.

  • The deep features capable of representing both information of feature and shape.

  • To the best of our knowledge, this is the first time an automatic system to support Driver Behavioral Situational Awareness (DB-SAW) has been presented. Particularly, it is the first system that finds the parts of the driver, i.e. head, body, seat belt, hands, eyes, mouth and nose.

Abstract

This paper presents a Grammar-aware Driver Parsing (GDP) algorithm, with deep features, to provide a novel driver behavior situational awareness system (DB-SAW). A deep model is first trained to extract highly discriminative features of the driver. Then, a grammatical structure on the deep features is defined to be used as prior knowledge for a semi-supervised proposal candidate generation. The Region with Convolutional Neural Networks (R-CNN) method is ultimately utilized to precisely segment parts of the driver. The proposed method not only aims to automatically find parts of the driver in challenging “drivers in the wild” databases, i.e. the standardized Strategic Highway Research Program (SHRP-2) and the challenging Vision for Intelligent Vehicles and Application (VIVA), but is also able to investigate seat belt usage and the position of the driver's hands (on a phone vs on a steering wheel). We conduct experiments on various applications and compare our GDP method against other state-of-the-art detection and segmentation approaches, i.e. SDS [1], CRF-RNN [2], DJTL [3], and R-CNN [4] on SHRP-2 and VIVA databases.

Introduction

Driver safety is one of the major concerns in today's world. With the number of vehicles on the road increasing on a daily basis, the ability to use computer vision and machine learning algorithms to automatically assess a driver's vigilance is extremely important. One of the biggest challenges in this topic is that drivers in videos are usually recorded under weak lighting, low resolution and poor illumination control of day and night modes. A person drives through various terrain, e.g. open roads, under trees, etc.., causing constant and often stark variations in illumination conditions. Furthermore, in the automotive field, lower-resolution cameras and lower-power embedded processors, which are preferred due to cost constraints, create additional challenges.

In order to meet the goals of assessing driver safety, we propose a fully automatic system that is able to (1) simultaneously detect and segment the seat belt of a driver to see if the driver is wearing it (as shown in Fig. 1(D)); (2) analyze the upper parts of the driver usually recorded from cameras, including head, body (body) and seat belt in order to detect if the driver is looking forward and keeping their eyes on the road; (3) determine if the drive is tired or starting to fall asleep, distracted due to a conversation, using a hand-held device, i.e. a cell phone; and (4) detect whether the driver's hands are on the steering wheel.

Motivation: The recently popular Simultaneous Detection and Segmentation (SDS) method [1] uses Convolutional Neural Networks (CNN) to classify category-independent region proposals and aims to detect all instances of a category in an image. Their experiments showed that the method achieves state-of-the-art results in object detection and semantic segmentation. However, SDS lacks grammatical structures of objects thus causes the problem of erroneous detection and segmentation as shown in Fig. 1(B) where the seat belt is the object of interest and in Fig. 2 where the hands on the phone are the objects of interest. The segmentation results from SDS share similar shapes with the objects of interest but their positions are incorrect. In addition, since the “driver in the wild” videos are usually low resolution and poorly illuminated, SDS is therefore unable to achieve accurate proposal candidates.

In our proposed system, the global relative structure of the parts of the driver (i.e. head, body, seatbelt) and local relative structure (i.e. eyes, nose, mouth) are first modeled using the Pictorial Structures (PS) approach [6]. The detected deep probability map of the driver will be used as prior knowledge for proposal candidate generation to achieve accurate detection and segmentation. The flowchart of our proposed system is shown in Fig. 3. Unlike previous PS methods [6], [7] using Histogram of Oriented Gradients (HOG) or Scale-Invariant Feature Transform (SIFT) features to extract features, the proposed PS employs deep features learned from our trained deep model shown in Section 3.1. These deep features allow our proposed PS approach to achieve higher detection results than previous PS methods [6], [7].

The contribution of our work can be summarized as follows:

  • Incorporate grammatical structure which tells the relationship between parts into Convolutional Neural Networks (CNNs)

  • Propose a fast and effective partition method which utilizes prior knowledge of deep probability map to defines within-subgraph and between-subgraph

  • The deep features capable of representing both information of feature and shape

  • To the best of our knowledge, this is the first time an automatic system to support Driver Behavioral Situational Awareness (DB-SAW) has been presented. Particularly, it is the first system that finds the parts of the driver, i.e. head, body, seat belt, hands, eyes, mouth and nose. The detection and segmentation results of our system can be used for numerous tasks, e.g. head detection and pose estimation, hands on wheels verification, hands on phone evaluation, etc.

Section snippets

Related work

This section reviews previous work on driver activity analysis. Since driver hand detection is a part of this work, we also review recent studies in hand detection.

The multimodal vision method [8] was presented to characterize driver activity based on head, eye and hand cues. The fused cues from these three inputs using hierarchical Support vector Machines (SVM) enrich the descriptions of the drivers state allowing for evaluation of driver performance captured in on-road settings. However, this

Grammar-aware driver parsing (GDP)

This section presents our Grammar-aware Driver Parsing approach with the awareness of the grammar structure of the driver to detect and segment parts of the driver. The proposed algorithm uses the grammatical structures as guidance and refinement for faster searching, more precise locating and more accurate segmentation.

Our proposed algorithm first develops a GPU-based Caffe framework [16] to train a Deep Convolutional Neural Network (DCNN) model for objects of interest, namely, face, torso and

Experimental results

Section 4.1 briefly reviews the main features of the databases used in our evaluations. Section 4.2 presents our training and evaluation steps for the DCNN used to extract deep features for the later experiments. Then, in the next three sections, we evaluate our proposed method in various tasks. Section 4.3 presents our experiments in parts of the drivers parsing on the “drivers in the wild” database. Section 4.4 presents the problem of hands on wheels detection. Finally, Section 4.5 presents

Conclusion

This paper has presented a Grammar-aware Driver Parsing approach with our trained deep model to solve the problem of driver behavioral situational awareness. The proposed approach first designs a deep model to extract highly discriminative features of parts of the driver. A Pictorial Structure is employed to build the grammatical structure of these parts. The deep probability maps are then used as prior knowledge to a semi-supervised segmentation method to generate high accuracy proposal

Conflict of interest

None declared.

T. Hoang Ngan Le is a student member of the IEEE. She is currently a Ph.D student in Electrical and Computer Engineering at Carnegie Mellon University (CMU) and a research assistant in CMU CyLab Biometrics Center since Sep. 2011. She was a Ph.D. student and research assistant at Center for Pattern Recognition and Machine Intelligence (CENPARMI) at Concordia University, Montreal, Canada from May. 2010 to Sep. 2011. She received her Bachelor and Master degrees in Department of Computer Sciences,

References (30)

  • B. Hariharan, P. Arbeláez, R. Girshick, J. Malik, Simultaneous detection and segmentation, in: ECCV, 2014, pp....
  • S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, P.H.S. Torr, Conditional random fields...
  • X. Wang, L. Zhang, L. Lin, Z. Liang, W. Zuo, Deep joint task learning for generic object extraction, in: NIPS, 2014,...
  • T.D. Ross Girshick et al.

    Region-based convolutional networks for accurate object detection and segmentation

    IEEE TPAMI

    (2015)
  • E. The National Academies of Sciences, Medicine, The Second Strategic Highway Research Program (2006–2015) (SHRP-2),...
  • M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited, People detection and articulated pose estimation,...
  • L. Pishchulin, M. Andriluka, P. Gehler, B. Schiele, Poselet conditioned pictorial structures, in: CVPR, IEEE, 2013, pp....
  • E. Ohn-Bar, S. Martin, A. Tawari, M.M. Trivedi, Head, eye, and hand patterns for driver activity recognition, in: ICPR,...
  • E. Ohn-Bar et al.

    Hand gesture recognition in real time for automotive interfacesa multimodal vision-based approach and evaluations

    IEEE Trans. ITS

    (2014)
  • A. Mittal, A. Zisserman, P.H.S. Torr, Hand detection using multiple proposals, in: British Machine Vision Conference,...
  • X. Sun, Y. Wei, S. Liang, X. Tang, J. Sun, Cascaded hand pose regression, in: CVPR, 2015, pp....
  • C. Qian, X. Sun, Y. Wei, X. Tang, J. Sun, Realtime and robust hand tracking from depth, in: CVPR, 2015, pp....
  • S. Sridhar, F. Mueller, A. Oulasvirta, C. Theobalt, Fast and robust hand tracking using detection-guided optimization,...
  • E. Trulls, S. Tsogkas, I. Kokkinos, A. Sanfeliu, F. Moreno-Noguer, Segmentation-aware deformable part models, in: CVPR,...
  • B. Rothrock, S. Park, S.C. Zhu, Integrating grammar and segmentation for human pose estimation, in: CVPR, 2013, pp....
  • Cited by (19)

    • A survey on vision-based driver distraction analysis

      2021, Journal of Systems Architecture
      Citation Excerpt :

      Siddharth et al. [100] exploit YOLO to generate hand region proposals, then use a pixel-level mask to refine the proposals, and employ SVM to classify driver hand grasp state. Le et al. [101] present a driver behavior situational awareness system (DB-SAW) using Grammar-aware Driver Parsing (GDP) approach to segment parts of the driver, analyze seat belt usage and driver hands position. Weyers et al. [102] present a system using LSTM for driver hand activity recognition from images taken by a time-of-flight camera.

    • Recognition of visual-related non-driving activities using a dual-camera monitoring system

      2021, Pattern Recognition
      Citation Excerpt :

      For NDA identification, the driver is constrained on the seat, space limitation and body occultation pose a challenge for driver pose estimation and further affect the action recognition. Le et al. [13] proposed a convolutional neural network (CNN)-based approach to achieve the driver behaviour parsing. It localises some body parts of the driver like head, hand, etc. by semantic segmentation in still images to achieve the detection of some actions, such as hands on steering wheel and hands on phone.

    • Deep contextual recurrent residual networks for scene labeling

      2018, Pattern Recognition
      Citation Excerpt :

      CNN has been proved to be effective in many areas from natural images [10,11,14,18,19] to facial analysis [23–25] to drive safety [15,26–28]. Recently, Le et al. [15] incorporates grammatical structure telling the relationship between parts into CNNs to support driver behavioral situational awareness (DB-SAW). Le et al. [15] uses prior knowledge of deep probability map to define within-subgraph and between-subgraph which deep features capable of representing both information of feature and shape.

    • HUMAN-MACHINE INTERACTION FOR AUTOMATED VEHICLES: Driver Status Monitoring and the Takeover Process

      2023, Human-Machine Interaction for Automated Vehicles: Driver Status Monitoring and the Takeover Process
    View all citing articles on Scopus

    T. Hoang Ngan Le is a student member of the IEEE. She is currently a Ph.D student in Electrical and Computer Engineering at Carnegie Mellon University (CMU) and a research assistant in CMU CyLab Biometrics Center since Sep. 2011. She was a Ph.D. student and research assistant at Center for Pattern Recognition and Machine Intelligence (CENPARMI) at Concordia University, Montreal, Canada from May. 2010 to Sep. 2011. She received her Bachelor and Master degrees in Department of Computer Sciences, Faculty of Information Technology in University of Natural Sciences, HCM City, Vietnam National University in 2005 and 2009, respectively. Her current research interests focus on image segmentation, facial recognition, classification, sparse representation, and compressed sensing. Homepage: http://www.andrew.cmu.edu/user/thihoanl/

    Chenchen Zhu is a student member of IEEE. He is currently a Ph.D. student at Carnegie Mellon University (CMU) major in Electrical and Computer Engineering. He also works as a research assistant in the CMU CyLab Biometrics Center. His research interests focus on deep learning, computer vision, facial analysis, and pattern recognition. He received the M.S. degree from Department of Electrical and Computer Engineering at CMU in 2015. Before join CMU, he received the B.S. degree in 2013 from Department of Electronic Information and Science at Nanjing University, China.

    Khoa Luu has been a postdoctoral research scientist at the CMU Cylab Biometrics Center since April 2011. He was a doctoral student and research assistant at Centre for Pattern Recognition and Machine Intelligence (CENPARMI) at Concordia University, Montreal, Canada from Jan. 2008 to March 2011. He received her Bachelor degree in Department of Computer Sciences, Faculty of Information Technology in University of Natural Sciences, HCM City, Vietnam National University in 2005. His research interests focus on various topics, including: Biometrics, Image Processing, Multifactor Analysis, Compressed Sensing and Machine Learning. Homepage: http://www.andrew.cmu.edu/user/kluu/

    Marios Savvides is the Founder and Director of the Biometrics Center at Carnegie Mellon University and is an Associate Research Professor at the Electrical & Computer Engineering Department and CMU CyLab. He is also one of the tapped researchers to form the Office of the Director of National Intelligence (ODNI) 1st Center of Academic Excellence in Science and Technology. His research is mainly focused on developing algorithms for robust face and iris biometrics as well as pattern recognition, machine vision and computer image understanding for enhancing biometric systems performance. He is on the program committee on several Biometric conferences such as IEEE BTAS, ICPR, SPIE Biometric Identification, IEEE AutoID and others and has also organized and co-chaired Robust Biometrics Understanding the Science & Technology (ROBUST 2008) conference. He is an annually invited speaker at IDGA's main conference on Biometrics for National Security and Defense (now IDGA Biometrics for National Security and Law Enforcement). He has authored and co-authored over 170 journal and conference publications, including several book chapters in the area of Biometrics and served as the area editor of the Springer's Encyclopedia of Biometrics. He helped develop the IEEE Certified Biometrics Professional (CBP) program and served on the main steering committee of the IEEE CBP program. His achievements include leading the R&D in CMU's past participation at NIST's Face Recognition Grand Challenge 2005 (CMU ranked #1 in academia and industry at the hardest experiment #4 open challenge) and also in NIST's Iris Challenge Evaluation (CMU ranked #1 in academia and #2 against iris vendors). Prof. Savvides is listed in Marquis Who's Who in America and in Marquis' Who's Who in Science & Engineering. He has filed over 20 patent applications in the area of biometrics and is the recipient of CMU's 2009 Carnegie Institute of Technology (CIT) Outstanding Faculty Research Award.

    View full text