Optimized structure learning of Bayesian Network for investigating causation of vehicles’ on-road crashes

https://doi.org/10.1016/j.ress.2022.108527Get rights and content

Highlights

  • A structure learning method is proposed to construct Bayesian Network of vehicles’ crash.

  • A robust feature selection method is applied to deal with the data with numerous features.

  • The generated Bayesian Network has a trade-off between complexity and interpretability.

  • The superior performances of the proposed structure learning method are verified.

  • Casual inferences on key features are conducted upon the generated Bayesian Network.

Abstract

A vehicle's crash can be seen as a failure of microscopic road transportation system. The causal investigation of vehicles’ crashes has drawn much attention from academia and industry alike, which is of significance to road traffic safety. This study develops a structure learning method to construct Bayesian Network (BN). The BN as generated by the method can comprehensively illustrate the causal relationships between risk contributing features and vehicles’ on-road risky events (i.e. near-crash and crash). The proposed structure learning method has following three advantages: (1). considering multiple categories of features; (2). applying robust feature selection method to improve prediction performance and facilitate the explanation of causation; and (3). making a trade-off between the complexity and interpretability of BN structure. The method is applied on the Second Highway Research Program (SHRP2) Naturalistic Driving Study (NDS) database for case study. The results show that the generated optimal BN achieves satisfactory performances on both structure complexity and prediction accuracy. Besides, as compared to the BN built by the other state-of-the-art structure learning methods, the optimal BN presents superior performance on causal interpretability. Also, by performing causal inferences upon the optimal BN, this study examines and analyzes the contributions of several key features to the risky events. Several interesting findings about the features’ contributions are reported in this paper, which could provide valuable references for road safety engineering in the future.

Introduction

Microscopic road transportation system is complex as it involves factors from multiple aspects, such as human, vehicle, and environment [21]. A reliable microscopic transportation system ensures the safety of a single vehicle or a cluster of vehicles on the road. A vehicle's crash can be seen as a failure of the microscopic transportation system, which can cause serious damage to property and great loss of life. As reported by the National Highway Traffic Safety Administration [1], in 2017, there were nearly 6,452,000 motor vehicle crashes and 37,247 motoring fatalities in the US. Moreover, such a failure can have an impact on the reliability of macroscopic road transportation system. For example, a serious vehicles’ crash might lead to a congestion and reduce the traffic efficiency of road network. Thus, investigating the causal relationships between the contributing factors and vehicles’ crashes is of importance towards the remediation of crashes and the enhancement of road transportation reliability.

Previous studies have focused on the contributing factors from various aspects in recent years, such as vehicles’ kinetic characteristics (e.g. position, velocity, etc., extracted from trajectory) [15,55,69,72], driving behaviors (e.g. improper driving maneuver, aggressive driving, etc.) [4,44,71], surrounding environment (e.g., weather, road condition, etc.) [29,48,75,82], human factors (e.g. physiology, psychology, social background, etc.) [3,5,25,26], transportation infrastructure [8],[83], etc. Since a vehicles’ crash can be regarded as a systematic failure which usually results from the effect of multiple factors rather than only a single mistake [62], many researchers have attempted to synthesize the factors from multiple aspects when investigating a vehicles’ crash [12,28,79]. With the rapid development of information technology, the collection of data with enormous features becomes increasingly available. Meanwhile, how to better exploit such kind of data becomes a persistent challenge faced by the researchers who intend to understand the underlying causes of a crash or precisely predict a crash in a data-driven way.

The development of data science and machine learning makes data-driven techniques become popular in the research area of driving safety and crash risk prevention. Gitelman et al. [29] investigated the relationship between the driving events collected by in-vehicle data recorders, road factors and crashes and identified high-risk locations on road network. Yang et al. [79] developed a real-time crash evaluation model for urban expressway using Bayesian dynamic logistic regression method based on in-field streaming traffic data. Bao et al. [6] explored the contributions of the trip pattern features extracted from a large-scale taxi GPS database to the spatially aggregated crashes in urban area. Xu, et al. [77] applied a four-stage random-parameters sequential logistics regression model to explore the relation between the probability of crash casualty and real-time multiple factors. Yang et al. [78] investigated time-dependent safety performance by using the dangerous driving event data captured by smartphones. Xie et al. [76] employed a deep learning model trained on empirical lane-changing data to predict vehicle's lane-changing maneuver. Shi et al. [69] developed a feature learning method which can evaluate car-following risk on the basis of a vehicles’ trajectory dataset. Osman et al. [53] applied a hierarchical machine learning classification method to identify the types of secondary tasks that drivers are engaged in based on the driving behavior parameters. However, in order to focus on a specific problem or simplify the method validation, most previous studies merely considered the features within a limited scope. Synthesizing the features belonging to multiple categories could provide a deeper insight into the causation of vehicles’ crashes.

As a state-of-the-art machine learning technique, Bayesian Network (BN) is a type of robust probabilistic model with graphic structure, which presents a set of features and specifies their conditional dependencies using a Directed Acyclic Graph (DAG) [56]. BNs have been widely applied in risk assessment considering its capability of conducting comprehensive and precise analysis on a sophisticated system [38]. Several previous studies have applied BN to investigate transportation systems and enhance transportation safety. Chen et al. [13] used a BN to explicitly investigate the statistical associations between injury severity outcomes and explanatory factors in rear-end crashes. Mbakwe et al. [49] combined Delphi technique and BN to assess the highway traffic safety in several developing countries. Chen et al. [12] proposed a probabilistic decision-making framework for rear-end collision avoidance system based upon the BN with major collision-causing factors. Also, various novel extensions of BNs have been developed to better explain systematic failures and conduct risk assessment. El-Awady and Ponnambalam [24] employed the simulation and Markov Chains to assist BN reasoning for probabilistic failure analysis of complex systems. Huang, et al. [36] developed a BN-K2 Algorithm-Expectation Maximization approach to measure the intensity of coupling influence between systematic failures and describe the propagation chains of the failures. [61]) proposed a Noisy-or Gate BN model combining Noisy-or Gate model and Naive BN to better identify the joint probability distribution of a target risk system. Guo et al. [31] built a discrete-time BN model by considering the impact of common cause failures on system reliability. Zywiec et al. [84] developed a novel methodology to incorporate neural network metamodel into BN-based probabilistic risk assessment for industrial facilities. Yin et al. [81] proposed a hybrid knowledge-based and data-driven approach to construct BN for studying the resilience of urban rail systems. Nevertheless, since the collection of various features becomes available, the challenges such as how to handle the input data with numerous features and how to optimize the BN structure considering the numerous features are being faced by researchers.

This study is aimed at addressing the research gaps inherent in the previous literature, which are summarized as follows:

  • (1)

    Most data-driven research on vehicles’ crashes has been conducted based on a dataset with limited feature categories. With the development of information industry, the collection of data with multiple categories of the features has become feasible. Hence, how to properly process a variety of features considering both feature-to-feature and category-to-category relationships places an emerging challenge for the research into vehicles’ crashes.

  • (2)

    The data with a large size of features always include noisy features which could result in a serious overfitting problem and have negative effects on the prediction performance of the model trained on such data. However, when investigating vehicles’ crash risk, few studies have attempted to effectively remove the noisy features from the dataset before training a model (especially a BN). Hence, how to precisely identify key features is a problem to be resolved in this study, which we believe is critical to the data-driven research on vehicles’ crashes.

  • (3)

    In most studies, the structures of BN are determined in two ways: first, pre-defined by prior knowledge; and second, identified by the conventional structure learning method based on probabilistic evidence. The former BN is easy to interpret but always complex in structure, while the latter BN has satisfactory structure performance but usually lacks interpretability and rationality. Therefore, it is necessary to develop a structure learning method that makes a trade-off between complexity against interpretability of BN structure.

To bridge the above research gaps, this study develops an optimized structure learning method to construct BN. The generated BN can be used to investigate the causal relationships between the features that have potential contributions to vehicles’ on-road risky events (i.e., near-crash and crash) and the causation of a risky event. As shown in Fig. 1, the research framework of this study has three phases, namely, data preprocessing, BN structure learning, and application and evaluation. Phase 1 is proposed to build a dataset of the samples (i.e. events) with candidate features and event label (i.e. non-crash, near-crash, and crash) of each sample in preparation for BN construction. Phase 2 develops the optimized structure learning method which is used to construct the candidate BNs based upon multiple feature categories and identify the optimal BN from the candidate BNs according to their performances. In Phase 3, a naturalistic driving database is applied for a case study, in which method comparisons are conducted to verify the performances of the proposed structure learning method and the causal inferences between the key features and the event are performed based upon the optimal BN.

This study has two major contributions, respectively, from the perspectives of methodology and application. First, this study develops a structure learning method that can better integrate prior knowledge and the statistical information collected from a large scale of data to construct an optimal BN. As mentioned in Section 1.2, the conventional methods of BN construction merely depend on prior knowledge or probabilistic evidence, which make them always fail to attain a trade-off between complexity and interpretability. By encapsuling advanced machine learning techniques, the proposed structure learning method takes advantages of both prior knowledge and statistical information and resolves the above limitations of the conventional methods. Also, in this study, we compare the proposed structure learning method with several state-of-the-art structure learning methods as developed in recent years, upon which we verify the superiority of the BN as generated by the proposed method on both prediction performance and rationality. Therefore, the proposed structure learning method resolves the above research gaps in safety engineering by providing a comprehensive approach to precisely investigate the causality of hazard from multiple information sources. Second, in this study, we conduct causal inferences based on the optimal BN as generated by the proposed structure learning method to explain the causation of a vehicles’ on-road risky event in a probabilistic way. Several interesting findings are reported and discussed in this paper, which can provide insights into the causality of risky driving maneuver on the road and further enhance road transportation safety. Moreover, future work could be conducted to further verify the findings and adopt the findings into automated technology such as Advanced Driver-Assistance System (ADAS).

This paper is organized as follows. Section 2 introduces the procedures of data preprocessing (i.e. Phase 1) and the procedures of developing the optimized BN structure learning method (i.e. Phase 2). Section 3 contains the network validation results, method comparison results and several interesting findings of the case study (i.e. Phase 3). The steps of each phase (as shown in Fig. 1) are introduced in detail within Sections 2 and 3, respectively. Section 4 covers the conclusions, limitations, and future work of this study.

Section snippets

Data preprocessing

As introduced in Fig. 1, the data preprocessing phase has three steps. First, candidate feature extraction is composed of two sub-steps, namely, data cleaning and feature categorization. Data cleaning is proposed to detect and eliminate noises from the dataset to improve the quality of data. Herein, incomplete features with missing data as well as duplicate features are identified and removed from the dataset. Savitzky-Golay filter [66] is employed to eliminate potential noises from the

Data source

This study employs the dataset extracted from the Second Highway Research Program (SHRP2) Naturalistic Driving Study (NDS) database for validation. The SHRP2 NDS collected a total of 5,512,900 naturalistic driving trips from nearly 3,400 participant drivers in the United States between 2010 and 2013. The participant vehicles were installed with a data acquisition system (DAS) to collect vehicles’ information from multiple aspects [33]. Besides, the SHRP2 NDS database also include driver's

Summary and conclusions

This study proposes an optimized structure learning method for constructing BN, upon which the causal relationships between contributing features and vehicles’ on-road risky events (i.e. near-crash and crash) are investigated. The framework of this study comprises three phases, namely, data preprocessing, BN structure learning, and case study. Data preprocessing aims at preparing the dataset for BN construction. In this phase, candidate features are extracted and classified into seven

CRediT authorship contribution statement

Tianyi Chen: Conceptualization, Methodology, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Yiik Diew Wong: Supervision, Conceptualization, Writing – review & editing. Xiupeng Shi: Conceptualization, Formal analysis. Xueqin Wang: Formal analysis, Writing – review & editing.

Declaration of Competing Interest

There are no conflicts of interest associated with this study.

Acknowledgments

This paper presents a part of the first author's PhD research.

References (87)

  • T. Chen et al.

    A data-driven feature learning approach based on Copula-Bayesian Network and its application in comparative investigation on risky lane-changing and car-following maneuvers

    Accid. Anal. Prev.

    (2021)
  • A. El-Awady et al.

    Integration of simulation and Markov Chains to support Bayesian Networks for probabilistic failure analysis of complex systems

    Reliab. Eng. Syst. Saf.

    (2021)
  • A.J. Filtness et al.

    Sleep-related crash characteristics: implications for applying a fatigue definition to crash reports

    Accid. Anal. Prev.

    (2017)
  • G. Fountas et al.

    The effects of driver fatigue, gender, and distracted driving on perceived and observed aggressive driving behavior: a correlated grouped random parameters bivariate probit approach

    Anal. Methods Accid. Res.

    (2019)
  • A. Ghasemzadeh et al.

    Quantifying regional heterogeneity effect on drivers’ speeding behavior using SHRP2 naturalistic driving data: a multilevel modeling approach

    Transp. Res. Part C Emerg. Technol.

    (2019)
  • V. Gitelman et al.

    Exploring relationships between driving events identified by in-vehicle data recorders, infrastructure characteristics and road crashes

    Transp. Res. Part C Emerg. Technol.

    (2018)
  • Y. Guo et al.

    A discrete-time Bayesian Network approach for reliability analysis of dynamic systems with common cause failures

    Reliab. Eng. Syst. Saf.

    (2021)
  • J.L. Harbluk et al.

    An on-road assessment of cognitive distraction: impacts on drivers’ visual behavior and braking performance

    Accid. Anal. Prev.

    (2007)
  • L. Huang et al.

    A hybrid approach for identifying the structure of a Bayesian Network model

    Expert Syst. Appl.

    (2019)
  • W. Huang et al.

    Operational failure analysis of high-speed electric multiple units: a Bayesian Network-K2 algorithm-expectation maximization approach

    Reliab. Eng. Syst. Saf.

    (2021)
  • S. Kabir et al.

    Applications of Bayesian Networks and petri nets in safety, reliability, and risk assessments: a review

    Saf. Sci.

    (2019)
  • R. Kohavi et al.

    Wrappers for feature subset selection

    Artif. Intell.

    (1997)
  • J. Lee et al.

    A framework for evaluating aggressive driving behaviors based on in-vehicle driving records

    Transp. Res. Part F Traffic Psychol. Behav.

    (2019)
  • A. Likas et al.

    The global k-means clustering algorithm

    Pattern Recognit.

    (2003)
  • A.L. Madsen et al.

    A parallel algorithm for Bayesian Network structure learning from large data sets

    Knowl. Based Syst.

    (2017)
  • F. Malin et al.

    Accident risk of road and weather conditions on different road types

    Accid. Anal. Prev.

    (2019)
  • A.C. Mbakwe et al.

    Alternative method of highway traffic safety analysis for developing countries using delphi technique and Bayesian Network

    Accid. Anal. Prev.

    (2016)
  • A. Mellor et al.

    Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin

    ISPRS J. Photogramm. Remote Sens.

    (2015)
  • O.A. Osman et al.

    A hierarchical machine learning classification approach for secondary task identification from observed driving behavior data

    Accid. Anal. Prev.

    (2019)
  • Y. Pan et al.

    Modeling risks in dependent systems: a Copula-Bayesian approach

    Reliab. Eng. Syst. Saf.

    (2019)
  • H. Park et al.

    Development of a lane change risk index using vehicle trajectory data

    Accid. Anal. Prev.

    (2018)
  • B. Peralta et al.

    Embedded local feature selection within mixture of experts

    Inf. Sci.

    (2014)
  • J. Ren et al.

    A methodology to model causal relationships on offshore safety assessment focusing on human and organizational factors

    J. Saf. Res.

    (2008)
  • X. Shi et al.

    A feature learning approach based on XGBoost for driving assessment and risk prediction

    Accid. Anal. Prev.

    (2019)
  • D.I. Tselentis et al.

    Driving safety efficiency benchmarking using smartphone data

    Transp. Res. Part C Emerg. Technol.

    (2019)
  • L. Wang et al.

    Quasi-vehicle-trajectory-based real-time safety analysis for expressways

    Transp. Res. Part C Emerg. Technol.

    (2019)
  • J. Wang et al.

    Driving risk assessment using near-crash database through data mining of tree-based model

    Accid. Anal. Prev.

    (2015)
  • J. Weng et al.

    Effects of environment, vehicle and driver characteristics on risky driving behavior at work zones

    Saf. Sci.

    (2012)
  • Y. Wu et al.

    Effects of crash warning systems on rear-end crash avoidance behavior under fog conditions

    Transp. Res. Part C Emerg. Technol.

    (2018)
  • D.F. Xie et al.

    A data-driven lane-changing model based on deep learning

    Transp. Res. Part C Emerg. Technol.

    (2019)
  • C. Xu et al.

    Quantitative risk assessment of freeway crash casualty using high-resolution traffic data

    Reliab. Eng. Syst. Saf.

    (2018)
  • D. Yang et al.

    Modeling of time-dependent safety performance using anonymized and aggregated smartphone-based dangerous driving event data

    Accid. Anal. Prev.

    (2019)
  • K. Yang et al.

    A Bayesian dynamic updating approach for urban expressway real-time crash risk evaluation

    Transp. Res. Part C Emerg. Technol.

    (2018)
  • Cited by (13)

    View all citing articles on Scopus
    View full text