Optimized structure learning of Bayesian Network for investigating causation of vehicles’ on-road crashes
Introduction
Microscopic road transportation system is complex as it involves factors from multiple aspects, such as human, vehicle, and environment [21]. A reliable microscopic transportation system ensures the safety of a single vehicle or a cluster of vehicles on the road. A vehicle's crash can be seen as a failure of the microscopic transportation system, which can cause serious damage to property and great loss of life. As reported by the National Highway Traffic Safety Administration [1], in 2017, there were nearly 6,452,000 motor vehicle crashes and 37,247 motoring fatalities in the US. Moreover, such a failure can have an impact on the reliability of macroscopic road transportation system. For example, a serious vehicles’ crash might lead to a congestion and reduce the traffic efficiency of road network. Thus, investigating the causal relationships between the contributing factors and vehicles’ crashes is of importance towards the remediation of crashes and the enhancement of road transportation reliability.
Previous studies have focused on the contributing factors from various aspects in recent years, such as vehicles’ kinetic characteristics (e.g. position, velocity, etc., extracted from trajectory) [15,55,69,72], driving behaviors (e.g. improper driving maneuver, aggressive driving, etc.) [4,44,71], surrounding environment (e.g., weather, road condition, etc.) [29,48,75,82], human factors (e.g. physiology, psychology, social background, etc.) [3,5,25,26], transportation infrastructure [8],[83], etc. Since a vehicles’ crash can be regarded as a systematic failure which usually results from the effect of multiple factors rather than only a single mistake [62], many researchers have attempted to synthesize the factors from multiple aspects when investigating a vehicles’ crash [12,28,79]. With the rapid development of information technology, the collection of data with enormous features becomes increasingly available. Meanwhile, how to better exploit such kind of data becomes a persistent challenge faced by the researchers who intend to understand the underlying causes of a crash or precisely predict a crash in a data-driven way.
The development of data science and machine learning makes data-driven techniques become popular in the research area of driving safety and crash risk prevention. Gitelman et al. [29] investigated the relationship between the driving events collected by in-vehicle data recorders, road factors and crashes and identified high-risk locations on road network. Yang et al. [79] developed a real-time crash evaluation model for urban expressway using Bayesian dynamic logistic regression method based on in-field streaming traffic data. Bao et al. [6] explored the contributions of the trip pattern features extracted from a large-scale taxi GPS database to the spatially aggregated crashes in urban area. Xu, et al. [77] applied a four-stage random-parameters sequential logistics regression model to explore the relation between the probability of crash casualty and real-time multiple factors. Yang et al. [78] investigated time-dependent safety performance by using the dangerous driving event data captured by smartphones. Xie et al. [76] employed a deep learning model trained on empirical lane-changing data to predict vehicle's lane-changing maneuver. Shi et al. [69] developed a feature learning method which can evaluate car-following risk on the basis of a vehicles’ trajectory dataset. Osman et al. [53] applied a hierarchical machine learning classification method to identify the types of secondary tasks that drivers are engaged in based on the driving behavior parameters. However, in order to focus on a specific problem or simplify the method validation, most previous studies merely considered the features within a limited scope. Synthesizing the features belonging to multiple categories could provide a deeper insight into the causation of vehicles’ crashes.
As a state-of-the-art machine learning technique, Bayesian Network (BN) is a type of robust probabilistic model with graphic structure, which presents a set of features and specifies their conditional dependencies using a Directed Acyclic Graph (DAG) [56]. BNs have been widely applied in risk assessment considering its capability of conducting comprehensive and precise analysis on a sophisticated system [38]. Several previous studies have applied BN to investigate transportation systems and enhance transportation safety. Chen et al. [13] used a BN to explicitly investigate the statistical associations between injury severity outcomes and explanatory factors in rear-end crashes. Mbakwe et al. [49] combined Delphi technique and BN to assess the highway traffic safety in several developing countries. Chen et al. [12] proposed a probabilistic decision-making framework for rear-end collision avoidance system based upon the BN with major collision-causing factors. Also, various novel extensions of BNs have been developed to better explain systematic failures and conduct risk assessment. El-Awady and Ponnambalam [24] employed the simulation and Markov Chains to assist BN reasoning for probabilistic failure analysis of complex systems. Huang, et al. [36] developed a BN-K2 Algorithm-Expectation Maximization approach to measure the intensity of coupling influence between systematic failures and describe the propagation chains of the failures. [61]) proposed a Noisy-or Gate BN model combining Noisy-or Gate model and Naive BN to better identify the joint probability distribution of a target risk system. Guo et al. [31] built a discrete-time BN model by considering the impact of common cause failures on system reliability. Zywiec et al. [84] developed a novel methodology to incorporate neural network metamodel into BN-based probabilistic risk assessment for industrial facilities. Yin et al. [81] proposed a hybrid knowledge-based and data-driven approach to construct BN for studying the resilience of urban rail systems. Nevertheless, since the collection of various features becomes available, the challenges such as how to handle the input data with numerous features and how to optimize the BN structure considering the numerous features are being faced by researchers.
This study is aimed at addressing the research gaps inherent in the previous literature, which are summarized as follows:
- (1)
Most data-driven research on vehicles’ crashes has been conducted based on a dataset with limited feature categories. With the development of information industry, the collection of data with multiple categories of the features has become feasible. Hence, how to properly process a variety of features considering both feature-to-feature and category-to-category relationships places an emerging challenge for the research into vehicles’ crashes.
- (2)
The data with a large size of features always include noisy features which could result in a serious overfitting problem and have negative effects on the prediction performance of the model trained on such data. However, when investigating vehicles’ crash risk, few studies have attempted to effectively remove the noisy features from the dataset before training a model (especially a BN). Hence, how to precisely identify key features is a problem to be resolved in this study, which we believe is critical to the data-driven research on vehicles’ crashes.
- (3)
In most studies, the structures of BN are determined in two ways: first, pre-defined by prior knowledge; and second, identified by the conventional structure learning method based on probabilistic evidence. The former BN is easy to interpret but always complex in structure, while the latter BN has satisfactory structure performance but usually lacks interpretability and rationality. Therefore, it is necessary to develop a structure learning method that makes a trade-off between complexity against interpretability of BN structure.
To bridge the above research gaps, this study develops an optimized structure learning method to construct BN. The generated BN can be used to investigate the causal relationships between the features that have potential contributions to vehicles’ on-road risky events (i.e., near-crash and crash) and the causation of a risky event. As shown in Fig. 1, the research framework of this study has three phases, namely, data preprocessing, BN structure learning, and application and evaluation. Phase 1 is proposed to build a dataset of the samples (i.e. events) with candidate features and event label (i.e. non-crash, near-crash, and crash) of each sample in preparation for BN construction. Phase 2 develops the optimized structure learning method which is used to construct the candidate BNs based upon multiple feature categories and identify the optimal BN from the candidate BNs according to their performances. In Phase 3, a naturalistic driving database is applied for a case study, in which method comparisons are conducted to verify the performances of the proposed structure learning method and the causal inferences between the key features and the event are performed based upon the optimal BN.
This study has two major contributions, respectively, from the perspectives of methodology and application. First, this study develops a structure learning method that can better integrate prior knowledge and the statistical information collected from a large scale of data to construct an optimal BN. As mentioned in Section 1.2, the conventional methods of BN construction merely depend on prior knowledge or probabilistic evidence, which make them always fail to attain a trade-off between complexity and interpretability. By encapsuling advanced machine learning techniques, the proposed structure learning method takes advantages of both prior knowledge and statistical information and resolves the above limitations of the conventional methods. Also, in this study, we compare the proposed structure learning method with several state-of-the-art structure learning methods as developed in recent years, upon which we verify the superiority of the BN as generated by the proposed method on both prediction performance and rationality. Therefore, the proposed structure learning method resolves the above research gaps in safety engineering by providing a comprehensive approach to precisely investigate the causality of hazard from multiple information sources. Second, in this study, we conduct causal inferences based on the optimal BN as generated by the proposed structure learning method to explain the causation of a vehicles’ on-road risky event in a probabilistic way. Several interesting findings are reported and discussed in this paper, which can provide insights into the causality of risky driving maneuver on the road and further enhance road transportation safety. Moreover, future work could be conducted to further verify the findings and adopt the findings into automated technology such as Advanced Driver-Assistance System (ADAS).
This paper is organized as follows. Section 2 introduces the procedures of data preprocessing (i.e. Phase 1) and the procedures of developing the optimized BN structure learning method (i.e. Phase 2). Section 3 contains the network validation results, method comparison results and several interesting findings of the case study (i.e. Phase 3). The steps of each phase (as shown in Fig. 1) are introduced in detail within Sections 2 and 3, respectively. Section 4 covers the conclusions, limitations, and future work of this study.
Section snippets
Data preprocessing
As introduced in Fig. 1, the data preprocessing phase has three steps. First, candidate feature extraction is composed of two sub-steps, namely, data cleaning and feature categorization. Data cleaning is proposed to detect and eliminate noises from the dataset to improve the quality of data. Herein, incomplete features with missing data as well as duplicate features are identified and removed from the dataset. Savitzky-Golay filter [66] is employed to eliminate potential noises from the
Data source
This study employs the dataset extracted from the Second Highway Research Program (SHRP2) Naturalistic Driving Study (NDS) database for validation. The SHRP2 NDS collected a total of 5,512,900 naturalistic driving trips from nearly 3,400 participant drivers in the United States between 2010 and 2013. The participant vehicles were installed with a data acquisition system (DAS) to collect vehicles’ information from multiple aspects [33]. Besides, the SHRP2 NDS database also include driver's
Summary and conclusions
This study proposes an optimized structure learning method for constructing BN, upon which the causal relationships between contributing features and vehicles’ on-road risky events (i.e. near-crash and crash) are investigated. The framework of this study comprises three phases, namely, data preprocessing, BN structure learning, and case study. Data preprocessing aims at preparing the dataset for BN construction. In this phase, candidate features are extracted and classified into seven
CRediT authorship contribution statement
Tianyi Chen: Conceptualization, Methodology, Validation, Formal analysis, Writing – original draft, Writing – review & editing. Yiik Diew Wong: Supervision, Conceptualization, Writing – review & editing. Xiupeng Shi: Conceptualization, Formal analysis. Xueqin Wang: Formal analysis, Writing – review & editing.
Declaration of Competing Interest
There are no conflicts of interest associated with this study.
Acknowledgments
This paper presents a part of the first author's PhD research.
References (87)
- et al.
Multilevel analysis of the role of human factors in regional disparities in crash outcomes
Accid. Anal. Prev.
(2017) - et al.
How instantaneous driving behavior contributes to crashes at intersections: extracting useful information from connected vehicle message data
Accid. Anal. Prev.
(2019) - et al.
Crash prediction with behavioral and physiological features for advanced vehicle collision avoidance system
Transp. Res. Part C Emerg. Technol.
(2017) - et al.
Understanding the effects of trip patterns on spatially aggregated crashes with large-scale taxi GPS data
Accid. Anal. Prev.
(2018) - et al.
Integer linear programming for the Bayesian Network structure learning problem
Artif. Intell.
(2017) - et al.
The role of transportation infrastructure on the impact of natural hazards on communities
Reliab. Eng. Syst. Saf.
(2022) - et al.
Improving algorithms for structure learning in Bayesian Networks using a new implicit score
Expert Syst. Appl.
(2010) - et al.
A multinomial logit model-Bayesian Network hybrid approach for driver injury severity analyses in rear-end crashes
Accid. Anal. Prev.
(2015) - et al.
Key feature selection and risk prediction for lane-changing behaviors based on vehicles’ trajectory data
Accid. Anal. Prev.
(2019) - et al.
Predicting lane-changing risk level based on vehicles’ space-series features: a pre-emptive learning approach
Transp. Res. Part C Emerg. Technol.
(2020)
A data-driven feature learning approach based on Copula-Bayesian Network and its application in comparative investigation on risky lane-changing and car-following maneuvers
Accid. Anal. Prev.
Integration of simulation and Markov Chains to support Bayesian Networks for probabilistic failure analysis of complex systems
Reliab. Eng. Syst. Saf.
Sleep-related crash characteristics: implications for applying a fatigue definition to crash reports
Accid. Anal. Prev.
The effects of driver fatigue, gender, and distracted driving on perceived and observed aggressive driving behavior: a correlated grouped random parameters bivariate probit approach
Anal. Methods Accid. Res.
Quantifying regional heterogeneity effect on drivers’ speeding behavior using SHRP2 naturalistic driving data: a multilevel modeling approach
Transp. Res. Part C Emerg. Technol.
Exploring relationships between driving events identified by in-vehicle data recorders, infrastructure characteristics and road crashes
Transp. Res. Part C Emerg. Technol.
A discrete-time Bayesian Network approach for reliability analysis of dynamic systems with common cause failures
Reliab. Eng. Syst. Saf.
An on-road assessment of cognitive distraction: impacts on drivers’ visual behavior and braking performance
Accid. Anal. Prev.
A hybrid approach for identifying the structure of a Bayesian Network model
Expert Syst. Appl.
Operational failure analysis of high-speed electric multiple units: a Bayesian Network-K2 algorithm-expectation maximization approach
Reliab. Eng. Syst. Saf.
Applications of Bayesian Networks and petri nets in safety, reliability, and risk assessments: a review
Saf. Sci.
Wrappers for feature subset selection
Artif. Intell.
A framework for evaluating aggressive driving behaviors based on in-vehicle driving records
Transp. Res. Part F Traffic Psychol. Behav.
The global k-means clustering algorithm
Pattern Recognit.
A parallel algorithm for Bayesian Network structure learning from large data sets
Knowl. Based Syst.
Accident risk of road and weather conditions on different road types
Accid. Anal. Prev.
Alternative method of highway traffic safety analysis for developing countries using delphi technique and Bayesian Network
Accid. Anal. Prev.
Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin
ISPRS J. Photogramm. Remote Sens.
A hierarchical machine learning classification approach for secondary task identification from observed driving behavior data
Accid. Anal. Prev.
Modeling risks in dependent systems: a Copula-Bayesian approach
Reliab. Eng. Syst. Saf.
Development of a lane change risk index using vehicle trajectory data
Accid. Anal. Prev.
Embedded local feature selection within mixture of experts
Inf. Sci.
A methodology to model causal relationships on offshore safety assessment focusing on human and organizational factors
J. Saf. Res.
A feature learning approach based on XGBoost for driving assessment and risk prediction
Accid. Anal. Prev.
Driving safety efficiency benchmarking using smartphone data
Transp. Res. Part C Emerg. Technol.
Quasi-vehicle-trajectory-based real-time safety analysis for expressways
Transp. Res. Part C Emerg. Technol.
Driving risk assessment using near-crash database through data mining of tree-based model
Accid. Anal. Prev.
Effects of environment, vehicle and driver characteristics on risky driving behavior at work zones
Saf. Sci.
Effects of crash warning systems on rear-end crash avoidance behavior under fog conditions
Transp. Res. Part C Emerg. Technol.
A data-driven lane-changing model based on deep learning
Transp. Res. Part C Emerg. Technol.
Quantitative risk assessment of freeway crash casualty using high-resolution traffic data
Reliab. Eng. Syst. Saf.
Modeling of time-dependent safety performance using anonymized and aggregated smartphone-based dangerous driving event data
Accid. Anal. Prev.
A Bayesian dynamic updating approach for urban expressway real-time crash risk evaluation
Transp. Res. Part C Emerg. Technol.
Cited by (13)
Causative analysis of freight railway accident in specific scenes using a data-driven Bayesian network
2024, Reliability Engineering and System SafetyA Bayesian network-based model for risk modeling and scenario deduction of collision accidents of inland intelligent ships
2024, Reliability Engineering and System SafetyCharacteristics identification and evolution patterns analyses of road chain conflicts
2024, Accident Analysis and PreventionWhat is the public really concerned about the AV crash? Insights from a combined analysis of social media and questionnaire survey
2023, Technological Forecasting and Social Change