Elsevier

Knowledge-Based Systems

Volume 205, 12 October 2020, 106314
Knowledge-Based Systems

A proactive decision support system for predicting traffic crash events: A critical analysis of imbalanced class distribution

https://doi.org/10.1016/j.knosys.2020.106314Get rights and content

Abstract

Real-time crash prediction plays a key role in enhancing traffic safety as well as mitigating disruptions to road users. The further improvements of predictability require the systemic analysis of crash likelihood within the driver–vehicle–environment triptych. This study presents a proactive decision support system that can predict crash events based on vehicle kinematics, driver inputs, roadway geometric features and real-time weather data. Modeling approaches that rely on Random forest, Support Vector Machine and Multilayer Perceptron machine learning techniques were applied to establish efficient crash predictions. Moreover, crash events are generally unexpected and occur rarely, thus classification results can yield deceivingly high prediction performance which are usually driven by the majority class at the expense of having poor performance on the crucial minority class. Therefore, this paper attempts to add to the current knowledge by investigating crash likelihood based on compared different data balancing techniques to improve the predictive performance through three balancing techniques: over-sampling, under-sampling and synthetic minority over-sampling (SMOTE). The highest performances have been acquired using SMOTE strategy as MLP achieved a 94.5% precision, 94.2% f1-score, 93.7% AUC and 95.3% recall, while SVM achieved a 91.5% g-mean. Furthermore, results indicated that more than 62% of total crashes have been reported in downhills and curved downhills, and 44% of all crash instances have been reported during both snow and rain weather patterns. Overall, the findings highlighted the significance of the explanatory variables associated with potential crash events and can suggest to decision-makers a safe and credible system for enhancing traffic safety.

Introduction

Traffic accidents are one of the most serious and threatening problems that encounters societies nowadays, having a detrimental influence both on a person and community level, resulting in many health issues, economic losses and fatalities. The World Health Organization [1] reports that 1.35 million people die in road traffic crashes every year, and a further 20–50 million are injured or disabled worldwide. The effects of road geometry and weather conditions have been identified to have major impacts on the identification of crash events [2], [3], [4]. On the other hand, related research depicted the vital implication of driver input responses and vehicle kinematics on safe driving and on the identification of crash and near-crash events [5], [6], [7]. However, little effort has been made to quantify the effects of real-time data combining weather conditions and road geometry along with driving maneuvering inputs and vehicle kinematics on predicting the occurrence of crash collisions. Therefore, a reliable system for crash collisions forecasting and proactive safety analysis is undeniably of great interest and necessity.

Crash analysis is a complex mechanism, affected by various contributing factors including the vehicle telemetry, driver state and environmental factors [8], [9]. While the related research have investigated unsafe driving behaviors in an attempt to characterize traffic crashes and develop real-time road management policies [6], [10], [11], exploring the impact of real time information acquired from the driver, vehicle, weather and road geometry scheme is relatively limited. Weather status was found to cause that more than 1.25 million accidents (21% of all vehicle crashes), leading to about 418,000 injuries (19% of crash injuries), and nearly 5000 casualties (16% of all casualties) [12], however, it was found that most of the literature that adopted weather variables in crash assessment used data obtained from police crash reports which could be susceptible to inaccuracies as the reported conditions may be what were observed by the person filling the crash report and not the effective weather status at the time of accident [5], [13]. Also, driver input action as well as vehicle kinematics have proven to have an essential impact on the recognition of crash and near-crash events [6], [14]. As concerns the roadway geometry, relevant scholars studied the effects of route geometric characteristics and how they lead to a substantial change of the driving behavior [4], [15]. In this study, real-time data were gathered using a driving simulator; transportation research has been actively adopting driving simulation experiences as they are much safer, with the major advantage of possessing full empirical control over conditions and the capacity to explore multiple design structures [16]. The adopted route layout included various geometry types such as uphills, downhills, curves, straight lines and roundabout stretches, whereas simulations were conducted during three adverse weather covariates namely fog, rain and snow seasons. The driver input responses (e.g. pedal positions and wheel angles) along with vehicle kinematics (e.g. speed and Time-To-Collision) were systematically recorded during the trials and preprocessed in order to identify the best precursors for crash events evaluation.

In crash prediction analysis, traditional statistical learning-based techniques such as logistic regression [10], quantile regression [17] and discriminant analysis [18] have been largely utilized. However, statistical models for crash prediction frequently suffer from poor data quality and require great deal of historical data and provide unsatisfying results when treating features with a high number of categories [19], [20]. Conversely, machine learning (ML)-based models have proven to supersede statistical analysis in predicting forthcoming events and have reported satisfying results in many transportation systems [21], [22], [23]. The significant interests of ML models can be characterized by (i) their autonomously surmounting major non-linear problems using datasets from multiple sources; (ii) their ability to easily incorporate newly data in an attempt to improve estimation performance, (iii) and their predictive and explanatory ability through the extraction of rules. The Support Vector Machines (SVM), Random Forest (RF) and Multilayer Perceptron (MLP) are ones of the most substantial machine learning techniques that have been used for crash events prediction [24], [25], [26], [27], [28]. Endorsing SVM in assessing safety performance measures for vehicle crashes depicted a good handling of small data sizes, with a great capability in producing fewer over-fitting issue and better generalization abilities [29]. At the same time, relevant scholars have affirmed the performance of RF in several fields as it Random Forest has been proven to reduce variance compared to a single decision tree, and it is also robust to outliers and missing values [30]. On the other hand, MLP gained its popularity owing to their excellence in various complex tasks by learning data representations in both supervised and unsupervised settings along with parallel processing, fault tolerance, and the efficiency to generalize to unseen data samples using hierarchical representations [31].

Another key role determinant in the prediction of crash events is the ratio of crash and non-crash instances in the dataset. In road accidents related observations, there usually are relatively fewer crash data points compared to no accidents’ samples. Numerous researchers embrace the conventional proportion of accepting 4 non-accident cases for each accident case [29], [32], [33]. However, this is likely to result in imbalances as there would be a bias toward the majority class since that predictive learners prioritize the label with the greater number of instances leading to an over-prediction of this class [5], [34]. Conventionally, oversampling and undersampling are two essential techniques that are largely utilized to address class-imbalance issue. Oversampling approaches prevent data loss by focusing on duplicating the samples of the minority class. In contrast, undersampling strategies attempt to balance the ratio of the classes by eliminating data points from the majority class. The Synthetic Minority Oversampling Technique (SMOTE), judged as one of the most powerful re-sampling algorithms, was presented by Chawla et al. [35] to solve the imbalance issue by producing synthetic instances from the minor class. On the issue of skewed instances, SMOTE is capable of identifying similar but more specific sections in the feature dimension as the decision region for the minority class [36]. To enhance the performance of classification models with imbalanced dataset, variations of the resampling techniques along with the aforementioned classification models have been employed. Moreover, The variable extraction procedure has been conducted based on the widely employed using principal component analysis (PCA) which is a dimension reduction approach that finds a linear transformation of the input data points generating projections of the original features to a new variable space [37]; PCA has been proved to be one of the effective methods of dimensionality reduction as no prespecified structure is required for the input space while the amount of variance explained by each variable is maximized by the orthogonal (i.e., uncorrelated) parameters [38].

The objective if this study is three-fold: (1) to identify the strongest factors contributing to the likelihood of crash instances across various route geometries under adverse weather conditions using comprehensive real-time data, (2) to examine the effects of different combinations of the four adopted features’ categories – road geometric characteristics, weather patterns, vehicle telemetry and driver inputs – on the occurrence of crash events and (3) to develop multiple prediction models – RF, SVM and MLP – using different sampling strategies namely under-sampling, over-sampling and SMOTE to handle the data imbalance issue. We aimed to thoroughly comprehend the interrelationships among various combinations of the input space features, sampling techniques and modeling learners in crash events prediction. In order to reduce the sampling bias in splitting the data between training and testing for each model building process, we employed 10-fold cross validation. The remainder of this study is organized as follows. First, descriptions of the driving simulator and experimental protocol with data analysis are provided. Second, in the methodology section, the construction of modeling techniques is presented in details. Next, the results are reported and interpreted. Finally, conclusions with future scopes of the present study are offered.

Section snippets

Participants

A total of 107 volunteers (89 males and 18 females) between the ages of 20 and 45 (M = 34.5; SD = 2.60) participated in the study. All participants had a full driver’s license and had been driving for at least a year. Average years of driving experience ranged from 1 to 23 years (M = 8.75; SD = 4.22) with an average hour of driving per day ranging from 1 to 5 h (M = 2.40, SD = 1.35). All were in a good health, and had (corrected to) normal vision. In reference to the provided information about

Methodology

The main objective of this work is to develop crash prediction models by considering the most pertinent inputs, create relevant features’ combination and develop efficient machine leaning techniques. Three popular classification methods, support vector machine (SVM), Random Forest (RF) as well as Multilayer Perceptron (MLP), along with three balancing techniques – random over-sampling (RUS), random under-sampling (ROS) and SMOTE – are used to build prediction models, and compared to each other

Building prediction models

In an effort to demonstrate the validity of the classifiers’ assessment, parameter optimization for each of the classifiers SVM and MLP and RF was carried out to select the best performing penalty parameters through cross-validation. To cope with the imbalanced issue, data balancing techniques were applied. The proper way to apply rebalancing strategies is to address the imbalance issue is by oversampling or undersampling only the training set while the test set is left intact [5], [34], [62],

Summary and conclusion

Road traffic crashes have been considered one of the main causes resulting in countless health issues, economic losses and fatalities, thus the investigation and understanding of the major contributors to road accidents is of practical significance. Numerous studies that have focused on this topic adopted traditional statistical techniques which frequently suffer from poor data quality and require great deal of historical data. Conversely, machine learning models have proven to supersede

CRediT authorship contribution statement

Zouhair Elamrani Abou Elassad: Conception and design of study, Acquisition of data, Analysis and/or interpretation of data, Writing - original draft, Writing - review & editing. Hajar Mousannif: Acquisition of data, Analysis and/or interpretation of data, Writing - original draft. Hassan Al Moatassime: Conception and design of study, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research was jointly supported by (1) the Moroccan Ministry of Equipment, Transport and Logistics, and (2) the Moroccan National Center for Scientific and Technical Research (CNRST).

References (70)

  • WashingtonS. et al.

    Applying quantile regression for modeling equivalent property damage only crashes to identify accident blackspots

    Accid. Anal. Prev.

    (2014)
  • BaY. et al.

    Crash prediction with behavioral and physiological features for advanced vehicle collision avoidance system

    Transp. Res. C

    (2017)
  • WangC. et al.

    A crash prediction method based on bivariate extreme value theory and video-based vehicle trajectory data

    Accid. Anal. Prev.

    (2019)
  • BassoF. et al.

    Real-time crash prediction in an urban expressway using disaggregated data

    Transp. Res. C

    (2018)
  • LiY. et al.

    Identification of significant factors in fatal-injury highway crashes using genetic algorithm and neural network

    Accid. Anal. Prev.

    (2018)
  • YuR. et al.

    Utilizing support vector machine in real-time crash risk evaluation

    Accid. Anal. Prev.

    (2013)
  • BasuS.

    Deep neural networks for texture classification—A theoretical analysis

    Neural Netw.

    (2018)
  • XuC. et al.

    Evaluation of the impacts of traffic states on crash risks on freeways

    Accid. Anal. Prev.

    (2012)
  • CervantesJ. et al.

    PSO-Based method for SVM classification on skewed data sets

    Neurocomputing

    (2017)
  • WoldS. et al.

    Principal component analysis

    Chemom. Intell. Lab. Syst.

    (1987)
  • WardJ.R. et al.

    Extending time to collision for probabilistic reasoning in general traffic scenarios

    Transp. Res. C

    (2015)
  • WernekeJ. et al.

    How to present collision warnings at intersections? - A comparison of different approaches

    Accid. Anal. Prev.

    (2013)
  • YanX. et al.

    The influence of in-vehicle speech warning timing on drivers’ collision avoidance performance at signalized intersections

    Transp. Res. C

    (2015)
  • FernándezA. et al.

    Hierarchical fuzzy rule based classification systems with genetic rule selection for imbalanced data-sets

    Internat. J. Approx. Reason.

    (2009)
  • GaoM. et al.

    A combined SMOTE and PSO based RBF classifier for two-class imbalanced problems

    Neurocomputing

    (2011)
  • RamedaniZ. et al.

    Potential of radial basis function based support vector regression for global solar radiation prediction

    Renew. Sustain. Energy Rev.

    (2014)
  • BasheerI. et al.

    Artificial neural networks: fundamentals, computing, design, and applicatio

    J. Microbiol. Methods

    (2000)
  • SchmidhuberJ.

    Deep learning in neural networks: An overview

    Neural Netw.

    (2015)
  • KiaA.N. et al.

    Network-based direction of movement prediction in financial markets

    Eng. Appl. Artif. Intell.

    (2020)
  • FernándezA. et al.

    A study of the behaviour of linguistic fuzzy rule based classification systems in the framework of imbalanced data-sets

    Fuzzy Sets and Systems

    (2008)
  • Davoudi KakhkiF. et al.

    Evaluating machine learning performance in predicting injury severity in agribusiness industries

    Saf. Sci.

    (2019)
  • DingY. et al.

    Forecasting financial condition of chinese listed companies based on support vector machine

    Expert Syst. Appl.

    (2008)
  • NafiahF. et al.

    Quantitative evaluation of crack depths and angles for pulsed eddy current non-destructive testing

    NDT E Int.

    (2019)
  • WestD.

    Neural network credit scoring models

    Comput. Oper. Res.

    (2000)
  • MakondB. et al.

    Probabilistic modeling of short survivability in patients with brain metastasis from lung cancer

    Comput. Methods Programs Biomed.

    (2015)
  • Cited by (37)

    • Cross-city crash severity analysis with cost-sensitive transfer learning algorithm

      2022, Expert Systems with Applications
      Citation Excerpt :

      The original dataset can be resampled by means of under-sampling and over-sampling, i.e., discarding majority samples (for example, less severe injury crashes) or generating new minority samples (for example, severer crashes) respectively. Some of the related works which utilised data resampling technique such as Yahaya et al. (2021) and Zhu (2021) for crash severity modelling; Abou Elassad, Mousannif, and Al Moatassime (2020) and Ke, Zhang, Yang, and Chen (2019) for real-time crash detection or prediction. However, the data resampling method tends to distort the original distribution of the training data (Seiffert, Khoshgoftaar, Van Hulse, & Napolitano, 2008; Weiss, McCarthy, & Zabar, 2007) and overfit the model (Chen, Shi, Wong, & Yu, 2020; Parsa, Taghipour, Derrible, & Mohammadian, 2019).

    • Intelligent cost-effective winter road maintenance by predicting road surface temperature using machine learning techniques

      2022, Knowledge-Based Systems
      Citation Excerpt :

      A reduction in transportation efficiency can lead to a growth in the number of crashes, especially in wintertime when traffic conditions are challenging. Road collision is a dangerous problem in societies and can influence communities and people, resulting in economic losses, health issues, and fatalities [4]. In transportation, the robust and accurate prediction of traffic parameters (e.g. flow, speed, occupancy, and travel time) and non-traffic parameters (e.g. traffic events and weather) can lead to efficient traffic management, such as a faster and safer path for transporting goods and avoiding congestion [5,6].

    • An empirical study of taxi crashes in Singapore

      2022, Asian Transport Studies
      Citation Excerpt :

      The study aimed to analyze intrinsic characteristics underlying traffic accident causes to derive safety implications to be used in safety policy development. Elassad et al. (2020) using a machine learning technique modeling approach (Support Vector Machine, Multilayer Perceptron, and Random forest) designed a decision support system that can predict crash events based on vehicle kinematics, driver inputs, roadway geometric features, and real-time weather data. Yeh et al. (2015) focused on developing crash injury severity analysis models for a mountainous freeway section.

    View all citing articles on Scopus
    View full text