Abstract:
Understanding the distribution and propagation of train delay is crucial for railway management. This paper combines the interpretability of logistic regression models wi...Show MoreMetadata
Abstract:
Understanding the distribution and propagation of train delay is crucial for railway management. This paper combines the interpretability of logistic regression models with the robustness and accuracy of Random Forest models to create a combined model which was applied to predict punctuality. The data consists of relative timetable deviation of train movement for all stations, as well as punctuality observations at destination stations. The data was recorded for both passenger and freight trains in Sweden between year 2017 and 2018.The data consists of many policy and categorical variables such as train operator which are known to indirectly effect delay risk, but are labeled as insignificant in classical regression making their coefficients unstable and difficult to interpret. For this reason, the study has applied logistic regression model with the variables of interest such as train type as well as first registered delay (relative deviation compared with timetable) along with "bagging" of Random Forest capturing indirect or/and sensitive predictors. This semi-parametric logistic regression model was trained using 2017 data and was accurate and robust when tested using the 2018 data. It has shown to be capable of handling the delays caused by unforeseen disruptions such as abnormal weather in the test year. In this paper we show that the semi-parametric model has significantly better prediction performance than linear models, Weibull distributions, Binomial logistic regression and Random Forest alone. Furthermore, the semiparametric model maintains its interpretability whilst producing accurate predictions with new data.
Date of Conference: 27-30 October 2019
Date Added to IEEE Xplore: 28 November 2019
ISBN Information: