Elsevier

Computer Communications

Volume 153, 1 March 2020, Pages 353-366
Computer Communications

Using data mining techniques for bike sharing demand prediction in metropolitan city

https://doi.org/10.1016/j.comcom.2020.02.007Get rights and content

Abstract

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. A Data mining technique is employed for overcoming the hurdles for the prediction of hourly rental bike demand. This paper discusses the models for hourly rental bike demand prediction. Data used include weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information. The paper also explores an filtering of features approach to eliminate the parameters which are not predictive and ranks the features based on its prediction performance. Five Statistical regression models were trained with their best hyperparameters  using repeated cross-validation and the performance is evaluated using a testing set: (a) Linear Regression (b) Gradient Boosting Machine (c) Support Vector Machine (Radial Basis Function Kernel) (d) Boosted Trees, and (e) Extreme Gradient Boosting Trees. When all the predictors are employed, the best model Gradient Boosting Machine can give the best and highest R2 value of 0.96 in the training set and 0.92 in the test set. Furthermore, several analyzes are carried out in Gradient Boosting Machine with different combinations of predictors to identify the most significant predictors and the relationships between them.

Introduction

Currently, the bike-sharing scheme is well-received throughout the world. It is a shared bike service to individuals, which is free of charge and for a short term basis at a minimal rate. Most bike-sharing systems permit people to borrow and return a bike from a bike station to another station that belongs to the same network. Bike-sharing gains a vast range of attention in recent years as part of initiatives to boost the use of cycle, improve the first mile/last mile link to other modes of transportation, and to minimize the negative effect of transport activities on the environment. Bike-sharing has significant impacts on establishing a larger cycling community, increasing the use of transportation, minimizing greenhouse gas emissions, enhancing public health and also traffic troubles.

Bike-sharing program progress is slow initially but after 1960 effective tracking tactics for bikes with improved technology are developed [1]. During this decade, the development gave rise to the rapid spread of bike-sharing systems across several continents. South Korea is turning into a land of two wheels, as cycling facilities provide transit flexibility vehicle emission reductions, health advantages, low congestion fuel efficiency, and financial benefit for individuals and also the paths are extended in cities across the nation as well as in rural areas.

The bike-sharing system is developed with accessible bikes for all residents for all cities across South Korea. The benefit of bike-sharing over renting is that riders can take a bike out of any system station and return it to any other station, enhance mobility, and benefit a greater number of users. Anyone can enjoy the benefits of the bike-sharing facility by being a member of a bike-sharing program and the user has access to a city-wide bike fleet for private utilization, either at minimal cost or free of charge. Many bike-sharing systems are automated based on cell phones or smart cards of the user. The first Korean city to introduce a bike-sharing system is Changwon Gyeongsangnam-do (Province of South Gyeongsang), that launched the Nubija (Nearly Useful Bike, Fun Attraction) in 2008, with 230 autonomous bikes running over 4600 bikes by 2012. Also, Seoul, Busan, and Daejeon, as well as in Suncheon, Jeollanam-do (South Jeolla Province) have the bike-sharing services. Currently, in 17 districts of Seoul, bike-sharing systems are offering 3200 bicycles. District governments own and command us, with the expectation of bike shares from Yeouido and Sangam World Cup Park [2]. Fig. 1 shows the Seoul Bike Ddareungi spot.

So the constant raise of users necessitates the prediction of the number of rental bikes that were needed to make the bike sharing system to consistently work. Therefore, this research aims to use machine learning and data mining based algorithms to predict required number of rental bikes required at each hour. In this method, data mining is used as it has the reliability to solve complicated issues. Across various cities, a growing body of research has investigated weather and climate impacts on cycling, usually across combination with several other factors that can affect cycling. The results differ in the degree to which climate influences use. Pucher et al. [3] shows that U.S. cities with relatively high levels of cycling have mild winters and often little rain compared to the extreme heat and moisture that disrupts cycling. Furthermore, in Pucher and Buehler’s analysis [4] estimating the percentage of cycling trips to work in U.S. and Canadian cities, rainfall and temperature are statistically important variables associated with lower cycling rates. This shows the influence of weather data in cycling patterns and selected weather parameters were used in this research.

The paper is structured as follows. Section 2 provides a comprehensive analysis of the literature review. Section 3 deals with methods of research. Section 4 provides data set description, exploratory analysis, data feature filtering and importance. Section 5 discusses various evaluation metrics used for evaluating the models, Section 6 provides model development process, Section 7 deals with results and discussion and Section 8 concludes the paper.

Section snippets

Related works

In the bike-sharing demand prediction, multiple pieces of research are carried out and some of the important works are discussed in this section. In Washington D.C. Capital Bike Sharer network, multiple linear regression and random forest algorithms are used to predict demand for rental bike [5]. A short-term prediction for the use of docking stations in Suzhou, a case is implemented in China [6]. In docking stations with one-month historical data, LSTM and GRU are used for the prediction of

Linear regression

Linear regression (LM) in the most simplest and nursing method, that is equated with the relationship between the Y attribute of the scalar output and one or even more X attributes of the input quantity. The case of an independent attribute is known as simple linear regression, and the method is called as multiple linear regressions when more than one independent attributes are considered. Data is designed using linear predictor functions in linear regression, and from data, the unknown model

Data preparation

This work comprehends the relation between rental bike used in each hour and the different predictors such as weather information and time information. Additionally it examines the efficiency of various regression models: (a) LM, (b) GBM, (c) SVM, (d) BT and (e) XGBTree, for predicting the demand for public rental bikes and ranking the influence of predictors or parameters in the prediction.

Data for one year (2017 December to November 2018) is downloaded from the Seoul Public Data Park website

Evaluation indices

Regression models are trained to select the best with a repeated 10-fold cross-validation scheme. The doParallel package [34] is employed to accelerate computations. Various evaluation criteria are employed to test the performance of regression models. The performance assessment indices used here are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Rsquared (R2) and Coefficient of Variation (CV).

The standard sample deviation between the observed and the predicted values of the

Model development

In order to find and decrease the error values while fitting a model, it is necessary to find optimal tuning parameters for each of the regression algorithms. The caret package provides a grid search function for determining the optimal possible parameter values for a model. The grid search provides interactive approach to try all combinations of hyperparameters and select the best hyperparameters.

Fig. 9 exhibits a linear regression model residual map. In case of LM, residuals are determined as

Results and discussion

Upon training each regression model, each of the regression prediction model has 30 outcomes from the 10-fold cross-validation sets repeated for 3 times. For each model, CARET uses this data to plot R2, RMSE and MAE values along with the confidence intervals as shown in Fig. 14. The best model is the model with lower MAE, RMSE and CV values as well as higher R2 values. This is because the error values should be less and R2 describes the explanation of the fit, so this value should be higher.

Conclusion

The data analysis and prediction provides a thought-provoking outcome for both the data exploratory research and the prediction models. The generated Pairwise plots based on their correlation, certainly show different parameter relationships that can be concealed in the most used prediction models. GBM and XGBTree models enhance the R2, RMSE, MAE and CV of predictions rather than SVM, LM and BT. Temp and Hour are considered as the most significant variable for the hourly rental bike count

CRediT authorship contribution statement

Sathishkumar V E: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Visualization, Investigation, Validation. Jangwoo Park: Supervision, Writing - review & editing. Yongyun Cho: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (35)

  • FanCheng et al.

    Development of prediction models for next-day building energy consumption and peak power demand using data mining techniques

    Appl. Energy

    (2014)
  • Bikesharing spreads in Korea, LINK:...
  • FengYouLi et al.

    A forecast for bike rental demand based on random forests and multiple linear regression

  • WuXinhua

    Station-level hourly bike demand prediction for dynamic repositioning in bike-sharing systems

  • YangZidong

    Mobility modeling and prediction in bike-sharing systems

  • ZhangLihuan

    Data analysis and visualization in bike-sharing systems

  • TomarasDimitrios et al.

    Modeling and predicting bike demand in large city situations

  • Cited by (82)

    • A censored semi-bandit model for resource allocation in bike sharing systems

      2023, Expert Systems with Applications
      Citation Excerpt :

      Li and Zheng (2019) proposed a hierarchical consistency prediction model to predict citywide bike usage in the next period. Sathishkumar et al. (2020) used weather information, the number of bikes rented per hour, and date information, and several regression models including linear regression, support vector machine, gradient boosting machine to predict the hourly rental bike demands. Xu et al. (2019) and Hua et al. (2020) adopt K-means to cluster the stations and adopt random forest to predict the rental number of bikes in each cluster.

    View all citing articles on Scopus
    View full text