Using data mining techniques for bike sharing demand prediction in metropolitan city

doi:10.1016/j.comcom.2020.02.007

Computer Communications

Volume 153, 1 March 2020, Pages 353-366

https://doi.org/10.1016/j.comcom.2020.02.007 Get rights and content

Abstract

Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. A Data mining technique is employed for overcoming the hurdles for the prediction of hourly rental bike demand. This paper discusses the models for hourly rental bike demand prediction. Data used include weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information. The paper also explores an filtering of features approach to eliminate the parameters which are not predictive and ranks the features based on its prediction performance. Five Statistical regression models were trained with their best hyperparameters using repeated cross-validation and the performance is evaluated using a testing set: (a) Linear Regression (b) Gradient Boosting Machine (c) Support Vector Machine (Radial Basis Function Kernel) (d) Boosted Trees, and (e) Extreme Gradient Boosting Trees. When all the predictors are employed, the best model Gradient Boosting Machine can give the best and highest R² value of 0.96 in the training set and 0.92 in the test set. Furthermore, several analyzes are carried out in Gradient Boosting Machine with different combinations of predictors to identify the most significant predictors and the relationships between them.

Introduction

Currently, the bike-sharing scheme is well-received throughout the world. It is a shared bike service to individuals, which is free of charge and for a short term basis at a minimal rate. Most bike-sharing systems permit people to borrow and return a bike from a bike station to another station that belongs to the same network. Bike-sharing gains a vast range of attention in recent years as part of initiatives to boost the use of cycle, improve the first mile/last mile link to other modes of transportation, and to minimize the negative effect of transport activities on the environment. Bike-sharing has significant impacts on establishing a larger cycling community, increasing the use of transportation, minimizing greenhouse gas emissions, enhancing public health and also traffic troubles.

Bike-sharing program progress is slow initially but after 1960 effective tracking tactics for bikes with improved technology are developed [1]. During this decade, the development gave rise to the rapid spread of bike-sharing systems across several continents. South Korea is turning into a land of two wheels, as cycling facilities provide transit flexibility vehicle emission reductions, health advantages, low congestion fuel efficiency, and financial benefit for individuals and also the paths are extended in cities across the nation as well as in rural areas.

The bike-sharing system is developed with accessible bikes for all residents for all cities across South Korea. The benefit of bike-sharing over renting is that riders can take a bike out of any system station and return it to any other station, enhance mobility, and benefit a greater number of users. Anyone can enjoy the benefits of the bike-sharing facility by being a member of a bike-sharing program and the user has access to a city-wide bike fleet for private utilization, either at minimal cost or free of charge. Many bike-sharing systems are automated based on cell phones or smart cards of the user. The first Korean city to introduce a bike-sharing system is Changwon Gyeongsangnam-do (Province of South Gyeongsang), that launched the Nubija (Nearly Useful Bike, Fun Attraction) in 2008, with 230 autonomous bikes running over 4600 bikes by 2012. Also, Seoul, Busan, and Daejeon, as well as in Suncheon, Jeollanam-do (South Jeolla Province) have the bike-sharing services. Currently, in 17 districts of Seoul, bike-sharing systems are offering 3200 bicycles. District governments own and command us, with the expectation of bike shares from Yeouido and Sangam World Cup Park [2]. Fig. 1 shows the Seoul Bike Ddareungi spot.

So the constant raise of users necessitates the prediction of the number of rental bikes that were needed to make the bike sharing system to consistently work. Therefore, this research aims to use machine learning and data mining based algorithms to predict required number of rental bikes required at each hour. In this method, data mining is used as it has the reliability to solve complicated issues. Across various cities, a growing body of research has investigated weather and climate impacts on cycling, usually across combination with several other factors that can affect cycling. The results differ in the degree to which climate influences use. Pucher et al. [3] shows that U.S. cities with relatively high levels of cycling have mild winters and often little rain compared to the extreme heat and moisture that disrupts cycling. Furthermore, in Pucher and Buehler’s analysis [4] estimating the percentage of cycling trips to work in U.S. and Canadian cities, rainfall and temperature are statistically important variables associated with lower cycling rates. This shows the influence of weather data in cycling patterns and selected weather parameters were used in this research.

The paper is structured as follows. Section 2 provides a comprehensive analysis of the literature review. Section 3 deals with methods of research. Section 4 provides data set description, exploratory analysis, data feature filtering and importance. Section 5 discusses various evaluation metrics used for evaluating the models, Section 6 provides model development process, Section 7 deals with results and discussion and Section 8 concludes the paper.

Section snippets

Related works

In the bike-sharing demand prediction, multiple pieces of research are carried out and some of the important works are discussed in this section. In Washington D.C. Capital Bike Sharer network, multiple linear regression and random forest algorithms are used to predict demand for rental bike [5]. A short-term prediction for the use of docking stations in Suzhou, a case is implemented in China [6]. In docking stations with one-month historical data, LSTM and GRU are used for the prediction of

Linear regression

Linear regression (LM) in the most simplest and nursing method, that is equated with the relationship between the Y attribute of the scalar output and one or even more X attributes of the input quantity. The case of an independent attribute is known as simple linear regression, and the method is called as multiple linear regressions when more than one independent attributes are considered. Data is designed using linear predictor functions in linear regression, and from data, the unknown model

Data preparation

This work comprehends the relation between rental bike used in each hour and the different predictors such as weather information and time information. Additionally it examines the efficiency of various regression models: (a) LM, (b) GBM, (c) SVM, (d) BT and (e) XGBTree, for predicting the demand for public rental bikes and ranking the influence of predictors or parameters in the prediction.

Data for one year (2017 December to November 2018) is downloaded from the Seoul Public Data Park website

Evaluation indices

Regression models are trained to select the best with a repeated 10-fold cross-validation scheme. The doParallel package [34] is employed to accelerate computations. Various evaluation criteria are employed to test the performance of regression models. The performance assessment indices used here are Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Rsquared (R²) and Coefficient of Variation (CV).

The standard sample deviation between the observed and the predicted values of the

Model development

In order to find and decrease the error values while fitting a model, it is necessary to find optimal tuning parameters for each of the regression algorithms. The caret package provides a grid search function for determining the optimal possible parameter values for a model. The grid search provides interactive approach to try all combinations of hyperparameters and select the best hyperparameters.

Fig. 9 exhibits a linear regression model residual map. In case of LM, residuals are determined as

Results and discussion

Upon training each regression model, each of the regression prediction model has 30 outcomes from the 10-fold cross-validation sets repeated for 3 times. For each model, CARET uses this data to plot R², RMSE and MAE values along with the confidence intervals as shown in Fig. 14. The best model is the model with lower MAE, RMSE and CV values as well as higher R² values. This is because the error values should be less and R² describes the explanation of the fit, so this value should be higher.

Conclusion

The data analysis and prediction provides a thought-provoking outcome for both the data exploratory research and the prediction models. The generated Pairwise plots based on their correlation, certainly show different parameter relationships that can be concealed in the most used prediction models. GBM and XGBTree models enhance the R², RMSE, MAE and CV of predictions rather than SVM, LM and BT. Temp and Hour are considered as the most significant variable for the hourly rental bike count

CRediT authorship contribution statement

Sathishkumar V E: Conceptualization, Methodology, Software, Data curation, Writing - original draft, Visualization, Investigation, Validation. Jangwoo Park: Supervision, Writing - review & editing. Yongyun Cho: Supervision, Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (35)

DeMaioPaul
Bike-sharing: History, impacts, models of provision, and future
J. Public Transp.
(2009)
PucherJohn et al.
Bicycling renaissance in North America?: Recent trends and alternative policies to promote bicycling
Transp. Res. A
(1999)
PucherJohn et al.
Why Canadians cycle more than Americans: a comparative analysis of bicycling trends and policies
Transp. Policy
(2006)
WangBo et al.
Short-term prediction for bike-sharing service using machine learning
Transp. Res. Proced.
(2018)
LinLei et al.
Predicting station-level hourly demand in a large-scale bike-sharing network: A graph convolutional neural network approach
Transp. Res. C
(2018)
PanYan
Predicting bike-sharing demand using recurrent neural networks
Proced. Comput. Sci.
(2019)
CandanedoLuis M. et al.
Data driven prediction models of energy use of appliances in a low-energy house
Energy Build.
(2017)
DongB. et al.
Applying support vector machines to predict building energy consumption in tropical region
Energy Build.
(2005)
PashazadehV. et al.
Data driven sensor and actuator fault detection and isolation in wind turbine using classifier fusion
Renew. Energy
(2018)
BelgiuM. et al.
Random forest in remote sensing: A review of applications and future directions
ISPRS J. Photogramm. Remote Sens.
(2016)

FanCheng et al.

Development of prediction models for next-day building energy consumption and peak power demand using data mining techniques

Appl. Energy

(2014)

Bikesharing spreads in Korea, LINK:...

FengYouLi et al.

A forecast for bike rental demand based on random forests and multiple linear regression

WuXinhua

Station-level hourly bike demand prediction for dynamic repositioning in bike-sharing systems

YangZidong

Mobility modeling and prediction in bike-sharing systems

ZhangLihuan

Data analysis and visualization in bike-sharing systems

TomarasDimitrios et al.

Modeling and predicting bike demand in large city situations

Cited by (82)

Bike sharing and cable car demand forecasting using machine learning and deep learning multivariate time series approaches[Formula presented]
2024, Expert Systems with Applications
In this paper the performance of different Machine Learning and Deep Learning approaches is evaluated in problems related to green mobility in big cities. Specifically, the forecasting of bike sharing demand in Madrid and Barcelona (Spain) is approached, for different prediction time-horizons, and also a problem of cable car demand forecasting in Madrid city. An important number of predictive variables are considered, which are grouped into four different sets (categorical/calendrical, persistence-based, meteorological and, as a novelty of the paper, information about analogue past instances), whose relevance is studied for all cases. A feature selection mechanism is also incorporated in order to improve the prediction accuracy of the proposed algorithms. A total of 12 different multivariate regression techniques are implemented, covering from Machine Learning methods to time-series Deep Learning approaches. Excellent results in all the prediction problems approached are reported. Finally, the consequences of obtaining accurate prediction in these three problem of green mobility in big cities are discussed. In addition, it is studied how the results could be exported to other similar cases in more general urban mobility studies. Novelties of the work include: (1) Addressing the forecast problem of passenger flow on a cable car using ML and DL multivariate techniques; (2) using the demand of analogous past instances as an additional feature to solve the demand prediction problems; and (3) the extraction of global conclusions about feature relevance when addressing a demand forecasting problem in green mobility.
Quantifying saturation point of Beijing bike-sharing market from environmental benefit: A data mining framework
2023, Journal of Cleaner Production
This study quantitatively estimates the carbon dioxide (CO₂) emissions savings from ride records for passengers whose travel behavior shifted from polluting modes (public transport and private car) to bike-sharing in Beijing. We present a framework for examining how travel time, distance, purpose, frequency, weather, and demographics affect passenger usage and estimate environmental benefits. The framework comprises modules of association rules, density-based spatial clustering, random forest, and CO₂ emission estimation. Our findings show that commuters with a trip distance of 1–2 km are more likely to change their behavior patterns. Therefore, more CO₂ emission savings accrue in developed districts where residential density and employment rates are higher, than in central districts. Beijing saves 4322.38 kg CO₂ per day. In contrast, four districts are oversupplied and have reached saturation points in the number of bikes. Implications for planners suggest that they will be able to better control the number of bikes launched.
The relative roles of different land-use types in bike-sharing demand: A machine learning-based multiple interpolation fusion method
2023, Information Fusion
Land use plays a crucial role in promoting the bike-sharing demand. Traditionally, studies on bike-sharing demand (BSD) are mainly focused on its prediction through regression methods, but the influence of MAUP (modifiable areal unit problem) in modeling is ignored. This paper aims to model spatial BSD distribution and prove the driving forces of different land use types to BSD through a machine-learning-based multiple interpolation fusion method. The hotspot detection model is employed to establish sample points covering different land use types in urban areas. In order to capture the differences in adaptations among different urban regions and for different data sizes, six machine learning methods are applied and evaluated to improve BSD estimation by fusing five spatial interpolation algorithms, including Inverse Distance Weight, Spline, Kriging, Natural Neighborhood and Trend. The methodological verification of Beijing City shows that the fusion models improve the estimation performance compared with individual interpolation algorithms, and that GRNN (generalized regression neural network) method is superior to all the others. According to fitting results of all POIs based on the GRNN fusion model, we identify which types of facilities correspond to customers that will have a stronger preference for bike-sharing and demonstrate which facility names are more prominent in each land use type. The conclusions presented here enrich our understanding relationships between land-use and BSD, which provide a valuable foundation for the bike-sharing development. Compared with implementing regression in an analysis zone or a square grid, troubles caused by the MAUP are effectively solved through this method.
Forecasting Bike Sharing Demand Using Quantum Bayesian Network
2023, Expert Systems with Applications
In recent years, bike-sharing systems (BSS) are being widely established in urban cities to provide a sustainable mode of transport, by fulfilling the mobility requirements of public residents. The application of BSS in highly congested urban cities reduces the effect of overcrowding, pollution, and traffic congestion problems. The crucial role behind incorporating BSS depends on the prediction of bike demand across all the bike stations. The bike demand prediction involves real-time analysis for identifying the discrepancy between the bike pick-up and drop-off throughout all the bike stations in a given time period. To enhance the prediction analysis of bike demand we propose quantum computing algorithms to provide computational speedup in comparison with classical algorithms. In this paper, we illustrate the construction of Quantum Bayesian Networks (QBN), for predicting bike demand. Furthermore, we provide a solution framework for implementing QBN for two case studies: (a) bike demand prediction during weekdays, (b) bike demand prediction during weekends. We have compared the quantum and classical solutions, by using IBM-Qiskit and Netica computing platforms.
Combatting the mismatch: Modeling bike-sharing rental and return machine learning classification forecast in Seoul, South Korea
2023, Journal of Transport Geography
Bike-sharing is rapidly gaining popularity due to health, transportation, and recreational benefits. As more people use bike-sharing, the burden of reallocating bikes will increase because of the mismatch between outgoing and incoming bikes. Optimizing truck routes, incentivizing users, and crowdsourcing are common suggestions to mitigate rebalancing issues. This research aims to provide a procedure to adjust landscape conditions as an alternative strategy. Comprehensive landscape metrics are quantified by FRAGSTATS analysis. Using public bike-sharing data in Seoul, South Korea, we analyzed spatial and temporal mismatch characteristics. Hot spot analysis was conducted to identify hot and cold spots of bike-sharing use in two scenarios: outgoing and incoming trips. This was used to generate tree-based binary ensemble machine learning classification models. Shapley Additive exPlanations (SHAP) values were calculated between hot and cold spots to understand how landscape characteristics and other determinants affect the mismatch. Our results suggest that climate and bike-sharing related factors significantly affect bike-sharing use. Transportation land use and landscape characteristics like the magnitude of biodiversity, contiguity, shape, area, and edge significantly contribute to labeling. The findings of this study can help bike-sharing operators better navigate their bike-sharing services.
A censored semi-bandit model for resource allocation in bike sharing systems
2023, Expert Systems with Applications
Citation Excerpt :
Li and Zheng (2019) proposed a hierarchical consistency prediction model to predict citywide bike usage in the next period. Sathishkumar et al. (2020) used weather information, the number of bikes rented per hour, and date information, and several regression models including linear regression, support vector machine, gradient boosting machine to predict the hourly rental bike demands. Xu et al. (2019) and Hua et al. (2020) adopt K-means to cluster the stations and adopt random forest to predict the rental number of bikes in each cluster.
Resource allocation is an essential problem in the application of bike sharing systems. Demand estimation from historical data plays an important role in bike resource allocation. However, as the observed demand is always lower than the available bike supply, the historical pickup data is a supply-censored version of true user demand, which may lead to the degradation of allocation policies designed directly from historical data in actual online use. Therefore, the exploration of latent user demand is also necessary for the bike-sharing system. In this paper, we study the following problem: whether we can optimize the allocation policy with observed historical demand (exploitation) and consider exploring the latent demand (exploration) during the allocation process simultaneously. We model this problem as a censored semi-bandit problem, which aims to maximize the cumulative number of successful pickups during the multi-round allocation process when the real user demand is unknown at the beginning. We adopt a nonparametric estimator to estimate the user demand from the censored pickup feedback and propose an upper confidence bound based allocation policy to achieve a trade-off between the exploitation and exploration of user demand. The convergence property of the proposed policy is proved theoretically in this paper. Computational results of ablation experiments based on real-world data sets demonstrate the significance of considering exploring latent user demands and the proposed policy can well reduce the lost demands.

View all citing articles on Scopus

View full text

Using data mining techniques for bike sharing demand prediction in metropolitan city

Abstract

Introduction

Section snippets

Related works

Linear regression

Data preparation

Evaluation indices

Model development

Results and discussion

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

J. Public Transp.

Transp. Res. A

Transp. Policy

Transp. Res. Proced.

Transp. Res. C

Proced. Comput. Sci.

Energy Build.