Keywords

1 Introduction

Nowadays, technology has so much impact on society that influences everyday’s life decisions. Accordingly, the advent and massification of personal devices that incorporate global geopositioning technology (gps) have facilitated access to daily required services like public transportation. Hence, in the last years, new research opportunities have emerged intended for acquiring and processing information coming from such devices, and that might help to improve the operation of what has been called smart cities. According to [1], the main research topics using the above-mentioned technologies can be categorized as social dynamics [2], traffic dynamics [3], and operational dynamics [4, 5]. Social dynamics is devoted to the study of populations, its behavior and external factors that influence them. Traffic dynamics studies the population flow and their mobility patterns in order to estimate travel times, design real-time indicators of traffic, among many other tasks. Finally, operational dynamics focus on the taxi drivers operation modes, including collection points, unloading travel times, employed trajectories, among many other variables. As a result, the analysis of these types of dynamics can be directly applied to treat urban problems like traffic management, the design of improved displacement trajectories, ensuring a more fluid transit.

In recent years, operational dynamics has attracted growing attention, producing a diverse number of conducted studies around it. As an example, in Harbin, China, analysis and estimation of hotspots were carried out by means of dbscan clustering algorithms by using large-scale taxi gps data, taking into account features such as the grouping of pick-up and drop-off locations [5]. In Italy, a stochastic model employing a special version of accelerated random walks in combination with exponential decay distributions was introduced to describe the trajectories of 780,000 private vehicles [6]. In Copenhagen, Denmark, multi-scale spatiotemporal analysis of mobility patterns was made combining gps and Wi-Fi scans data [7]. However, it should be noticed that all the mentioned studies use different information highly correlated with the peculiarities of each city. Consequently, although the proposed methodologies may be scalable, the influence of endemic factors such as topography, demography, city distribution, as well as the implicit nature of the data hinders the reproducibility of the analyses and their transfer to other populations.

In this paper, we present a spatiotemporal analysis applied to the forecast of taxi service demands per communes for the city of Manizales (Colombia) by using generalized additive models (gam). The proposed analysis is based on three components: trend, seasonality, and holidays [8]. The study was conducted on data collected from taxi service requests for the second four-month period of 2016. The rest of this paper is organized as follows: Sect. 2 describes the theoretical aspects of the regressive model employed. Section 3 describes the employed data and its preprocessing, along with the experimental framework. Later, in Sect. 4 we describe and discuss the achieved results. Finally, Sect. 5 presents the conclusions of this work.

2 Methods

2.1 Time Series Forecasting Model

The approach of the forecasting model is based on the use of generalized additive models (gam) [9], being this a form of regressive model with non-linear smoothing. This model allows the addition of various functions of linear/non-linear nature as components in the final form of the equation. This allows a flexible model analysis by being treated as separate components. The forecasting model is decomposed into three components: trend, seasonality, and holidays [8]. The general model function is given by the following equation:

$$\begin{aligned} y(t) = g(t) + s(t) + h(t) + \epsilon _{t} \end{aligned}$$
(1)

Where g(t) describes the changes in trend of time series, s(t) represents the variations in seasonality along the weeks, while h(t) gives information related to holidays, and the \(\epsilon _{t}\) provides error attributed to external factors [8]. The following sections will give a more detailed description of each of the components involved in the model.

Trend Function. In contrast to trend models based on saturated growth, piecewise linear models often allow a simpler and more useful prediction when no abrupt or exponential growth is exhibited. The form that describes such a trend model corresponds to the following equation:

$$\begin{aligned} g(t) = ( k + a(t)^{T}\delta )t + (m + a(t)^{T}\gamma ) \end{aligned}$$
(2)

Being k the growth rate, \(\delta \) the adjustment rate, m the mismatch parameter, and \(\gamma \) is defined as the multiplication of the change points in time \(s_{j}\) and the adjustment rate \(\delta \), i.e. \(-s_{j}\delta _{j}\) to ensure the continuity of the trend function [8]. Finally, the component vector a(t) is defined as:

$$\begin{aligned} a_{j}(t)=\left\{ \begin{array}{lr} \,\,1, &{} \quad t\ge s_{j},\\ 0, &{} \text { otherwise}\end{array}\right. \end{aligned}$$
(3)

It is important to point out that to select the right number of changepoints, we assume a \(\delta _{j}\sim \) Laplace\((0,\tau )\), being \(\tau \) the parameter that directly controls the flexibility of the model.

Seasonality Function. When allusion is made to seasonality, it refers to the measurement of periodic behavior in a series of times. In our case, we use seasonality to measure whether there is any periodical behavior of the phenomenon to be evaluated over the course of the days of the week. In order to obtain a smoothed approximation of the effects of seasonality, a Fourier series estimation can be made [10], exhibiting the following equation:

$$\begin{aligned} X(t) = \sum _{i=1}^{N}\left( a_{n}cos\left( \frac{2\pi nt}{P} \right) + b_{n}sin\left( \frac{2\pi nt}{P} \right) \right) \end{aligned}$$
(4)

Where P is the period used in the time series and will depend on the analysis window, while Fourier coefficients can be expressed as \(\beta =[a_1, b_1, . . . , a_N; b_N]^T\). However, this form of estimation requires a parameter adjustment. For this purpose, it is necessary to construct a matrix composed by seasonality vectors for each time t from the Fourier series, obtaining:

$$\begin{aligned} X(t) = \left[ \text {cos} \left( \frac{2\pi (1)t}{P} \right) , \cdots , \text {sin}\left( \frac{2\pi (N)t}{P} \right) \right] \end{aligned}$$
(5)

Where N allows to adjust seasonality patterns to rapid changes. Finally, the estimation can be redrafted as \(s(t) = X(t)\beta \), being \(\beta \sim \) Normal\((0,\sigma ^{2})\).

Holiday Function can be considered as perturbation factors as these can usually affect the dynamics of taxi service requests by depending directly on the behaviour of the population during such dates, e.g. local migration, or leisure activities that would allow a decrease/increase in demand for taxis during the course of the day.

The way to model the holiday function is throughout the generation of a matrix of regressors, by using the following equation:

$$\begin{aligned} \begin{matrix} h(t) = Z(t)\kappa \text { Where } Z(t) = [ 1(t \in D_{1}), \cdots , 1(t \in D_{L})] \end{matrix} \end{aligned}$$
(6)

Being \(D_{i}\) is the set of past and future dates for each holiday i, and \(\kappa \sim \) Normal\((0, \nu ^{2})\) concurring with the changes in the forecasts.

2.2 Model Fitting

Due to the large number of parameters in all components, i.e. parameters existing in trend, seasonality and holidays functions; l-bfgs is used as optimizer. Limited-memory bfgs (l-bfgs) is an optimization technique derived from the quasi-Newton family of methods, which are known for their great versatility and ease of unrestricted optimization solutions [11,12,13]. l-bfgs is based on the approximation of the Broyden-Fletcher-Goldfarb-Shanno algorithm (bfgs) having as condition the minimization of completely differentiable functions. This algorithm is based on gradients and has been popular for parameter estimation in many machine learning problems [14, 15].

3 Experimental Setup

In this section, we present the experimental framework used for the predictive model generation that corresponds to the demand for taxi services requested by communes in the Manizales city. Initially, a description of the database is made. Then, two types of preprocessing are developed in this study: a spatial preprocessing which explains the form and criteria used to sectorize the city, and a temporal preprocessing which defines the time intervals used for the construction of the time series. Finally, a description of the nature of the model and its generation is carried out.

3.1 Database

The database is conformed by 709531 records of requests for taxi services provided by the company Mobility Solutions S.A.S. through the CityTaxi application. The data collected correspond to the period from April 1, 2016 to July 31, 2016. It should be noted that in accordance with the Colombian law 1581 for protection of personal data, the database does not include sensitive personal information of users and/or drivers. Table 1 shows the fields available in each of the records. It is important to highlight that such records corresponds to hour, date and place where the requests were generated, i.e. places in which the users solicited the service.

Table 1. Example of a record

3.2 Spatial Preprocessing

Usually, in order to facilitate data processing, a common practice is to decompose the area of a city by independent sectors. A wide variety of methodologies to carry out these decompositions have been proposed in the literature [4, 16]. However, in several of the approaches, fractionation is done in a uniform way and consequently, this does not always guarantee a correct distribution of the city, in addition to not contemplating information apriori of the city. In the present work we have used the territorial divisions proposed by the governmental entities, which for the case study are the territorial divisions proposed by the Manizales mayor’s office, available on the Manizales geoportal websiteFootnote 1. The information in the geoportal is in turn fed with data supplied by the Agustín Codazzi Geographic Institute, being the entity in charge of producing the official maps and basic cartography of Colombia. Although cities in Colombia present mainly two forms of partitioning, i.e. by commune and by neighbourhood, we have chosen to work with the division by communes, due to the risk of loss of temporal resolution that supposes the use of segmentation by neighborhoods. To make an initial estimation of the behavior regarding the demand in taxis by commune, it is necessary to know the critical mass present in each commune, i.e. the size of the sample, in this case, by sectors. That is why Table 2, presented below, reflects the number of dwellings present per commune.

Table 2. Number of dwellings per commune.

As can be seen, Table 2 also presents the segmentation of the number of dwellings segmented by socio-economic strata, the above for the purpose of obtaining information about the dynamics of transport according to socio-economic income for being used in the discussion.

3.3 Temporal Preprocessing

Variables such as the demand corresponding to taxi services requests can be considered time-dependent factors. Since the time variable presents a continuous nature and cannot be analyzed directly, a common practice in the literature is the adoption of time intervals. The definition of time intervals or also known as time windows are an essential part of the experimental process. In order to provide an adequate time window, the data were analyzed on the basis of the following criteria: (i) The temporal resolution must capture the population dynamics. (ii) A representative sample, enough large for each of the communes, must be ensured for a posterior spatiotemporal analysis to carried out. After evaluating these criteria, we empirically find that a time window of 1 h guarantees their fulfillment. Each record is assigned exclusively to one of the 2928 temporary windows.

3.4 Model Generation

The models for the forecast of demand of services by commune were generated, by means of gam as described in Sect. 2.1, using the prophet package [8] available inFootnote 2. Trend change points (changepoints) were automatically adjusted by algorithm. Nevertheless, the amount of changepoints used is determined by the parameter \(\tau \), where high values will cause that the trend be adjusted excessively while the choice of small values will increase the generalization capacity of the model. In consequence, a correct choice of \(\tau \) parameter will positively influence the performance of the model. We adjusted the value of \(\tau \) through an interval search using exponentially growing sequences \(\{10^{-3}, 5 \times 10^{-3}, 10^{-2}, . . . , 5 \times 10^{-1}, 10^{0}\}\), in order to minimize mean absolute error (mae). Additionally only the first 80\(\%\) of the time series was used to infer changepoints, with the aim of have plenty of runway for projecting the trend forward and to avoid overfitting fluctuations at the end of the time series. The models for each commune were fit using a initial history of 3 months, and a forecast was made on 5 weeks horizon with cutoff per week. We used the classical “rolling origin” forecast evaluation [17], because it is not possible to use cross validation method that is employed in classification problems, given the observations are not exchangeable.

4 Results and Discussion

The results in this section will be shown in the following order: Initially, a detailed explanation about the observations found in the spatial analysis will be given. For this, we are going to use heat maps taking into account the rate of services per commune. Subsequently, a temporal analysis will be carried out on the behavior of the demand from time series. Finally, a spatiotemporal analysis will be conducted, considering the two previous analyses, focusing more on the prediction of time series.

4.1 Spatial Analysis

The spatial analysis allows to have a wide knowledge about the influence of the population concentrations and their relationship with the demand of the taxi services. That being said, Fig. 1 shows a heat map of the concentrations in the demand for taxis by communes in the city of Manizales.

Fig. 1.
figure 1

Heat map of Manizales city with the rate of services per commune.

If we look more closely at the heat map, the colors are related to the demand for taxi services, the greater the intensity of color, the greater the number of requests present in a given commune. It is important to emphasize that the commune with the greatest amount of demand is the commune 6 (Palogrande), concentrating 17.03\(\%\) of the requests that were generated in the city, positioning itself as the main point of confluence. Presumably due to the fact that the commune 6 presents a greater concentration of banks than others communes, furthermore this holds the greatest quantity of commercial establishments with a total of 9801 entities, as can be evidenced in the Table 2. In contrast, it can be observed that commune 1 (San Jose), shown in the softest color, presents the lowest percentage of requests in the demand for taxis. Another factor regarding the commune 1 is that this commune has the least number of dwellings and these belong to low socioeconomic strata.

Fig. 2.
figure 2

Forecast and real time series of demand for services. (Color figure online)

4.2 Temporal Analysis

The temporal analysis of the present work considers the behavior over time of the dynamics of taxi requests throughout the city. Figure 2 shows the analysis between the original behavior (blue line) and the forecast values (yellow line) for the periods between July 27, 2016 and August 03, 2016, considering a one-week analysis window. It is easily noticeable that the predicted values have a very high correspondence with the original behaviour of the time series analysed in the above-mentioned time frame, evidencing a high level of accuracy in the prediction. However, it is important to make some additional observations shown in the Fig. 2. The additional observations turn around the highest values exhibited by the demand for taxi services in the time series. One of the explanations for this phenomenon is based on two facts of interest: Both days correspond to weekends and also point out the end of the month. In Colombia, both half and the end of the month are conventional pay days for workers, and if we also note that weekends are usually days of recreation and night-time activities, the peak values in the figure are completely predictable. The second additional observation highlights the non-existence of holidays during the time frame of analysis. In view of the above, a possible cause for the high values in the time series may lie in climatic factors that have propitiate such behaviour in the requests of taxi services.

Fig. 3.
figure 3

Demand on an average day.

In addition to the analysis on the dynamics of taxi services along the week, an additional analysis was performed which consisted in the observation of demand of services in range of hours for days contemplated in the analysis. All days in the analysis showed a periodic pattern in their dynamics, revealing low and high demand activities, as exhibit Fig. 3. If we observe in detail, Fig. 3 presents peak values in 3 time ranges: 7:00–8:00, 13:00–14:00, 19:00–20:00; these values coincide perfectly with the hours of entry to workday, lunch hours and end of workday respectively. In these time ranges the highest concentrations of vehicular traffic throughout the work week are commonly presented and therefore this phenomenon correlates with the demand for taxi services.

4.3 Spatiotemporal Analysis

After making the forecasts for communes were obtained the predictions with their respective confidence intervals. We computed the following statistics: root mean squared error (rmse), mean absolute error (mae), and coverage of the confidence intervals. Additionally, the mean absolute percent error (mape) also was computed. However, in order to avoid calculation problems when the signal takes zeros values, these were replaced with one. Each statistic is calculated through a rolling window including \(10\%\) of the predictions. Table 3 shows mean and standard deviation calculated between windows, where (\(\downarrow \)) indicates that the smaller is better, and (\(\uparrow \)) indicates that the larger is better.

Table 3. Performance measures per commune mean ± sd.

As can be observed in Table 3 the models by communes present a very good fit, given the low levels reported in mape, which oscillate around \(1\%\). It should also be noted that Palogrande commune has the highest values in rmse and mae, since this commune has the highest demand rate of the city. On the other hand, the zone with the lowest error rates is commune 9 (Tesorito). This may be due to the fact that in this area is located not only the most industrialized part of the city, but also two of the most important institutions of higher education, presenting very well defined work schedules.

5 Conclusions

In this paper, we presented a spatiotemporal analysis using taxi pick up points information of the second four-month period of 2016. We studied demand per communes and the predictability of services taxis demand. From the conducted analysis we can conclude that commune with high demand correspond with Zona rosa. On the other hand, the temporal analysis revealed three peaks of high demand during the day, these peak coincide perfectly with hours of entry to workday, lunch hours and end of workday. In addition, the good performance of the generalized additive models for prediction horizons of a few weeks could be evidenced.

As future work, climate information will be included in order to explain some high demand peaks that are not explained by the stationarity component of the signal. Moreover, we would like to expand analysis period to some years as well as improve spatial resolution.