Keywords

1 Introduction

Traffic flow forecasting is an important aspect in designing intelligent transportation systems for cities and highways. It is also of great interests to everyday travelers who may desire to know in advance the congestion levels of roads and the amount of time it would take to reach their destinations. Much research has been devoted into studying this topic in recent years, as can be evident in very recent work such as [12, 20, 27, 29]. The fruits of such studies can be of good use to city planners in governments, traffic app developers as well as everyday commuters and travelers.

The dawn of the big data era has also greatly facilitated the advancements in research on traffic forecasting. Seeing the need to collect large amounts of high quality and high resolution traffic data, numerous states in the United States have invested in deploying great number of traffic sensors on their busiest roads and highways. The Caltrans Performance Measurement System (PeMS)Footnote 1 from the state of California is an example of such systems. High resolution traffic data such as flow and speed are collected in real-time from more than 39000 sensors deployed in major urban areas and highways across the state. This work is devoted to studying traffic flow forecasting using PeMS data collected in southern California during the entire year of 2018.

The vast majority of existing literature on the topic of traffic forecasting has devoted to forecasting in the immediate short terms, such as a couple of minutes ahead. It is certainly justified, as the immediate short terms can usually best capture the dynamic nature of traffic situations and are usually of the greatest interests. For example, a commuter would be very interested in the optimal routes to avoid the most traffic congestions during morning rush hours; or the amount of time, hopefully in minutes, for the commuter to arrive at his or her workplace. Longer term traffic forecasts can certainly be done by relying more on the historical data of a particular location but may suffer from relatively poorer accuracies due to larger time gaps (e.g., forecasting 24 h in advance may need to heavily rely on historical averages, but if exceptional circumstances occur, such as an accident or rainy weather, then clearly the forecasts produced 24 h ago may not be as reliable).

Many researchers have exclusively studied on 1-step ahead forecast, such as in [7, 14, 15, 17]; in other words, if the data are of 15 min resolutions, then forecasts are produced for only 15 min ahead. Others, especially recently, have also studied multi-step traffic forecasting, from minutes to a couple of hours ahead, such as in [12, 16, 29].

Univariate forecasting, meaning producing forecasts by relying on historical data from one particular sensor alone, is also most prolific in the literature, such as in [10, 12, 15, 17, 26]. Multivariate forecasting usually involves using data from multiple spatially dependent sensors to produce improved forecasts than the univariate counterparts, such as in [3, 16, 29]. Very recently, some researchers have also chosen to simply give data from very large numbers of sensors to a deep Neural Network and task it to determine and establish any dependencies among the data [16, 29]. Some studies have also included external variables such as weather data into their forecasting models, such as in [12, 14].

This work can be thought of as an extension to a very recent previous work in [21]. Additional popular forecasting models are included and a new multivariate forecasting experiment is conducted. More details on the improvements upon our previous work are included in the Related Work section. The contributions of this work are as follows: (1) to evaluate the effectiveness of commonly used statistical and machine learning models on univariate traffic flow forecasting using large amounts of temporal data; (2) to study the impacts of incorporating spatially dependent data into multivariate forecasting models; (3) to examine the performance of multi-step forecasts in the short term, which is generally very dynamic and volatile. (4) to provide a reference for the relative performance of popular traffic flow forecasting models in both univariate and multivariate settings.

Most forecasting models used in this work are provided by the ScalaTion project [19]. It is a open source, MIT licensed, Scala-based project designed for analytics and simulation using big data. For more details, please visit http://www.cs.uga.edu/~jam/scalation.html. The only exceptions are the Neural Networks models, which are provided by Keras [5] using the Tensorflow [1] backend.

The rest of this paper is organized as follows: Sect. 2 discusses the basic background on various statistical and machine learning models included in this study. Section 3 is about Related Work in traffic forecasting. Section 4 explains the detailed experimental setup and performance evaluations. Finally, Sect. 5 concludes the paper and offers potential directions for future work.

2 Background

In general, a forecasting model may take on the form of

$$\begin{aligned} y_t = f(X, B) + \epsilon _t, \end{aligned}$$
(1)

where \(y_t\) the response of interest at time t (e.g., the traffic flow at 8:00AM); X is the set of inputs (e.g., traffic flow data in the recent past); B is the set of parameters; f is a function that maps X and B to a forecasted value at time t, often denoted as \(\hat{y}_t\); and \(\epsilon _t\) is the residual at time t.

Though the exact form of the function f, the set of parameters B, and the set of inputs X can differ greatly for various forecasting models, their common goal is to produce forecasted values that are as close to the actual values as possible across multiple time instances and minimize some type of norm of the residuals such as the Sum of Squared Errors (SSE).

The model in Eq. 1 may be generalized into the multivariate case consisting of m time series as follows:

$$\begin{aligned} \mathbf y _t = g(X, B) + \varvec{\epsilon }_t , \end{aligned}$$
(2)

where the response \(\mathbf y _t\) and the residuals \(\varvec{\epsilon }_t\) have all been generalized into dimension m. The function g(XB) now maps X and B to a vector of forecasted values, one for each of the m time series.

2.1 Statistical Models

Statistical models generally involve formalization of equations that try to explain the relationships among various variables based on certain assumptions. Commonly use statistical forecasting models include seasonal Autoregressive Integrated Moving Average (ARIMA) model [2]; its multivariate generalization seasonal Vector Autoregressive Moving Average (VARMA) model [24]; exponential smoothing model [9, 28]; as well as regression models.

Seasonal ARIMA. In reference to Eq. 1, the seasonal Autoregressive Moving Average model expresses f(XB) as

$$\begin{aligned} f(X, B) = c + \sum _{i = 1}^{p} \phi _{i}y_{t-i} + \sum _{i = 1}^{q} \theta _{i}\epsilon _{t-i} + \sum _{i = 1}^{P} \varPhi _{i}y_{t-il} + \sum _{i = 1}^{Q} \varTheta _{i}\epsilon _{t-il} , \end{aligned}$$
(3)

where the set of inputs X includes the i-th lagged values of the response \(y_{t-i}\) and residual \(\epsilon _{t-i}\) as well as the seasonal lagged values \(y_{t-il}\) and \(\epsilon _{t-il}\) of seasonal period l; the set of parameters B contains an intercept c, p autoregressive parameters \(\phi \)’s, q moving average parameters \(\theta \)’s, and their seasonal counterparts, P \(\varPhi \)’s and Q \(\varTheta \)’s. Differencing of order d or seasonal differencing of order D may also be applied to the time series to stabilize the mean before fitting the parameters. Notation wise, it is common to express a seasonal ARIMA model as SARIMA \((p,d,q) \times (P,D,Q)_l\).

Seasonal VARMA. The seasonal Vector Autoregressive Moving Average model is the multivariate generalization of the seasonal ARIMA model. Instead of only relying on the lagged values of one time series to make forecasts, the seasonal VARMA model incorporates lagged values from m time series to help make forecasts for each time series. The seasonal VARMA model expresses g(XB) in Eq. 2 as

$$\begin{aligned} g(X, B) = \mathbf c + \sum _{i = 1}^{p} A_{i}{} \mathbf y _{t-i} + \sum _{i = 1}^{q} M_{i}\varvec{\epsilon }_{t-i} + \sum _{i = 1}^{P} U_{i}{} \mathbf y _{t-il} + \sum _{i = 1}^{Q} O_{i}\varvec{\epsilon }_{t-il} , \end{aligned}$$
(4)

where \(\mathbf c \) is a vector of dimension m representing intercepts for the m time series in the model; the parameter matrices A’s, M’s, U’s, and O’s are all of dimensions \(m \times m\) and are the multivariate generalizations of \(\phi \)’s, \(\theta \)’s, \(\varPhi \)’s, and \(\varTheta \)’s in Eq. 3, respectively; Similarly, \(\mathbf y \) and \(\varvec{\epsilon }\) are both vectors of size m, representing the values and residuals of all m time series.

Exponential Smoothing. Another univariate forecasting model is exponential smoothing, for which the f(XB) in Eq. 1, when given data up to time \(t-1\), may be calculated as

$$\begin{aligned} f(X, B) = s_{t-1} + d_{t-1} + a_{t-l} , \end{aligned}$$
(5)

where the smoothed value s, the trend factor d, and the additive seasonal factor a may be recursively computed as

$$\begin{aligned} \begin{gathered} s_{t_i} = \alpha (y_{t_i} - a_{t_i-l}) + (1-\alpha )(s_{t_i-1} + d_{t_i-1}) ,\\ d_{t_i} = \beta (s_{t_i} - s_{t_i-1}) + (1-\beta )d_{t_i-1} ,\\ a_{t_i} = \gamma (y_{t_i} - s_{t_i}) + (1-\gamma )a_{t_i-l} , \end{gathered} \end{aligned}$$
(6)

where \(t_i \in [1, t)\); the set of inputs X includes all values of the time series up to time \(t-1\); the set of parameters B contains \(\alpha \), \(\beta \) and \(\gamma \), known as smoothing parameters bounded between 0 and 1.

Regression. Regression models such as linear regression and polynomial regression are usually very efficient and effective. The linear regression model may express f(XB) in Eq. 1 as

$$\begin{aligned} f(X, B) = \mathbf b \cdot \mathbf x _t , \end{aligned}$$
(7)

where the set of inputs X includes input vector \(\mathbf x _t\); and the set of parameters B includes the parameter vector \(\mathbf b \). Polynomial regression includes additional input features such as the powers of the original input features and products of two original input features, known as interaction terms.

2.2 Machine Learning Models

The machine learning models place great emphasis on learning directly from the data. In general, no predefined forms of equations or any assumptions are needed for machine learning models. They are more flexible in the sense that they are not constrained to any predefined forms and are free to extract any knowledge and relationship among variables from the data.

To train a machine learning model, an \(n \times k_i\) input training matrix, where n is the number of instances and \(k_i\) is the number of input features, and an \(n \times k_o\) output/response matrix, where \(k_o\) is the number of outputs/responses, are usually required. Because of this kind of design, it is very straight forward to train a machine learning model for multivariate, multi-step time series forecasting: simply append more columns to the input matrix and output matrix as needed. The training process usually considers one output feature at a time.

Support Vector Regression. Support Vector Regression (SVR) [6, 25], in its simplest linear form, expresses f(XB) in Eq. 1 as follows,

$$\begin{aligned} f(X, B) = \langle \mathbf x _t, \mathbf b \rangle + c , \end{aligned}$$
(8)

where the \(\mathbf x _t\), during training, is the instance associated with training output \(y_t\). Please note that \(\mathbf x _t\) often contains information up until time t, but not at time t. The pair of angle brackets \(\langle \rangle \) is the inner product operator. The set of inputs X includes \(\mathbf x _t\). The set of parameters B includes a parameter vector \(\mathbf b \) and an intercept c. Optimization is performed to minimize

$$\begin{aligned} \frac{1}{2} ||\mathbf b ||^2 , \end{aligned}$$
(9)

subject to the constraint of

$$\begin{aligned} | y_t - (\langle \mathbf x _t, \mathbf b \rangle + c) | \le \epsilon , \end{aligned}$$
(10)

that is, the forecasted value must be within a threshold \(\epsilon \) of the observed value for all training instances. Often times, a non-linear kernel function may be used to transform the training instances into higher dimensional space in order to fit a curve rather than a line. The parameter vector \(\mathbf b \) may also be expressed as a linear combination of selected training instances, known as support vectors [25].

Neural Networks. Neural Networks (NN) have garnered much attention in recent years, primarily due to the advancement in deep learning research. The standard 3-layer Neural Network, when containing more than one output neuron, may express g(XB) in Eq. 2 as follows,

$$\begin{aligned} g(X, B) = f_1(B_1^T f_0(B_0 ^ T \mathbf x _t + \mathbf c _0) + \mathbf c _1) , \end{aligned}$$
(11)

let \(k_h\) be the number of hidden nodes, then the set of parameters B includes the \(k_i \times k_h\) parameter/weight matrix \(B_0\), the \(k_h\) dimensional bias/intercept vector \(\mathbf c _0\), the \(k_h \times k_o\) parameter/weight matrix \(B_1\), and the \(k_o\) dimensional bias vector \(\mathbf c _1\). The two activation functions, \(f_0\) and \(f_1\), output signals from input layer to hidden layer, and from hidden layer to output layer, respectively. Additional hidden layers may be added to a Neural Network and its forecasted/predicted values may be produced in a similar layer-by-layer manner. Since the information can only be passed in a forward manner, and every pair of adjacent layers are completely connected by edges, such Neural Networks are also more precisely called feed forward fully connected Neural Networks.

Long Short-Term Memory Neural Networks. Long Short-Term Memory (LSTM) Neural Network [8] is a type of recurrent Neural Networks designed to work with temporal data. The core of an LSTM NN is an LSTM unit, which may also be viewed as a special layer. The input to an LSTM unit/layer must contain an additional temporal dimension to the standard instances \(\times \) features input matrix used in other machine learning models. In other words, a training input instance to an LSTM unit/layer contains the temporal evolution of the values of the features.

An LSTM unit/layer contains a cell state that maintains valuable information throughout time. Three gates exist within an LSTM unit/layer that affects the information stored in the cell state: (1) the input gate determines new information that needs to be added to the cell state; (2) the forget gate determines what old information is no longer relevant in the cell state; (3) the output gate determines what output signals to produce based on the contents of the cell state.

Once the final output from an LSTM unit/layer has been obtained at the last time step, then the output may be fed into another Neural Network layer, such as a fully connected layer described in the previous section, and the final forecasted output may be obtained in a similar layer by layer manner described in Eq. 11.

3 Related Work

Univariate time series forecasting of traffic flow is most common in the literature. In [26], a Neural Network was used as a meta-learner trained based on the outputs of ARIMA, MA and exponential smoothing models. The performance was better than ARIMA and a small NN. In [10], the authors’ proposed SVR model outperformed SARIMA, exponential smoothing, and a small Neural Network. A work in [15] compared SARIMA, SVR and NN using 15-min resolution data collected over 9 months by 16 sensors from PeMS. The SARIMA model performed the best, but the authors’ proposed SVR model ran much faster without losing much accuracies. The NN structure was relatively small in size, and did not perform too well. In [17], both speed and flow data were used to forecast speed, and the performance of LSTM NN was superior to ARIMA, SVR and other Neural Networks. In another work [12], LSTM NN outperformed ARIMA, NN, and Deep Belief Networks. Rainfall data were also included to improve forecasting accuracies.

Multivariate time series forecasting generally relies on using spatially dependent sensor data to improve performance. In [3], a VAR model took advantage of the spatial dependencies between sensors that are on the same freeway and outperformed univariate ARIMA and SARIMA models. In [16], the authors’ proposed deep Neural Network built with stacked autoencoders took 5-min resolution traffic flow data during the first two months of 2013 as training data. The deep Neural Network was responsible for learning any temporal and spatial dependencies among the data and its performance was better than SVR and other types of Neural Networks. In another work [29], the authors’ proposed LSTM structure took advantage of both spatial and temporal dependencies and outperformed ARIMA, SVR, and other types of Neural Networks. The data were in 5-min resolution, collected during first half year of 2015, from 500 sensors in the 5th Ring (city bypass) in Beijing.

In comparison with our recent previous work in [21], this work improves upon the univariate aspect by including additional univariate forecasting models such as a SARIMA model based on BIC, which generally works better than the SARIMA model in [21] for the southern CA datasets, various regression models, the SVR model, and the increasingly popular LSTM NN. In preliminary testings, we also attempted to improve upon the NN model using tanh activation functions found in [21] by using the leaky ReLU [18] activation function which results in higher accuracies. The SARIMA and NN models were the top two performers in [21], and improvements are made upon both in this work. In addition, this work includes a new multivariate experiment to study the effects of spatial dependencies on traffic forecasting. From the statistical models, two seasonal VARMA models and various regression models are included. The machine learning models from the univariate experiment are also used for multivariate traffic flow forecasting and are discussed in more details in the Evaluations section.

4 Evaluations

This section provides details on the datasets, experimental setup, and evaluation results.

4.1 Dataset Description and Preprocessing

The traffic data are obtained from Caltrans Performance Measurement System. This study focuses on southern California (San Diego and surrounding areas), or district 11 as classified by the California Department of Transportation. All data are from major highways, or Mainline (ML) according to PeMS classification. The resolution of the data is 5 min. A total of 373 sensors are chosen in this study. The size of all the data is approximately 1.5 GB. Table 1 contains a summary of the number of sensors selected from each highway. All sensors chosen for this study must contain data for the entire year of 2018.

Table 1. Number of sensors from each highway

The quality control system of PeMS is very robust. If missing data arise due to sensor failure, PeMS automatically imputes the data and provides the imputed data to the users. The users are also given information on the percentage of observed (non-imputed) data across all lanes at any sensor location. In extremely rare occasions, the data provided by PeMS may contain missing data for certain timestamps. In such cases, the missing data are imputed through linear interpolation. All imputed data are included to train models but excluded for performance evaluations.

4.2 Evaluation Metrics

Three normalized evaluation metrics are considered, Mean Absolute Percentage Error (MAPE), Normalized Root Mean Squared Error (NRMSE), and coefficient of determination \(R^2\)

$$\begin{aligned} MAPE = \frac{1}{T}\sum _{t = 1}^{T}\Big |\frac{y_t - \hat{y}_t}{y_t}\Big | , \end{aligned}$$
(12)
$$\begin{aligned} NRMSE = \frac{T}{\sum _{t = 1}^{T} y_t} \sqrt{\frac{\sum _{t = 1}^{T} (y_t - \hat{y}_t)^2}{T}} , \end{aligned}$$
(13)
$$\begin{aligned} R^2 = 1 - \frac{\sum _{t = 1}^{T} (y_t - \hat{y}_t)^2}{\sum _{t = 1}^{T} y_t^2 - \frac{1}{T}(\sum _{t = 1}^{T} y_t)^2 } , \end{aligned}$$
(14)

where T is the total number of instances in the evaluation set. Two experiments are conducted, a univariate forecasting experiment and a multivariate one. All testings are performed on the Sapelo2 cluster from Georgia Advanced Computing Resource CenterFootnote 2.

4.3 Problem Analysis and Modeling

Data from the first 8 months of 2018 are used to train models and the last 4 months are used to evaluate the performances. Only data from work days are considered, as weekend data are usually of significant different patterns. Such practice is common is the literature, as can be seen in [15, 16]. Furthermore, the evaluation is focused on daytime traffic from 7:00AM to 7:00PM since traffic is most congested and dynamic during daytime. Forecasts are produced for 12 steps ahead, or up to 1 h ahead since the data are in 5-min resolution. A baseline, weekly historical averages computed from the previous 4 weeks, is included to compare against other forecasting models.

Univariate Experiment. In the univariate experiment, a model is trained only with historical data from one particular sensor. The SARIMA \((1,0,1) \times (0,1,1)_{1440}\), denoted as SARIMA in Fig. 1, is commonly used in literature [15, 21, 23]. The seasonal period is one week (1440 = 5 work days per week \(\times \) 24 h per day \(\times \) 12 five-minute periods per hour). The SARIMA \((5,0,5) \times (2,1,1)_{1440}\), denoted as SARIMA2, is also used and its parameters are found using a grid search like algorithm proposed by [11] based on the BIC criterion [22] on a small subset of the data. Optimization of exponential smoothing parameters is done by minimizing one-step ahead within sample SSE.

The input features of machine learnings and regression models include 12 most recent traffic flow observations, 12 observations from the previous seasonal period (i.e., when forecasting this coming Monday’s traffic from 8:00AM to 9:00AM, last Monday’s traffic flow data from the same time window are used), 12 historical averages computed from previous 4 weeks (the baseline), and the time of the day. The training output matrix simply includes traffic flow data for the next 12 steps (1 h). All data are normalized between 0 and 1.

Various regression models are considered, including linear regression (Reg), quadratic regression without interaction terms (QuadReg), quadratic regression with interaction terms, also known as response surface regression (RespSurf), and cubic regression without interaction terms (CubicReg). The \(\nu \)-SVR model, in which the parameter \(\nu \) controls the number of support vectors, is chosen for its efficiency. The ScalaTion implementation is based on the LIBSVM package [4]. The value of \(\nu \) is set to 0.05 and cost is set to 1.0 through grid search. The NN model consists of a 4-layer structure. The two hidden layers are using leaky ReLU activation functions [18]. The output layer is using the identity activation function. Since 12-step ahead forecasts are desired, 12 NNs are trained, each with a single output neuron representing a particular forecasting horizon. The size of each layer is half the size of its previous layer. By using grid search, the number of training epochs is set to 300, batch size is 32 and the alpha parameter in leaky ReLu is 0.3 (the latter two parameters are also default values in Keras). Optimization is done using the Adam algorithm [13]. LSTM NN uses a very similar set up with NN, except that the first hidden layer is replaced with the LSTM layer and the number of training epochs is only 100. In addition, the inputs to the LSTM NN require an additional temporal dimension, therefore instances in the training input matrix are further grouped weekly for 4 weeks.

Multivariate Experiment. On a given highway, there can be many sensors. It is intuitive that data from traffic sensors in close proximity are spatially dependent. Data from an upstream sensor can provide information on upcoming congestions while data from downstream sensors determines the rate of traffic flow down the road. In this multivariate traffic flow forecasting experiment, traffic sensors on a particular highway are first sorted by either longitude or latitude, depending on the direction of the highway; then they are divided into groups of 3, with a distance of about 5 miles between any neighboring sensors in the group. The central sensor in each group is the focus of forecast, considering data from both the upstream sensor and the downstream sensor. In this experiment, both flow and speed from all 3 sensors are considered and features are generated in a similar manner as in the univariate experiment. Due to the large number of available features, a simple feature selection process is also conducted to extract 72 most useful features, which would be approximately one-third of all the available features. The feature selection is based on a repeated forward selection process that picks the next best feature which optimally improves the overall adjusted \(R^2\) when fitting a linear regression model. The other setups of the multivariate models are kept the same with their univariate counterparts. No multivariate generalization of exponential smoothing is considered in this study; and preliminary results show that response surface regression performs poorly, possibly due to a large number of interaction terms, and is therefore also excluded from the multivariate experiment.

4.4 Forecasting Evaluations

The forecasting evaluation results are aggregated from all 373 sensors using weighted average, since each sensor may have a slightly different number of observed (non-imputed) values in the evaluation set.

Fig. 1.
figure 1

Univariate models performance comparison

Fig. 2.
figure 2

Multivariate models performance comparison

In Fig. 1, the performance of univariate forecasting models are compared. Since the baseline is independent of the forecasting horizon (number of steps ahead to forecast), it is denoted as a flat line. In general, the machine learning models produce better forecasting accuracies, with LSTM NN and NN leading in terms of performance. response surface regression also performed well, followed by SVR and other regression models. Exponential smoothing performs well in the initial steps, but its performance quickly degrades and becomes worse than the baseline around step 7 to 9. The SARIMA \((5,0,5) \times (2,1,1)_{1440}\) generally performs better than the SARIMA \((1,0,1) \times (0,1,1)_{1440}\) model, perhaps except for forecasting the very first step, depending on the evaluation metrics. It is also worth noting that comparing the time series models with machine learning and regression models may not be completely fair, as the input features differ greatly. Though this point may also be argued as a strength for machine learning and regression models, which have the flexibility of incorporating various input features.

In Fig. 2, multivariate forecasting models are evaluated. NN leads in overall performance, followed by LSTM NN, SVR, and other regression models, though the gaps have become closer from their univariate performances. The two VARMA models exhibit similar performance patterns from their SARIMA counterparts. Table 2 contains the detailed average improvements (across all 12 steps) from univariate models to multivariate models. The calculations are done by taking the differences between the univariate metrics and metrics of their multivariate counterparts (e.g., SARIMA vs. VARMA), and then divide by the univariate metrics. For the \(R^2\) metric, the sign is also flipped so that all the positive values in Table 2 represent improvements of the multivariate models upon their univariate counterparts. Most models improve upon their univariate counterparts, though VARMA2 and LSTM NN suffer small losses, possibly due to overfitting.

Table 2. Average improvements of multivariate models on their univariate counterparts

It is also worth noting that the regression models perform reasonably well and are by far the most efficient models in terms of training time. The Neural Networks are generally the slowest, but exhibit great performances. In scenarios where training efficiency is highly valued, the regression models may be considered as viable alternatives to Neural Networks.

5 Conclusion and Future Work

In this study, we focus on multi-step short term forecasting of traffic flow using large amounts of sensor data from southern California. Improvements are made in comparison with our recent previous work [21] by including additional and improved univariate models as well as multivariate forecasting models to take advantage of spatial dependencies. In both the univariate and multivariate experiments, the two types of Neural Networks performed well, and other machine learning and regression models also tend to perform better than the traditional time series models that are simpler in terms of the number of parameters. Multivariate forecasting, by taking advantage of spatial dependencies, generally perform better than their univariate counterparts.

As a direction for future work, we plan to include more spatially dependent sensors that cover a much longer segment of the highway in order to test the performance improvements by relying on spatial dependencies. We are also considering Seq2Seq LSTM NN, which should be more suitable for multi-step forecasting. Furthermore, it is also intuitive to include basic theories that involve speed and distance or even simulation models to help make more accurate forecasts. Many in the data science community in recent years have exclusively relied on deep Neural Networks to learn any knowledge in the data that often feels like a black box. We believe by combining well established theories with the recent advancements in data science, we could more efficiently train better forecasting models.