Missing data imputation for traffic congestion data based on joint matrix factorization
Introduction
Recently, traffic congestion has aroused widespread attention from researchers, as it has an inseparable relationship with people’s daily life and urban development. Massive amounts of traffic data make research in traffic domain possible, e.g., traffic flow prediction, travel time prediction and traffic congestion prediction, which can help alleviate traffic congestion and manage traffic effectively [1], [2], [3], [4], [5], [6], [7], [8], [9]. For example, according to the predicted traffic conditions, authorities can optimize traffic signal time and people can adjust their driving routes dynamically.
To obtain accurate traffic prediction, high quality data is the basis and premise. However, due to sensor malfunction, transmission error or other reasons, missing traffic data is pervasive and inevitable. For example, a sensor reading typically has a missing rate of about 10%, and some have a higher missing rate of 20% to 25% in Beijing, China [10]. In this situation, it is extremely difficult to extract accurate and sufficient traffic information from the incomplete traffic data, which becomes a notable obstacle for the existing studies. For instance, Li et al. [11] show that missing values have a significant negative impact on the model’s performance. Therefore, the motivation of this work is to propose an imputation algorithm to estimate missing data based on the observed traffic data. Typically, three missing patterns including Missing Completely at Random (MCR), Missing at Random (MR) and Not Missing at Random (NMR) have mainly been investigated in the existing studies [12], [13]. In this paper, we solve the imputation problem based on the MCR assumption, that is to say, the missing data is in a completely random way that is independent of the observed and unobserved variables.
To describe the traffic conditions, there exist different traffic indicators, e.g., traffic flow [14], traffic speed [15] and traffic congestion level [16]. In this paper, we take the traffic congestion data as an example and address the problem of missing traffic data completion. The congestion level for a road segment during a given time slot is defined as , where is the average travel time of vehicles for that segment, and is the baseline travel time for the same road segment. The baseline travel time is determined by users, which could be the shortest travel time for that road segment or the average travel time under free traffic condition (e.g., traffic data from 24:00 to 5:00).
Imputing the missing values of traffic congestion level, however, is very challenging, due to a series of complex factors. To demonstrate this, we carefully analyze the congestion time series with an urban traffic dataset collected in Jinan, China. Fig. 1(a) reports the traffic congestion levels of a sampled road segment from 19:00 to 20:30 on five consecutive workdays (from February 29, 2016 to March 4, 2016) and Fig. 1(b) shows the traffic congestion levels of two sampled segments from 6:00 to 21:00 on Wednesday (March 2, 2016). From these figures, we observe three features of traffic congestion patterns.
- •
Periodicity. The traffic congestion levels show periodicity in the consecutive working days, that is, the congestion levels during the same time slot in the consecutive working days are similar, while the levels in different time periods are disparate. For example, as shown in Fig. 1(a), the variation trend of traffic congestion levels is similar in the different working days.
- •
Temporal coherence. The traffic congestion level in a time slot has a strong correlation with those in the neighboring time slots, and the correlation diminishes as the temporal distance increases. For example, as depicted in Fig. 1(b), the traffic condition of 18:00 may be affected by the congestion occurring at 17:30, but can be considered free from the influence of the traffic at 8:00 of the same day.
- •
Road similarity. The variation trends of the traffic congestion levels of some road segments are similar, as the structure and the connectivity of these segments in the urban transportation system are the same. For example, as shown in Fig. 1(b), there is high similarity in the fluctuation of the two road segments’ congestion levels, e.g., the congestion levels of both segments show a clear upward trend from 7:00 to 8:00, and the tide begins to ebb at 8:00.
To model the above features, we propose a novel missing data Imputation Model for traffic Congestion level data, CIM for short, which predicts the missing values based on the observed congestion levels of all the road segments via the joint matrix factorization. We first convert the time-series traffic congestion data into an order-3 tensor, whose element denotes the congestion level of the th road segment in the th time slot of the th day. Then we utilize the joint matrix factorization method to model the tensor, as we would like to focus on modeling periodicity and road similarity. Computing the joint matrix factorization allows us to specifically target these two aspects by decomposing two matrices (namely, the day-time slot and the road-time slot matrices), instead of giving equal considerations to all interactions of the three factors of concern (day, time, and road segment). Specifically, to model the periodicity, we fetch the day-time slot matrix from the tensor, and project both the days and the time slots into a latent factor space of dimensionality , such that the day-time slot interactions are modeled as inner product in that space. Likewise, the road-time slot matrix is also factorized into the inner product of two matrices via introducing the latent factors, to capture the similarity between road segments. Further, we include a weight which balances the effectiveness of periodicity and road similarity in predicting the missing values. Finally, to ensure the temporal coherence, we minimize the difference between the estimated congestion level in a time slot and those in the surrounding time slots. For the task of imputation, CIM can complete the missing congestion levels efficiently and effectively even in the face of relatively large amount of destroyed data.
We conduct extensive experiments on a real traffic dataset and contrast the performance of CIM with baselines including naive statistic average approaches (e.g., NNI [17]), matrix factorization methods (e.g., MF [18]), tensor decomposition methods (e.g., LRTI [19], TDI [20], and BGCP [21]), and deep learning models (e.g., MLP and PCNN [16]). Experimental results show that our CIM outperforms the baselines.
The main contributions of this paper are as follows.
- •
Different from existing methods that only consider part of traffic congestion features (including periodicity, road similarity and temporal coherence), we propose a model named CIM based on joint matrix factorization which models all the features to impute the missing traffic congestion values.
- •
We factorize the day-time slot matrix and the road-time slot matrix via introducing latent factors to capture the periodicity and the similarity between road segments, and incorporate the local constraints into the process of matrix factorization to model the temporal coherence.
The rest of this paper is organized as follows. Section 2 reviews the studies on missing traffic data imputation. Section 3 introduces the definition of congestion level and the problem solved in this paper. Section 4 describes the model we proposed in detail. In Section 5, a number of experiments are performed on a real traffic dataset. Section 6 concludes this paper.
Section snippets
Related work
Recently, the imputation of traffic data has caused widespread concern due to the ubiquity of missing traffic data and the importance of complete traffic data. A variety of methods have been proposed to solve this problem, which can be roughly divided into three categories: averaging historical traffic data, modeling traffic patterns of similar road segments and utilizing deep neural networks.
Preliminaries
We first introduce the definition of congestion level and then formally define the problem to be addressed in this paper. Table 1 shows the notations and descriptions used in this paper. We use bold uppercase letters such as to represent tensors, regular uppercase letters such as to represent scalar constants. For a tensor , its entry is represented as .
Definition 1 Congestion Level The congestion level of a road in time slot of day is formally defined as
Methodology
In this section, we first elaborate the proposed CIM method, followed by the details of each component. Finally, parameter estimation will be introduced with the Alternating Least Squares (ALS) matrix factorization method.
Experiments
To investigate the effectiveness of our proposed model, we carry out a series of experiments on a real traffic dataset. In this section, we first introduce our dataset and basic settings, and then compare the proposed model CIM with several baselines to demonstrate its performance. Finally, we evaluate CIM with different parameters.
Conclusion
In this paper, we have proposed a novel missing data imputation model (named CIM) for traffic congestion level data, which can generate an appropriate estimate of the missing values based on the observed traffic congestion data of all the road segments. In view of the characteristics of the traffic congestion data, CIM first organizes the time series data in the form of an order-3 tensor including the modes of days, time slots and road segments. Then, to model the temporal periodicity and the
CRediT authorship contribution statement
Xiaoyi Jia: Conceptualization, Methodology, Software, Writing - original draft. Xiaoyu Dong: Data curation. Meng Chen: Methodology, Supervision, Writing - review & editing. Xiaohui Yu: Writing - review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906107, the Natural Science Foundation of Shandong Province of China under Grant No. ZR2019BF010, the Young Scholars Program of Shandong University, China, and the Open Fund of Key Laboratory of Urban Natural Resources Monitoring and Simulation, Ministry of Natural Resources, China .
References (36)
- et al.
Effective and unburdensome forecast of highway traffic flow with adaptive computing
Knowl.-Based Syst.
(2021) - et al.
Estimation of missing values in heterogeneous traffic data: Application of multimodal deep learning model
Knowl.-Based Syst.
(2020) - et al.
Learning traffic as a graph: A gated graph wavelet recurrent neural network for network-scale traffic prediction
Transp. Res. C
(2020) - et al.
Algorithms and applications for approximate nonnegative matrix factorization
Comput. Statist. Data Anal.
(2007) - et al.
A tensor-based method for missing traffic data completion
Transp. Res. C
(2013) - et al.
A bayesian tensor decomposition approach for spatiotemporal traffic data imputation
Transp. Res. C
(2019) - et al.
Ensemble correlation-based low-rank matrix completion with applications to traffic data imputation
Knowl.-Based Syst.
(2017) - et al.
K nearest neighbours with mutual information for simultaneous classification and missing data imputation
Neurocomputing
(2009) - et al.
Similarity-learning information-fusion schemes for missing data imputation
Knowl.-Based Syst.
(2020) - et al.
Mining moving patterns for predicting next location
Inf. Syst.
(2015)
Citywide traffic speed prediction: A geometric deep learning approach
Knowl.-Based Syst.
A deep-learning model for urban traffic flow prediction with traffic events mined from twitter
World Wide Web
Fine-grained traffic flow prediction of various vehicle types via fusison of multisource data and deep learning approaches
IEEE Trans. Intell. Transp. Syst.
Ppca-based missing data imputation for traffic flow volume: A systematical approach
IEEE Trans. Intell. Transp. Syst.
Short-term highway traffic flow prediction based on a hybrid strategy considering temporal–spatial information
J. Adv. Transp.
Cited by (34)
A missing manufacturing process data imputation framework for nonlinear dynamic soft sensor modeling and its application
2024, Expert Systems with ApplicationsSemantic understanding and prompt engineering for large-scale traffic data imputation
2024, Information FusionMulti-stage deep residual collaboration learning framework for complex spatial–temporal traffic data imputation
2023, Applied Soft ComputingHigh-dimensional data analytics in civil engineering: A review on matrix and tensor decomposition
2023, Engineering Applications of Artificial IntelligenceMulti-feature generation network-based imputation method for industrial data with high missing rate
2023, Expert Systems with Applications