Elsevier

Knowledge-Based Systems

Volume 225, 5 August 2021, 107114
Knowledge-Based Systems

Missing data imputation for traffic congestion data based on joint matrix factorization

https://doi.org/10.1016/j.knosys.2021.107114Get rights and content

Highlights

  • Propose a matrix factorization-based model to impute the missing traffic values.

  • Consider periodicity, road similarity and temporal coherence in the imputation model.

  • Outperform the baselines in the task of traffic congestion value imputation.

Abstract

In reality, the missing of some traffic data is inevitable due to some unexpected errors, which not only affects traffic management but also hinders the development of traffic data research. In this paper, we propose a novel Imputation Model for traffic Congestion data, CIM for short, based on joint matrix factorization. CIM jointly models the characteristics of traffic congestion patterns, including periodicity, road similarity and temporal coherence to estimate the missing congestion values. In particular, we first construct an order-3 tensor based on the traffic congestion data. Then, we model the periodicity and road similarity via joint matrix factorization by exploiting the spatial and temporal information. Finally, we incorporate the local constraints into the process of matrix factorization to ensure the temporal coherence. Experimental results on a real traffic dataset indicate that modeling the three features of congestion patterns simultaneously is effective and CIM outperforms the baselines for the task of missing traffic data imputation.

Introduction

Recently, traffic congestion has aroused widespread attention from researchers, as it has an inseparable relationship with people’s daily life and urban development. Massive amounts of traffic data make research in traffic domain possible, e.g., traffic flow prediction, travel time prediction and traffic congestion prediction, which can help alleviate traffic congestion and manage traffic effectively [1], [2], [3], [4], [5], [6], [7], [8], [9]. For example, according to the predicted traffic conditions, authorities can optimize traffic signal time and people can adjust their driving routes dynamically.

To obtain accurate traffic prediction, high quality data is the basis and premise. However, due to sensor malfunction, transmission error or other reasons, missing traffic data is pervasive and inevitable. For example, a sensor reading typically has a missing rate of about 10%, and some have a higher missing rate of 20% to 25% in Beijing, China [10]. In this situation, it is extremely difficult to extract accurate and sufficient traffic information from the incomplete traffic data, which becomes a notable obstacle for the existing studies. For instance, Li et al. [11] show that missing values have a significant negative impact on the model’s performance. Therefore, the motivation of this work is to propose an imputation algorithm to estimate missing data based on the observed traffic data. Typically, three missing patterns including Missing Completely at Random (MCR), Missing at Random (MR) and Not Missing at Random (NMR) have mainly been investigated in the existing studies [12], [13]. In this paper, we solve the imputation problem based on the MCR assumption, that is to say, the missing data is in a completely random way that is independent of the observed and unobserved variables.

To describe the traffic conditions, there exist different traffic indicators, e.g., traffic flow [14], traffic speed [15] and traffic congestion level [16]. In this paper, we take the traffic congestion data as an example and address the problem of missing traffic data completion. The congestion level c for a road segment during a given time slot is defined as c=max[0,(tt¯)t¯], where t is the average travel time of vehicles for that segment, and t¯ is the baseline travel time for the same road segment. The baseline travel time is determined by users, which could be the shortest travel time for that road segment or the average travel time under free traffic condition (e.g., traffic data from 24:00 to 5:00).

Imputing the missing values of traffic congestion level, however, is very challenging, due to a series of complex factors. To demonstrate this, we carefully analyze the congestion time series with an urban traffic dataset collected in Jinan, China. Fig. 1(a) reports the traffic congestion levels of a sampled road segment from 19:00 to 20:30 on five consecutive workdays (from February 29, 2016 to March 4, 2016) and Fig. 1(b) shows the traffic congestion levels of two sampled segments from 6:00 to 21:00 on Wednesday (March 2, 2016). From these figures, we observe three features of traffic congestion patterns.

  • Periodicity. The traffic congestion levels show periodicity in the consecutive working days, that is, the congestion levels during the same time slot in the consecutive working days are similar, while the levels in different time periods are disparate. For example, as shown in Fig. 1(a), the variation trend of traffic congestion levels is similar in the different working days.

  • Temporal coherence. The traffic congestion level in a time slot has a strong correlation with those in the neighboring time slots, and the correlation diminishes as the temporal distance increases. For example, as depicted in Fig. 1(b), the traffic condition of 18:00 may be affected by the congestion occurring at 17:30, but can be considered free from the influence of the traffic at 8:00 of the same day.

  • Road similarity. The variation trends of the traffic congestion levels of some road segments are similar, as the structure and the connectivity of these segments in the urban transportation system are the same. For example, as shown in Fig. 1(b), there is high similarity in the fluctuation of the two road segments’ congestion levels, e.g., the congestion levels of both segments show a clear upward trend from 7:00 to 8:00, and the tide begins to ebb at 8:00.

To model the above features, we propose a novel missing data Imputation Model for traffic Congestion level data, CIM for short, which predicts the missing values based on the observed congestion levels of all the road segments via the joint matrix factorization. We first convert the time-series traffic congestion data into an order-3 tensor, whose (i,j,k)th element denotes the congestion level of the ith road segment in the jth time slot of the kth day. Then we utilize the joint matrix factorization method to model the tensor, as we would like to focus on modeling periodicity and road similarity. Computing the joint matrix factorization allows us to specifically target these two aspects by decomposing two matrices (namely, the day-time slot and the road-time slot matrices), instead of giving equal considerations to all interactions of the three factors of concern (day, time, and road segment). Specifically, to model the periodicity, we fetch the day-time slot matrix from the tensor, and project both the days and the time slots into a latent factor space of dimensionality F, such that the day-time slot interactions are modeled as inner product in that space. Likewise, the road-time slot matrix is also factorized into the inner product of two matrices via introducing the latent factors, to capture the similarity between road segments. Further, we include a weight which balances the effectiveness of periodicity and road similarity in predicting the missing values. Finally, to ensure the temporal coherence, we minimize the difference between the estimated congestion level in a time slot and those in the surrounding time slots. For the task of imputation, CIM can complete the missing congestion levels efficiently and effectively even in the face of relatively large amount of destroyed data.

We conduct extensive experiments on a real traffic dataset and contrast the performance of CIM with baselines including naive statistic average approaches (e.g., NNI [17]), matrix factorization methods (e.g., MF [18]), tensor decomposition methods (e.g., LRTI [19], TDI [20], and BGCP [21]), and deep learning models (e.g., MLP and PCNN [16]). Experimental results show that our CIM outperforms the baselines.

The main contributions of this paper are as follows.

  • Different from existing methods that only consider part of traffic congestion features (including periodicity, road similarity and temporal coherence), we propose a model named CIM based on joint matrix factorization which models all the features to impute the missing traffic congestion values.

  • We factorize the day-time slot matrix and the road-time slot matrix via introducing latent factors to capture the periodicity and the similarity between road segments, and incorporate the local constraints into the process of matrix factorization to model the temporal coherence.

The rest of this paper is organized as follows. Section 2 reviews the studies on missing traffic data imputation. Section 3 introduces the definition of congestion level and the problem solved in this paper. Section 4 describes the model we proposed in detail. In Section 5, a number of experiments are performed on a real traffic dataset. Section 6 concludes this paper.

Section snippets

Related work

Recently, the imputation of traffic data has caused widespread concern due to the ubiquity of missing traffic data and the importance of complete traffic data. A variety of methods have been proposed to solve this problem, which can be roughly divided into three categories: averaging historical traffic data, modeling traffic patterns of similar road segments and utilizing deep neural networks.

Preliminaries

We first introduce the definition of congestion level and then formally define the problem to be addressed in this paper. Table 1 shows the notations and descriptions used in this paper. We use bold uppercase letters such as A to represent tensors, regular uppercase letters such as B to represent scalar constants. For a tensor ARN1×N2×N3, its (i,j,k)th entry is represented as ai,j,k.

Definition 1 Congestion Level

The congestion level ci,j,k of a road i in time slot j of day k is formally defined as ci,j,k=max[0,(ti,j,kt¯)

Methodology

In this section, we first elaborate the proposed CIM method, followed by the details of each component. Finally, parameter estimation will be introduced with the Alternating Least Squares (ALS) matrix factorization method.

Experiments

To investigate the effectiveness of our proposed model, we carry out a series of experiments on a real traffic dataset. In this section, we first introduce our dataset and basic settings, and then compare the proposed model CIM with several baselines to demonstrate its performance. Finally, we evaluate CIM with different parameters.

Conclusion

In this paper, we have proposed a novel missing data imputation model (named CIM) for traffic congestion level data, which can generate an appropriate estimate of the missing values based on the observed traffic congestion data of all the road segments. In view of the characteristics of the traffic congestion data, CIM first organizes the time series data in the form of an order-3 tensor including the modes of days, time slots and road segments. Then, to model the temporal periodicity and the

CRediT authorship contribution statement

Xiaoyi Jia: Conceptualization, Methodology, Software, Writing - original draft. Xiaoyu Dong: Data curation. Meng Chen: Methodology, Supervision, Writing - review & editing. Xiaohui Yu: Writing - review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grant No. 61906107, the Natural Science Foundation of Shandong Province of China under Grant No. ZR2019BF010, the Young Scholars Program of Shandong University, China, and the Open Fund of Key Laboratory of Urban Natural Resources Monitoring and Simulation, Ministry of Natural Resources, China .

References (36)

  • JamesJ.

    Citywide traffic speed prediction: A geometric deep learning approach

    Knowl.-Based Syst.

    (2021)
  • EssienA. et al.

    A deep-learning model for urban traffic flow prediction with traffic events mined from twitter

    World Wide Web

    (2020)
  • Z. Pan, Y. Liang, W. Wang, Y. Yu, Y. Zheng, J. Zhang, Urban traffic prediction from spatio-temporal data using deep...
  • C. Zheng, X. Fan, C. Wang, J. Qi, Gman: A graph multi-attention network for traffic prediction, in: Proceedings of the...
  • H. Yao, X. Tang, H. Wei, G. Zheng, Z. Li, Revisiting spatial–temporal similarity: A deep learning framework for traffic...
  • WangP. et al.

    Fine-grained traffic flow prediction of various vehicle types via fusison of multisource data and deep learning approaches

    IEEE Trans. Intell. Transp. Syst.

    (2020)
  • QuL. et al.

    Ppca-based missing data imputation for traffic flow volume: A systematical approach

    IEEE Trans. Intell. Transp. Syst.

    (2009)
  • LiL. et al.

    Short-term highway traffic flow prediction based on a hybrid strategy considering temporal–spatial information

    J. Adv. Transp.

    (2016)
  • Cited by (34)

    View all citing articles on Scopus
    View full text