Skip to main content
Log in

Cleaning timestamps with temporal constraints

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Timestamps are often found to be dirty in various scenarios, e.g., in distributed systems with clock synchronization problems or unreliable RFID readers. Without cleaning the imprecise timestamps, temporal-related applications such as provenance analysis or pattern queries are not reliable. To evaluate the correctness of timestamps, temporal constraints could be employed, which declare the distance restrictions between timestamps. Guided by such constraints on timestamps, in this paper, we study a novel problem of repairing inconsistent timestamps that do not conform to the required temporal constraints. Following the same line of data repairing, the timestamp repairing problem is to minimally modify the timestamps towards satisfaction of temporal constraints. This problem is practically challenging, given the huge space of possible timestamps. We tackle the problem by identifying a concise set of promising candidates, where an optimal repair solution can always be found. Repair algorithms with efficient pruning are then devised over the identified candidates. Approximate solutions are also presented including simple heuristic and linear programming (LP) relaxation. Experiments on real datasets demonstrate the superiority of our proposal compared to the state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. Referring to Proposition 13.

References

  1. http://ampds.org/

  2. http://db.csail.mit.edu/labdata/labdata.html

  3. http://iot.ee.surrey.ac.uk:8080/datasets.html

  4. https://archive.ics.uci.edu/ml/datasets/gas+sensors+for +home+activity+monitoring

  5. https://github.com/rui-hrh/timestamp

  6. https://physionet.org/data/

  7. Barga, R.S., Goldstein, J., Ali, M.H., Hong, M.: Consistent streaming through time: a vision for event stream processing. In: CIDR 2007, Third Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 7–10 Jan 2007, Online Proceedings, pp. 363–374 (2007)

  8. Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)

    Article  Google Scholar 

  9. Bohannon, P., Flaster, M., Fan, W., Rastogi, R.: A cost-based model and effective heuristic for repairing constraints by value modification. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, 14–16 June 2005, pp. 143–154 (2005)

  10. Cheng, D., Bahadori, M.T., Liu, Y.: FBLG: a simple and effective approach for temporal dependence discovery from time series data. In: Macskassy, S.A., Perlich, C., Leskovec, J., Wang, W., Ghani, R. (eds.) The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, New York, NY, USA, 24–27 Aug 2014, pp. 382–391. ACM (2014)

  11. Chomicki, J., Marcinkowski, J.: On the computational complexity of minimal-change integrity maintenance in relational databases. In: Inconsistency Tolerance [Result from a Dagstuhl Seminar], pp. 119–150 (2005)

  12. Chu, X., Ilyas, I.F., Papotti, P.: Discovering denial constraints. PVLDB 6(13), 1498–1509 (2013)

    Google Scholar 

  13. Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, 8–12 April 2013, pp. 458–469 (2013)

  14. Dechter, R., Meiri, I., Pearl, J.: Temporal constraint networks. Artif. Intell. 49(1–3), 61–95 (1991)

    Article  MathSciNet  Google Scholar 

  15. Ding, L., Chen, S., Rundensteiner, E.A., Tatemura, J., Hsiung, W., Candan, K.S.: Runtime semantic query optimization for event stream processing. In: Proceedings of the 24th International Conference on Data Engineering, ICDE 2008, 7–12 April 2008, Cancún, Mexico, pp. 676–685 (2008)

  16. Duan, L., Pang, T., Nummenmaa, J., Zuo, J., Zhang, P., Tang, C.: Bus-OLAP: a data management model for non-on-time events query over bus journey data. Data Sci. Eng. 3(1), 52–67 (2018)

    Article  Google Scholar 

  17. Dyreson, C.E., Snodgrass, R.T.: Supporting valid-time indeterminacy. ACM Trans. Database Syst. 23(1), 1–57 (1998)

    Article  Google Scholar 

  18. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2008, 9–11 June 2008, Vancouver, BC, Canada, pp. 159–170 (2008)

  19. Fan, W.: Constraint-driven database repair, 2nd edn. In: Encyclopedia of Database Systems (2018)

  20. Jin, T., Wang, J., Wen, L.: Efficiently querying business process models with beehivez. In: Proceedings of the Demo Track of the Ninth Conference on Business Process Management 2011, Clermont-Ferrand, France, August 31st, 2011 (2011)

  21. Karp, R.M.: Reducibility among combinatorial problems. In: Proceedings of a symposium on the Complexity of Computer Computations, Held 20–22 March 1972, at the IBM Thomas J. Watson Research Center, Yorktown Heights, New York, USA, pp. 85–103 (1972)

  22. Rogge-Solti, A., Mans, R., van der Aalst, W.M.P., Weske, M.: Improving documentation by repairing event logs. In: The Practice of Enterprise Modeling—6th IFIP WG 8.1 Working Conference, PoEM 2013, Riga, Latvia, 6–7 Nov 2013, Proceedings, pp. 129–144 (2013)

  23. Song, S., Cao, Y., Wang, J.: Cleaning timestamps with temporal constraints. PVLDB 9(10), 708–719 (2016)

    Google Scholar 

  24. Song, S., Zhang, A., Wang, J., Yu, P.S.: SCREEN: stream data cleaning under speed constraints. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4, 2015, pp. 827–841 (2015)

  25. Sun, P., Liu, Z., Davidson, S.B., Chen, Y.: Detecting and resolving unsound workflow views for correct provenance analysis. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29–July 2, 2009, pp. 549–562 (2009)

  26. Tang, L., Li, T., Shwartz, L.: Discovering lag intervals for temporal dependencies. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, Beijing, China, 12–16 Aug 2012, pp. 633–641 (2012)

  27. Yakout, M., Berti-Équille, L., Elmagarmid, A.K.: Don’t be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, New York, NY, USA, 22–27 June 2013, pp. 553–564 (2013)

  28. Zhang, H., Diao, Y., Immerman, N.: Recognizing patterns in streams with imprecise timestamps. PVLDB 3(1), 244–255 (2010)

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the National Key Research and Development Plan (2019YFB1705301), the National Natural Science Foundation of China (62072265, 61572272, 71690231), and the MIIT High Quality Development Program 2020.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shaoxu Song.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Proofs

Proofs

1.1 Proof of Theorem 1

To prove the np-hardness of the repairing problem, we build a reduction from the 3-coloring problem, which is known to be np-complete [21]. Given a connected graph \(G=(V,E)\), the 3-coloring problem is to determine whether there is a way of coloring the vertices in graph G such that no two adjacent vertices are of the same color, using at most 3 different colors.

Each vertex \(v_i\in V\) corresponds to a variable \(X_i\). Its assigned color \(C(v_i)\) maps to the assignment of \(X_i\). Let \(D=\{1,2,3,6\}\) be the timestamp domain, where the values 1, 2 and 3 stand for the three admissible colors in the coloring problem, and the value 6 is for the initial assignment. That is, we set \(x_i=6\) for all i at the beginning. For each edge \((v_i,v_j)\in E\), we associate a constraint \({S}_{ij}\) with multiple intervals {[-2,-2], [-1,-1], [1,1], [2,2]}. It restricts \(X_i\) and \(X_j\) to have different values, i.e., restricting the two adjacent vertices \(v_i\) and \(v_j\) to have different colors.

We show in the following that the tuple x of assignment has a repair \({x}'\) with cost \(\varDelta ({x},{x}')\ge 3n\) that satisfies all the constraints iff the graph G is 3-colorable.

First, let C be a feasible 3-coloring solution. For each edge \((v_i, v_j)\in E\), recall that \(v_i\) and \(v_j\) should not be the same color, and \(C(v_i),C(v_j) \in \{1, 2, 3\}\), which stand for the three admissible colors. We consider a repair \(x'_i=C(v_i)\) for all i. The difference between \(C(v_i)\) and \(C(v_j)\), i.e., \(x'_i-x'_j\) (or \(x'_j-x'_i\)) will always satisfy the constraint {[-2,-2], [-1,-1], [1,1], [2,2]}. That is, we get a repair \(x'\) with cost \(\varDelta ({x},{x}')\ge 3n\) that satisfies all the temporary constraints.

Conversely, suppose that there exists a feasible repair \(x'\) with cost \(\varDelta ({x},{x}')\ge 3n\). Apparently, we have \(x'_i \ne 6\) for all i, since any \(x'_i=6\) will definitely violate the constraints. For each edge \((v_i, v_j)\in E\), since \(x'_i - x'_j \ne 0\) and \(x'_i, x'_j\) can only take values from \(\{1,2,3\}\), the two adjacent vertices \(v_i\) and \(v_j\) do not share the same color. Thereby, we have a proper 3-coloring solution \(C(v_i)=x'_i\) for graph G.

1.2 Proof of Proposition 2

To prove this proposition, we consider two aspects:

First, \(\varDelta ({x}, {x}'')\le \varDelta ({x}, {x}')\) is ensured. By decreasing the assignment for \(|{N}_p|\ge |{N}_q|\) (and similarly, increasing for \(|{N}_p|<|{N}_q|\)), the repairing cost is non-increasing in each step, as illustrated in Eq. (2).

Second, by moving changed nodes to \({N}_u\), the conclusion is proved. In particular, a node i is added into \({N}_m\) if there is a tight edge \(i\rightarrow j\) or \(j\rightarrow i\) for some \(j\in {N}_m\). Moreover, nodes in \({N}_m\) are moved to \({N}_u\), if either they are connected to some node in \({N}_u\) or some node in \({N}_m\) itself become unchanged.

1.3 Proof of Proposition 3

To illustrate the correctness of Algorithm 1, we consider the following aspects.

First, \({x}''\) is a feasible solution. The bound \(\eta \) ensures that each being modified assignment will not exceed the constraints specified by \({d}_{jk}\) in Eq. (3) or equivalently Lines 24 and 28 in Algorithm 1.

Second, \(\varDelta ({x}, {x}'')\le \varDelta ({x}, {x}')\) is ensured. By decreasing the assignment for \(|{N}_p|\ge |{N}_q|\) (and similarly, increasing for \(|{N}_p|<|{N}_q|\)), the repairing cost is non-increasing in each step, as illustrated in Eq. (2).

Third, the connectivity w.r.t. tight chain is obvious by seeing that nodes in \({N}_m\) are connected by tight edges. In particular, a node i is added into \({N}_m\) if there is a tight edge \(i\rightarrow j\) or \(j\rightarrow i\) for some \(j\in {N}_m\), according to Line 13. Moreover, nodes in \({N}_m\) are moved to \({N}_u\), if either they are connected to some node in \({N}_u\) (Line 16) or some node in \({N}_m\) itself become unchanged (Line 19).

Finally, to show the termination of the algorithm, we can see that after each step of modification (Lines 2330), either some node become unchanged by variation \(\theta \) (with at least one node moved to \({N}_u\)) or some node reaches the bound by variation \(\eta \). For the latter case, at least one node is moved from \({N}_v\) to \({N}_m\) or from \({N}_m\) to \({N}_u\).

1.4 Proof of Lemma 4

Solely reducing \({x}^*_i\) without modifying the corresponding \({x}^*_j\) is forbidden. Otherwise, it leads to another solution with lower repairing cost, which is contradictory to the optimality of \({x}^*\) with the minimum repairing cost. In other words, there must exist some j such that \({x}^*_j-{x}^*_i={d}_{ij}\).

1.5 Proof of Corollary 5

Referring to Proposition 3, the conclusion is obvious by conducting \(\mathsf {Transform}({M},{x},{x}')\) for any optimal solution \({x}'\). It returns another optimal solution \({x}^*\) with the same optimal repairing cost and connecting changed nodes to unchanged ones via tight edges (chains).

1.6 Proof of Lemma 6

Since edges \(i\rightarrow j\) and \(j\rightarrow k\) are tight, we have \({x}'_k-{x}'_i={d}_{ij}+{d}_{jk}\). Referring to the temporal constraints, it follows \({d}_{ij}+{d}_{jk}= {x}'_k-{x}'_i\le {d}_{ik}\). According to the shortest paths in defining the minimal network M, we have \({d}_{ik}\le {d}_{ij}+{d}_{jk}.\) The conclusion is a direct consequence.

1.7 Proof of Proposition 7

For any tight edges, \(k_{y-1}\rightarrow k_{y}, k_{y}\rightarrow k_{y+1}\), in a tight chain that makes it not a provenance chain, according to the transitivity in Lemma 6, there must exist a tight edge \(k_{y-1}\rightarrow k_{y+1}\). In other words, the node \(k_{y}\) can be removed from the tight chain.

Similar conclusion applies to the case of \(k_{y-1}\leftarrow k_{y}, k_{y}\leftarrow k_{y+1}\). By removing all the aforesaid \(k_{y}\), the chain becomes a provenance chain.

1.8 Proof of Proposition 8

We prune the unused temporal constraints by comparing all the pairs of candidates across two nodes, where the maximum size of candidates of a node is a. Comparing all the pairs of candidates across two nodes needs \(O(a^2)\) comparisons, and we have \(n^2\) node pairs, so the whole time complexity is \(O(a^2 n^2)\).

1.9 Proof of Proposition 9

Referring to the branch and bound computation, the repairing procedure at most try all the combinations of the candidates of n nodes, where the maximum size of candidates of a node is a. The time complexity of Algorithm 4 is \(O(a^{n})\).

1.10 Proof of Proposition 10

First, given \({T}_i\subseteq \tilde{{T}}_i, \forall i\), it is obvious to see that all the solutions of subproblem \(\langle {x},{T}\rangle \) are also the solutions of \(\langle \tilde{{x}},\tilde{{T}}\rangle \). On the other hand, \(\tilde{{x}}'_i\in {T}_i, \forall i\), indicates that \(\tilde{{x}}'\) is a solution of \(\langle {x},{T}\rangle \) as well. If there exists another solution \({x}^*\) with lower cost for \(\langle {x},{T}\rangle \), it contradicts the optimality of \(\tilde{{x}}'\) to \(\langle \tilde{{x}},\tilde{{T}}\rangle \).

1.11 Proof of Lemma 11

The correctness is easy to see according to the candidate prune rule (2) in Sect. 5.1.2. It eliminates all the candidates in violation to \({T}_i=\{{t}_i\}\), i.e., all the \({t}_j\in {T}_j\) such that \(({t}_i,{t}_j)\not \vDash {M}_{ij}\).

1.12 Proof of Proposition 12

Assume that \({x}'\) is not an optimal solution of \(\langle {x},{T} \rangle \), i.e., exists a \({x}^*\) with \(\varDelta ({x},{x}^*) < \varDelta ({x}, {x}')\). According to Eq. (8), for any safe \({T}_i \in {T}\), \({x}'_i\) is the one with the lowest cost, i.e., \(|{x}^*_i - {x}_i| \ge |{x}'_i - {x}_i|\). Since \({x}'\) is not optimal, there must exists a non-safe \({T}_j\) such that \(|{x}^*_j - {x}_j| < |{x}'_j - {x}_j|\).

We construct a \(\tilde{{x}}''\) for \(\langle \tilde{{x}},\tilde{{T}} \rangle \), where \(\tilde{{x}}''_i=\tilde{{x}}'_i\) if \({T}_i\) is safe; otherwise, \(\tilde{{x}}''_j = {x}^*_j\) for non-safe \({T}_j\). The safe-subproblem definition requires \(\tilde{{T}}_i\) to be safe for each safe \({T}_i\), i.e., \(\tilde{{x}}''\) forms a feasible solution. It follows \(\varDelta (\tilde{{x}},\tilde{{x}}'') < \varDelta (\tilde{{x}}, \tilde{{x}}')\), referring to \( |\tilde{{x}}''_j-\tilde{{x}}_j|= |{x}^*_j - {x}_j| < |{x}'_j -{x}_j| = |\tilde{{x}}'_j-\tilde{{x}}_j| \) for some non-safe \({T}_j\). In other words, \(\tilde{{x}}''\) is a solution of \(\langle \tilde{{x}},\tilde{{T}} \rangle \) with cost lower than \(\tilde{{x}}'\), which is a contradiction.

1.13 Proof of Proposition 13

First, as illustrated in Lines 410 in Algorithm 1, when moving a node i from \({N}_v\) to \({N}_m\), we check whether it will introduce violations to the existing nodes in \({N}_v\cup {N}_u\) w.r.t. temporal constraints M. That is, a modification is made on \({x}'_i\) to ensure its \(\alpha _i\le 0\) and \(\beta _i\le 0\), where \(\alpha _i=\max _{k\in {N}_v\cup {N}_u, {d}_{ik}\in {M}} {x}'_k-{x}'_i-{d}_{ik}\) and \(\beta _i=\max _{k\in {N}_v\cup {N}_u, {d}_{ki}\in {M}} {x}'_i-{x}'_k-{d}_{ki}\). Similarly, Line 14 guarantees no violation when moving node i from \({N}_v\) to \({N}_m\).

Moreover, as presented in the proof of Proposition 3, with the bound \(\eta \), the modification in Lines 24 and 28 will not introduce violations to the temporal constraints M either. To sum up, Algorithm 1 always returns a feasible solution that satisfies the temporal constraints M, no matter whether the input \({x}'\) has violation or not.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, S., Huang, R., Cao, Y. et al. Cleaning timestamps with temporal constraints. The VLDB Journal 30, 425–446 (2021). https://doi.org/10.1007/s00778-020-00641-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00641-6

Keywords

Navigation