research-article

Open Access

Time Series Data Validity

Authors:
Yunxiang Su

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

0009-0004-7584-6913
View Profile

,
Yikun Gong

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

0009-0001-3112-8301
View Profile

,
Shaoxu Song

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China

0000-0002-9503-2755
View Profile

Authors Info & Claims

Proceedings of the ACM on Management of Data Volume 1 Issue 1Article No.: 85pp 1–26https://doi.org/10.1145/3588939

Published:30 May 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

As a key step of data preparation, it is always necessary to first assert the quality of data before conducting any data application. Given a set of constraints, the validity measure evaluates the degree of data meeting the constraints, e.g., whether the values are in the specified range or fluctuate drastically over time in a series. It is worth noting that simply counting all the data points in violation to the constraints may over claim the data validity issue. Following the minimum change criteria in data repairing, we propose to study the minimum number of data points that need to be changed in order to satisfy the constraints, or equivalently, the maximum rate of data that can be reserved without change, as the validity measure. To our best knowledge, this is the first study on defining and evaluating time series data validity. We devise algorithms for computing the validity measure in quadratic time and linear space. Remarkably, the validity measure has been deployed and included as a function in SQL statements, in Apache IoTDB, an open-source time series database. The algorithm fully adapts to the LSM-based storage of time series in multiple segments. Extensive experiments over 8 real-world datasets show up to 4 orders of magnitude improvement in time cost compared to the related method SCREEN.

Supplemental Material

PACMMOD-V1mod085.mp4

Presentation video for SIGMOD 2023

mp4

32.7 MB

Download

References

https://archive.ics.uci.edu/.Google Scholar
https://github.com/apache/iotdb/tree/research/quality-validity.Google Scholar
https://github.com/iotdbValidity/validity-exp.Google Scholar
https://iotdb.apache.org.Google Scholar
https://iotdb.apache.org/UserGuide/Master/UDF-Library/Data-Quality.html#validity.Google Scholar
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, 2016.Google ScholarDigital Library
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow., 3(1):197--207, 2010.Google ScholarDigital Library
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 541--552. IEEE Computer Society, 2013.Google Scholar
P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In F. Özcan, editor, Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14--16, 2005, pages 143--154. ACM, 2005.Google ScholarDigital Library
A. Chandel, N. Koudas, K. Q. Pu, and D. Srivastava. Fast identification of relational constraint violations. In R. Chirkova, A. Dogac, M. T. Özsu, and T. K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, pages 776--785. IEEE Computer Society, 2007.Google Scholar
Y. Chen and C. Caramanis. Noisy and missing data regression: Distribution-oblivious support recovery. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16--21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 383--391. JMLR.org, 2013.Google Scholar
W. Fan, F. Geerts, S. Ma, and H. Müller. Detecting inconsistencies in distributed data. In F. Li, M. M. Moro, S. Ghandeharizadeh, J. R. Haritsa, G. Weikum, M. J. Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, and V. J. Tsotras, editors, Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, pages 64--75. IEEE Computer Society, 2010.Google ScholarCross Ref
M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier Detection for Temporal Data. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2014.Google ScholarCross Ref
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In R. Fagin, editor, Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings, volume 361 of ACM International Conference Proceeding Series, pages 53--62. ACM, 2009.Google ScholarDigital Library
N. Laptev, S. Amizadeh, and I. Flint. Generic and scalable framework for automated time-series anomaly detection. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1939--1947. ACM, 2015.Google ScholarDigital Library
P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 13--24. IEEE, 2021.Google ScholarCross Ref
E. Livshits, B. Kimelfeld, and S. Roy. Computing optimal repairs for functional dependencies. In J. V. den Bussche and M. Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10--15, 2018, pages 225--237. ACM, 2018.Google ScholarDigital Library
E. Livshits, R. Kochirgan, S. Tsur, I. F. Ilyas, B. Kimelfeld, and S. Roy. Properties of inconsistency measures for databases. In G. Li, Z. Li, S. Idreos, and D. Srivastava, editors, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, pages 1182--1194. ACM, 2021.Google Scholar
I. Melnyk, A. Banerjee, B. L. Matthews, and N. C. Oza. Semi-markov switching vector autoregressive model-based anomaly detection in aviation systems. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016, pages 1065--1074. ACM, 2016.Google ScholarDigital Library
P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarDigital Library
F. Pukelsheim. The three sigma rule. The American Statistician, 48(2):88--91, 1994.Google ScholarCross Ref
D. Samariya and J. Ma. A new dimensionality-unbiased score for efficient and effective outlying aspect mining. Data Sci. Eng., 7(2):120--135, 2022.Google ScholarCross Ref
S. Song, C. Li, and X. Zhang. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1115--1124. ACM, 2015.Google ScholarDigital Library
S. Song and A. Zhang. Iot data quality. In M. d'Aquin, S. Dietze, C. Hauff, E. Curry, and P. Cudré-Mauroux, editors, CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19--23, 2020, pages 3517--3518. ACM, 2020.Google ScholarDigital Library
S. Song, A. Zhang, J. Wang, and P. S. Yu. SCREEN: stream data cleaning under speed constraints. In T. K. Sellis, S. B. Davidson, and Z. G. Ives, editors, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 827--841. ACM, 2015.Google ScholarDigital Library
S. Sun, S. Ma, J. Song, W. Yue, X. Lin, and T. Ma. Experiments and analyses of anonymization mechanisms for trajectory data publishing. J. Comput. Sci. Technol., 37(5):1026--1048, 2022.Google ScholarDigital Library
Y. Sun and S. Song. From minimum change to maximum density: On s-repair under integrity constraints. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 1943--1948. IEEE, 2021.Google ScholarCross Ref
M. Thimm. On the expressivity of inconsistency measures. Artif. Intell., 234:120--151, 2016.Google ScholarDigital Library
L. V. Tran, M. Mun, and C. Shahabi. Real-time distance-based outlier detection in data streams. Proc. VLDB Endow., 14(2):141--153, 2020.Google ScholarDigital Library
D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009, 2009.Google Scholar
P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 811--828. ACM, 2019.Google Scholar
Y. Yu, L. Cao, E. A. Rundensteiner, and Q. Wang. Detecting moving object outliers in massive-scale trajectory streams. In S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pages 422--431. ACM, 2014.Google ScholarDigital Library
A. Zhang, S. Song, and J. Wang. Sequential data cleaning: A statistical approach. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 909--924. ACM, 2016.Google Scholar

Index Terms

Time Series Data Validity
1. Applied computing
  1. Enterprise computing
    1. Enterprise data management
2. Information systems
  1. Data management systems
    1. Information integration
      1. Data cleaning
  2. Information systems applications
    1. Enterprise information systems
      1. Enterprise applications

Recommendations

Observational Data Patterns for Time Series Data Quality Assessment
E-SCIENCE '14: Proceedings of the 2014 IEEE 10th International Conference on e-Science - Volume 01

Observational data are fundamental for scientific research in almost any domain. Recent advances in sensor and data management technologies are enabling unprecedented amounts of observational data to be collected and analyzed. However, an essential part ...
Read More
A Review on Data Cleansing Methods for Big Data
Abstract
Massive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
Read More
Clustering of time series data-a survey

Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview, this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 May 2023
Published in pacmmod Volume 1, Issue 1
Author Tags
IoT
data quality
time series data
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 452
  Total Downloads
- Downloads (Last 12 months)452
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Time Series Data Validity

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Observational Data Patterns for Time Series Data Quality Assessment

A Review on Data Cleansing Methods for Big Data

Clustering of time series data-a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Time Series Data Validity

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Observational Data Patterns for Time Series Data Quality Assessment

A Review on Data Cleansing Methods for Big Data

Clustering of time series data-a survey

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media