Abstract
As a key step of data preparation, it is always necessary to first assert the quality of data before conducting any data application. Given a set of constraints, the validity measure evaluates the degree of data meeting the constraints, e.g., whether the values are in the specified range or fluctuate drastically over time in a series. It is worth noting that simply counting all the data points in violation to the constraints may over claim the data validity issue. Following the minimum change criteria in data repairing, we propose to study the minimum number of data points that need to be changed in order to satisfy the constraints, or equivalently, the maximum rate of data that can be reserved without change, as the validity measure. To our best knowledge, this is the first study on defining and evaluating time series data validity. We devise algorithms for computing the validity measure in quadratic time and linear space. Remarkably, the validity measure has been deployed and included as a function in SQL statements, in Apache IoTDB, an open-source time series database. The algorithm fully adapts to the LSM-based storage of time series in multiple segments. Extensive experiments over 8 real-world datasets show up to 4 orders of magnitude improvement in time cost compared to the related method SCREEN.
Supplemental Material
- https://archive.ics.uci.edu/.Google Scholar
- https://github.com/apache/iotdb/tree/research/quality-validity.Google Scholar
- https://github.com/iotdbValidity/validity-exp.Google Scholar
- https://iotdb.apache.org.Google Scholar
- https://iotdb.apache.org/UserGuide/Master/UDF-Library/Data-Quality.html#validity.Google Scholar
- Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, 2016.Google ScholarDigital Library
- G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow., 3(1):197--207, 2010.Google ScholarDigital Library
- G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 541--552. IEEE Computer Society, 2013.Google Scholar
- P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In F. Özcan, editor, Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14--16, 2005, pages 143--154. ACM, 2005.Google ScholarDigital Library
- A. Chandel, N. Koudas, K. Q. Pu, and D. Srivastava. Fast identification of relational constraint violations. In R. Chirkova, A. Dogac, M. T. Özsu, and T. K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, pages 776--785. IEEE Computer Society, 2007.Google Scholar
- Y. Chen and C. Caramanis. Noisy and missing data regression: Distribution-oblivious support recovery. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16--21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 383--391. JMLR.org, 2013.Google Scholar
- W. Fan, F. Geerts, S. Ma, and H. Müller. Detecting inconsistencies in distributed data. In F. Li, M. M. Moro, S. Ghandeharizadeh, J. R. Haritsa, G. Weikum, M. J. Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, and V. J. Tsotras, editors, Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, pages 64--75. IEEE Computer Society, 2010.Google ScholarCross Ref
- M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier Detection for Temporal Data. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2014.Google ScholarCross Ref
- S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In R. Fagin, editor, Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings, volume 361 of ACM International Conference Proceeding Series, pages 53--62. ACM, 2009.Google ScholarDigital Library
- N. Laptev, S. Amizadeh, and I. Flint. Generic and scalable framework for automated time-series anomaly detection. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1939--1947. ACM, 2015.Google ScholarDigital Library
- P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 13--24. IEEE, 2021.Google ScholarCross Ref
- E. Livshits, B. Kimelfeld, and S. Roy. Computing optimal repairs for functional dependencies. In J. V. den Bussche and M. Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10--15, 2018, pages 225--237. ACM, 2018.Google ScholarDigital Library
- E. Livshits, R. Kochirgan, S. Tsur, I. F. Ilyas, B. Kimelfeld, and S. Roy. Properties of inconsistency measures for databases. In G. Li, Z. Li, S. Idreos, and D. Srivastava, editors, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, pages 1182--1194. ACM, 2021.Google Scholar
- I. Melnyk, A. Banerjee, B. L. Matthews, and N. C. Oza. Semi-markov switching vector autoregressive model-based anomaly detection in aviation systems. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016, pages 1065--1074. ACM, 2016.Google ScholarDigital Library
- P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarDigital Library
- F. Pukelsheim. The three sigma rule. The American Statistician, 48(2):88--91, 1994.Google ScholarCross Ref
- D. Samariya and J. Ma. A new dimensionality-unbiased score for efficient and effective outlying aspect mining. Data Sci. Eng., 7(2):120--135, 2022.Google ScholarCross Ref
- S. Song, C. Li, and X. Zhang. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1115--1124. ACM, 2015.Google ScholarDigital Library
- S. Song and A. Zhang. Iot data quality. In M. d'Aquin, S. Dietze, C. Hauff, E. Curry, and P. Cudré-Mauroux, editors, CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19--23, 2020, pages 3517--3518. ACM, 2020.Google ScholarDigital Library
- S. Song, A. Zhang, J. Wang, and P. S. Yu. SCREEN: stream data cleaning under speed constraints. In T. K. Sellis, S. B. Davidson, and Z. G. Ives, editors, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 827--841. ACM, 2015.Google ScholarDigital Library
- S. Sun, S. Ma, J. Song, W. Yue, X. Lin, and T. Ma. Experiments and analyses of anonymization mechanisms for trajectory data publishing. J. Comput. Sci. Technol., 37(5):1026--1048, 2022.Google ScholarDigital Library
- Y. Sun and S. Song. From minimum change to maximum density: On s-repair under integrity constraints. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 1943--1948. IEEE, 2021.Google ScholarCross Ref
- M. Thimm. On the expressivity of inconsistency measures. Artif. Intell., 234:120--151, 2016.Google ScholarDigital Library
- L. V. Tran, M. Mun, and C. Shahabi. Real-time distance-based outlier detection in data streams. Proc. VLDB Endow., 14(2):141--153, 2020.Google ScholarDigital Library
- D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009, 2009.Google Scholar
- P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 811--828. ACM, 2019.Google Scholar
- Y. Yu, L. Cao, E. A. Rundensteiner, and Q. Wang. Detecting moving object outliers in massive-scale trajectory streams. In S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pages 422--431. ACM, 2014.Google ScholarDigital Library
- A. Zhang, S. Song, and J. Wang. Sequential data cleaning: A statistical approach. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 909--924. ACM, 2016.Google Scholar
Index Terms
- Time Series Data Validity
Recommendations
Observational Data Patterns for Time Series Data Quality Assessment
E-SCIENCE '14: Proceedings of the 2014 IEEE 10th International Conference on e-Science - Volume 01Observational data are fundamental for scientific research in almost any domain. Recent advances in sensor and data management technologies are enabling unprecedented amounts of observational data to be collected and analyzed. However, an essential part ...
A Review on Data Cleansing Methods for Big Data
AbstractMassive amounts of data are available for the organization which will influence their business decision. Data collected from the various resources are dirty and this will affect the accuracy of prediction result. Data cleansing offers a better ...
Clustering of time series data-a survey
Time series clustering has been shown effective in providing useful information in various domains. There seems to be an increased interest in time series clustering as part of the effort in temporal data mining research. To provide an overview, this ...
Comments