skip to main content
research-article
Open Access

Time Series Data Validity

Published:30 May 2023Publication History
Skip Abstract Section

Abstract

As a key step of data preparation, it is always necessary to first assert the quality of data before conducting any data application. Given a set of constraints, the validity measure evaluates the degree of data meeting the constraints, e.g., whether the values are in the specified range or fluctuate drastically over time in a series. It is worth noting that simply counting all the data points in violation to the constraints may over claim the data validity issue. Following the minimum change criteria in data repairing, we propose to study the minimum number of data points that need to be changed in order to satisfy the constraints, or equivalently, the maximum rate of data that can be reserved without change, as the validity measure. To our best knowledge, this is the first study on defining and evaluating time series data validity. We devise algorithms for computing the validity measure in quadratic time and linear space. Remarkably, the validity measure has been deployed and included as a function in SQL statements, in Apache IoTDB, an open-source time series database. The algorithm fully adapts to the LSM-based storage of time series in multiple segments. Extensive experiments over 8 real-world datasets show up to 4 orders of magnitude improvement in time cost compared to the related method SCREEN.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod085.mp4

Presentation video for SIGMOD 2023

mp4

32.7 MB

References

  1. https://archive.ics.uci.edu/.Google ScholarGoogle Scholar
  2. https://github.com/apache/iotdb/tree/research/quality-validity.Google ScholarGoogle Scholar
  3. https://github.com/iotdbValidity/validity-exp.Google ScholarGoogle Scholar
  4. https://iotdb.apache.org.Google ScholarGoogle Scholar
  5. https://iotdb.apache.org/UserGuide/Master/UDF-Library/Data-Quality.html#validity.Google ScholarGoogle Scholar
  6. Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow., 3(1):197--207, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In C. S. Jensen, C. M. Jermaine, and X. Zhou, editors, 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 541--552. IEEE Computer Society, 2013.Google ScholarGoogle Scholar
  9. P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In F. Özcan, editor, Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14--16, 2005, pages 143--154. ACM, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Chandel, N. Koudas, K. Q. Pu, and D. Srivastava. Fast identification of relational constraint violations. In R. Chirkova, A. Dogac, M. T. Özsu, and T. K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, pages 776--785. IEEE Computer Society, 2007.Google ScholarGoogle Scholar
  11. Y. Chen and C. Caramanis. Noisy and missing data regression: Distribution-oblivious support recovery. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16--21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 383--391. JMLR.org, 2013.Google ScholarGoogle Scholar
  12. W. Fan, F. Geerts, S. Ma, and H. Müller. Detecting inconsistencies in distributed data. In F. Li, M. M. Moro, S. Ghandeharizadeh, J. R. Haritsa, G. Weikum, M. J. Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, and V. J. Tsotras, editors, Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, pages 64--75. IEEE Computer Society, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Gupta, J. Gao, C. C. Aggarwal, and J. Han. Outlier Detection for Temporal Data. Synthesis Lectures on Data Mining and Knowledge Discovery. Morgan & Claypool Publishers, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In R. Fagin, editor, Database Theory - ICDT 2009, 12th International Conference, St. Petersburg, Russia, March 23--25, 2009, Proceedings, volume 361 of ACM International Conference Proceeding Series, pages 53--62. ACM, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Laptev, S. Amizadeh, and I. Flint. Generic and scalable framework for automated time-series anomaly detection. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1939--1947. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Li, X. Rao, J. Blase, Y. Zhang, X. Chu, and C. Zhang. Cleanml: A study for evaluating the impact of data cleaning on ML classification tasks. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 13--24. IEEE, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  17. E. Livshits, B. Kimelfeld, and S. Roy. Computing optimal repairs for functional dependencies. In J. V. den Bussche and M. Arenas, editors, Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Houston, TX, USA, June 10--15, 2018, pages 225--237. ACM, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Livshits, R. Kochirgan, S. Tsur, I. F. Ilyas, B. Kimelfeld, and S. Roy. Properties of inconsistency measures for databases. In G. Li, Z. Li, S. Idreos, and D. Srivastava, editors, SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021, pages 1182--1194. ACM, 2021.Google ScholarGoogle Scholar
  19. I. Melnyk, A. Banerjee, B. L. Matthews, and N. C. Oza. Semi-markov switching vector autoregressive model-based anomaly detection in aviation systems. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and R. Rastogi, editors, Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13--17, 2016, pages 1065--1074. ACM, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The log-structured merge-tree (lsm-tree). Acta Informatica, 33(4):351--385, 1996.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Pukelsheim. The three sigma rule. The American Statistician, 48(2):88--91, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. Samariya and J. Ma. A new dimensionality-unbiased score for efficient and effective outlying aspect mining. Data Sci. Eng., 7(2):120--135, 2022.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Song, C. Li, and X. Zhang. Turn waste into wealth: On simultaneous clustering and cleaning over dirty data. In L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, editors, Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10--13, 2015, pages 1115--1124. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Song and A. Zhang. Iot data quality. In M. d'Aquin, S. Dietze, C. Hauff, E. Curry, and P. Cudré-Mauroux, editors, CIKM '20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19--23, 2020, pages 3517--3518. ACM, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Song, A. Zhang, J. Wang, and P. S. Yu. SCREEN: stream data cleaning under speed constraints. In T. K. Sellis, S. B. Davidson, and Z. G. Ives, editors, Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pages 827--841. ACM, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. S. Sun, S. Ma, J. Song, W. Yue, X. Lin, and T. Ma. Experiments and analyses of anonymization mechanisms for trajectory data publishing. J. Comput. Sci. Technol., 37(5):1026--1048, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Sun and S. Song. From minimum change to maximum density: On s-repair under integrity constraints. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19--22, 2021, pages 1943--1948. IEEE, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  28. M. Thimm. On the expressivity of inconsistency measures. Artif. Intell., 234:120--151, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. L. V. Tran, M. Mun, and C. Shahabi. Real-time distance-based outlier detection in data streams. Proc. VLDB Endow., 14(2):141--153, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Z. Wang, X. L. Dong, A. D. Sarma, M. J. Franklin, and A. Y. Halevy. Functional dependency generation and applications in pay-as-you-go data integration systems. In 12th International Workshop on the Web and Databases, WebDB 2009, Providence, Rhode Island, USA, June 28, 2009, 2009.Google ScholarGoogle Scholar
  31. P. Wang and Y. He. Uni-detect: A unified approach to automated error detection in tables. In P. A. Boncz, S. Manegold, A. Ailamaki, A. Deshpande, and T. Kraska, editors, Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pages 811--828. ACM, 2019.Google ScholarGoogle Scholar
  32. Y. Yu, L. Cao, E. A. Rundensteiner, and Q. Wang. Detecting moving object outliers in massive-scale trajectory streams. In S. A. Macskassy, C. Perlich, J. Leskovec, W. Wang, and R. Ghani, editors, The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '14, New York, NY, USA - August 24 - 27, 2014, pages 422--431. ACM, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Zhang, S. Song, and J. Wang. Sequential data cleaning: A statistical approach. In F. Özcan, G. Koutrika, and S. Madden, editors, Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, pages 909--924. ACM, 2016.Google ScholarGoogle Scholar

Index Terms

  1. Time Series Data Validity

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Management of Data
          Proceedings of the ACM on Management of Data  Volume 1, Issue 1
          PACMMOD
          May 2023
          2807 pages
          EISSN:2836-6573
          DOI:10.1145/3603164
          Issue’s Table of Contents

          Copyright © 2023 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 May 2023
          Published in pacmmod Volume 1, Issue 1

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader