Skip to main content

Advertisement

Log in

Granulation of Large Temporal Databases: An Allan Variance Approach

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

As the use of Big Data begins to dominate various scientific and engineering applications, the ability to conduct complex data analyses with speed and efficiency has become increasingly important. The availability of large amounts of data results in ever-growing storage requirements and magnifies issues related to query response times. In this work, we propose a novel methodology for granulation and data reduction of large temporal databases that can address both issues simultaneously. While prior data reduction techniques rely on heuristics or may be computationally intensive, our work borrows the concept of Allan Variance (AVAR) from the fields of signal processing and sensor characterization to efficiently and systematically reduce the size of temporal databases. Specifically, we use Allan variance to systematically determine the temporal window length over which data remains relevant. Large temporal databases are then granulated using the AVAR-determined window length. Averaging over the resulting granules produces aggregate information for each granule, resulting in significant data reduction. The query performance and data quality are evaluated using existing standard datasets, as well as for two large datasets that include temporal information for vehicular and weather data. Our results demonstrate that the AVAR-based data reduction approach is efficient and maintains data quality, while leading to an order of magnitude improvement in query execution times compared to three existing clustering-based data reduction methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Allan DW. Statistics of atomic frequency standards. Proc IEEE. 1966;54(2):221–30.

    Article  Google Scholar 

  2. Bezdek JC, Ehrlich R, Full W. FCM: the fuzzy c-means clustering algorithm. Comput Geosci. 1984;10(2–3):191–203.

    Article  Google Scholar 

  3. Dua D, Graff C. UCI machine learning repository, 2017. http://archive.ics.uci.edu/ml. 2019

  4. Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv (CSUR). 1991;23(1):5–48.

    Article  Google Scholar 

  5. Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov. 1997;1(1):29–53.

    Article  Google Scholar 

  6. Haeri H, Beal CE, Jerath K. Near-optimal moving average estimation at characteristic timescales: an allan variance approach. IEEE Control Syst Lett. 2021;5(5):1531–6.

    Article  MathSciNet  Google Scholar 

  7. Hartigan JA. Clustering algorithms. Hoboken: Wiley; 1975.

    MATH  Google Scholar 

  8. Helsen J, Peeters C, Doro P, Ververs E, Jordaens P.J. Wind farm operation and maintenance optimization using big data. In: 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService), 2017;179–184.

  9. Henrikson J. Completeness and total boundedness of the Hausdorff metric. MIT Undergrad J Math. 1999;1:69–80.

    Google Scholar 

  10. IEEE. IEEE standard for floating-point arithmetic. In: IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019;1–84.

  11. Januzaj E, Kriegel H.-P, Pfeifle M. DBDC: Density-based distributed clustering. In: International Conference on Extending Database Technology, 2004;88–105. Springer.

  12. Jerath K, Brennan S, Lagoa C. Bridging the gap between sensor noise modeling and sensor characterization. Measurement. 2018;116:350–66.

    Article  Google Scholar 

  13. Jerath K, Brennan SN. GPS-free terrain-based vehicle tracking performance as a function of inertial sensor characteristics. Dyn Syst Control Conf. 2011;54761:367–74.

    Google Scholar 

  14. Johnston W. Model visualization. San Francisco: Morgan Kaufmann Publishers Inc.; 2001. p. 223–7.

    Google Scholar 

  15. Kaufmann L. Clustering by means of medoids. In Proc. Statistical Data Analysis Based on the L1 Norm Conference. Neuchatel. 1987;1987:405–16.

  16. Keogh E, Mueen A. Curse of dimensionality. In: Encyclopedia of Machine Learning and Data Mining. 2017;314–315

  17. Kile H, Uhlen K. Data reduction via clustering and averaging for contingency and reliability analysis. Int J Elect Power Energy Syst. 2012;43(1):1435–42.

    Article  Google Scholar 

  18. Kodinariya TM, Makwana PR. Review on determining number of cluster in k-means clustering. Int J. 2013;1(6):90–5.

    Google Scholar 

  19. Liu H, Motoda H. On issues of instance selection. Data Min Knowl Discov. 2002;6:115–30.

    Article  MathSciNet  Google Scholar 

  20. Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: a review. IEEE Trans Knowl Data Eng. 2018;31(12):2346–63.

    Google Scholar 

  21. Lumini A, Nanni L. A clustering method for automatic biometric template selection. Pattern Recogn. 2006;39(3):495–7.

    Article  MATH  Google Scholar 

  22. MacQueen J. et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA. 1967

  23. Maddipatla SP, Haeri H, Jerath K, Brennan S. Fast allan variance (FAVAR) and Dynamic fast allan variance (D-FAVAR) algorithms for both regularly and irregularly sampled data. IFAC-PapersOnLine. 2021;54(20):26–31.

    Article  Google Scholar 

  24. Madigan D, Nason M. Data reduction: sampling. In: Handbook of data mining and knowledge discovery. 2002;205–208.

  25. Mishra AD, Garg D. Selection of best sorting algorithm. Int J Intell Inform Process. 2008;2(2):363–8.

    Google Scholar 

  26. NASA. Prediction of worldwide energy resource (POWER) datasets. https://power.larc.nasa.gov/.

  27. Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF. A new fast prototype selection method based on clustering. Pattern Anal Appl. 2010;13(2):131–41.

    Article  MathSciNet  Google Scholar 

  28. Pedrycz W. Granular computing: an introduction. In: Proceedings joint 9th IFSA world congress and 20th NAFIPS international conference (Cat. No. 01TH8569), 2001;3, 1349–1354. IEEE

  29. Rehman MH, Liew CS, Abbas A, Jayaraman PP, Wah TY, Khan SU. Big data reduction methods: a survey. Data Sci Eng. 2016;1(4):265–84.

    Article  Google Scholar 

  30. Sesia I, Tavella P. Estimating the Allan variance in the presence of long periods of missing data and outliers. Metrologia. 2008;45(6).

  31. Sinanaj L, Haeri H, Gao L, Maddipatla S, Chen C, Jerath K, Beal C, Brennan S. Allan Variance-based Granulation Technique for Large Temporal Databases. In: Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KMIS, pages 17–28. INSTICC, SciTePress. 2021

  32. Sun X, Liu L, Geng C, Yang S. Fast data reduction with granulation-based instances importance labeling. IEEE Access. 2019;7:33587–97.

    Article  Google Scholar 

  33. Zadeh LA. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 1997;90(2):111–27.

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This material is based upon work supported by the National Science Foundation under Grant No. 1932138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kshitij Jerath.

Ethics declarations

Conflict of interest statement

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Knowledge Discovery, Knowledge Engineering and Knowledge Management” guest edited by Joaquim Filipe, Ana Fred, Jan Dietz, Ana Salgado and Jorge Bernardino.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sinanaj, L., Haeri, H., Maddipatla, S.P. et al. Granulation of Large Temporal Databases: An Allan Variance Approach. SN COMPUT. SCI. 4, 7 (2023). https://doi.org/10.1007/s42979-022-01397-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-022-01397-2

Keywords