Abstract
As the use of Big Data begins to dominate various scientific and engineering applications, the ability to conduct complex data analyses with speed and efficiency has become increasingly important. The availability of large amounts of data results in ever-growing storage requirements and magnifies issues related to query response times. In this work, we propose a novel methodology for granulation and data reduction of large temporal databases that can address both issues simultaneously. While prior data reduction techniques rely on heuristics or may be computationally intensive, our work borrows the concept of Allan Variance (AVAR) from the fields of signal processing and sensor characterization to efficiently and systematically reduce the size of temporal databases. Specifically, we use Allan variance to systematically determine the temporal window length over which data remains relevant. Large temporal databases are then granulated using the AVAR-determined window length. Averaging over the resulting granules produces aggregate information for each granule, resulting in significant data reduction. The query performance and data quality are evaluated using existing standard datasets, as well as for two large datasets that include temporal information for vehicular and weather data. Our results demonstrate that the AVAR-based data reduction approach is efficient and maintains data quality, while leading to an order of magnitude improvement in query execution times compared to three existing clustering-based data reduction methods.














Similar content being viewed by others
References
Allan DW. Statistics of atomic frequency standards. Proc IEEE. 1966;54(2):221–30.
Bezdek JC, Ehrlich R, Full W. FCM: the fuzzy c-means clustering algorithm. Comput Geosci. 1984;10(2–3):191–203.
Dua D, Graff C. UCI machine learning repository, 2017. http://archive.ics.uci.edu/ml. 2019
Goldberg D. What every computer scientist should know about floating-point arithmetic. ACM Comput Surv (CSUR). 1991;23(1):5–48.
Gray J, Chaudhuri S, Bosworth A, Layman A, Reichart D, Venkatrao M, Pellow F, Pirahesh H. Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min Knowl Discov. 1997;1(1):29–53.
Haeri H, Beal CE, Jerath K. Near-optimal moving average estimation at characteristic timescales: an allan variance approach. IEEE Control Syst Lett. 2021;5(5):1531–6.
Hartigan JA. Clustering algorithms. Hoboken: Wiley; 1975.
Helsen J, Peeters C, Doro P, Ververs E, Jordaens P.J. Wind farm operation and maintenance optimization using big data. In: 2017 IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService), 2017;179–184.
Henrikson J. Completeness and total boundedness of the Hausdorff metric. MIT Undergrad J Math. 1999;1:69–80.
IEEE. IEEE standard for floating-point arithmetic. In: IEEE Std 754-2019 (Revision of IEEE 754-2008), 2019;1–84.
Januzaj E, Kriegel H.-P, Pfeifle M. DBDC: Density-based distributed clustering. In: International Conference on Extending Database Technology, 2004;88–105. Springer.
Jerath K, Brennan S, Lagoa C. Bridging the gap between sensor noise modeling and sensor characterization. Measurement. 2018;116:350–66.
Jerath K, Brennan SN. GPS-free terrain-based vehicle tracking performance as a function of inertial sensor characteristics. Dyn Syst Control Conf. 2011;54761:367–74.
Johnston W. Model visualization. San Francisco: Morgan Kaufmann Publishers Inc.; 2001. p. 223–7.
Kaufmann L. Clustering by means of medoids. In Proc. Statistical Data Analysis Based on the L1 Norm Conference. Neuchatel. 1987;1987:405–16.
Keogh E, Mueen A. Curse of dimensionality. In: Encyclopedia of Machine Learning and Data Mining. 2017;314–315
Kile H, Uhlen K. Data reduction via clustering and averaging for contingency and reliability analysis. Int J Elect Power Energy Syst. 2012;43(1):1435–42.
Kodinariya TM, Makwana PR. Review on determining number of cluster in k-means clustering. Int J. 2013;1(6):90–5.
Liu H, Motoda H. On issues of instance selection. Data Min Knowl Discov. 2002;6:115–30.
Lu J, Liu A, Dong F, Gu F, Gama J, Zhang G. Learning under concept drift: a review. IEEE Trans Knowl Data Eng. 2018;31(12):2346–63.
Lumini A, Nanni L. A clustering method for automatic biometric template selection. Pattern Recogn. 2006;39(3):495–7.
MacQueen J. et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA. 1967
Maddipatla SP, Haeri H, Jerath K, Brennan S. Fast allan variance (FAVAR) and Dynamic fast allan variance (D-FAVAR) algorithms for both regularly and irregularly sampled data. IFAC-PapersOnLine. 2021;54(20):26–31.
Madigan D, Nason M. Data reduction: sampling. In: Handbook of data mining and knowledge discovery. 2002;205–208.
Mishra AD, Garg D. Selection of best sorting algorithm. Int J Intell Inform Process. 2008;2(2):363–8.
NASA. Prediction of worldwide energy resource (POWER) datasets. https://power.larc.nasa.gov/.
Olvera-López JA, Carrasco-Ochoa JA, Martínez-Trinidad JF. A new fast prototype selection method based on clustering. Pattern Anal Appl. 2010;13(2):131–41.
Pedrycz W. Granular computing: an introduction. In: Proceedings joint 9th IFSA world congress and 20th NAFIPS international conference (Cat. No. 01TH8569), 2001;3, 1349–1354. IEEE
Rehman MH, Liew CS, Abbas A, Jayaraman PP, Wah TY, Khan SU. Big data reduction methods: a survey. Data Sci Eng. 2016;1(4):265–84.
Sesia I, Tavella P. Estimating the Allan variance in the presence of long periods of missing data and outliers. Metrologia. 2008;45(6).
Sinanaj L, Haeri H, Gao L, Maddipatla S, Chen C, Jerath K, Beal C, Brennan S. Allan Variance-based Granulation Technique for Large Temporal Databases. In: Proceedings of the 13th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - KMIS, pages 17–28. INSTICC, SciTePress. 2021
Sun X, Liu L, Geng C, Yang S. Fast data reduction with granulation-based instances importance labeling. IEEE Access. 2019;7:33587–97.
Zadeh LA. Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic. Fuzzy Sets Syst. 1997;90(2):111–27.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Grant No. 1932138. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest statement
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This article is part of the topical collection “Knowledge Discovery, Knowledge Engineering and Knowledge Management” guest edited by Joaquim Filipe, Ana Fred, Jan Dietz, Ana Salgado and Jorge Bernardino.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Sinanaj, L., Haeri, H., Maddipatla, S.P. et al. Granulation of Large Temporal Databases: An Allan Variance Approach. SN COMPUT. SCI. 4, 7 (2023). https://doi.org/10.1007/s42979-022-01397-2
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s42979-022-01397-2