Abstract
The probability density function (PDF) is an effective data model for a variety of stream mining tasks. As such, accurate estimates of the PDF are essential to reducing the uncertainties and errors associated with mining results. The nonparametric adaptive kernel density estimator (AKDE) provides accurate, robust, and asymptotically consistent estimates of a PDF. However, due to AKDE’s extensive computational requirements, it cannot be directly applied to the data stream environment. This paper describes the development of an AKDE approximation approach that heeds the constraints of the data stream environment and supports efficient processing of multiple queries. To this end, this work proposes (1) the concept of local regions to provide a partition-based variable bandwidth to capture local density structures and enhance estimation quality; (2) a suite of linear-pass methods to construct the local regions and kernel objects online; (3) an efficient multiple queries evaluation algorithm; (4) a set of approximate techniques to increase the throughput of multiple density queries processing; and (5) a fixed-size memory time-based sliding window that updates the kernel objects in linear time. Comprehensive experiments were conducted with real-world and synthetic data sets to validate the effectiveness and efficiency of the approach.















Similar content being viewed by others
References
Freeway performance measurement system (PeMS) [http://pems.eecs.berkeley.edu]
Aggarwal C (2003) A framework for diagnosing changes in evolving data streams. In: Proceedings of 2003 ACM SIGMOD international conference on management of data. San Diego, CA, pp 575–586
Aggarwal C, Yu PS (2007) Data streams: models and algorithms. In: Aggarwal C (ed) A survey of synopsis construction in data streams. Springer Science and Business Media, New York, pp 69–202
Asuncion A, Newman DJ (2007) UCI machine learning repository. [http://www.ics.uci.edu/~mlearn/MLRepository.html]
Babcock B, Babu S, Datar M, Motwani R, Widom J (2002) Models and issues in data stream systems. In: Proceedings of 21st ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems. Madison, WI, pp 1–16
Babcock B, Datar M, Motwani R (2002) Sampling from a moving window over streaming data. In: Proceedings of 13th Annual ACM-SIAM symposium on discrete algorithms. San Francisco, CA, pp 633–634
Chan CC, Batur C, Srinivasan A (1991) Determination of quantization intervals in rule based model for dynamic systems. In: Proceedings of IEEE conference of systems, man, and, cybernetics. pp 1719–1723
Clear R, Berman S (1988) Estimation of linear interpolation error. In: Proceedings of the annual illuminating engineering society conference
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Proceedings of 12th international conference on machine learning. pp 194–202
Duoandikoetxea J (2001) Fourier analysis: American mathematical society
Gibbons P, Matias Y, Poosala V (2002) Fast incremental maintenance of approximate histograms. ACM Trans Database Syst 27:261–298
Gilbert A, Kotidis Y, Muthukrishan S, Strauss MJ (2002) How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of the 28th international conference of very large data bases. Hong Kong, China, pp 454–465
Gray A, Moore A (2003) Rapid evaluation of multiple density models. In: Proceedings of 9th international workshop on artificial intelligence and statistics. Key West, FL
Guha S, Koudas N, Shim K (2006) Approximation and streaming algorithms for histogram construction problems. ACM Trans Database Syst 31:396–438
Heinz C (2007) Density estimation over data streams. Phd, Mathematics, Phillipps-University Marburg
Heinz C, Seeger B (2008) Cluster kernels: resource-aware kernel density estimators over streaming data. IEEE Trans Knowl Data Eng 20:880–893
Heinz C, Seeger B (2006) Exploring data streams with nonparametric estimators. In: Proceedings of 18th international conference on statistical and scientific database management. Vienna, Austria, pp 261–264
Heinz C, Seeger B (2006) Resource-aware kernel density estimators over streaming data. In: Proceedings of 15th ACM international conference on information and knowledge management. Arlington, VA, pp 870–871
Heinz C, Seeger B (2006) Towards kernel density estimation over streaming data. In: Proceedings of 13th international conference on management of data. Delhi, pp 91–102
Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise, in proceedings of ACM Knowledge Discovery and Data Mining 58–65
Ioannidis Y (2003) The history of histograms (abridged). In: Proceedings of 29th international conference on very large databases. Berlin, pp 19–30
Keogh E, Xi X, Wei L, Ratanamahatana CA (2008) The UCR time series classification/clustering. [http://www.cs.ucr.edu/~eamonn/time_series_data]. Available: http://www.cs.ucr.edu/~eamonn/
Ledl T (2004) Kernel density estimation: theory and application in discriminant analysis. Aust J Stat 33:267–279
Liu H, Hussain F, Tan CL, Dash M (2002) Discretization: an enabling technique. Data Min Knowl Discov 6:393–423
Merckt TV (1993) Decision trees in numerical attribute spaces. In: Proceedings of the 13th international joint conference on artificial intelligence, pp 1016–1021
Nussbaumer HJ (1982) Fast Fourier transform and convolution algorithms, 2nd edn. Springer, New York
Parzen E (1962) On estimation of a probability density function and mode. Ann Math Stat 33:1065–1076
Sain SR, Scott DW (1996) On locally adaptive density estimation. J Am Stat Assoc 91:1525–1534
Scott DW (1992) Multivariate density estimation. Wiley, New York
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London
Smith JO (2011) Digital audio resampling home page. Available http://www-ccrma.stanford.edu/~jos/resample
Subramaniam S, Palpanas T, Papadopoulos D, Kalogeraki V, Gunopulos D (2006) Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd international conference on very large databases. Seoul, pp 187–198
Wand MP, Jones MC (1995) Kernel smoothing. CRC Press, Boca Raton
Wegman EJ, Marchette DJ (2003) On some techniques for streaming data: a case study of internet packet headers. J Comput Graph Stat 12:1–22
Weiss SM, Galen RS, Tadepalli PV (1991) Maximizing the predictive value of production rules, artificial intelligence, pp 47–71
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of ACM SIGMOD international conference on management of data. Montreal, pp 103–114
Zhang T, Ramakrishnan R, Livny M (1999) Fast density estimation using CF-kernel for very large databases. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining. San Diego, CA, pp 312–316
Zhou A, Cai Z, Wei L, Qian W (2003) M-Kernel merging: towards density estimation over data streams. In: Proceedings of the 8th international conference on database systems for advanced applications. Kyoto, pp 285–292
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Boedihardjo, A.P., Lu, CT. & Chen, F. Fast adaptive kernel density estimator for data streams. Knowl Inf Syst 42, 285–317 (2015). https://doi.org/10.1007/s10115-013-0712-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-013-0712-0