Abstract
We consider the problem of density estimation when the data is in the form of a continuous stream with no fixed length. In this setting, implementations of the usual methods of density estimation such as kernel density estimation are problematic. We propose a method of density estimation for massive datasets that is based upon taking the derivative of a smooth curve that has been fit through a set of quantile estimates. To achieve this, a low-storage, single-pass, sequential method is proposed for simultaneous estimation of multiple quantiles for massive datasets that form the basis of this method of density estimation. For comparison, we also consider a sequential kernel density estimator. The proposed methods are shown through simulation study to perform well and to have several distinct advantages over existing methods.
Similar content being viewed by others
References
Billingsley, P.: Probability and Measure. Wiley, New York (1986)
Chen, F., Lambert, D., Pinheiro, J.C.: Incremental quantile estimation for massive tracking. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, p. 10 (2000)
Dunn, C.L.: Precise simulated percentiles in a pinch. Am. Stat. 45(3), 207–211 (1991)
Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London (1994)
Jain, R., Chlamtac, I.: The p-square algorithm for dynamic calculation of quantiles and histograms without storing observations. Commun. ACM 28(10), 1076–1085 (1985)
Kesidis, G.: Bandwidth adjustments using on-line packet-level adjustments. In: SPIE Conference on Performance and Control of Network Systems, Boston, Sept. 19–22, 1999
Liechty, J.C., Lin, D.K.J., McDermott, J.P.: Single-pass low-storage arbitrary quantile estimation for massive datasets. Stat. Comput. 13(2), 91–100 (2003)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings ACM SIGMOD International Conference on Management of Data, June, pp. 426–435 (1998)
Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999)
Paxson, V., Floyd, S.: Wide-area traffic: the failure of Poisson modeling. IEEE/ACM Trans. Netw., pp. 226–244 (1995)
Pearl, J.: A space-efficient on-line method of computing quantile estimates. J. Algorithms 2, 164–177 (1981)
Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, COMAD, pp. 294–305 (1996)
Raatikainen, K.E.E.: Simultaneous estimation of several percentiles. Simulation 49(4), 159–164 (1987)
Raatikainen, K.E.E.: Sequential procedure for simultaneous estimation of several percentiles. Trans. Soc. Comput. Simul. 7(1), 21–44 (1990)
Rousseeuw, P.J., Bassett, G.W.: The remedian: a robust averaging method for large datasets. J. Am. Stat. Assoc. 85(409), 97–104 (1990)
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton (1998)
Tierney, L.: A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM J. Sci. Stat. Comput. 4(4), 706–711 (1983)
Wahba, G.: Interpolating spline methods for density estimation I. Equi-spaced knots. Ann. Stat. 3, 30–48 (1975)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
McDermott, J.P., Babu, G.J., Liechty, J.C. et al. Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation. Stat Comput 17, 311–321 (2007). https://doi.org/10.1007/s11222-007-9021-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-007-9021-3