Skip to main content
Log in

Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We consider the problem of density estimation when the data is in the form of a continuous stream with no fixed length. In this setting, implementations of the usual methods of density estimation such as kernel density estimation are problematic. We propose a method of density estimation for massive datasets that is based upon taking the derivative of a smooth curve that has been fit through a set of quantile estimates. To achieve this, a low-storage, single-pass, sequential method is proposed for simultaneous estimation of multiple quantiles for massive datasets that form the basis of this method of density estimation. For comparison, we also consider a sequential kernel density estimator. The proposed methods are shown through simulation study to perform well and to have several distinct advantages over existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Billingsley, P.: Probability and Measure. Wiley, New York (1986)

    MATH  Google Scholar 

  • Chen, F., Lambert, D., Pinheiro, J.C.: Incremental quantile estimation for massive tracking. In: Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, p. 10 (2000)

  • Dunn, C.L.: Precise simulated percentiles in a pinch. Am. Stat. 45(3), 207–211 (1991)

    Article  Google Scholar 

  • Green, P.J., Silverman, B.W.: Nonparametric Regression and Generalized Linear Models. Chapman & Hall, London (1994)

    MATH  Google Scholar 

  • Jain, R., Chlamtac, I.: The p-square algorithm for dynamic calculation of quantiles and histograms without storing observations. Commun. ACM 28(10), 1076–1085 (1985)

    Article  Google Scholar 

  • Kesidis, G.: Bandwidth adjustments using on-line packet-level adjustments. In: SPIE Conference on Performance and Control of Network Systems, Boston, Sept. 19–22, 1999

  • Liechty, J.C., Lin, D.K.J., McDermott, J.P.: Single-pass low-storage arbitrary quantile estimation for massive datasets. Stat. Comput. 13(2), 91–100 (2003)

    Article  Google Scholar 

  • Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Approximate medians and other quantiles in one pass and with limited memory. In: Proceedings ACM SIGMOD International Conference on Management of Data, June, pp. 426–435 (1998)

  • Manku, G.S., Rajagopalan, S., Lindsay, B.G.: Random sampling techniques for space efficient online computation of order statistics of large databases. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp. 251–262 (1999)

  • Paxson, V., Floyd, S.: Wide-area traffic: the failure of Poisson modeling. IEEE/ACM Trans. Netw., pp. 226–244 (1995)

  • Pearl, J.: A space-efficient on-line method of computing quantile estimates. J. Algorithms 2, 164–177 (1981)

    Article  MATH  Google Scholar 

  • Poosala, V., Ioannidis, Y.E., Haas, P.J., Shekita, E.J.: Improved histograms for selectivity estimation of range predicates. In: Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, COMAD, pp. 294–305 (1996)

  • Raatikainen, K.E.E.: Simultaneous estimation of several percentiles. Simulation 49(4), 159–164 (1987)

    Google Scholar 

  • Raatikainen, K.E.E.: Sequential procedure for simultaneous estimation of several percentiles. Trans. Soc. Comput. Simul. 7(1), 21–44 (1990)

    Google Scholar 

  • Rousseeuw, P.J., Bassett, G.W.: The remedian: a robust averaging method for large datasets. J. Am. Stat. Assoc. 85(409), 97–104 (1990)

    Article  MATH  Google Scholar 

  • Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, Boca Raton (1998)

    Google Scholar 

  • Tierney, L.: A space-efficient recursive procedure for estimating a quantile of an unknown distribution. SIAM J. Sci. Stat. Comput. 4(4), 706–711 (1983)

    Article  MATH  Google Scholar 

  • Wahba, G.: Interpolating spline methods for density estimation I. Equi-spaced knots. Ann. Stat. 3, 30–48 (1975)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to James P. McDermott.

Rights and permissions

Reprints and permissions

About this article

Cite this article

McDermott, J.P., Babu, G.J., Liechty, J.C. et al. Data skeletons: simultaneous estimation of multiple quantiles for massive streaming datasets with applications to density estimation. Stat Comput 17, 311–321 (2007). https://doi.org/10.1007/s11222-007-9021-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-007-9021-3

Keywords

Navigation