Skip to main content
Log in

Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Squashing is a lossy data compression technique that preserves statistical information. Specifically, squashing compresses a massive dataset to a much smaller one so that outputs from statistical analyses carried out on the smaller (squashed) dataset reproduce outputs from the same statistical analyses carried out on the original dataset. Likelihood-based data squashing (LDS) differs from a previously published squashing algorithm insofar as it uses a statistical model to squash the data. The results show that LDS provides excellent squashing performance even when the target statistical analysis departs from the model used to squash the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aha, D.W., Kilber, D., and Albert, M. K. 1991. Instance-based learning algorithms. Machine Learning, 6:37–66.

    Google Scholar 

  • Box, G.E.P. and Draper, N.R. 1987. Empirical Model Building and Response Surfaces. New York, USA: John Wiley & Sons.

    Google Scholar 

  • Box, C.E.P., Hunter, W.G., and Hunter, J.S. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York, USA: John Wiley & Sons.

    Google Scholar 

  • Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pp. 9–15.

  • Breiman, L. and Friedman, J. 1984. Tools for large data set analysis. In Statistical Signal Processing, Edward J. Wegman and James G. Smith (Eds). New York: M. Dekker, pp. 191–197.

    Google Scholar 

  • Catlett, J. Megainduction: A test flight. 1991. In Proceedings of the Eighth International Workshop on Machine Learning, pp. 596–599.

  • DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 6–15.

  • Furnival, G.M. and Wilson, R.W. 1974. Regression by leaps and bounds. Technometrics, 16: 499–511.

    Google Scholar 

  • Gibson, G.A., Vitter, J.S., and Wilkes, J. 1996. Report of the working group on storage I/O issues in large-scale computing. ACM Computing Surveys, 28:779–793.

    Google Scholar 

  • Lawless, J. and Singhal, K. 1978. Efficient screening of nonnormal regression models. Biometrics, 34, pp. 318–327.

    Google Scholar 

  • Provost, F. and Kolluri, V. 1989. A survey of methods for scaling up inductive algorithms. Journal of Data Mining and Knowledge Discovery, 3: 131–169.

    Google Scholar 

  • Schwarz, G. 1978. Estimating the dimension of a model. Annals of Statistics, 6: 461–464.

    Google Scholar 

  • Syed, N.A., Liu, H., and Sung, K.K. 1999. A study of support vectors on model independent example selection. In Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, pp. 272–276.

  • Venables, W.N. and Ripley, B.D. 1997. Modern Applied Statistics with S-PLUS. New York: Springer-Verlag.

    Google Scholar 

  • Zhang, T., Ramakrishnan, R., and Livny, M. 1996. Birch: An efficient data clustering method for large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pp. 103–114.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Madigan, D., Raghavan, N., Dumouchel, W. et al. Likelihood-Based Data Squashing: A Modeling Approach to Instance Construction. Data Mining and Knowledge Discovery 6, 173–190 (2002). https://doi.org/10.1023/A:1014095614948

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1014095614948

Navigation