Skip to main content
Log in

Data Squashing by Empirical Likelihood

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Data squashing was introduced by W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon, in Proceedings of the 5th International Conference on KDD (1999). The idea is to scale data sets down to smaller representative samples instead of scaling up algorithms to very large data sets. They report success in learning model coefficients on squashed data. This paper presents a form of data squashing based on empirical likelihood. This method reweights a random sample of data to match certain expected values to the population. The computation required is a relatively easy convex optimization. There is also a theoretical basis to predict when it will and won't produce large gains. In a credit scoring example, empirical likelihood weighting also accelerates the rate at which coefficients are learned. We also investigate the extent to which these benefits translate into improved accuracy, and consider reweighting in conjunction with boosted decision trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Baggerly, K.A. 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika, 85(3):535–547.

    Google Scholar 

  • Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.

  • Bratley, P., Fox, B.J., and Schrage, L.E. 1987. A Guide to Simulation, 2nd edn. Berlin: Springer-Verlag.

    Google Scholar 

  • Cochran, W.G. 1977. Sampling Techniques, 3rd edn. New York: John Wiley & Sons.

    Google Scholar 

  • Davis, P.J. and Rabinowitz, P. 1984.Methods of Numerical Integration, 2nd edn. San Diego, CA: Academic Press.

    Google Scholar 

  • DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD), San Diego: ACM Press, pp. 6–15.

    Google Scholar 

  • Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156.

  • Friedman, J. 1999a. Greedy function approximation:Astochastic boosting machine. Technical Report, Department of Statistics, Stanford University.

  • Friedman, J. 1999b. Stochastic gradient boosting. Technical Report, Department of Statistics, Stanford University.

  • Friedman, J., Hastie, T., and Tibshirani, R. 1999. Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Stanford University.

  • Hesterberg, T. 1995. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185–194.

    Google Scholar 

  • Lohr, S. 1999. Sampling: Design and Analysis. Pacific Grove, CA: Duxbury Press.

    Google Scholar 

  • Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., and Ridgeway, G. 2002. Likelihood-based data squashing: A modeling approach to instance construction. Journal of Data Mining and Knowledge Discovery, 6(2):173–190.

    Google Scholar 

  • Owen, A. 1990. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18:90–120.

    Google Scholar 

  • Owen, A. 1991.Empirical likelihood for linear models. The Annals of Statistics, 19:1725–1747.

    Google Scholar 

  • Qin, J. and Lawless, J. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics, 22:300–325.

    Google Scholar 

  • Ripley, B.D. 1987. Stochastic Simulation. New York: John Wiley & Sons.

    Google Scholar 

  • Rowe, N.C. 1983. Rule-based statistical calculations on a database abstract, Stanford University, Dept. Computer Science. Ph.D. Thesis.

  • Wolff, G., Stork, D., and Owen, A. 1996. Empirical error-confidence curves for neural-network and Gaussian classifiers. International Journal of Neural Systems, (3):263–271.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Owen, A. Data Squashing by Empirical Likelihood. Data Mining and Knowledge Discovery 7, 101–113 (2003). https://doi.org/10.1023/A:1021568920107

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1021568920107

Navigation