Data Squashing by Empirical Likelihood

Owen, Art

doi:10.1023/A:1021568920107

Data Squashing by Empirical Likelihood

Published: January 2003

Volume 7, pages 101–113, (2003)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Art Owen¹

236 Accesses
17 Citations
Explore all metrics

Abstract

Data squashing was introduced by W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon, in Proceedings of the 5th International Conference on KDD (1999). The idea is to scale data sets down to smaller representative samples instead of scaling up algorithms to very large data sets. They report success in learning model coefficients on squashed data. This paper presents a form of data squashing based on empirical likelihood. This method reweights a random sample of data to match certain expected values to the population. The computation required is a relatively easy convex optimization. There is also a theoretical basis to predict when it will and won't produce large gains. In a credit scoring example, empirical likelihood weighting also accelerates the rate at which coefficients are learned. We also investigate the extent to which these benefits translate into improved accuracy, and consider reweighting in conjunction with boosted decision trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Baggerly, K.A. 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika, 85(3):535–547.
Google Scholar
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.
Bratley, P., Fox, B.J., and Schrage, L.E. 1987. A Guide to Simulation, 2nd edn. Berlin: Springer-Verlag.
Google Scholar
Cochran, W.G. 1977. Sampling Techniques, 3rd edn. New York: John Wiley & Sons.
Google Scholar
Davis, P.J. and Rabinowitz, P. 1984.Methods of Numerical Integration, 2nd edn. San Diego, CA: Academic Press.
Google Scholar
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD), San Diego: ACM Press, pp. 6–15.
Google Scholar
Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156.
Friedman, J. 1999a. Greedy function approximation:Astochastic boosting machine. Technical Report, Department of Statistics, Stanford University.
Friedman, J. 1999b. Stochastic gradient boosting. Technical Report, Department of Statistics, Stanford University.
Friedman, J., Hastie, T., and Tibshirani, R. 1999. Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Stanford University.
Hesterberg, T. 1995. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185–194.
Google Scholar
Lohr, S. 1999. Sampling: Design and Analysis. Pacific Grove, CA: Duxbury Press.
Google Scholar
Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., and Ridgeway, G. 2002. Likelihood-based data squashing: A modeling approach to instance construction. Journal of Data Mining and Knowledge Discovery, 6(2):173–190.
Google Scholar
Owen, A. 1990. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18:90–120.
Google Scholar
Owen, A. 1991.Empirical likelihood for linear models. The Annals of Statistics, 19:1725–1747.
Google Scholar
Qin, J. and Lawless, J. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics, 22:300–325.
Google Scholar
Ripley, B.D. 1987. Stochastic Simulation. New York: John Wiley & Sons.
Google Scholar
Rowe, N.C. 1983. Rule-based statistical calculations on a database abstract, Stanford University, Dept. Computer Science. Ph.D. Thesis.
Wolff, G., Stork, D., and Owen, A. 1996. Empirical error-confidence curves for neural-network and Gaussian classifiers. International Journal of Neural Systems, (3):263–271.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistics, Stanford University, Sequoia Hall, Stanford, CA, 94025, USA
Art Owen

Authors

Art Owen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Owen, A. Data Squashing by Empirical Likelihood. Data Mining and Knowledge Discovery 7, 101–113 (2003). https://doi.org/10.1023/A:1021568920107

Download citation

Issue Date: January 2003
DOI: https://doi.org/10.1023/A:1021568920107

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data Squashing by Empirical Likelihood

Abstract

Access this article

Similar content being viewed by others

On the Discriminative Power of Credit Scoring Systems Trained on Independent Samples

Correcting a Class of Complete Selection Bias with External Data Based on Importance Weight Estimation

Decision Theory in Econometrics

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Data Squashing by Empirical Likelihood

Abstract

Access this article

Similar content being viewed by others

On the Discriminative Power of Credit Scoring Systems Trained on Independent Samples

Correcting a Class of Complete Selection Bias with External Data Based on Importance Weight Estimation

Decision Theory in Econometrics

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation