Abstract
Data squashing was introduced by W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon, in Proceedings of the 5th International Conference on KDD (1999). The idea is to scale data sets down to smaller representative samples instead of scaling up algorithms to very large data sets. They report success in learning model coefficients on squashed data. This paper presents a form of data squashing based on empirical likelihood. This method reweights a random sample of data to match certain expected values to the population. The computation required is a relatively easy convex optimization. There is also a theoretical basis to predict when it will and won't produce large gains. In a credit scoring example, empirical likelihood weighting also accelerates the rate at which coefficients are learned. We also investigate the extent to which these benefits translate into improved accuracy, and consider reweighting in conjunction with boosted decision trees.
Similar content being viewed by others
References
Baggerly, K.A. 1998. Empirical likelihood as a goodness-of-fit measure. Biometrika, 85(3):535–547.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), pp. 9–15.
Bratley, P., Fox, B.J., and Schrage, L.E. 1987. A Guide to Simulation, 2nd edn. Berlin: Springer-Verlag.
Cochran, W.G. 1977. Sampling Techniques, 3rd edn. New York: John Wiley & Sons.
Davis, P.J. and Rabinowitz, P. 1984.Methods of Numerical Integration, 2nd edn. San Diego, CA: Academic Press.
DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. 1999. Squashing flat files flatter. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining (KDD), San Diego: ACM Press, pp. 6–15.
Freund, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Machine Learning: Proceedings of the Thirteenth International Conference, pp. 148–156.
Friedman, J. 1999a. Greedy function approximation:Astochastic boosting machine. Technical Report, Department of Statistics, Stanford University.
Friedman, J. 1999b. Stochastic gradient boosting. Technical Report, Department of Statistics, Stanford University.
Friedman, J., Hastie, T., and Tibshirani, R. 1999. Additive logistic regression: A statistical view of boosting. Technical Report, Department of Statistics, Stanford University.
Hesterberg, T. 1995. Weighted average importance sampling and defensive mixture distributions. Technometrics, 37(2):185–194.
Lohr, S. 1999. Sampling: Design and Analysis. Pacific Grove, CA: Duxbury Press.
Madigan, D., Raghavan, N., DuMouchel, W., Nason, M., Posse, C., and Ridgeway, G. 2002. Likelihood-based data squashing: A modeling approach to instance construction. Journal of Data Mining and Knowledge Discovery, 6(2):173–190.
Owen, A. 1990. Empirical likelihood ratio confidence regions. The Annals of Statistics, 18:90–120.
Owen, A. 1991.Empirical likelihood for linear models. The Annals of Statistics, 19:1725–1747.
Qin, J. and Lawless, J. 1994. Empirical likelihood and general estimating equations. The Annals of Statistics, 22:300–325.
Ripley, B.D. 1987. Stochastic Simulation. New York: John Wiley & Sons.
Rowe, N.C. 1983. Rule-based statistical calculations on a database abstract, Stanford University, Dept. Computer Science. Ph.D. Thesis.
Wolff, G., Stork, D., and Owen, A. 1996. Empirical error-confidence curves for neural-network and Gaussian classifiers. International Journal of Neural Systems, (3):263–271.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Owen, A. Data Squashing by Empirical Likelihood. Data Mining and Knowledge Discovery 7, 101–113 (2003). https://doi.org/10.1023/A:1021568920107
Issue Date:
DOI: https://doi.org/10.1023/A:1021568920107