Article

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

Authors:
Wei Fan

IBM T. J. Watson Research, Hawthorne, NY

IBM T. J. Watson Research, Hawthorne, NY
View Profile

,
Ian Davidson

University of Albany, State University of New York, Albany, NY

University of Albany, State University of New York, Albany, NY
View Profile

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2006Pages 147–156https://doi.org/10.1145/1150402.1150422

Published:20 August 2006Publication History

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 147–156

ABSTRACT

One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called "stationary distribution assumption" that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms' accuracy could be so poor that it is not even better than random guessing. Therefore,a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches.

References

Fan W., Davidson I., Zadrozny B. and Yu P., (2005), An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias, 5th IEEE International Conference on Data Mining, ICDM 2005. Google ScholarDigital Library
Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47:153--161.Google ScholarCross Ref
Little, R. and Rubin, D. (2002). Statistical Analysis with Missing Data. Wiley, 2nd edition. Google ScholarDigital Library
McCallum, A. (1998). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. CMU TR.Google Scholar
Mitchell, T. (1997), Machine Learning, McGraw Hill. Google ScholarDigital Library
Moore, A. A Tutorial on the VC Dimension for Characterizing Classifiers, Available from the Website: www.cs.cmu.edu/~awm/tutorialsGoogle Scholar
Rennie, J. 20 Newsgroups, (2003). Technical Report, Dept C.S., MIT.Google Scholar
Rosset, S., Zhu, J., Zou, H., and Hastie, T. (2005). A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 1161--1168. MIT Press.Google Scholar
Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), "A Framework for Structural Risk Minimisation" Proceedings of the 9th Annual Conference on Computational Learning Theory. Google ScholarDigital Library
Smith, A. and Elkan, C. (2004). A bayesian network framework for reject inference. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 286--295. Google ScholarDigital Library
Vapnik, V., The Nature of Statistical Learning, Springer, 1995. Google ScholarDigital Library
Mitchell, T. M. (1997). Machine Learning. McGraw Hill. Google ScholarDigital Library
Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21th International Conference on Machine Learning. Google ScholarDigital Library
B. Zadrozny and C. Elkan. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD01). Google ScholarDigital Library

Index Terms

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Fair and Robust Classification Under Sample Selection Bias
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

To address the sample selection bias between the training and test data, previous research works focus on reweighing biased training data to match the test data and then building classification models on the reweighed training data. However, how to ...
Read More
Making generative classifiers robust to selection bias
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining

This paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set ...
Read More
Nested cross-validation when selecting classifiers is overzealous for most practical applications
Abstract
When selecting a classification algorithm to be applied to a particular problem, one has to simultaneously select the best algorithm for that dataset and the best set of hyperparameters for the chosen model. The usual approach is to ...
Highlights
- Flat cross validation computes both the best hyperparameter and the expected accuracy.
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2006
986 pages
ISBN:1595933395
DOI:10.1145/1150402
Conference Chair:
Tina Eliassi-Rad
LLNL
,
General Chair:
Lyle Ungar
University of Pennsylvania
,
Program Chairs:
Mark Craven
University of Wisconsin
,
Dimitrios Gunopulos
University of California, Riverside
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 August 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
cross-validation
sample selection bias
stationary distribution
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 698
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fair and Robust Classification Under Sample Selection Bias

Making generative classifiers robust to selection bias

Nested cross-validation when selecting classifiers is overzealous for most practical applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Reverse testing: an efficient framework to select amongst classifiers under sample selection bias

KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fair and Robust Classification Under Sample Selection Bias

Making generative classifiers robust to selection bias

Nested cross-validation when selecting classifiers is overzealous for most practical applications

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media