ABSTRACT
One of the most important assumptions made by many classification algorithms is that the training and test sets are drawn from the same distribution, i.e., the so-called "stationary distribution assumption" that the future and the past data sets are identical from a probabilistic standpoint. In many domains of real-world applications, such as marketing solicitation, fraud detection, drug testing, loan approval, sub-population surveys, school enrollment among others, this is rarely the case. This is because the only labeled sample available for training is biased in different ways due to a variety of practical reasons and limitations. In these circumstances, traditional methods to evaluate the expected generalization error of classification algorithms, such as structural risk minimization, ten-fold cross-validation, and leave-one-out validation, usually return poor estimates of which classification algorithm, when trained on biased dataset, will be the most accurate for future unbiased dataset, among a number of competing candidates. Sometimes, the estimated order of the learning algorithms' accuracy could be so poor that it is not even better than random guessing. Therefore,a method to determine the most accurate learner is needed for data mining under sample selection bias for many real-world applications. We present such an approach that can determine which learner will perform the best on an unbiased test set, given a possibly biased training set, in a fraction of the computational cost to use cross-validation based approaches.
- Fan W., Davidson I., Zadrozny B. and Yu P., (2005), An Improved Categorization of Classifier's Sensitivity on Sample Selection Bias, 5th IEEE International Conference on Data Mining, ICDM 2005. Google ScholarDigital Library
- Heckman, J. (1979). Sample selection bias as a specification error. Econometrica, 47:153--161.Google ScholarCross Ref
- Little, R. and Rubin, D. (2002). Statistical Analysis with Missing Data. Wiley, 2nd edition. Google ScholarDigital Library
- McCallum, A. (1998). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. CMU TR.Google Scholar
- Mitchell, T. (1997), Machine Learning, McGraw Hill. Google ScholarDigital Library
- Moore, A. A Tutorial on the VC Dimension for Characterizing Classifiers, Available from the Website: www.cs.cmu.edu/~awm/tutorialsGoogle Scholar
- Rennie, J. 20 Newsgroups, (2003). Technical Report, Dept C.S., MIT.Google Scholar
- Rosset, S., Zhu, J., Zou, H., and Hastie, T. (2005). A method for inferring label sampling mechanisms in semi-supervised learning. In Advances in Neural Information Processing Systems 17, pages 1161--1168. MIT Press.Google Scholar
- Shawe-Taylor J., Bartlett P., Williamson R., Anthony M., (1996), "A Framework for Structural Risk Minimisation" Proceedings of the 9th Annual Conference on Computational Learning Theory. Google ScholarDigital Library
- Smith, A. and Elkan, C. (2004). A bayesian network framework for reject inference. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 286--295. Google ScholarDigital Library
- Vapnik, V., The Nature of Statistical Learning, Springer, 1995. Google ScholarDigital Library
- Mitchell, T. M. (1997). Machine Learning. McGraw Hill. Google ScholarDigital Library
- Zadrozny, B. (2004). Learning and evaluating classifiers under sample selection bias. In Proceedings of the 21th International Conference on Machine Learning. Google ScholarDigital Library
- B. Zadrozny and C. Elkan. (2001). Learning and making decisions when costs and probabilities are both unknown. In Proceedings of the Seventh ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD01). Google ScholarDigital Library
Index Terms
- Reverse testing: an efficient framework to select amongst classifiers under sample selection bias
Recommendations
Fair and Robust Classification Under Sample Selection Bias
CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge ManagementTo address the sample selection bias between the training and test data, previous research works focus on reweighing biased training data to match the test data and then building classification models on the reweighed training data. However, how to ...
Making generative classifiers robust to selection bias
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data miningThis paper presents approaches to semi-supervised learning when the labeled training data and test data are differently distributed. Specifically, the samples selected for labeling are a biased subset of some general distribution and the test set ...
Nested cross-validation when selecting classifiers is overzealous for most practical applications
AbstractWhen selecting a classification algorithm to be applied to a particular problem, one has to simultaneously select the best algorithm for that dataset and the best set of hyperparameters for the chosen model. The usual approach is to ...
Highlights- Flat cross validation computes both the best hyperparameter and the expected accuracy.
Comments