On Selection Bias with Imbalanced Classes

Jacobusse, Gert; Veenman, Cor

doi:10.1007/978-3-319-46307-0_21

On Selection Bias with Imbalanced Classes

Gert Jacobusse¹⁶ &
Cor Veenman^16,17

Conference paper
First Online: 21 September 2016

1871 Accesses
7 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9956))

Abstract

In various applications, such as law enforcement and medical screening, one class outnumbers the other, which is called class imbalance. The inspection to recognize targets from the minority class is usually driven by experience and expert knowledge. In that way, targets can be found way above the base rate to make the inspection process feasible. In order to make the search for targets more efficient, the inspected samples can serve as training set for a learning method. In this study, we show how the introduced selection bias can be remedied in several ways using unlabeled data. With a synthetic dataset and a real-world law enforcement dataset, we show that adding unlabeled data to the non-targets strongly improves ranking performance. Importantly, completely leaving out the labeled non-targets and using only the unlabeled data as non-targets gives the best results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
A strict application of this scenario would make it hard to discover new phenomena or trends, since the goal is to find targets similar to previous cases. This exploratory aspect is, however, not the subject of this study.

References

Bradley, A.: The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recogn. 30(7), 1145–1159 (1997)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MathSciNet MATH Google Scholar
Chapelle, O., Schölkopf, B., Zien, A.: Semi-supervised Learning (2006)
Google Scholar
Chaudhari, S., Shevade, S.: Learning from positive and unlabelled examples using maximum margin clustering. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7665, pp. 465–473. Springer, Heidelberg (2012). doi:10.1007/978-3-642-34487-9_56
Chapter Google Scholar
Chen, C., Liaw, A., Breiman, L.: Using Random Forest to Learn Imbalanced Data. Technical report, Department of Statistics, University of Berkeley (2004)
Google Scholar
Cortes, C., Mohri, M., Riley, M., Rostamizadeh, A.: Sample selection bias correction theory. In: Freund, Y., Györfi, L., Turán, G., Zeugmann, T. (eds.) ALT 2008. LNCS (LNAI), vol. 5254, pp. 38–53. Springer, Heidelberg (2008). doi:10.1007/978-3-540-87987-9_8
Chapter Google Scholar
Duda, R., Hart, P., Stork, D.: Pattern Classification. John Wiley and Sons Inc., New York (2001)
MATH Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence, IJCAI 2001, vol. 2, pp. 973–978 (2001)
Google Scholar
Elkan, C., Noto, K.: Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008, pp. 213–220. ACM, New York (2008)
Google Scholar
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: LIBLINEAR: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
MATH Google Scholar
Fawcett, T., Provost, F.: Adaptive fraud detection. Data Min. Knowl. Disc. 1, 291–316 (1997)
Article Google Scholar
Friedman, J.H.: Greedy function approximation: a gradient boosting machine (2000)
Google Scholar
Guo, X., Yin, Y., Dong, C., Yang, G., Zhou, G.: On the class imbalance problem. In: 2008 Fourth International Conference on Natural Computation, ICNC 2008, vol. 4, pp. 192–201. IEEE (2008)
Google Scholar
He, H., Garcia, E.: Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21(9), 1263–1284 (2009)
Article Google Scholar
Hu, H., Sha, C., Wang, X., Zhou, A.: A unified framework for semi-supervised PU learning. World Wide Web 17(4), 493–510 (2014)
Article Google Scholar
Huang, J., Smola, A., Gretton, A., Borgwardt, K., Scholkopf, B.: Correcting sample selection bias by unlabeled data. In: Advances in Neural Information Processing Systems, vol. 19, p. 601 (2007)
Google Scholar
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
MATH Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML), pp. 179–186. Morgan Kaufmann (1997)
Google Scholar
Li, H., Chen, Z., Liu, B., Wei, X., Shao, J.: Spotting fake reviews via collective positive-unlabeled learning. In: IEEE International Conference on Data Mining (ICDM 2014) (2014)
Google Scholar
Li, Q., Wang, Y., Bryant, S.: A novel method for mining highly imbalanced high-throughput screening data in PubChem. Bioinformatics 25(24), 3310–3316 (2009)
Article Google Scholar
Little, R.J.A., Rubin, D.B.: Statistical Analysis with Missing Data. Wiley, New York (2002)
Book MATH Google Scholar
Liu, A., Ziebart, B.: Robust classification under sample selection bias. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, Montreal, Quebec, Canada, 8–13 December 2014, pp. 37–45 (2014)
Google Scholar
Malof, J., Mazurowski, M., Tourassib, G.: The effect of class imbalance on case selection for case-based classifiers: an empirical study in the context of medical decision support. Neural Netw. 25(1), January 2012
Google Scholar
Mansour, Y., Mohri, M., Rostamizadeh, A.: Domain adaptation: learning bounds and algorithms. CoRR
Google Scholar
du Plessis, M., Niu, G., Sugiyama, M.: Analysis of learning from positive andunlabeled data. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems, Montreal, Quebec, Canada, 8–13 December 2014, pp. 703–711 (2014)
Google Scholar
Ramoni, M., Sebastiani, P.: Robust learning with missing data. Mach. Learn. 45(2), 147–170 (2001)
Article MATH Google Scholar
Shimodaira, H.: Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Infer. 90(2), 227–244 (2000)
Article MathSciNet MATH Google Scholar
Van Vlasselaer, V., Akoglu, L., Eliassi-Rad, T., Snoeck, M., Baesens, B.: Guilt-by-constellation: fraud detection by suspicious clique memberships. In: 2015 48th Hawaii International Conference on System Sciences (HICSS), pp. 918–927. IEEE, January 2015
Google Scholar
Varshney, K., Chenthamarakshan, V., Fancher, S., Wang, J., Fang, D., Mojsilović, A.: Predicting employee expertise for talent management in the enterprise. In: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2014, pp. 1729–1738. ACM, New York (2014)
Google Scholar
Veenman, C.: Data base investigation as a ranking problem. In: Proceedings of the European Intelligence and Security Informatics Conference (EISIC), Odense, Denmark, 21–24 August 2012
Google Scholar
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets - a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, pp. 67–73 (2005)
Google Scholar
Zadrozny, B.: Learning and evaluating classifiers under sample selection bias. In: Proceedings of the Twenty-First International Conference on Machine Learning, ICML 2004, p. 114. ACM, New York (2004)
Google Scholar
Zhou, J., Pan, S., Mao, Q., Tsang, I.: Multi-view positive and unlabeled learning. In: Proceedings of the 4th Asian Conference on Machine Learning, ACML 2012, Singapore, Singapore, 4–6 November 2012, pp. 555–570 (2012)
Google Scholar
Zhu, X.: Semi-supervised learning literature survey. Technical report (2006)
Google Scholar

Download references

Acknowledgements

We would like to thank Gerard Meester and his colleagues from the Inspectorate SZW of the Ministry of Social Affairs and Employment, The Hague, The Netherlands, for their support and making their data available for our research. We also thank Dr. Marco Loog for his critical review of and extensive comments on a previous version of the paper.

Author information

Authors and Affiliations

Digital and Biometric Traces, Netherlands Forensic Institute, Laan van Ypenburg 6, 2497 GB, The Hague, The Netherlands
Gert Jacobusse & Cor Veenman
Leideu Institute of Advanced Computer Science, Leiden University, Niels Bohrweg 1, Leiden, The Netherlands
Cor Veenman

Authors

Gert Jacobusse
View author publications
You can also search for this author in PubMed Google Scholar
Cor Veenman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cor Veenman .

Editor information

Editors and Affiliations

Campus Middelhe, M.G.103a, Universiteit Antwerpen Campus Middelhe, M.G.103a, Antwerp, Belgium
Toon Calders
Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Bari, Italy
Donato Malerba

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jacobusse, G., Veenman, C. (2016). On Selection Bias with Imbalanced Classes. In: Calders, T., Ceci, M., Malerba, D. (eds) Discovery Science. DS 2016. Lecture Notes in Computer Science(), vol 9956. Springer, Cham. https://doi.org/10.1007/978-3-319-46307-0_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-46307-0_21
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46306-3
Online ISBN: 978-3-319-46307-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics