Handling Class Imbalance in k-Nearest Neighbor Classification by Balancing Prior Probabilities

Gøttcke, Jonatan Møller Nuutinen; Zimek, Arthur

doi:10.1007/978-3-030-89657-7_19

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 13058))

Included in the following conference series:

International Conference on Similarity Search and Applications

1055 Accesses

Abstract

It is well known that recall rather than precision is the performance measure to optimize in imbalanced classification problems, yet most existing methods that adjust for class imbalance do not particularly address the optimization of recall. Here we propose an elegant and straightforward variation of the k-nearest neighbor classifier to balance imbalanced classification problems internally in a probabilistic interpretation and show how this relates to the optimization of the recall. We evaluate this novel method against popular k-nearest neighbor-based class imbalance handling algorithms and compare them to general oversampling and undersampling techniques. We demonstrate that the performance of the proposed method is on par with SMOTE yet our method is much simpler and outperforms several competitors over a large selection of real-world and synthetic datasets and parameter choices while having the same complexity as the regular k-nearest neighbor classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 64.99; Price excludes VAT (USA)

Softcover Book: USD 84.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A nearest neighbor-based approach for improving the reliability of multiclass probabilistic classifiers

Article 27 August 2024

OKC classifier: an efficient approach for classification of imbalanced dataset using hybrid methodology

Article 22 August 2022

Undersampling based on generalized learning vector quantization and natural nearest neighbors for imbalanced data

Article 03 July 2024

References

Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., García, S.: KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Log. Soft Comput. 17(2–3), 255–287 (2011)
Google Scholar
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor. 6(1), 20–29 (2004)
Google Scholar
Bellinger, C., Drummond, C., Japkowicz, N.: Beyond the boundaries of SMOTE. In: Frasconi, P., Landwehr, N., Manco, G., Vreeken, J. (eds.) ECML PKDD 2016. LNCS (LNAI), vol. 9851, pp. 248–263. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46128-1_16
Chapter Google Scholar
Bellinger, C., Sharma, S., Japkowicz, N., Zaïane, O.R.: Framework for extreme imbalance classification: SWIM - sampling with the majority class. Knowl. Inf. Syst. 62(3), 841–866 (2020)
Article Google Scholar
Chapelle, O., Schölkopf, B., Zien, A.: Introduction to semi-supervised learning. In: Semi-Supervised Learning, pp. 1–12. The MIT Press (2006)
Google Scholar
Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)
Article Google Scholar
Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article Google Scholar
Davis, J., Goadrich, M.: The relationship between precision-recall and ROC curves. In: Cohen, W.W., Moore, A.W. (eds.) Proceedings of ICML, pp. 233–240 (2006). https://doi.org/10.1145/1143844.1143874
Domingos, P.M.: MetaCost: a general method for making classifiers cost-sensitive. In: KDD, pp. 155–164. ACM (1999)
Google Scholar
Dubey, H., Pudi, V.: Class based weighted K-nearest neighbor over imbalance dataset. In: Pei, J., Tseng, V.S., Cao, L., Motoda, H., Xu, G. (eds.) PAKDD 2013. LNCS (LNAI), vol. 7819, pp. 305–316. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37456-2_26
Chapter Google Scholar
Elkan, C.: The foundations of cost-sensitive learning. In: IJCAI, pp. 973–978. Morgan Kaufmann (2001)
Google Scholar
Han, H., Wang, W.-Y., Mao, B.-H.: Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang, D.-S., Zhang, X.-P., Huang, G.-B. (eds.) ICIC 2005. LNCS, vol. 3644, pp. 878–887. Springer, Heidelberg (2005). https://doi.org/10.1007/11538059_91
Chapter Google Scholar
Hand, D.J.: Measuring classifier performance: a coherent alternative to the area under the ROC curve. Mach. Learn. 77(1), 103–123 (2009). https://doi.org/10.1007/s10994-009-5119-5
Article MATH Google Scholar
He, H., Bai, Y., Garcia, E.A., Li, S.: ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: IJCNN, pp. 1322–1328. IEEE (2008)
Google Scholar
Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the International Conference on Artificial Intelligence, vol. 56 (2000)
Google Scholar
Japkowicz, N.: Assessment metrics for imbalanced learning. In: He, H., Ma, Y. (eds.) Imbalanced Learning: Foundations, algorithms, and applications, chap. 8, pp. 187–206. Wiley, Hoboken (2013)
Google Scholar
Kriminger, E., Príncipe, J.C., Lakshminarayan, C.: Nearest neighbor distributions for imbalanced classification. In: IJCNN, pp. 1–5. IEEE (2012)
Google Scholar
Kubat, M., Matwin, S.: Addressing the curse of imbalanced training sets: One-sided selection. In: ICML, pp. 179–186. Morgan Kaufmann (1997)
Google Scholar
Lemaitre, G., Nogueira, F., Aridas, C.K.: Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J. Mach. Learn. Res. 18, 17:1–17:5 (2017)
Google Scholar
Liu, W., Chawla, S.: Class confidence weighted kNN algorithms for imbalanced data sets. In: Huang, J.Z., Cao, L., Srivastava, J. (eds.) PAKDD 2011. LNCS (LNAI), vol. 6635, pp. 345–356. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20847-8_29
Chapter Google Scholar
Ntoutsi, E., et al.: Bias in data-driven artificial intelligence systems - an introductory survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 10(3), e1356 (2020)
Article Google Scholar
Pedregosa, F.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Qin, Z., Wang, A.T., Zhang, C., Zhang, S.: Cost-sensitive classification with k-nearest neighbors. In: Wang, M. (ed.) KSEM 2013. LNCS (LNAI), vol. 8041, pp. 112–131. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39787-5_10
Chapter Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, Burlington (1993)
Google Scholar
Rodieck, R.W.: The density recovery profile: a method for the analysis of points in the plane applicable to retinal studies. Vis. Neurosci. 6(2), 95–111 (1991)
Article Google Scholar
Siddappa, N.G., Kampalappa, T.: Imbalance data classification using local Mahalanobis distance learning based on nearest neighbor. SN Comput. Sci. 1(2), 76 (2020)
Article Google Scholar
Song, Y., Huang, J., Zhou, D., Zha, H., Giles, C.L.: IKNN: informative K-nearest neighbor pattern classification. In: Kok, J.N., Koronacki, J., Lopez de Mantaras, R., Matwin, S., Mladenič, D., Skowron, A. (eds.) PKDD 2007. LNCS (LNAI), vol. 4702, pp. 248–264. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-74976-9_25
Chapter Google Scholar
Zhang, S.: Cost-sensitive KNN classification. Neurocomputing 391, 234–242 (2020)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
Jonatan Møller Nuutinen Gøttcke & Arthur Zimek

Authors

Jonatan Møller Nuutinen Gøttcke
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jonatan Møller Nuutinen Gøttcke or Arthur Zimek .

Editor information

Editors and Affiliations

National University of San Luis, San Luis, Argentina
Nora Reyes
University of St Andrews, St Andrews, UK
Richard Connor
University of Vienna, Vienna, Austria
Nils Kriege
Kiel University, Kiel, Germany
Daniyal Kazempour
University of Bologna, Bologna, Italy
Ilaria Bartolini
TU Dortmund University, Dortmund, Germany
Erich Schubert
TU Dortmund University, Dortmund, Germany
Jian-Jia Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gøttcke, J.M.N., Zimek, A. (2021). Handling Class Imbalance in k-Nearest Neighbor Classification by Balancing Prior Probabilities. In: Reyes, N., et al. Similarity Search and Applications. SISAP 2021. Lecture Notes in Computer Science(), vol 13058. Springer, Cham. https://doi.org/10.1007/978-3-030-89657-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-89657-7_19
Published: 22 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89656-0
Online ISBN: 978-3-030-89657-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics