Abstract
One of the current challenges for the supervised classification methods is to obtain acceptable values of the performance measures for an imbalanced dataset. There is a significant disproportion in the number of objects from different class labels in datasets with a high imbalanced ratio. This article analyzes the clustering and weighted scoring algorithm based on estimating the number of clusters that consider the minimum number of objects from the minority class label in each cluster. This algorithm uses the distance metric when determining the value of the score function. Therefore, this article aims to analyze the impact of selecting the distance metric on the six classification performance measures’ value. The performed experiments show that the Euclidean distance allows obtaining the best classification results for imbalanced datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Repository link: https://github.com/w4k2/cws-enc-cores21.
References
Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)
Abdulhammed, R., Faezipour, M., Abuzneid, A., AbuMallouh, A.: Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic. IEEE Sens. Lett. 3(1), 1–4 (2018)
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17 (2011)
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2014)
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning (ICML-2002). Citeseer (2002)
Choraś, M., Pawlicki, M., Kozik, R.: Recognizing faults in software related difficult data. In: International Conference on Computational Science, pp. 263–272. Springer (2019)
Fotouhi, S., Asadi, S., Kattan, M.W.: A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inf. 90, 103,089 (2019)
Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to combined solutions. In: Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 3–30. Springer (2008)
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Kaufmann, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Klikowski, J., Burduk, R.: Clustering and weighted scoring algorithm based on estimating the number of clusters. In: International Conference on Computational Science. Springer (2021, accepted)
Klikowski, J., Ksieniewicz, P., Woźniak, M.: A genetic-based ensemble learning applied to imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 340–352. Springer (2019)
Koziarski, M., WoĹşniak, M., Krawczyk, B.: Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. arXiv preprint arXiv:2004.03406 (2020)
Kozik, R., Choras, M., Keller, J.: Balanced efficient lifelong learning (B-ELLA) for cyber attack detection. J. UCS 25(1), 2–15 (2019)
Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
Ksieniewicz, P., Burduk, R.: Clustering and weighted scoring in geometric space support vector machine ensemble for highly imbalanced data classification. In: International Conference on Computational Science, pp. 128–140. Springer (2020)
Ksieniewicz, P., Zyblewski, P.: stream-learn-open-source python library for difficult data stream batch analysis. arXiv preprint arXiv:2001.11077 (2020)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken (2004)
Lopez-Garcia, P., Masegosa, A.D., Osaba, E., Onieva, E., Perallos, A.: Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49(8), 2807–2822 (2019)
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
RodrĂguez, J.J., Diez-Pastor, J.F., Arnaiz-Gonzalez, Kuncheva, L.I.: Random balance ensembles for multiclass imbalance learning. Knowl.-Based Syst. 193, 105,434 (2020). https://doi.org/10.1016/j.knosys.2019.105434. https://www.sciencedirect.com/science/article/pii/S0950705119306598
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Szeszko, P., Topczewska, M.: Empirical assessment of performance measures for preprocessing moments in imbalanced data classification problem. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp. 183–194. Springer (2016)
Trajdos, P., Kurzynski, M.: A correction method of a base classifier applied to imbalanced data classification. In: International Conference on Computational Science, pp. 88–102. Springer (2020)
Virtanen, P., et al.: SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Zhang, C., et al.: Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl.-Based Syst. 174, 137–143 (2019)
Zyblewski, P., Ksieniewicz, P., Woźniak, M.: Classifier selection for highly imbalanced data streams with minority driven ensemble. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 626–635. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20912-4_57
Acknowledgements
This work was supported by the Polish National Science Centre under the grant No. 2017/25/B/ST6/01750 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Klikowski, J., Burduk, R. (2022). Distance Metrics in Clustering and Weighted Scoring Algorithm. In: Choraś, M., Choraś, R.S., Kurzyński, M., Trajdos, P., Pejaś, J., Hyla, T. (eds) Progress in Image Processing, Pattern Recognition and Communication Systems. CORES IP&C ACS 2021 2021 2021. Lecture Notes in Networks and Systems, vol 255. Springer, Cham. https://doi.org/10.1007/978-3-030-81523-3_3
Download citation
DOI: https://doi.org/10.1007/978-3-030-81523-3_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81522-6
Online ISBN: 978-3-030-81523-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)