Distance Metrics in Clustering and Weighted Scoring Algorithm

Klikowski, Jakub; Burduk, Robert

doi:10.1007/978-3-030-81523-3_3

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 255))

Included in the following conference series:

428 Accesses

Abstract

One of the current challenges for the supervised classification methods is to obtain acceptable values of the performance measures for an imbalanced dataset. There is a significant disproportion in the number of objects from different class labels in datasets with a high imbalanced ratio. This article analyzes the clustering and weighted scoring algorithm based on estimating the number of clusters that consider the minimum number of objects from the minority class label in each cluster. This algorithm uses the distance metric when determining the value of the score function. Therefore, this article aims to analyze the impact of selecting the distance metric on the six classification performance measures’ value. The performed experiments show that the Euclidean distance allows obtaining the best classification results for imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Repository link: https://github.com/w4k2/cws-enc-cores21.

References

Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)
Article Google Scholar
Abdulhammed, R., Faezipour, M., Abuzneid, A., AbuMallouh, A.: Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic. IEEE Sens. Lett. 3(1), 1–4 (2018)
Article Google Scholar
Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17 (2011)
Google Scholar
Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2014)
MATH Google Scholar
Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning (ICML-2002). Citeseer (2002)
Google Scholar
Choraś, M., Pawlicki, M., Kozik, R.: Recognizing faults in software related difficult data. In: International Conference on Computational Science, pp. 263–272. Springer (2019)
Google Scholar
Fotouhi, S., Asadi, S., Kattan, M.W.: A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inf. 90, 103,089 (2019)
Article Google Scholar
Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to combined solutions. In: Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 3–30. Springer (2008)
Google Scholar
Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)
Article Google Scholar
Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)
Article Google Scholar
Kaufmann, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)
Book Google Scholar
Klikowski, J., Burduk, R.: Clustering and weighted scoring algorithm based on estimating the number of clusters. In: International Conference on Computational Science. Springer (2021, accepted)
Google Scholar
Klikowski, J., Ksieniewicz, P., Woźniak, M.: A genetic-based ensemble learning applied to imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 340–352. Springer (2019)
Google Scholar
Koziarski, M., Woźniak, M., Krawczyk, B.: Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. arXiv preprint arXiv:2004.03406 (2020)
Kozik, R., Choras, M., Keller, J.: Balanced efficient lifelong learning (B-ELLA) for cyber attack detection. J. UCS 25(1), 2–15 (2019)
MathSciNet Google Scholar
Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)
Article Google Scholar
Ksieniewicz, P., Burduk, R.: Clustering and weighted scoring in geometric space support vector machine ensemble for highly imbalanced data classification. In: International Conference on Computational Science, pp. 128–140. Springer (2020)
Google Scholar
Ksieniewicz, P., Zyblewski, P.: stream-learn-open-source python library for difficult data stream batch analysis. arXiv preprint arXiv:2001.11077 (2020)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken (2004)
Book Google Scholar
Lopez-Garcia, P., Masegosa, A.D., Osaba, E., Onieva, E., Perallos, A.: Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49(8), 2807–2822 (2019)
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rodríguez, J.J., Diez-Pastor, J.F., Arnaiz-Gonzalez, Kuncheva, L.I.: Random balance ensembles for multiclass imbalance learning. Knowl.-Based Syst. 193, 105,434 (2020). https://doi.org/10.1016/j.knosys.2019.105434. https://www.sciencedirect.com/science/article/pii/S0950705119306598
Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)
Article Google Scholar
Szeszko, P., Topczewska, M.: Empirical assessment of performance measures for preprocessing moments in imbalanced data classification problem. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp. 183–194. Springer (2016)
Google Scholar
Trajdos, P., Kurzynski, M.: A correction method of a base classifier applied to imbalanced data classification. In: International Conference on Computational Science, pp. 88–102. Springer (2020)
Google Scholar
Virtanen, P., et al.: SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2
Article Google Scholar
Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)
Article Google Scholar
Zhang, C., et al.: Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl.-Based Syst. 174, 137–143 (2019)
Article Google Scholar
Zyblewski, P., Ksieniewicz, P., Woźniak, M.: Classifier selection for highly imbalanced data streams with minority driven ensemble. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 626–635. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20912-4_57
Chapter Google Scholar

Download references

Acknowledgements

This work was supported by the Polish National Science Centre under the grant No. 2017/25/B/ST6/01750 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.

Author information

Authors and Affiliations

Department of Systems and Computer Networks, Wroclaw University of Science and Technology, Wroclaw, Poland
Jakub Klikowski & Robert Burduk

Authors

Jakub Klikowski
View author publications
You can also search for this author in PubMed Google Scholar
Robert Burduk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Robert Burduk .

Editor information

Editors and Affiliations

Institute of Computer Science and Telecommunications, University of Science and Technlogy, Bydgoszcz, Poland
Michal Choraś
Institute of Computer Science and Telecommunications, University of Science and Technology, Bydgoszcz, Poland
Ryszard S. Choraś
Faculty of Electronics, Wroclaw University of Science and Technology, Wrocław, Poland
Marek Kurzyński
Faculty of Electronics, Wroclaw University of Science and Technology, Wrocław, Poland
Paweł Trajdos
West Pomeranian University of Technology in Szczecin, Szczecin, Poland
Jerzy Pejaś
West Pomeranian University of Technology in Szczecin, Szczecin, Poland
Tomasz Hyla

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Klikowski, J., Burduk, R. (2022). Distance Metrics in Clustering and Weighted Scoring Algorithm. In: Choraś, M., Choraś, R.S., Kurzyński, M., Trajdos, P., Pejaś, J., Hyla, T. (eds) Progress in Image Processing, Pattern Recognition and Communication Systems. CORES IP&C ACS 2021 2021 2021. Lecture Notes in Networks and Systems, vol 255. Springer, Cham. https://doi.org/10.1007/978-3-030-81523-3_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-81523-3_3
Published: 18 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-81522-6
Online ISBN: 978-3-030-81523-3
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics