Skip to main content

Distance Metrics in Clustering and Weighted Scoring Algorithm

  • Conference paper
  • First Online:
Progress in Image Processing, Pattern Recognition and Communication Systems (CORES 2021, IP&C 2021, ACS 2021)

Abstract

One of the current challenges for the supervised classification methods is to obtain acceptable values of the performance measures for an imbalanced dataset. There is a significant disproportion in the number of objects from different class labels in datasets with a high imbalanced ratio. This article analyzes the clustering and weighted scoring algorithm based on estimating the number of clusters that consider the minimum number of objects from the minority class label in each cluster. This algorithm uses the distance metric when determining the value of the score function. Therefore, this article aims to analyze the impact of selecting the distance metric on the six classification performance measures’ value. The performed experiments show that the Euclidean distance allows obtaining the best classification results for imbalanced datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Repository link: https://github.com/w4k2/cws-enc-cores21.

References

  1. Abdallah, A., Maarof, M.A., Zainal, A.: Fraud detection system: a survey. J. Netw. Comput. Appl. 68, 90–113 (2016)

    Article  Google Scholar 

  2. Abdulhammed, R., Faezipour, M., Abuzneid, A., AbuMallouh, A.: Deep and machine learning approaches for anomaly-based intrusion detection of imbalanced network traffic. IEEE Sens. Lett. 3(1), 1–4 (2018)

    Article  Google Scholar 

  3. Alcalá-Fdez, J., et al.: Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J. Multiple-Valued Log. Soft Comput. 17 (2011)

    Google Scholar 

  4. Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge (2014)

    MATH  Google Scholar 

  5. Basu, S., Banerjee, A., Mooney, R.: Semi-supervised clustering by seeding. In: Proceedings of 19th International Conference on Machine Learning (ICML-2002). Citeseer (2002)

    Google Scholar 

  6. Choraś, M., Pawlicki, M., Kozik, R.: Recognizing faults in software related difficult data. In: International Conference on Computational Science, pp. 263–272. Springer (2019)

    Google Scholar 

  7. Fotouhi, S., Asadi, S., Kattan, M.W.: A comprehensive data level analysis for cancer diagnosis on imbalanced data. J. Biomed. Inf. 90, 103,089 (2019)

    Article  Google Scholar 

  8. Fred, A., Lourenço, A.: Cluster ensemble methods: from single clusterings to combined solutions. In: Supervised and Unsupervised Ensemble Methods and Their Applications, pp. 3–30. Springer (2008)

    Google Scholar 

  9. Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(4), 463–484 (2011)

    Article  Google Scholar 

  10. Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., Bing, G.: Learning from class-imbalanced data: review of methods and applications. Expert Syst. Appl. 73, 220–239 (2017)

    Article  Google Scholar 

  11. Kaufmann, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York (1990)

    Book  Google Scholar 

  12. Klikowski, J., Burduk, R.: Clustering and weighted scoring algorithm based on estimating the number of clusters. In: International Conference on Computational Science. Springer (2021, accepted)

    Google Scholar 

  13. Klikowski, J., Ksieniewicz, P., Woźniak, M.: A genetic-based ensemble learning applied to imbalanced data classification. In: International Conference on Intelligent Data Engineering and Automated Learning, pp. 340–352. Springer (2019)

    Google Scholar 

  14. Koziarski, M., WoĹşniak, M., Krawczyk, B.: Combined cleaning and resampling algorithm for multi-class imbalanced data with label noise. arXiv preprint arXiv:2004.03406 (2020)

  15. Kozik, R., Choras, M., Keller, J.: Balanced efficient lifelong learning (B-ELLA) for cyber attack detection. J. UCS 25(1), 2–15 (2019)

    MathSciNet  Google Scholar 

  16. Krawczyk, B., Woźniak, M., Schaefer, G.: Cost-sensitive decision tree ensembles for effective imbalanced classification. Appl. Soft Comput. 14, 554–562 (2014)

    Article  Google Scholar 

  17. Ksieniewicz, P., Burduk, R.: Clustering and weighted scoring in geometric space support vector machine ensemble for highly imbalanced data classification. In: International Conference on Computational Science, pp. 128–140. Springer (2020)

    Google Scholar 

  18. Ksieniewicz, P., Zyblewski, P.: stream-learn-open-source python library for difficult data stream batch analysis. arXiv preprint arXiv:2001.11077 (2020)

  19. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, Hoboken (2004)

    Book  Google Scholar 

  20. Lopez-Garcia, P., Masegosa, A.D., Osaba, E., Onieva, E., Perallos, A.: Ensemble classification for imbalanced data based on feature space partitioning and hybrid metaheuristics. Appl. Intell. 49(8), 2807–2822 (2019)

    Article  Google Scholar 

  21. Pedregosa, F., et al.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  22. RodrĂ­guez, J.J., Diez-Pastor, J.F., Arnaiz-Gonzalez, Kuncheva, L.I.: Random balance ensembles for multiclass imbalance learning. Knowl.-Based Syst. 193, 105,434 (2020). https://doi.org/10.1016/j.knosys.2019.105434. https://www.sciencedirect.com/science/article/pii/S0950705119306598

  23. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  24. Szeszko, P., Topczewska, M.: Empirical assessment of performance measures for preprocessing moments in imbalanced data classification problem. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp. 183–194. Springer (2016)

    Google Scholar 

  25. Trajdos, P., Kurzynski, M.: A correction method of a base classifier applied to imbalanced data classification. In: International Conference on Computational Science, pp. 88–102. Springer (2020)

    Google Scholar 

  26. Virtanen, P., et al.: SciPy 1.0 contributors: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020). https://doi.org/10.1038/s41592-019-0686-2

    Article  Google Scholar 

  27. Woźniak, M., Graña, M., Corchado, E.: A survey of multiple classifier systems as hybrid systems. Inf. Fusion 16, 3–17 (2014)

    Article  Google Scholar 

  28. Zhang, C., et al.: Multi-imbalance: an open-source software for multi-class imbalance learning. Knowl.-Based Syst. 174, 137–143 (2019)

    Article  Google Scholar 

  29. Zyblewski, P., Ksieniewicz, P., Woźniak, M.: Classifier selection for highly imbalanced data streams with minority driven ensemble. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) Artificial Intelligence and Soft Computing, pp. 626–635. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-20912-4_57

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by the Polish National Science Centre under the grant No. 2017/25/B/ST6/01750 as well as by the statutory funds of the Department of Systems and Computer Networks, Faculty of Electronics, Wroclaw University of Science and Technology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Robert Burduk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Klikowski, J., Burduk, R. (2022). Distance Metrics in Clustering and Weighted Scoring Algorithm. In: Choraś, M., Choraś, R.S., Kurzyński, M., Trajdos, P., Pejaś, J., Hyla, T. (eds) Progress in Image Processing, Pattern Recognition and Communication Systems. CORES IP&C ACS 2021 2021 2021. Lecture Notes in Networks and Systems, vol 255. Springer, Cham. https://doi.org/10.1007/978-3-030-81523-3_3

Download citation

Publish with us

Policies and ethics