Skip to main content

The Quality of Clustering Data Containing Outliers

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2022)

Abstract

This article evaluates the efficiency and performance of both clustering algorithms: an agglomerative hierarchical clustering AHC (with various linkage options and distance measures) and the \(K-Means\) algorithm. We assess the quality of clustering using Davies-Bouldin and Dunn cluster validity indices. Our goal is to compare and analyze outlier detection algorithms depending on the applied clustering algorithm. We also wanted to verify whether the quality of clusters without outliers is higher than of those with outliers. In our research, we compare the LOF (Local Outlier Factor) and COF (Connectivity-based Outlier Factor) algorithms for detecting outliers (selecting \(1\%\), \(5\%\), and 10% of the most outlier instances in a given dataset). Next, we analyze how clustering quality has improved after excluding such outliers. In the experiments, three real datasets were used with a different number of instances. We wanted to investigate whether it is essential what clustering algorithm and outlier detection method we will use? Our goal was to check whether the clustering parameters impact the obtained clustering results. To the best of our knowledge, no research would combine these issues in one study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kishan, G.M., Chilukuri, K.M., HuaMing, H.: Anomaly Detection Principles and Algorithms, pp. 23–38. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-67526-8

    Book  Google Scholar 

  2. Ranga Suri, N.N.R., Narasimha, M.M., Athithan, G.: Outlier Detection: Techniques and Applications, pp. 3–9. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05127-3

    Book  Google Scholar 

  3. Maddala, G.S.: Outliers. Introduction to Econometrics, 2nd edn. MacMillan, New York (1992)

    Google Scholar 

  4. The CLUSTER Procedure: Clustering Methods, pp.1250–1260. SAS Institute (2009)

    Google Scholar 

  5. Legany, C., Juhasz, S., Babos, A.: Cluster validity measurement techniques, Knowledge Engineering and Data Bases, pp. 388–393. WSEAS, USA (2006)

    Google Scholar 

  6. UCI Machine Learning Repository, October 2021. https://archive.ics.uci.edu/ml/

  7. Martiniano, A., Ferreira, R.P., Sassi, R.J.: Absenteeism dataset, April 2018. https://archive.ics.uci.edu/ml/datasets/Absenteeism+at+work

  8. Alzahrani, A., Sadaoui S.: Shill Bidding Dataset, March 2020. https://archive.ics.uci.edu/ml/datasets/Shill+Bidding+Dataset

  9. Gardner, A., Selmic, R.R. Kanno, J., Duncan, C.A.: MoCap Hand Postures Data Set, November 2016. https://archive.ics.uci.edu/ml/datasets/MoCap+Hand+Postures

  10. Wes McKinney and the Pandas Development Team: Pandas: powerful Python data analysis toolkit Release 1.3.3, November 2021. https://devdocs.io/pandas~0.25/

  11. NumPy Reference, release 1.21.0, Written by the NumPy community, November 2021. https://numpy.org/doc/stable/numpy-ref.pdf

  12. Anomaly Detection Tutorial, November 2021. https://pycaret.readthedocs.io/en/latest/api/anomaly.html

  13. An introduction to machine learning with scikit-learn, October 2021. https://scikit-learn.org/0.21/tutorial/basic/tutorial.html

  14. Dunn, J.C.: Well separated clusters and optimal fuzzy partitions. J. Cybernetica 4, 95–104 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  15. Santhana, C., Katie, A., Frans, C.: Best Clustering Configuration Metrics: Towards Multiagent Based Clustering, pp. 2–8. University of Liverpool, UK (2010)

    Google Scholar 

  16. Steinbach, M.S., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques, Department of Computer Science and Engineering, Computer Science (2000)

    Google Scholar 

  17. Karthikeyan, B., George, D.J., Manikandan, G., Thomas, T.: A comparative study on K-means clustering and agglomerative hierarchical clustering. Int. J. Emerg. Trends Eng. Res. 8(5) (2020). https://doi.org/10.30534/ijeter/2020/20852020

  18. Saleena, T.S., Sathish, S.J., Joseph, A.: Comparison of K-means algorithm and hierarchical algorithm using Weka tool. Int. J. Adv. Res. Comput. Commun. Eng. IJARCCE 7(7) (2018)

    Google Scholar 

  19. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dalles, TX (2000)

    Google Scholar 

  20. Karuna, K., Rajanikanth, A.: An enhanced algorithm for improved cluster generation to remove outlier’s ratio for large datasets in data mining (2016)

    Google Scholar 

  21. Jabbar, A.M.: Local and global outlier detection algorithms in unsupervised approach: a review. Iraqi J. Electr. Electron. Eng. (College of Engineering, University of Basrah) 17(1) (2021)

    Google Scholar 

  22. Nowak-Brzezińska, A., Horyń, C.: Outliers in rules - the comparison of LOF, COF and K-means algorithms. Procedia Comput. Sci. 176, 1420–1429 (2020)

    Google Scholar 

  23. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. Accessed Dec 2021

  24. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.normalize.html. Accessed Dec 2021

  25. Liu, H., Li, J., Wu, Y., Fu, Y.: Clustering with outlier removal. IEEE Trans. Knowl. Data Eng. (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Agnieszka Nowak-Brzezińska .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nowak-Brzezińska, A., Gaibei, I. (2022). The Quality of Clustering Data Containing Outliers. In: Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, TP., Trawiński, B., Szczerbicki, E. (eds) Intelligent Information and Database Systems. ACIIDS 2022. Lecture Notes in Computer Science(), vol 13758. Springer, Cham. https://doi.org/10.1007/978-3-031-21967-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-21967-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-21966-5

  • Online ISBN: 978-3-031-21967-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics