Skip to main content

Does Number of Clusters Effect the Purity and Entropy of Clustering?

  • Conference paper
  • First Online:

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 549))

Abstract

Cluster analysis automatically partitioned the data into a number of different meaningful groups or clusters using the clustering algorithms. Every clustering algorithm produces its own type of clusters. Therefore, the evaluation of clustering is very important to find the better clustering algorithm. There exist a number of evaluation measures which can be broadly divided internal, external and relative measures. Internal measures are used to assess the quality of the obtained clusters like cluster cohesion and number of clusters (NoC). The external measures such as purity and entropy find the extent to which the clustering structure discovered by a clustering algorithm matches some external structure while the relative measures are used to assess two different clustering results using internal or external measures. To explore the effect of external evaluations specifically the NoC on internal evaluation measures like purity and entropy, an empirical study is conducted. The idea is taken from the fact that the NoC obtained in the clustering process is an indicator of the successfulness of a clustering algorithm. In this paper, some necessary propositions are formulated and then four previously utilized test cases are considered to validate the effect of NoC on purity and entropy. The proofs and experimental results indicate that the purity maximizes and the entropy minimizes with increasing NoC.

This is a preview of subscription content, log in via an institution.

References

  1. Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2006). Chap. 8. http://www-users.cs.umn.edu/~kumar/

  2. Verma, S., Nagwani, N.K.: Software bug classification using suffix tree clustering (STC) algorithm. Int. J. Comput. Sci. Technol. 4333, 36–41 (2011)

    Google Scholar 

  3. Düntsch, I., Gediga, G.: Rough set clustering. Brock University, Department of Computer Science, Rough, Ontario, Canada, Technical report (2015)

    Google Scholar 

  4. Maqbool, O., Babri, H.A.: Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33(11), 759–780 (2007)

    Article  Google Scholar 

  5. Anquetil, N., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: Proceedings of the Sixth Working Conference on Reverse Engineering, pp. 235–255 (1999)

    Google Scholar 

  6. Davey, J., Burd, E.: Evaluating the suitability of data clustering for software remodularisation. In: Proceedings Seventh Working Conference on Reverse Engineering, pp. 268–276. IEEE Computer Society (2000). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=891478

  7. Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algorithms in the context of software evolution. In: IEEE International Conference on Software Maintenance, ICSM, vol. 2005, pp. 525–535 (2005)

    Google Scholar 

  8. Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand, pp. 49–56, April 2008. http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf

  9. Wang, Y., Liu, P., Guo, H., Li, H., Chen, X.: Improved hierarchical clustering algorithm for software architecture recovery. In: International Conference on Intelligent Computing and Cognitive Informatics, pp. 1–4 (2010)

    Google Scholar 

  10. Christopher, P.R., Manning, D., Schütze, H.: Introduction to Information Retrieval, April 2009

    Google Scholar 

  11. Uddin, J., Ghazali, R., Deris, M.M., Naseem, R., Shah, H.: A survey on bug prioritization. Artif. Intell. Rev. 46, 1–36 (2016). doi:10.1007/s10462-016-9478-6

    Article  Google Scholar 

  12. Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  13. Herawan, T., Deris, M.M., Abawajy, J.H.: A rough set approach for selecting clustering attribute. Knowl. Based Syst. 23(3), 220–231 (2010). doi:10.1016/j.knosys.2009.12.003

    Article  Google Scholar 

  14. Zhao, Y.: Criterion functions for document clustering: experiments and analysis (Technical report), pp. 1–30 (2001). http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Criterion+Functions+for+Document+Clustering+?+Experiments+and+Analysis#4

  15. Beaubouef, T., Petry, F.E., Arora, G.: Information-theoretic measures of uncertainty for rough sets and rough relational databases. J. Inf. Sci. 5, 185–195 (1998)

    Google Scholar 

  16. Suraj, Z.: An introduction to rough set theory and its applications. In: ICENCO 2004, Cairo, Egypt, 27–30 December 2004

    Google Scholar 

  17. Pawlak, Z.: Rough Sets Theoretical Aspects of Reasoning about Data (1991)

    Google Scholar 

  18. Grzymala-Busse, J.W.: Rough set theory with applications to data mining. In: Negoita, M.G., Reusch, B. (eds.) Real World Applications of Computational Intelligence. Studies in Fuzziness and Soft Computing, vol. 179, pp. 221–244. Springer, Heidelberg (2005). doi:10.1007/11364160_7

    Chapter  Google Scholar 

  19. Pawlak, Z., et al.: Rough sets. Commun. ACM 38(11), 88–95 (1995). http://portal.acm.org/citation.cfm?doid=219717.219791

    Article  Google Scholar 

Download references

Acknowledgment

The authors would like to thank Universiti Tun Hussein Onn Malaysia (UTHM) and Ministry of Higher Education (MOHE) Malaysia for financially supporting this research under the Fundamental Research Grant Scheme (FRGS), Vote No. 1235.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jamal Uddin .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Uddin, J., Ghazali, R., Deris, M.M. (2017). Does Number of Clusters Effect the Purity and Entropy of Clustering?. In: Herawan, T., Ghazali, R., Nawi, N.M., Deris, M.M. (eds) Recent Advances on Soft Computing and Data Mining. SCDM 2016. Advances in Intelligent Systems and Computing, vol 549. Springer, Cham. https://doi.org/10.1007/978-3-319-51281-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-51281-5_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-51279-2

  • Online ISBN: 978-3-319-51281-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics