Does Number of Clusters Effect the Purity and Entropy of Clustering?

Uddin, Jamal; Ghazali, Rozaida; Deris, Mustafa Mat

doi:10.1007/978-3-319-51281-5_36

Does Number of Clusters Effect the Purity and Entropy of Clustering?

Jamal Uddin¹⁸,
Rozaida Ghazali¹⁸ &
Mustafa Mat Deris¹⁸

Conference paper
First Online: 29 December 2016

1360 Accesses
1 Citations

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 549))

Abstract

Cluster analysis automatically partitioned the data into a number of different meaningful groups or clusters using the clustering algorithms. Every clustering algorithm produces its own type of clusters. Therefore, the evaluation of clustering is very important to find the better clustering algorithm. There exist a number of evaluation measures which can be broadly divided internal, external and relative measures. Internal measures are used to assess the quality of the obtained clusters like cluster cohesion and number of clusters (NoC). The external measures such as purity and entropy find the extent to which the clustering structure discovered by a clustering algorithm matches some external structure while the relative measures are used to assess two different clustering results using internal or external measures. To explore the effect of external evaluations specifically the NoC on internal evaluation measures like purity and entropy, an empirical study is conducted. The idea is taken from the fact that the NoC obtained in the clustering process is an indicator of the successfulness of a clustering algorithm. In this paper, some necessary propositions are formulated and then four previously utilized test cases are considered to validate the effect of NoC on purity and entropy. The proofs and experimental results indicate that the purity maximizes and the entropy minimizes with increasing NoC.

This is a preview of subscription content, log in via an institution.

References

Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2006). Chap. 8. http://www-users.cs.umn.edu/~kumar/
Verma, S., Nagwani, N.K.: Software bug classification using suffix tree clustering (STC) algorithm. Int. J. Comput. Sci. Technol. 4333, 36–41 (2011)
Google Scholar
Düntsch, I., Gediga, G.: Rough set clustering. Brock University, Department of Computer Science, Rough, Ontario, Canada, Technical report (2015)
Google Scholar
Maqbool, O., Babri, H.A.: Hierarchical clustering for software architecture recovery. IEEE Trans. Softw. Eng. 33(11), 759–780 (2007)
Article Google Scholar
Anquetil, N., Lethbridge, T.C.: Experiments with clustering as a software remodularization method. In: Proceedings of the Sixth Working Conference on Reverse Engineering, pp. 235–255 (1999)
Google Scholar
Davey, J., Burd, E.: Evaluating the suitability of data clustering for software remodularisation. In: Proceedings Seventh Working Conference on Reverse Engineering, pp. 268–276. IEEE Computer Society (2000). http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=891478
Wu, J., Hassan, A.E., Holt, R.C.: Comparison of clustering algorithms in the context of software evolution. In: IEEE International Conference on Software Maintenance, ICSM, vol. 2005, pp. 525–535 (2005)
Google Scholar
Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand, pp. 49–56, April 2008. http://nzcsrsc08.canterbury.ac.nz/site/proceedings/Individual_Papers/pg049_Similarity_Measures_for_Text_Document_Clustering.pdf
Wang, Y., Liu, P., Guo, H., Li, H., Chen, X.: Improved hierarchical clustering algorithm for software architecture recovery. In: International Conference on Intelligent Computing and Cognitive Informatics, pp. 1–4 (2010)
Google Scholar
Christopher, P.R., Manning, D., Schütze, H.: Introduction to Information Retrieval, April 2009
Google Scholar
Uddin, J., Ghazali, R., Deris, M.M., Naseem, R., Shah, H.: A survey on bug prioritization. Artif. Intell. Rev. 46, 1–36 (2016). doi:10.1007/s10462-016-9478-6
Article Google Scholar
Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007)
Article MathSciNet MATH Google Scholar
Herawan, T., Deris, M.M., Abawajy, J.H.: A rough set approach for selecting clustering attribute. Knowl. Based Syst. 23(3), 220–231 (2010). doi:10.1016/j.knosys.2009.12.003
Article Google Scholar
Zhao, Y.: Criterion functions for document clustering: experiments and analysis (Technical report), pp. 1–30 (2001). http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Criterion+Functions+for+Document+Clustering+?+Experiments+and+Analysis#4
Beaubouef, T., Petry, F.E., Arora, G.: Information-theoretic measures of uncertainty for rough sets and rough relational databases. J. Inf. Sci. 5, 185–195 (1998)
Google Scholar
Suraj, Z.: An introduction to rough set theory and its applications. In: ICENCO 2004, Cairo, Egypt, 27–30 December 2004
Google Scholar
Pawlak, Z.: Rough Sets Theoretical Aspects of Reasoning about Data (1991)
Google Scholar
Grzymala-Busse, J.W.: Rough set theory with applications to data mining. In: Negoita, M.G., Reusch, B. (eds.) Real World Applications of Computational Intelligence. Studies in Fuzziness and Soft Computing, vol. 179, pp. 221–244. Springer, Heidelberg (2005). doi:10.1007/11364160_7
Chapter Google Scholar
Pawlak, Z., et al.: Rough sets. Commun. ACM 38(11), 88–95 (1995). http://portal.acm.org/citation.cfm?doid=219717.219791
Article Google Scholar

Download references

Acknowledgment

The authors would like to thank Universiti Tun Hussein Onn Malaysia (UTHM) and Ministry of Higher Education (MOHE) Malaysia for financially supporting this research under the Fundamental Research Grant Scheme (FRGS), Vote No. 1235.

Author information

Authors and Affiliations

Faculty of Computer Science and Information Technology, Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Jamal Uddin, Rozaida Ghazali & Mustafa Mat Deris

Authors

Jamal Uddin
View author publications
You can also search for this author in PubMed Google Scholar
Rozaida Ghazali
View author publications
You can also search for this author in PubMed Google Scholar
Mustafa Mat Deris
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jamal Uddin .

Editor information

Editors and Affiliations

Department of Information System, University of Malaya, Kuala Lumpur, Malaysia
Tutut Herawan
Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Rozaida Ghazali
Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Nazri Mohd Nawi
Universiti Tun Hussein Onn Malaysia, Batu Pahat, Malaysia
Mustafa Mat Deris

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Uddin, J., Ghazali, R., Deris, M.M. (2017). Does Number of Clusters Effect the Purity and Entropy of Clustering?. In: Herawan, T., Ghazali, R., Nawi, N.M., Deris, M.M. (eds) Recent Advances on Soft Computing and Data Mining. SCDM 2016. Advances in Intelligent Systems and Computing, vol 549. Springer, Cham. https://doi.org/10.1007/978-3-319-51281-5_36

Download citation

DOI: https://doi.org/10.1007/978-3-319-51281-5_36
Published: 29 December 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51279-2
Online ISBN: 978-3-319-51281-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics