ABSTRACT
Abstract---Clustering is a relevant research area in Machine Learning. The purpose of clustering is to group the objects in a dataset so that each group is made up of similar ones, which have characteristics that make them groupable, and between different groups a degree of dissimilarity is essential. The process of interpretation of the groups is fundamental to the pratical applicability of clustering. Automatic labeling, as defined in this research, results in tuples composed of attributes and their respective ranges of values. Each cluster must have a number of tuples capable of providing a unique identification for all objects, so that they are distinguishable from each other by different representative attributes or ranges of different values for the same attribute. This paper presents an unsupervised clusters labeling method that employs the CAIM (Class-Attribute Interdependency Maximization) discretization algorithm in order to find representative value ranges in the attributes that will be relevant for clusters interpretation. The model in this research sought to obtain a method that mitigates the limitations observed in other works that proposed automatic labeling of clusters. The tests carried out with 03 databases - Seeds, Iris and Glass - result in an average accuracy of the suggested labels of 97.20%. The labels suggested are made up of few attributes, compared with previous labelers, and in most cases one attribute is sufficient to define it.
- Charu C Aggarwal. 2015. Data mining: the textbook. Springer.Google Scholar
- Francisco Araujo, Vinicius Machado, Antonio Soares, and Rodrigo Veras. 2018. Automatic Cluster Labeling Based on Phylogram Analysis. 1--8. Google ScholarCross Ref
- Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, JesúS M PéRez, and IñIgo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition 46, 1 (2013), 243--256.Google ScholarDigital Library
- Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.Google ScholarCross Ref
- Malgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A Kowalski, Szymon Lukasik, and Slawomir Zak. 2010. Complete gradient clustering algorithm for features analysis of x-ray images. In Information technologies in biomedicine. Springer, 15--24.Google Scholar
- Bogdan S Chlebus and Sinh Hoa Nguyen. 1998. On finding optimal discretizations for two attributes. In International Conference on Rough Sets and Current Trends in Computing. Springer, 537--544.Google ScholarCross Ref
- Bruno Vicente Alves de Lima, Vinicius Ponte Machado, and Lucas Araújo Lopes. 2015. Automatic labeling of social network users Scientia. Net through the machine learning supervised application. Social Network Analysis and Mining 5, 1 (2015), 44.Google ScholarCross Ref
- Dheeru Dua and Casey Graff. 2017. UCI - Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- IW Evett, EJ Spiehler, and PH Duffin. 1988. Knowledge based systems. ch. Rule Induction in Forensic Science (1988), 152--160.Google Scholar
- Ronald A Fisher. 1936. The use of multiple measurements in taxonomic problems. Annals of eugenics 7, 2 (1936), 179--188.Google ScholarCross Ref
- Salvador Garcia, Julian Luengo, José Antonio Sáez, Victoria Lopez, and Francisco Herrera. 2012. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25, 4 (2012), 734--750.Google ScholarDigital Library
- Francisco Imperes Filho, Vinicius Ponte Machado, Rodrigo de Melo Souza Veras, Kelson Romulo Teixeira Aires, and Aline Montenegro Leal Silva. 2020. Group Labeling Methodology Using Distance-based Data Grouping Algorithms. Revista de Informática Teórica e Aplicada 27, 1 (2020), 48--61.Google Scholar
- Tarcísio Franco Jaime. 2019. Uso de Algoritmos de Aprendizagem de Máquina Supervisionado para Rotulação de Dados. Dissertação (Programa de Pós Graduação em Ciência da Computação - PPGCC). Universidade Federal do Piauí, Teresina-PI.Google Scholar
- Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters 31, 8 (2010), 651--666.Google ScholarDigital Library
- Feng Jiang and Yuefei Sui. 2015. A novel approach for discretization of continuous attributes in rough set theory. Knowledge-Based Systems 73 (2015), 324--334.Google ScholarDigital Library
- Jon M Kleinberg. 2003. An impossibility theorem for clustering. In Advances in neural information processing systems. 463--470.Google Scholar
- Lukasz A Kurgan and Krzysztof J Cios. 2004. CAIM discretization algorithm. IEEE transactions on Knowledge and Data Engineering 16, 2 (2004), 145--153.Google ScholarDigital Library
- Lucas A Lopes, Vinicius P Machado, Ricardo AL Rabêlo, Ricardo AS Fernandes, and Bruno VA Lima. 2016. Automatic labelling of clusters of discrete and continuous data with supervised machine learning. Knowledge-Based Systems 106 (2016), 231--241.Google ScholarDigital Library
- Vinicius Ponte Machado, Vilmar Pereira Ribeiro Filho, and Ricardo de Andrade Lira. 2015. Rotulação de grupos utilizando conjuntos fuzzy. In Simpósio Brasileiro de Automação Inteligente-SBAI. 2.Google Scholar
- James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google Scholar
- Shraddha K Popat and M Emmanuel. 2014. Review and comparative study of clustering techniques. International journal of computer science and information technologies 5, 1 (2014), 805--812.Google Scholar
- Md Geaur Rahman and Md Zahidul Islam. 2016. Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Systems with Applications 45 (2016), 410--423.Google ScholarDigital Library
- Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 138 pages.Google Scholar
Index Terms
- CAIBAL: cluster-attribute interdependency based automatic labeler
Recommendations
Unsupervised Learning with Mixed Numeric and Nominal Data
This paper presents a Similarity-Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy, that gives greater weight to uncommon ...
Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods
AL methods produce smoother Intra-labeler learning curves during the training phase.AL methods result in almost half of the mean Inter-labeler AUC standard deviation.The consensus label resulted in an AUC that was at least as high as that of the gold ...
K-mixed prototypes: a clustering algorithm for relational data with mixed attribute types
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied ComputingMany real-life applications involve data with mixed numeric and categorical values. While the notion of similarity/distance measure is well defined for numeric values, defining the distance between categorical values is not as straightforward, mainly ...
Comments