research-article

CAIBAL: cluster-attribute interdependency based automatic labeler

Authors:
Marcel Moura

Instituto Federal do Piauí (IFPI), Cocal, Piauí, Brazil

Instituto Federal do Piauí (IFPI), Cocal, Piauí, Brazil
View Profile

,
Rodrigo Veras

Universidade Federal do Piauí (UFPI), Teresina, Piauí, Brazil

Universidade Federal do Piauí (UFPI), Teresina, Piauí, Brazil
View Profile

,
Vinicius Machado

Universidade Federal do Piauí (UFPI), Teresina, Piauí, Brazil

Universidade Federal do Piauí (UFPI), Teresina, Piauí, Brazil
View Profile

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied ComputingApril 2022Pages 1109–1116https://doi.org/10.1145/3477314.3507140

Published:06 May 2022Publication History

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

Pages 1109–1116

ABSTRACT

Abstract---Clustering is a relevant research area in Machine Learning. The purpose of clustering is to group the objects in a dataset so that each group is made up of similar ones, which have characteristics that make them groupable, and between different groups a degree of dissimilarity is essential. The process of interpretation of the groups is fundamental to the pratical applicability of clustering. Automatic labeling, as defined in this research, results in tuples composed of attributes and their respective ranges of values. Each cluster must have a number of tuples capable of providing a unique identification for all objects, so that they are distinguishable from each other by different representative attributes or ranges of different values for the same attribute. This paper presents an unsupervised clusters labeling method that employs the CAIM (Class-Attribute Interdependency Maximization) discretization algorithm in order to find representative value ranges in the attributes that will be relevant for clusters interpretation. The model in this research sought to obtain a method that mitigates the limitations observed in other works that proposed automatic labeling of clusters. The tests carried out with 03 databases - Seeds, Iris and Glass - result in an average accuracy of the suggested labels of 97.20%. The labels suggested are made up of few attributes, compared with previous labelers, and in most cases one attribute is sufficient to define it.

References

Charu C Aggarwal. 2015. Data mining: the textbook. Springer.Google Scholar
Francisco Araujo, Vinicius Machado, Antonio Soares, and Rodrigo Veras. 2018. Automatic Cluster Labeling Based on Phylogram Analysis. 1--8. Google ScholarCross Ref
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, JesúS M PéRez, and IñIgo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition 46, 1 (2013), 243--256.Google ScholarDigital Library
Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.Google ScholarCross Ref
Malgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A Kowalski, Szymon Lukasik, and Slawomir Zak. 2010. Complete gradient clustering algorithm for features analysis of x-ray images. In Information technologies in biomedicine. Springer, 15--24.Google Scholar
Bogdan S Chlebus and Sinh Hoa Nguyen. 1998. On finding optimal discretizations for two attributes. In International Conference on Rough Sets and Current Trends in Computing. Springer, 537--544.Google ScholarCross Ref
Bruno Vicente Alves de Lima, Vinicius Ponte Machado, and Lucas Araújo Lopes. 2015. Automatic labeling of social network users Scientia. Net through the machine learning supervised application. Social Network Analysis and Mining 5, 1 (2015), 44.Google ScholarCross Ref
Dheeru Dua and Casey Graff. 2017. UCI - Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
IW Evett, EJ Spiehler, and PH Duffin. 1988. Knowledge based systems. ch. Rule Induction in Forensic Science (1988), 152--160.Google Scholar
Ronald A Fisher. 1936. The use of multiple measurements in taxonomic problems. Annals of eugenics 7, 2 (1936), 179--188.Google ScholarCross Ref
Salvador Garcia, Julian Luengo, José Antonio Sáez, Victoria Lopez, and Francisco Herrera. 2012. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25, 4 (2012), 734--750.Google ScholarDigital Library
Francisco Imperes Filho, Vinicius Ponte Machado, Rodrigo de Melo Souza Veras, Kelson Romulo Teixeira Aires, and Aline Montenegro Leal Silva. 2020. Group Labeling Methodology Using Distance-based Data Grouping Algorithms. Revista de Informática Teórica e Aplicada 27, 1 (2020), 48--61.Google Scholar
Tarcísio Franco Jaime. 2019. Uso de Algoritmos de Aprendizagem de Máquina Supervisionado para Rotulação de Dados. Dissertação (Programa de Pós Graduação em Ciência da Computação - PPGCC). Universidade Federal do Piauí, Teresina-PI.Google Scholar
Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters 31, 8 (2010), 651--666.Google ScholarDigital Library
Feng Jiang and Yuefei Sui. 2015. A novel approach for discretization of continuous attributes in rough set theory. Knowledge-Based Systems 73 (2015), 324--334.Google ScholarDigital Library
Jon M Kleinberg. 2003. An impossibility theorem for clustering. In Advances in neural information processing systems. 463--470.Google Scholar
Lukasz A Kurgan and Krzysztof J Cios. 2004. CAIM discretization algorithm. IEEE transactions on Knowledge and Data Engineering 16, 2 (2004), 145--153.Google ScholarDigital Library
Lucas A Lopes, Vinicius P Machado, Ricardo AL Rabêlo, Ricardo AS Fernandes, and Bruno VA Lima. 2016. Automatic labelling of clusters of discrete and continuous data with supervised machine learning. Knowledge-Based Systems 106 (2016), 231--241.Google ScholarDigital Library
Vinicius Ponte Machado, Vilmar Pereira Ribeiro Filho, and Ricardo de Andrade Lira. 2015. Rotulação de grupos utilizando conjuntos fuzzy. In Simpósio Brasileiro de Automação Inteligente-SBAI. 2.Google Scholar
James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google Scholar
Shraddha K Popat and M Emmanuel. 2014. Review and comparative study of clustering techniques. International journal of computer science and information technologies 5, 1 (2014), 805--812.Google Scholar
Md Geaur Rahman and Md Zahidul Islam. 2016. Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Systems with Applications 45 (2016), 410--423.Google ScholarDigital Library
Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 138 pages.Google Scholar

Index Terms

CAIBAL: cluster-attribute interdependency based automatic labeler
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
      2. Unsupervised learning
        Cluster analysis
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications
    1. Data mining

Index terms have been assigned to the content through auto-classification.

Recommendations

Unsupervised Learning with Mixed Numeric and Nominal Data

This paper presents a Similarity-Based Agglomerative Clustering (SBAC) algorithm that works well for data with mixed numeric and nominal features. A similarity measure, proposed by Goodall for biological taxonomy, that gives greater weight to uncommon ...
Read More
Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods

AL methods produce smoother Intra-labeler learning curves during the training phase.AL methods result in almost half of the mean Inter-labeler AUC standard deviation.The consensus label resulted in an AUC that was at least as high as that of the gold ...
Read More
K-mixed prototypes: a clustering algorithm for relational data with mixed attribute types
SAC '19: Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing

Many real-life applications involve data with mixed numeric and categorical values. While the notion of similarity/distance measure is well defined for numeric values, defining the distance between categorical values is not as straightforward, mainly ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing
April 2022
2099 pages
ISBN:9781450387132
DOI:10.1145/3477314
Conference Chairs:
Jiman Hong
Soongsil University
,
Miroslav Bures
Czech Technical University, Czechia
,
Program Chairs:
Juw Won Park
University of Louisville
,
Tomas Cerny
Baylor University
Copyright © 2022 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 May 2022
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering
interpretation
labeling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 30
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CAIBAL: cluster-attribute interdependency based automatic labeler

SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised Learning with Mixed Numeric and Nominal Data

Inter-labeler and intra-labeler variability of condition severity classification models using active and passive learning methods

K-mixed prototypes: a clustering algorithm for relational data with mixed attribute types