skip to main content
10.1145/3477314.3507140acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

CAIBAL: cluster-attribute interdependency based automatic labeler

Authors Info & Claims
Published:06 May 2022Publication History

ABSTRACT

Abstract---Clustering is a relevant research area in Machine Learning. The purpose of clustering is to group the objects in a dataset so that each group is made up of similar ones, which have characteristics that make them groupable, and between different groups a degree of dissimilarity is essential. The process of interpretation of the groups is fundamental to the pratical applicability of clustering. Automatic labeling, as defined in this research, results in tuples composed of attributes and their respective ranges of values. Each cluster must have a number of tuples capable of providing a unique identification for all objects, so that they are distinguishable from each other by different representative attributes or ranges of different values for the same attribute. This paper presents an unsupervised clusters labeling method that employs the CAIM (Class-Attribute Interdependency Maximization) discretization algorithm in order to find representative value ranges in the attributes that will be relevant for clusters interpretation. The model in this research sought to obtain a method that mitigates the limitations observed in other works that proposed automatic labeling of clusters. The tests carried out with 03 databases - Seeds, Iris and Glass - result in an average accuracy of the suggested labels of 97.20%. The labels suggested are made up of few attributes, compared with previous labelers, and in most cases one attribute is sufficient to define it.

References

  1. Charu C Aggarwal. 2015. Data mining: the textbook. Springer.Google ScholarGoogle Scholar
  2. Francisco Araujo, Vinicius Machado, Antonio Soares, and Rodrigo Veras. 2018. Automatic Cluster Labeling Based on Phylogram Analysis. 1--8. Google ScholarGoogle ScholarCross RefCross Ref
  3. Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, JesúS M PéRez, and IñIgo Perona. 2013. An extensive comparative study of cluster validity indices. Pattern Recognition 46, 1 (2013), 243--256.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Tadeusz Caliński and Jerzy Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics-theory and Methods 3, 1 (1974), 1--27.Google ScholarGoogle ScholarCross RefCross Ref
  5. Malgorzata Charytanowicz, Jerzy Niewczas, Piotr Kulczycki, Piotr A Kowalski, Szymon Lukasik, and Slawomir Zak. 2010. Complete gradient clustering algorithm for features analysis of x-ray images. In Information technologies in biomedicine. Springer, 15--24.Google ScholarGoogle Scholar
  6. Bogdan S Chlebus and Sinh Hoa Nguyen. 1998. On finding optimal discretizations for two attributes. In International Conference on Rough Sets and Current Trends in Computing. Springer, 537--544.Google ScholarGoogle ScholarCross RefCross Ref
  7. Bruno Vicente Alves de Lima, Vinicius Ponte Machado, and Lucas Araújo Lopes. 2015. Automatic labeling of social network users Scientia. Net through the machine learning supervised application. Social Network Analysis and Mining 5, 1 (2015), 44.Google ScholarGoogle ScholarCross RefCross Ref
  8. Dheeru Dua and Casey Graff. 2017. UCI - Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  9. IW Evett, EJ Spiehler, and PH Duffin. 1988. Knowledge based systems. ch. Rule Induction in Forensic Science (1988), 152--160.Google ScholarGoogle Scholar
  10. Ronald A Fisher. 1936. The use of multiple measurements in taxonomic problems. Annals of eugenics 7, 2 (1936), 179--188.Google ScholarGoogle ScholarCross RefCross Ref
  11. Salvador Garcia, Julian Luengo, José Antonio Sáez, Victoria Lopez, and Francisco Herrera. 2012. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Transactions on Knowledge and Data Engineering 25, 4 (2012), 734--750.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Francisco Imperes Filho, Vinicius Ponte Machado, Rodrigo de Melo Souza Veras, Kelson Romulo Teixeira Aires, and Aline Montenegro Leal Silva. 2020. Group Labeling Methodology Using Distance-based Data Grouping Algorithms. Revista de Informática Teórica e Aplicada 27, 1 (2020), 48--61.Google ScholarGoogle Scholar
  13. Tarcísio Franco Jaime. 2019. Uso de Algoritmos de Aprendizagem de Máquina Supervisionado para Rotulação de Dados. Dissertação (Programa de Pós Graduação em Ciência da Computação - PPGCC). Universidade Federal do Piauí, Teresina-PI.Google ScholarGoogle Scholar
  14. Anil K Jain. 2010. Data clustering: 50 years beyond K-means. Pattern recognition letters 31, 8 (2010), 651--666.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Feng Jiang and Yuefei Sui. 2015. A novel approach for discretization of continuous attributes in rough set theory. Knowledge-Based Systems 73 (2015), 324--334.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jon M Kleinberg. 2003. An impossibility theorem for clustering. In Advances in neural information processing systems. 463--470.Google ScholarGoogle Scholar
  17. Lukasz A Kurgan and Krzysztof J Cios. 2004. CAIM discretization algorithm. IEEE transactions on Knowledge and Data Engineering 16, 2 (2004), 145--153.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lucas A Lopes, Vinicius P Machado, Ricardo AL Rabêlo, Ricardo AS Fernandes, and Bruno VA Lima. 2016. Automatic labelling of clusters of discrete and continuous data with supervised machine learning. Knowledge-Based Systems 106 (2016), 231--241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Vinicius Ponte Machado, Vilmar Pereira Ribeiro Filho, and Ricardo de Andrade Lira. 2015. Rotulação de grupos utilizando conjuntos fuzzy. In Simpósio Brasileiro de Automação Inteligente-SBAI. 2.Google ScholarGoogle Scholar
  20. James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google ScholarGoogle Scholar
  21. Shraddha K Popat and M Emmanuel. 2014. Review and comparative study of clustering techniques. International journal of computer science and information technologies 5, 1 (2014), 805--812.Google ScholarGoogle Scholar
  22. Md Geaur Rahman and Md Zahidul Islam. 2016. Discretization of continuous attributes through low frequency numerical values and attribute interdependency. Expert Systems with Applications 45 (2016), 410--423.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. 2016. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann. 138 pages.Google ScholarGoogle Scholar

Index Terms

  1. CAIBAL: cluster-attribute interdependency based automatic labeler
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SAC '22: Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing
            April 2022
            2099 pages
            ISBN:9781450387132
            DOI:10.1145/3477314

            Copyright © 2022 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 6 May 2022

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate1,650of6,669submissions,25%
          • Article Metrics

            • Downloads (Last 12 months)9
            • Downloads (Last 6 weeks)1

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader