skip to main content
10.1145/3289402.3289543acmotherconferencesArticle/Chapter ViewAbstractPublication PagessitaConference Proceedingsconference-collections
research-article

Finding disjoint clusters in a categorical data space

Authors Info & Claims
Published:24 October 2018Publication History

ABSTRACT

In This paper we provide a prototype of method for segment a high dimensional categorical data using frequent patterns. The frequent patterns are mined using a conventional frequent pattern mining algorithm according to a predefined support threshold. In addition, we restrict the frequent patterns length to a predefined low value in order to ensure the understandability of the results. Associations between the frequent patterns are discovered in order to reveal containment and overlap between them. Segments are iteratively defined as the largest region of data space covered by several frequent patterns. The illustrative example shows promising results in term of the quality of the resulted segments and the understandability.

References

  1. Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. 2001. On the surprising behavior of distance metrics in high dimensional space. In International conference on database theory. Springer, 420--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Charu C Aggarwal, Na Ta, Jianyong Wang, Jianhua Feng, and Mohammed Zaki. 2007. Xproj: a framework for projected structural clustering of xml documents. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 46--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan. 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. 27. ACM.Google ScholarGoogle Scholar
  4. Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, A Inkeri Verkamo, et al. 1996. Fast discovery of association rules. Advances in knowledge discovery and data mining 12, 1 (1996), 307--328. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ira Assent. 2012. Clustering high dimensional data. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2, 4 (2012), 340--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ira Assent, Ralph Krieger, Emmanuel Müller, and Thomas Seidl. 2008. INSCY: Indexing subspace clusters with in-process-removal of redundancy. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 719--724. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christian Baumgartner, Claudia Plant, K Railing, H-P Kriegel, and Peer Kroger. 2004. Subspace selection for clustering high-dimensional data. In Data Mining, 2004. ICDM'04. Fourth IEEE International Conference on. IEEE, 11--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Kristin P Bennett, Usama Fayyad, and Dan Geiger. 1999. Density-based indexing for approximate nearest-neighbor queries. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 233--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Abdelaziz Berrado and George C Runger. 2007. Using metarules to organize and group discovered association rules. Data mining and knowledge discovery 14, 3 (2007), 409--431. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Catherine L Blake and Christopher J Merz. 1998. UCI Repository of machine learning databases {http://www.ics.uci.edu/'mlearn/MLRepository.html}. Irvine, CA: University of California. Department of Information and Computer Science 55 (1998).Google ScholarGoogle Scholar
  11. Allison Chang, Dimitris Bertsimas, and Cynthia Rudin. 2012. An integer optimization approach to associative classification. In Advances in neural information processing systems. 269--277.Google ScholarGoogle Scholar
  12. Chun-Hung Cheng, Ada Waichee Fu, and Yi Zhang. 1999. Entropy-based subspace clustering for mining numerical data. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 84--93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guojun Gan and Jianhong Wu. 2004. Subspace clustering for high dimensional categorical data. ACM SIGKDD Explorations Newsletter 6, 2 (2004), 87--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Venkatesh Ganti, Johannes Gehrke, and Raghu Ramakrishnan. 1999. CAC-TUSâĂnŤclustering categorical data using summaries. In Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 73--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jiawei Han, Jian Pei, and Yiwen Yin. 2000. Mining frequent patterns without candidate generation. In ACM sigmod record, Vol. 29. ACM, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Michael E Houle, Hans-Peter Kriegel, Peer Kröger, Erich Schubert, and Arthur Zimek. 2010. Can shared-neighbor distances defeat the curse of dimensionality?. In International Conference on Scientific and Statistical Database Management. Springer, 482--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Zhexue Huang. 1998. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery 2, 3 (1998), 283--304. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Karin Kailing, Hans-Peter Kriegel, and Peer Kröger. 2004. Density-connected subspace clustering for high-dimensional data. In Proceedings of the 2004 SIAM International Conference on Data Mining. SIAM, 246--256.Google ScholarGoogle ScholarCross RefCross Ref
  19. Hans-Peter Kriegel, Peer Kröger, and Arthur Zimek. 2009. Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD) 3, 1 (2009), 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Guimei Liu, Jinyan Li, Kelvin Sim, and Limsoon Wong. 2007. Distance based subspace clustering with flexible dimension partitioning. In Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on. IEEE, 1250--1254.Google ScholarGoogle ScholarCross RefCross Ref
  21. James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google ScholarGoogle Scholar
  22. Emmanuel Müller, Ira Assent, Stephan Günnemann, Ralph Krieger, and Thomas Seidl. 2009. Relevant subspace clustering: Mining the most interesting non-redundant concepts in high dimensional data. In Data Mining, 2009. ICDM'09. Ninth IEEE International Conference on. IEEE, 377--386. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Harsha Nagesh, Sanjay Goil, and Alok Choudhary. 2001. Adaptive grids for clustering massive data sets. In Proceedings of the 2001 SIAM International Conference on Data Mining. SIAM, 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jian Pei, Xiaoling Zhang, Moonjung Cho, Haixun Wang, and Philip S Yu. 2003. Maple: A fast algorithm for maximal pattern-based clustering. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 259--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Michel Verleysen and Damien François. 2005. The curse of dimensionality in data mining and time series prediction. In International Work-Conference on Artificial Neural Networks. Springer, 758--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mohammed J Zaki, Markus Peters, Ira Assent, and Thomas Seidl. 2007. Clicks: An effective algorithm for mining subspace clusters in categorical datasets. Data & Knowledge Engineering 60, 1 (2007), 51--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Arthur Zimek and Jilles Vreeken. 2015. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning 98, 1--2 (2015), 121--155. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    SITA'18: Proceedings of the 12th International Conference on Intelligent Systems: Theories and Applications
    October 2018
    301 pages
    ISBN:9781450364621
    DOI:10.1145/3289402

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 24 October 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader