Skip to main content
Log in

Squeezer: An efficient algorithm for clustering categorical data

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

This paper presents a new efficient algorithm for clustering categorical data,Squeezer, which can produce high quality clustering results and at the same time deserve good scalability. TheSqueezer algorithm reads each tuplet in sequence, either assigningt to an existing cluster (initially none), or creatingt as a new cluster, which is determined by the similarities betweent and clusters. Due to its characteristics, the proposed algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time. Outliers can also be handled efficiently and directly inSqueezer. Experimental results on real-life and synthetic datasets verify the superiority ofSqueezer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. InProc. 1999 Int. Conf. Data Engineering, Sydney, Australia, Mar., 1999, pp. 512–521.

  2. Alexandros Nanopoulos, Yannis Theodoridis, Yannis Manolopoulos C2P: Clustering based on closest pairs. InProc. 27th Int. Conf. Very Large Database, Rome, Italy, September, 2001, pp. 331–340.

  3. Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases. InProc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon, USA., Aug., 1996, pp.226–231.

  4. Zhang T, Ramakrishnan R, Livny M. BIRTH: An efficient data clustering method for very large databases. InProc. the ACM-SIGMOD Int. Conf. Management of Data, Montreal, Quebec, Canada, June, 1996, pp. 103–114.

  5. Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: A clustering algorithm for large databases. InProc. the ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington, USA, June, 1998, pp. 73–84.

  6. Karypis G, Han E-H, Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling.IEEE Computer, 1999, 32(8): 68–75.

    Google Scholar 

  7. Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. 1998 Int. Conf. Very Large Databases, New York, August, 1998, pp. 428–439.

  8. Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the 1998 ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington, USA, June, 1998, pp. 94–105.

  9. Jiang M F, Tseng S S, Su C M. Two-phase clustering process for outliers detection.Pattern Recognition Letters, 2001, 22(6/7): 691–700.

    Article  MATH  Google Scholar 

  10. Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan. CACTUS-clustering categorical data using summaries. InProc. 1999 Int. Conf. Knowledge Discovery and Data Mining, August, 1999, pp. 73–83.

  11. UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRRepository.html

  12. Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. InProc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona, USA, May, 1997, pp. 146–151.

  13. David Gibson, Jon Kleiberg, Prabhakar Raghavan. Clustering categorical data: An approach based on dynamic systems. InProc. 1998 Int. Conf. Very Large Databases, New York, August, 1998, pp. 311–322.

  14. Zhang Yi, Ada Wai-Chee Fu, Chun Hing Cai, Peng-Ann Heng. Clustering categorical data. InProc. 2000 IEEE Int. Conf. Data Engineering, San Deigo, USA, March, 2000, p.305.

  15. Eui-Hong Han, George Karypis, Vipin Kumar, Bamshad Mobasher. Clustering based on association rule hypergraphs. InProc. 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Tucson, Arizona, USA, May, 1997, pp. 78–85.

  16. Wang Ke, Xu Chu, Liu Bing. Clustering transactions using large items. InProceedings of the 1999 ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, USA, November, 1999, pp. 483–490.

  17. Sudipto Guha, Nina Mishra, Rajeev Motwani, Liadan O’Callaghan. Clustering data streams. InThe 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, California, USA, November, 2000, pp. 359–366.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to He Zengyou.

Additional information

This work was supported by the National Natural Science Foundation of China (Grant No. 60084004) and the IBM AS/400 Research Fund.

HE Zengyon received his M.S. degree in computer science from harbin Institute of Technology (HIT) in 2002. He is currently a Ph.D. candidate in the Department of Computer Science and Engineering, HIT. His main research interests include data mining, multi-database systems and approximate query answering.

XU Xiaofei received his M.S. and Ph.D. degrees in computer science from HIT in 1985 and 1988 respectively. He is currently a professor in the Department of Computer Science and Engineering, HIT. His main research interests include CIMS and database systems.

DENG Shengchun received his Ph.D. degree in computer science from HIT in 2002. He is currently an associate professor in the Department of Computer Science and Engineering, HIT. His main research interests include data mining and data warehouse.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, Z., Xu, X. & Deng, S. Squeezer: An efficient algorithm for clustering categorical data. J. Comput. Sci. & Technol. 17, 611–624 (2002). https://doi.org/10.1007/BF02948829

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/BF02948829

Keywords

Navigation