Squeezer: An efficient algorithm for clustering categorical data

He, Zengyou; Xu, Xiaofei; Deng, Shengchun

doi:10.1007/BF02948829

Squeezer: An efficient algorithm for clustering categorical data

Published: September 2002

Volume 17, pages 611–624, (2002)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

He Zengyou¹,
Xu Xiaofei¹ &
Deng Shengchun¹

874 Accesses
Explore all metrics

Abstract

This paper presents a new efficient algorithm for clustering categorical data,Squeezer, which can produce high quality clustering results and at the same time deserve good scalability. TheSqueezer algorithm reads each tuplet in sequence, either assigningt to an existing cluster (initially none), or creatingt as a new cluster, which is determined by the similarities betweent and clusters. Due to its characteristics, the proposed algorithm is extremely suitable for clustering data streams, where given a sequence of points, the objective is to maintain consistently good clustering of the sequence so far, using a small amount of memory and time. Outliers can also be handled efficiently and directly inSqueezer. Experimental results on real-life and synthetic datasets verify the superiority ofSqueezer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

K-modestream algorithm for clustering categorical data streams

Article 29 April 2017

State-of-the-art on clustering data streams

Article Open access 01 December 2016

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. ROCK: A robust clustering algorithm for categorical attributes. InProc. 1999 Int. Conf. Data Engineering, Sydney, Australia, Mar., 1999, pp. 512–521.
Alexandros Nanopoulos, Yannis Theodoridis, Yannis Manolopoulos C2P: Clustering based on closest pairs. InProc. 27th Int. Conf. Very Large Database, Rome, Italy, September, 2001, pp. 331–340.
Ester M, Kriegel H P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases. InProc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), Portland, Oregon, USA., Aug., 1996, pp.226–231.
Zhang T, Ramakrishnan R, Livny M. BIRTH: An efficient data clustering method for very large databases. InProc. the ACM-SIGMOD Int. Conf. Management of Data, Montreal, Quebec, Canada, June, 1996, pp. 103–114.
Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: A clustering algorithm for large databases. InProc. the ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington, USA, June, 1998, pp. 73–84.
Karypis G, Han E-H, Kumar V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling.IEEE Computer, 1999, 32(8): 68–75.
Google Scholar
Sheikholeslami G, Chatterjee S, Zhang A. WaveCluster: A multi-resolution clustering approach for very large spatial databases. InProc. 1998 Int. Conf. Very Large Databases, New York, August, 1998, pp. 428–439.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. InProc. the 1998 ACM SIGMOD Int. Conf. Management of Data, Seattle, Washington, USA, June, 1998, pp. 94–105.
Jiang M F, Tseng S S, Su C M. Two-phase clustering process for outliers detection.Pattern Recognition Letters, 2001, 22(6/7): 691–700.
Article MATH Google Scholar
Venkatesh Ganti, Johannes Gehrke, Raghu Ramakrishnan. CACTUS-clustering categorical data using summaries. InProc. 1999 Int. Conf. Knowledge Discovery and Data Mining, August, 1999, pp. 73–83.
UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRRepository.html
Huang Z. A fast clustering algorithm to cluster very large categorical data sets in data mining. InProc. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona, USA, May, 1997, pp. 146–151.
David Gibson, Jon Kleiberg, Prabhakar Raghavan. Clustering categorical data: An approach based on dynamic systems. InProc. 1998 Int. Conf. Very Large Databases, New York, August, 1998, pp. 311–322.
Zhang Yi, Ada Wai-Chee Fu, Chun Hing Cai, Peng-Ann Heng. Clustering categorical data. InProc. 2000 IEEE Int. Conf. Data Engineering, San Deigo, USA, March, 2000, p.305.
Eui-Hong Han, George Karypis, Vipin Kumar, Bamshad Mobasher. Clustering based on association rule hypergraphs. InProc. 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery. Tucson, Arizona, USA, May, 1997, pp. 78–85.
Wang Ke, Xu Chu, Liu Bing. Clustering transactions using large items. InProceedings of the 1999 ACM International Conference on Information and Knowledge Management, Kansas City, Missouri, USA, November, 1999, pp. 483–490.
Sudipto Guha, Nina Mishra, Rajeev Motwani, Liadan O’Callaghan. Clustering data streams. InThe 41st Annual Symposium on Foundations of Computer Science, Redondo Beach, California, USA, November, 2000, pp. 359–366.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Harbin Institute of Technology, 150001, Harbin, P.R. China
He Zengyou, Xu Xiaofei & Deng Shengchun

Authors

He Zengyou
View author publications
You can also search for this author inPubMed Google Scholar
Xu Xiaofei
View author publications
You can also search for this author inPubMed Google Scholar
Deng Shengchun
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to He Zengyou.

Additional information

This work was supported by the National Natural Science Foundation of China (Grant No. 60084004) and the IBM AS/400 Research Fund.

HE Zengyon received his M.S. degree in computer science from harbin Institute of Technology (HIT) in 2002. He is currently a Ph.D. candidate in the Department of Computer Science and Engineering, HIT. His main research interests include data mining, multi-database systems and approximate query answering.

XU Xiaofei received his M.S. and Ph.D. degrees in computer science from HIT in 1985 and 1988 respectively. He is currently a professor in the Department of Computer Science and Engineering, HIT. His main research interests include CIMS and database systems.

DENG Shengchun received his Ph.D. degree in computer science from HIT in 2002. He is currently an associate professor in the Department of Computer Science and Engineering, HIT. His main research interests include data mining and data warehouse.

Rights and permissions

Reprints and permissions

About this article

Cite this article

He, Z., Xu, X. & Deng, S. Squeezer: An efficient algorithm for clustering categorical data. J. Comput. Sci. & Technol. 17, 611–624 (2002). https://doi.org/10.1007/BF02948829

Download citation

Received: 20 November 2001
Revised: 04 March 2002
Issue Date: September 2002
DOI: https://doi.org/10.1007/BF02948829

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Squeezer: An efficient algorithm for clustering categorical data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

K-modestream algorithm for clustering categorical data streams

State-of-the-art on clustering data streams

StrDip: A Fast Data Stream Clustering Algorithm Using the Dip Test of Unimodality

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now