Feature Selection for Clustering

Dash, Manoranjan; Liu, Huan

doi:10.1007/3-540-45571-X_13

Manoranjan Dash⁴ &
Huan Liu⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1805))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2249 Accesses
88 Citations

Abstract

Clustering is an important data mining task. Data mining often concerns large and high-dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high-dimensionality or both. Different features affect clusters differently, some are important for clusters while others may hinder the clustering task. An efficient way of handling it is by selecting a subset of important features. It helps in finding clusters efficiently, understanding the data better and reducing data size for efficient storage, collection and processing. The task of finding original important features for unsupervised data is largely untouched. Traditional feature selection algorithms work only for supervised data where class information is available. For unsupervised data, without class information, often principal components (PCs) are used, but PCs still require all features and they may be difficult to understand. Our approach: first features are ranked according to their importance on clustering and then a subset of important features are selected. For large data we use a scalable method using sampling. Empirical evaluation shows the effectiveness and scalability of our approach for benchmark and synthetic data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. C. Aggarwal, C. Procopiuc, J. L. Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In Proceedings of ACM SIGMOD Conference on Management of Data, pages 61–72, 1999.
Google Scholar
R Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of ACM SIGMOD Conference on Management of Data, 1998.
Google Scholar
R Agrawal and R. Srikant. Fast algorithm for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994.
Google Scholar
P. S. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining (KDD’98), pages 9–15, 1998.
Google Scholar
C. Cheng, A. W. Pu, and Y. Zhang. Entropy-based subspace clustering for mining numerical data. In Proceedings of Internationl Conference on Knowledge Discovery and Data Mining (KDD’99), 1999.
Google Scholar
M. Dash and H. Liu. Feature selection for classification. International Journal of Intelligent Data Analysis, http://www.elsevier.com/locate/ida, 1(3), 1997.
J. L. Devore. Probability and Statistics for Engineering and Sciences. Duxbury Press, 4th edition, 1995.
Google Scholar
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis, chapter Unsupervised Learning and Clustering. John Wiley & Sons, 1973.
Google Scholar
U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinment clustering algorithms. In Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining (KDD’98), pages 194–198, 1998.
Google Scholar
V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS-clustering categorical data using summaries. In Proceedings of International Conference on Knowledge Discovery and Data Mining (KDD’99), 1999.
Google Scholar
A. K. Jain and R. C. Dubes. Algorithm for Clustering Data, chapter Clustering Methods and Algorithms. Prentice-Hall Advanced Reference Series, 1988.
Google Scholar
R. Kohavi. Wrappers for performance enhancement and oblivious decision graphs. PhD thesis, Department of Computer Science, Stanford University, Stanford, CA, 1995.
Google Scholar
C. J. Merz and P. M. Murphy. UCI repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html, 1996.

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore
Manoranjan Dash & Huan Liu

Authors

Manoranjan Dash
View author publications
You can also search for this author in PubMed Google Scholar
Huan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate School of Systems Management, Universiy of Tsukuba, 3-29-1 Otsuka, Bunkyo-ku, Tokyo, 112-0012, Japan
Takao Terano
Department of Computer Science and Engineering, Arizona State University, P.O. Box 875 406, Tempe, AZ, 85287-5406
Huan Liu
Department of Computer Science, National Tsing Hua University, Hsinchu, 300, Taiwan ROC
Arbee L. P. Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dash, M., Liu, H. (2000). Feature Selection for Clustering. In: Terano, T., Liu, H., Chen, A.L.P. (eds) Knowledge Discovery and Data Mining. Current Issues and New Applications. PAKDD 2000. Lecture Notes in Computer Science(), vol 1805. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45571-X_13

Download citation

DOI: https://doi.org/10.1007/3-540-45571-X_13
Published: 24 March 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67382-8
Online ISBN: 978-3-540-45571-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics