research-article

Towards improving subspace data analysis

Author:

Yong ShiAuthors Info & Claims

ACMSE '10: Proceedings of the 48th annual ACM Southeast Conference

Article No.: 63, Pages 1 - 4

https://doi.org/10.1145/1900008.1900093

Published: 15 April 2010 Publication History

Abstract

In this paper, we present continuous research on data analysis based on our previous work on cluster-outlier iterative detection approach in subspace. Based on the observation that, for noisy data sets, clusters and outliers can not be processed efficiently when they are handled separately from each other, we proposed a cluster-outlier iterative detection algorithm in full data space in our previous work [22]. Due to the fact that the real data sets normally have high dimensionality, and natural clusters and outliers do not exist in the full data space, we proposed an algorithm (SubCOID) to detect clusters and outliers in subspace [21]. However, it is not a trivial task to associate each cluster and each outlier with different subsets of dimensions. In this paper, we present the improved SubCOID algorithm, applying some novel approach to choosing a unique subset of dimensions for each cluster and each outlier. The selection is based on the intra-relationship within clusters, the intra-relationship within outliers, and the inter-relationship between clusters and outliers. This process is performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields such as pattern recognition, data clustering and signal processing.

References

[1]

C. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and J. Park. Fast algorithms for projected clustering. In Proceedings of the ACM SIGMOD CONFERENCE on Management of Data, pages 61--72, Philadelphia, PA, 1999.

Digital Library

[2]

C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In SIGMOD Conference, 2001.

Digital Library

[3]

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proceedings of the ACM SIGMOD Conference on Management of Data, pages 94--105, Seattle, WA, 1998.

Digital Library

[4]

Ankerst M., Breunig M. M., Kriegel H.-P., Sander J. OPTICS: Ordering Points To Identify the Clustering Structure. Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD'99), Philadelphia, PA, pages 49--60, 1999.

Digital Library

[5]

P. S. Bradley and U. M. Fayyad. Refining initial points for K-Means clustering. In Proc. 15th International Conf. on Machine Learning, pages 91--99. Morgan Kaufmann, San Francisco, CA, 1998.

Digital Library

[6]

M. Breunig, H. Kriegel, R. Ng, and J. Sander. LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD CONFERENCE on Management of Data, pages 93--104, Dallas, Texas, May 16--18 2000.

Digital Library

[7]

M. Ester, K. H.-P., J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996.

[8]

U. Fayyad, C. Reina, and P. S. Bradley. Initialization of iterative refinement clustering algorithms. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 194--198, New York, August 1998.

Digital Library

[9]

U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI Press, 1996.

Digital Library

[10]

T. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, 38:311--322, 1985.

[11]

S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 73--84, Seattle, WA, 1998.

Digital Library

[12]

S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In Proceedings of the IEEE Conference on Data Engineering, 1999.

Digital Library

[13]

A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 58--65, New York, August 1998.

Digital Library

[14]

J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Volume I, Statistics., 1967.

[15]

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.

[16]

E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, pages 392--403, New York, August 1998.

Digital Library

[17]

R. T. Ng and J. Han. Efficient and Effective Clustering Methods for Spatial Data Mining. In Proceedings of the 20th VLDB Conference, pages 144--155, Santiago, Chile, 1994.

Digital Library

[18]

S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of the ACM SIGMOD CONFERENCE on Management of Data, pages 427--438, Dallas, Texas, May 16--18 2000.

Digital Library

[19]

T. Seidl and H. Kriegel. Optimal multi-step k-nearest neighbor search. In Proceedings of the ACM SIGMOD conference on Management of Data, pages 154--164, Seattle, WA, 1998.

Digital Library

[20]

G. Sheikholeslami, S. Chatterjee, and A. Zhang. Wavecluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24th International Conference on Very Large Data Bases, 1998.

Digital Library

[21]

Y. Shi. Subcoid: Exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. In ACMSE 2008: The 46th ACM Southeast Conference, 2008.

Digital Library

[22]

Y. Shi and A. Zhang. Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. In International Conference on Data Engineering (ICDE), 2005.

Digital Library

[23]

W. Wang, J. Yang, and R. Muntz. STING: A Statistical Information Grid Approach to Spatial Data Mining. In Proceedings of the 23rd VLDB Conference, pages 186--195, Athens, Greece, 1997.

Digital Library

[24]

D. Yu, G. Sheikholeslami, and A. Zhang. Findout: Finding outliers in very large datasets. The Knowledge and Information Systems (KAIS), (4), October 2000.

Digital Library

[25]

T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, pages 103--114, Montreal, Canada, 1996.

Digital Library

Recommendations

Towards subspace clustering on dynamic data: an incremental version of PreDeCon
StreamKDD '10: Proceedings of the First International Workshop on Novel Data Stream Pattern Mining Techniques

Todays data are high dimensional and dynamic, thus clustering over such kind of data is rather complicated. To deal with the high dimensionality problem, the subspace clustering research area has lately emerged that aims at finding clusters in subspaces ...
Subspace clustering of high-dimensional data: an evolutionary approach

Clustering high-dimensional data has been a major challenge due to the inherent sparsity of the points. Most existing clustering algorithms become substantially inefficient if the required similarity measure is computed between data points in the ...
Subspace clustering methods for high dimensional data

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ACMSE '10: Proceedings of the 48th annual ACM Southeast Conference

April 2010

488 pages

ISBN:9781450300643

DOI:10.1145/1900008

Conference Chair:
H. Conrad Cunningham
University of Mississippi
,
Program Chairs:
Paul Ruth,
Nicholas A. Kraft

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

ACM: Association for Computing Machinery

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 April 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ACM SE '10

Sponsor:

ACM

ACM SE '10: ACM Southeast Regional Conference

April 15 - 17, 2010

Mississippi, Oxford

Acceptance Rates

ACMSE '10 Paper Acceptance Rate 48 of 94 submissions, 51%;

Overall Acceptance Rate 502 of 1,023 submissions, 49%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
81
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten