Abstract
Nowadays, most data mining algorithms focus on clustering methods alone. Also, there are a lot of approaches designed for outlier detection. We observe that, in many situations, clusters and outliers are concepts whose meanings are inseparable to each other, especially for those data sets with noise. Thus, it is necessary to treat both clusters and outliers as concepts of the same importance in data analysis. In this paper, we present our continuous work on the cluster–outlier iterative detection algorithm (Shi in SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. Auburn, pp. 132–135, 2008; Shi and Zhang in Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. IEEE Computer Society. Tokyo, pp. 518–519, 2005) to detect the clusters and outliers in another perspective for noisy data sets. In this algorithm, clusters are detected and adjusted according to the intra-relationship within clusters and the inter-relationship between clusters and outliers, and vice versa. The adjustment and modification of the clusters and outliers are performed iteratively until a certain termination condition is reached. This data processing algorithm can be applied in many fields, such as pattern recognition, data clustering, and signal processing. Experimental results demonstrate the advantages of our approach.
Similar content being viewed by others
References
Achtert E, Kriegel H, Zimek A (2008) ELKI: a software system for evaluation of subspace clustering algorithms. In: Ludascher B, Mamoulis N (eds) Proceedings of the 20th international conference on scientific and statistical database management (SSDBM), Hong Kong, pp 580–585
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Aref W (eds) Proceedings of the 2001 ACM SIGMOD international conference on management of data. ACM Press, Santa Barbara, pp 37–46
Agrawal R, Gehrke J, Gunopulos D et al (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 94–105
Aggarwal C, Hinneburg A, Keim D (2001) On the surprising behavior of distance metrics in high dimensional space. In: Bussche J, VianuLecture V (eds) Proceedings of the 8th international conference on database theory. Springer, London, pp 420–434
Aggarwal C, Procopiuc C, Wolf J et al (1999) Fast algorithms for projected clustering. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 61–72
Ankerst M, Breunig M, Kriegel H et al (1999) OPTICS: ordering points to identify the clustering structure. In: Delis A, Faloutsos C, Ghandeharizadeh S (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM Press, Philadelphia, pp 49–60
Bay S (1999) The UCI KDD Archive [http://kdd.ics.uci.edu]. Department of Information and Computer Science, University of California, Irvine
Beyer K, Goldstein J, Ramakrishnan R et al (1999) When is “nearest neighbor” meaningful?. In: Beeri C, Buneman P (eds) Proceedings of international conference on database theory. Springer, Jerusalem, pp 217– 235
Bradley P, Fayyad U (1998) Refining initial points for K-Means clustering. In: Proceedings of 15th international conference on machine learning. Morgan Kaufmann, San Francisco, pp 91–99
Breunig M, Kriegel H, Ng R et al (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 93–104
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Berkhin P, Caruana R, Wu X (eds) Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, San Jose, pp 133–142
Chen C, Lee J (2001) The validity measurement of fuzzy C-means classifier for remotely sensed images. In: Proceedings of 22nd Asian conference on remote sensing. Singapore
Ester M, Kriegel H, Sander J et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad U (eds) Proceedings of 2nd international conference on knowledge discovery and data mining. AAAI Press, Portland, pp 226–231
Fayyad U, Piatetsky-Shapiro G, Smyth P et al (1996) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park
Fayyad U, Reina C, Bradley P (1998) Initialization of iterative refinement clustering algorithms. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 194–198
Gonzalez T (1985) Clustering to minimize the maximum intercluster distance. Theor Comput Sci 38: 311–322
Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Haas L, Tiwary A (eds) Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Seattle, pp 73–84
Guha S, Rastogi R, Shim K (1999) ROCK: a robust clustering algorithm for categorical attributes. In: Proceedings of the IEEE conference on data engineering. IEEE Computer Society Press, Sydney, pp 512–521
Hinneburg A, Keim D (1998) An efficient approach to clustering in large multimedia databases with noise. In: Agrawal R, Stolorz P, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining. AAAI Press, New York, pp 58–65
Halkidi M, Vazirgiannis M (2001) A data set oriented approach for clustering algorithm selection. In: Raedt L, Siebes A (eds) Proceedings of the 5th European conference on principles of data mining and knowledge discovery. Springer, Freiburg, pp 165–179
Hinneburg A, Aggarwal C, Keim D (2000) What is the nearest neighbor in high dimensional spaces?. In: Abbadi A, Brodie M, Chakravarthy S (eds) Proceedings of 26th international conference on very large data bases. Morgan Kaufmann, Cairo, pp 506–515
Jain A, Murty M, Flyn P (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323
Karypis G, Han E, Kumar V (1999) Chameleon: hierarchical clustering using dynamic modeling. Computer 32: 68–75
Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, Hoboken
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 392–403
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. University of California Press, Berleley 1:281–297
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca J, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large data bases. Morgan Kaufmann, Santiago de Chile, pp 144–155
Nguyen M, Mark L, Omiecinski E (2008) Unusual pattern detection in high dimensions. Advances in knowledge discovery and data mining, 12th Pacific-Asia conference. Springer, Osaka, pp, pp 247–259
Peterson G, McBride B (2008) The importance of generalizability for anomaly detection. Knowl Inf Syst 14(3): 377–392
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Chen W, Naughton J, Bernstein P (eds) Proceedings of the ACM SIGMOD conference on management of data. ACM, Dallas, pp 427–438
Rothman M (1963) The laws of physics. Basic Books, New York
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24th international conference on very large data bases. Morgan Kaufmann, New York, pp 428–439
Shi Y (2008a) Detecting clusters and outliers for multi-dimensional data. In: Proceedings of the 2008 international conference on multimedia and ubiquitous engineering. SERSC, Busan, pp 429–432
Shi Y (2008b) SubCOID: exploring cluster-outlier iterative detection approach to multi-dimensional data analysis in subspace. In: ACMSE 2008: the 46th ACM southeast conference. ACM, Auburn, pp 132–135
Shi Y, Zhang A (2005) Towards exploring interactive relationship between clusters and outliers in multi-dimensional data analysis. In: Proceedings of the 21st international conference on data engineering. IEEE Computer Society, Tokyo, pp 518–519
Shi Y, Song Y, Zhang A (2003) A shrinking-based approach for multi-dimensional data analysis. In: Freytag J, Lockemann P, Abiteboul S et al (eds) Proceedings of 29th international conference on very large data bases. ACM, Berlin, pp 440–451
Tao Y, Xiao X, Zhou S (2006) Mining distance-based outliers from large databases in any metric space. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 394–403
Wang J, Chiang J (2008) A cluster validity measure with outlier detection for support vector clustering. IEEE Trans Syst, Man, Cybernet, B 38(1): 78–89
Wang W, Yang J, Muntz R (1997) STING: a statistical information grid approach to spatial data mining. In: Jarke M, Carey M, Dittrich K et al (eds) Proceedings of 23rd international conference on very large data bases. Morgan Kaufmann, Athens, pp 186–195
Wu M, Jermaine C (2006) Outlier detection by sampling with accuracy guarantees. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 767–772
Wu X, Kumar V, Ross Q et al (2008) Top 10 algorithms in data mining. Knowl Inf Syst 14(1): 1–37
Xiong H, Steinbach M, Ruslim A et al (2008) Characterizing pattern preserving clustering. Knowl Inf Syst 19(3): 311–336
Xiong H, Wu J, Chen J (2006) K-means clustering versus validation measures: a data distribution perspective. In: Eliassi-Rad T, Ungar L, Craven M et al (eds) Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Philadelphia, pp 779–784
Yang J, Zhong N, Yao Y et al (2008) Local peculiarity factor and its application in outlier detection. In: Li Y, Liu B, Sarawagi S (eds) Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, Las Vegas, pp 776–784
Yu D, Sheikholeslami G, Zhang A (2000) FindOut: finding outliers in very large Datasets. Knowl Inf Syst 4(4): 387–412
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish H, Mumick I (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data. ACM, Montreal, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shi, Y., Zhang, L. COID: A cluster–outlier iterative detection approach to multi-dimensional data analysis. Knowl Inf Syst 28, 709–733 (2011). https://doi.org/10.1007/s10115-010-0323-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0323-y