Abstract
There exist many data clustering algorithms, but they can not adequately handle the number of clusters or cluster shapes. Their performance mainly depends on a choice of algorithm parameters. Our approach to data clustering and algorithm does not require the parameter choice; it can be treated as a natural adaptation to the existing structure of distances between data points. The outlier factor introduced by the author specifies a degree of being an outlier for each data point. The outlier factor notion is based on the difference between the frequency distribution of interpoint distances in a given dataset and the corresponding distribution of uniformly distributed points. Then data clusters can be determined by maximizing the outlier factor function. The data points in dataset are divided into clusters according to the attractor regions of local optima. An experimental evaluation of the proposed algorithm shows that the proposed method can identify complex cluster shapes. Key advantages of the approach are: good clustering properties for datasets with comparatively large amount of noise (an additional data points), and an absence of important parameters which adequate choice determines the quality of results.
Similar content being viewed by others
References
Brin, S. (1995), Near Neighbor Search in Large Metric Spaces. In: Proceedings of the 21st International Conference on Very Large Databases (VLDB-1995), Zurich, Switzerland, Morgan Kaufmann, pp. 574–584.
N.R. Draper H. Smith (1966) Applied Regression Analysis Wiley New York
Ertoz, L., Steinbach, M. and Kumar, V. (2002), A new shared nearest neighbor clustering algorithm and its applications, AHPCRC, Technical Report 134.
R.A. Fisher (1936) ArticleTitleThe use of multiple measurements in taxonomy problems Annals of Eugenics 7 179–188
D.M. Hawkins D. Bradu G.V. Kass (1984) ArticleTitleLocation of several outliers in multiple regression data using elemental sets Technometrics 26 197–208 Occurrence Handle10.2307/1267545
Hinneburg, A. and Keim, D. (1998), An efficient approach to clustering large multimedia databases with noise. In: Proceedings of the 4th ACM SIGKDD, New York, NY, pp. 58–65.
A.K. Jain R.C. Dubes (1988) Algorithms for Clustering Data Prentice Hall Englewood Cliffs, NJ
A. Jain M.N. Murty P. Flynn (1999) ArticleTitleData clustering: a review ACM Computing Surveys 31 IssueID3 264–323 Occurrence Handle10.1145/331499.331504
J. MacQueen (1967) Some methods for classification and analysis of multivariate observations L.M. Le Cam J. Neyman (Eds) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume I: Statistics. University of California Press Berkeley and Los Angeles, CA 281–297
V. Saltenis (2004) ArticleTitleOutlier detection based on the distribution of distances between data points Informatica 15 IssueID3 399–410
Steinbach, M., Ertoz, L. and Kumar, V. (2003), Challenges of Clustering High Dimensional Data. New Vistas in Statistical Physics. Applications in Econophysics, Bioinformatics, and Pattern Recognition, Springer-Verlag, Berlin.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Saltenis, V. Data Clustering Based on Maximization of Outlier Factor. J Glob Optim 35, 625–635 (2006). https://doi.org/10.1007/s10898-005-5372-5
Received:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10898-005-5372-5