Abstract
Outlier detection is becoming a hot issue in the field of data mining since outliers often contain useful information. In this paper, we propose an improved KNN based outlier detection algorithm which is fulfilled through two stage clustering. Clustering one is to partition the dataset into several clusters and then calculate the Kth nearest neighbor in each cluster which can effectively avoid passing the entire dataset for each calculation. Clustering two is to partition the clusters obtained by clustering one and then prune the partitions as soon as it is determined that it cannot contain outliers which results in substantial savings in computation. Experimental results on both synthetic and real life datasets demonstrate that our algorithm is efficient in large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ramaswamy, S., Rastogi, R., Kyuseok, S.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 93–104. ACM Press, New York (2000)
Birant, D., Kut, A.: Spatio-temporal outlier detection in large databases. In: Proceedings of Conf. Information Technology Interfaces, pp. 179–184 (2003)
Barnett, V., Lewis, T.: Outliers in Statistical Data. John Wiley and Sons, New York (1994)
Knorr, E., Ng, R.: Algorithms for mining distancebased outliers in large datasets. In: Proceedings of the 24th Conference on VLDB, New York, pp. 392–403 (1998)
Johnson, T., Kwok, I., Ng, R.: Fast Computation of 2-Dimensional Depth Contours. In: Proceedings of 4th. Int. Conf. on KDD, New York, pp. 224–228 (1998)
Ruts, I., Rousseeuw, P.: Computing Depth Contours of Bivariate Point Clouds. Journal of Computational Statistics and Data Analysis (23), 153–168 (1996)
Breunig, M.M., Kriegel, H.P., Ng, R.T.: LOF: Identifying density based local outliers. In: Proceedings of ACM Conference, pp. 93–104 (2000)
Jain, A., Murty, M., Flynn, P.: Data Clustering: A Review. ACM Computing Surveys 31(3), 264–323 (1999)
Ng, R.T., Han, J.: Efficient and Effective Clustering Methods for Spatial Data Mining. In: Proceedings of 20th Int. Conf. on Very Large Data Bases, Santiago, Chile, pp. 144–155 (1994)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: Clustering for Mining in Large Spatial Databases. KI-Journal (Artificial Intelligence), Special Issue on Data Mining 12(1), 18–24 (1998)
Guha, S., Rastogi, R., Shim, K.: CURE: An Efficient Clustering Algorithms for Large Databases. In: Proceedings of ACM SIGMOD Int. Conf. on Management of Data, Seattle, WA, pp. 73–84 (1998)
Yang, P., Huang, B.: An efficient outlier mining algorithm for large dataset. In: Proceedings of the International Conference on Information Management, Innovation Management and Industrial Engineering, vol. 1, pp. 199–202 (2008)
Zhang, T., Ramakrishnan, R., Birch, M.L.: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114 (June 1996)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Q., Zheng, M. (2010). An Improved KNN Based Outlier Detection Algorithm for Large Datasets. In: Cao, L., Feng, Y., Zhong, J. (eds) Advanced Data Mining and Applications. ADMA 2010. Lecture Notes in Computer Science(), vol 6440. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17316-5_56
Download citation
DOI: https://doi.org/10.1007/978-3-642-17316-5_56
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-17315-8
Online ISBN: 978-3-642-17316-5
eBook Packages: Computer ScienceComputer Science (R0)