Abstract
In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d- 《 d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proc. ACM Int. Conference on Managment of Data (SIGMOD’01), 2001.
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Tech. Report, n. 25, ISI-CNR, 2002.
A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, pages 164–169, 1996.
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), 2000.
C. E. Brodley and M. Friedl. Identifying and eliminating mislabeled training instances. In Proc. National American Conf. on Artificial Intelligence (AAAI/IAAI 96), pages 799–805, 1996.
Yu D., Sheikholeslami S., and A. Zhang. Findout: Finding outliers in very large datasets. In Tech. Report, 99-03, Univ. of New York, Buffalo, pages 1–19, 1999.
C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proc. ACM Int. Conf. on Principles of Database Systems (PODS’89), pages 247–252, 1989.
J. Han and M. Kamber. Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco, 2001.
H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.
H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Int. conf. on Very Large Databases (VLDB98), pages 392–403, 1998.
M. Lopez and S. Liao. Finding k-closest-pairs efficiently for high dimensional data. In Proc. 12th Canadian Conf. on Computational Geometry (CCCG), pages 197–204, 2000.
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), pages 427–438, 2000.
Hans Sagan. Space Filling Curves. Springer-Verlag, 1994.
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In Proc. Sixth Int. Conf on ExtendingD atabase Thecnology (EDBT), Valencia, Spain, March 1998.
J. Shepherd, X. Zhu, and N. Megiddo. A fast indexing method for multidimensional nearest neighbor search. In Proc. SPIE Conf. on Storage and Retrieval for image and video databases VII, pages 350–355, 1999.
Z. R. Struzik and A. Siebes. Outliers detection and localisation with wavelet based multifractal formalism. In Tech. Report, CWI, Amsterdam, INS-R0008, 2000.
K. Yamanishi and J. Takeuchi. Discovering outlier filtering rules from unlabeled data. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 389–394, 2001.
K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne. On-line unsupervised learning outlier detection using finite mixtures with discounting learning algorithms. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 250–254, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Angiulli, F., Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_2
Download citation
DOI: https://doi.org/10.1007/3-540-45681-3_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive