Fast Outlier Detection in High Dimensional Spaces

Angiulli, Fabrizio; Pizzuti, Clara

doi:10.1007/3-540-45681-3_2

Fabrizio Angiulli⁴ &
Clara Pizzuti⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2431))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

5852 Accesses
12 Altmetric

Abstract

In this paper we propose a new definition of distance-based outlier that considers for each point the sum of the distances from its k nearest neighbors, called weight. Outliers are those points having the largest values of weight. In order to compute these weights, we find the k nearest neighbors of each point in a fast and efficient way by linearizing the search space through the Hilbert space filling curve. The algorithm consists of two phases, the first provides an approximated solution, within a small factor, after executing at most d + 1 scans of the data set with a low time complexity cost, where d is the number of dimensions of the data set. During each scan the number of points candidate to belong to the solution set is sensibly reduced. The second phase returns the exact solution by doing a single scan which examines further a little fraction of the data set. Experimental results show that the algorithm always finds the exact solution during the first phase after d- 《 d + 1 steps and it scales linearly both in the dimensionality and the size of the data set.

Download to read the full chapter text

Chapter PDF

Automatic detection of outliers and the number of clusters in k-means clustering via Chebyshev-type inequalities

Article 20 January 2022

SDROF: outlier detection algorithm based on relative skewness density ratio outlier factor

Article 02 December 2024

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proc. ACM Int. Conference on Managment of Data (SIGMOD’01), 2001.
Google Scholar
F. Angiulli and C. Pizzuti. Fast outlier detection in high dimensional spaces. In Tech. Report, n. 25, ISI-CNR, 2002.
Google Scholar
A. Arning, C. Aggarwal, and P. Raghavan. A linear method for deviation detection in large databases. In Proc. Int. Conf. on Knowledge Discovery and Data Mining, pages 164–169, 1996.
Google Scholar
V. Barnett and T. Lewis. Outliers in Statistical Data. John Wiley & Sons, 1994.
Google Scholar
M. M. Breunig, H. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), 2000.
Google Scholar
C. E. Brodley and M. Friedl. Identifying and eliminating mislabeled training instances. In Proc. National American Conf. on Artificial Intelligence (AAAI/IAAI 96), pages 799–805, 1996.
Google Scholar
Yu D., Sheikholeslami S., and A. Zhang. Findout: Finding outliers in very large datasets. In Tech. Report, 99-03, Univ. of New York, Buffalo, pages 1–19, 1999.
Google Scholar
C. Faloutsos and S. Roseman. Fractals for secondary key retrieval. In Proc. ACM Int. Conf. on Principles of Database Systems (PODS’89), pages 247–252, 1989.
Google Scholar
J. Han and M. Kamber. Data Mining, Concepts and Technique. Morgan Kaufmann, San Francisco, 2001.
Google Scholar
H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.
Google Scholar
H. V. Jagadish. Linear clustering of objects with multiple atributes. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’90), pages 332–342, 1990.
Google Scholar
E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. In Proc. Int. conf. on Very Large Databases (VLDB98), pages 392–403, 1998.
Google Scholar
M. Lopez and S. Liao. Finding k-closest-pairs efficiently for high dimensional data. In Proc. 12th Canadian Conf. on Computational Geometry (CCCG), pages 197–204, 2000.
Google Scholar
S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proc. ACM Int. Conf. on Managment of Data (SIGMOD’00), pages 427–438, 2000.
Google Scholar
Hans Sagan. Space Filling Curves. Springer-Verlag, 1994.
Google Scholar
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven exploration of olap data cubes. In Proc. Sixth Int. Conf on ExtendingD atabase Thecnology (EDBT), Valencia, Spain, March 1998.
Google Scholar
J. Shepherd, X. Zhu, and N. Megiddo. A fast indexing method for multidimensional nearest neighbor search. In Proc. SPIE Conf. on Storage and Retrieval for image and video databases VII, pages 350–355, 1999.
Google Scholar
Z. R. Struzik and A. Siebes. Outliers detection and localisation with wavelet based multifractal formalism. In Tech. Report, CWI, Amsterdam, INS-R0008, 2000.
Google Scholar
K. Yamanishi and J. Takeuchi. Discovering outlier filtering rules from unlabeled data. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 389–394, 2001.
Google Scholar
K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne. On-line unsupervised learning outlier detection using finite mixtures with discounting learning algorithms. In Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 250–254, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

ISI-CNR, c/o DEIS, Universitá della Calabria, 87036, Rende, CS, Italy
Fabrizio Angiulli & Clara Pizzuti

Authors

Fabrizio Angiulli
View author publications
You can also search for this author in PubMed Google Scholar
Clara Pizzuti
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Helsinki, P.O. Box 26, 00014, Helsinki, Finland
Tapio Elomaa , Heikki Mannila & Hannu Toivonen , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Angiulli, F., Pizzuti, C. (2002). Fast Outlier Detection in High Dimensional Spaces. In: Elomaa, T., Mannila, H., Toivonen, H. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 2002. Lecture Notes in Computer Science, vol 2431. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45681-3_2

Download citation

DOI: https://doi.org/10.1007/3-540-45681-3_2
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44037-6
Online ISBN: 978-3-540-45681-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics