ABSTRACT
Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of applications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Following up on the work of Kriegel et al. (KDD '08), we investigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estimation algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is efficient and scalable to very large high-dimensional data sets.
Supplemental Material
- C. C. Aggarwal, A. Hinneburg, and D. A. Keim. On the surprising behavior of distance metrics in high dimensional spaces. In Proceedings of ICDT'01, pages 420--434, 2001. Google ScholarDigital Library
- C. C. Aggarwal and P. S. Yu. Outlier detection for high dimensional data. In Proceedings of SIGMOD'01, pages 37--46, 2001. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google ScholarDigital Library
- S. D. Bay and M. Schwabacher. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of KDD'03, pages 29--38, 2003. Google ScholarDigital Library
- V. Braverman, K.-M. Chung, Z. Liu, M. Mitzenmacher, and R. Ostrovsky. Ams without 4-wise independence on product domains. In Proceedings of STACS'10, pages 119--130, 2008.Google Scholar
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. Lof: Identifying density-based local outliers. In Proceedings of SIGMOD'00, pages 93--104, 2000. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. In Proceedings of STOC'02, pages 380--388, 2002. Google ScholarDigital Library
- D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009. Google ScholarDigital Library
- A. Frank and A. Asuncion. UCI machine learning repository, 2010.Google Scholar
- A. Ghoting, S. Parthasarathy, and M. E. Otey. Fast mining of distance-based outliers in high-dimensional datasets. Data Mining and Knowledge Discovery, 16(3):349--364, 2008. Google ScholarDigital Library
- M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM, 42(6):1115--1145, 1995. Google ScholarDigital Library
- P. Indyk and A. McGregor. Declaring independence via the sketching of sketches. In Proceedings of SODA'08, pages 737--745, 2008. Google ScholarDigital Library
- E. M. Knorr and R. T. Ng. Algorithms for mining distance-based outliers in large datasets. In Proceedings of VLDB'98, pages 392--403, 1998. Google ScholarDigital Library
- H.-P. Kriegel, M. Schubert, and A. Zimek. Angle-based outlier detection in high-dimensional data. In Proceedings KDD'08, pages 444--452, 2008. Google ScholarDigital Library
- H.-P. Kriegel, M. Schubert, and A. Zimek. Outlier detection techniques. In Tutorial at KDD'10, 2010.Google Scholar
- V. Mahadevan, W. Li, V. Bhalodia, and N. Vasconcelos. Anomaly detection in crowded scenes. In Proceedings of CVPR'10, pages 1975--1981, 2010.Google ScholarCross Ref
- E. Müller, M. Schiffer, and T. Seidl. Statistical selection of relevant subspace projections for outlier ranking. In Proceedings of ICDE'11, pages 434--445, 2011. Google ScholarDigital Library
- S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. Loci: Fast outlier detection using the local correlation integral. In Proceedings of ICDE'03, pages 315--326, 2003.Google ScholarCross Ref
- S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. In Proceedings of SIGMOD'00, pages 427--438, 2000. Google ScholarDigital Library
- Y. Wang, S. Parthasarathy, and S. Tatikonda. Locality sensitive outlier detection: A ranking driven approach. In Proceedings of ICDE'11, pages 410--421, 2011. Google ScholarDigital Library
- R. Wheeler and J. S. Aitken. Multiple algorithms for fraud detection. Knowledge Based Systems, 13(2--3):93--99, 2000.Google Scholar
Index Terms
A near-linear time approximation algorithm for angle-based outlier detection in high-dimensional data
Recommendations
Continuous Angle-based Outlier Detection on High-dimensional Data Streams
IDEAS '15: Proceedings of the 19th International Database Engineering & Applications SymposiumOutlier detection over data streams is an increasingly important task in data mining. Traditional distance-based data stream outlier detection is unsuitable for high-dimensional data sets, since the discrimination of distances between different data ...
Angle-based outlier detection in high-dimensional data
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningDetecting outliers in a large set of data objects is a major data mining task aiming at finding different mechanisms responsible for different groups of objects in a data set. All existing approaches, however, are based on an assessment of distances (...
An effective and efficient algorithm for high-dimensional outlier detection
The outlier detection problem has important applications in the field of fraud detection, network robustness analysis, and intrusion detection. Most such applications are most important for high-dimensional domains in which the data can contain hundreds ...
Comments