Density-preserving projections for large-scale local anomaly detection

de Vries, Timothy; Chawla, Sanjay; Houle, Michael E.

doi:10.1007/s10115-011-0430-4

Density-preserving projections for large-scale local anomaly detection

Regular Paper
Published: 17 June 2011

Volume 32, pages 25–52, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Timothy de Vries¹,
Sanjay Chawla¹ &
Michael E. Houle²

411 Accesses
34 Citations
3 Altmetric
Explore all metrics

Abstract

Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104
Article Google Scholar
Golub GH, Van Loan CF (1996) Matrix computations. 3rd edn. Johns Hopkins University Press, Baltimore, MD, USA
MATH Google Scholar
Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4): 237–253
Google Scholar
Geusebroek J-M, Burghouts GJ, Smeulders AWM (2005) The amsterdam library of object images. Int J Comput Vis 61(1): 103–112
Article Google Scholar
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 29–38
Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: VLDB ’00: proceedings of the 26th international conference on very large data bases. Morgan Kaufmann Publishers Inc, San Francisco, pp 506–515
Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323
Article Google Scholar
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE
Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: KDD ’01: proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 293–298
Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok, Thailand
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD ’01: proceedings of the 2001 ACM SIGMOD international conference on Management of data. ACM, New York, pp 37–46
Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inform Syst 10: 333–355
Article Google Scholar
Chawla S, Sun P (2006) Slom: a new measure for local spatial outliers. Knowl Inform Syst 9(4): 412–429
Article Google Scholar
Agarwal D (2007) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inform Syst 11(1): 29–44
Article Google Scholar
Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9: 309–338
Article Google Scholar
Tang J, Chen Z, Fu AW, Cheung DW (2006) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inform Syst 11: 45–84
Article Google Scholar
Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC ’98: proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, New York, pp 604–613
Houle ME, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: ICDE, pp 619–630
Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10): 1151–1155
Article Google Scholar
Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: ICMLA ’06: proceedings of the 5th international conference on machine learning and applications. IEEE Computer Society, Washington, DC, pp 245–250
Johnson WB, Lindenstrauss J (1982) Extensions of lipschitz mappings into a hilbert space. In: Conference in modern analysis and probability (New Haven, Conn.). Amer. Math. Soc., pp 189–206
Dasgupta S, Gupta A (1999) An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Berkeley, CA, Technical Report TR-99-006
Achlioptas D (2001) Database-friendly random projections. In: 20th ACM symposium on principles of database systems. ACM, pp 274–281
Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: STOC ’02: proceedings of the thiry-fourth annual ACM symposium on theory of computing. ACM, New York, pp 741–750
Yap CK (1988) A geometric consistency theorem for a symbolic perturbation scheme. In: SCG ’88: proceedings of the fourth annual symposium on Computational geometry. ACM, New York, pp 134–142
Asuncion A, Newman D (2007) UCI machine learning repository, [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html

Download references

Author information

Authors and Affiliations

School of Information Technologies, University of Sydney, Sydney, Australia
Timothy de Vries & Sanjay Chawla
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, 101-8430, Japan
Michael E. Houle

Authors

Timothy de Vries
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Chawla
View author publications
You can also search for this author in PubMed Google Scholar
Michael E. Houle
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Timothy de Vries.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Vries, T., Chawla, S. & Houle, M.E. Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32, 25–52 (2012). https://doi.org/10.1007/s10115-011-0430-4

Download citation

Received: 11 January 2011
Accepted: 04 June 2011
Published: 17 June 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10115-011-0430-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Density-preserving projections for large-scale local anomaly detection

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Density-preserving projections for large-scale local anomaly detection

Abstract

Access this article

Similar content being viewed by others

Density-Based Clustering Based on Hierarchical Density Estimates

Data clustering: application and trends

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation