Skip to main content
Log in

Density-preserving projections for large-scale local anomaly detection

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Outlier or anomaly detection is a fundamental data mining task with the aim to identify data points, events, transactions which deviate from the norm. The identification of outliers in data can provide insights about the underlying data generating process. In general, outliers can be of two kinds: global and local. Global outliers are distinct with respect to the whole data set, while local outliers are distinct with respect to data points in their local neighbourhood. While several approaches have been proposed to scale up the process of global outlier discovery in large databases, this has not been the case for local outliers. We tackle this problem by optimising the use of local outlier factor (LOF) for large and high-dimensional data. We propose projection-indexed nearest-neighbours (PINN), a novel technique that exploits extended nearest-neighbour sets in a reduced-dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of random projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300,000 elements and 102,600 dimensions. A further investigation into the use of high-dimensionality-specific indexing such as spatial approximate sample hierarchy (SASH) shows that our novel technique holds benefits over even these types of highly efficient indexing. We cement the practical applications of our novel technique with insights into what it means to find local outliers in real data including image and text data, and include potential applications for this knowledge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) Lof: identifying density-based local outliers. SIGMOD Rec 29(2): 93–104

    Article  Google Scholar 

  2. Golub GH, Van Loan CF (1996) Matrix computations. 3rd edn. Johns Hopkins University Press, Baltimore, MD, USA

    MATH  Google Scholar 

  3. Knorr EM, Ng RT, Tucakov V (2000) Distance-based outliers: algorithms and applications. VLDB J 8(3–4): 237–253

    Google Scholar 

  4. Geusebroek J-M, Burghouts GJ, Smeulders AWM (2005) The amsterdam library of object images. Int J Comput Vis 61(1): 103–112

    Article  Google Scholar 

  5. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD ’03: proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 29–38

  6. Hinneburg A, Aggarwal CC, Keim DA (2000) What is the nearest neighbor in high dimensional spaces? In: VLDB ’00: proceedings of the 26th international conference on very large data bases. Morgan Kaufmann Publishers Inc, San Francisco, pp 506–515

  7. Tenenbaum JB, Silva Vd, Langford JC (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290(5500): 2319–2323

    Article  Google Scholar 

  8. Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: ICDE

  9. Jin W, Tung AKH, Han J (2001) Mining top-n local outliers in large databases. In: KDD ’01: proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, New York, pp 293–298

  10. Kriegel H-P, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok, Thailand

  11. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD ’01: proceedings of the 2001 ACM SIGMOD international conference on Management of data. ACM, New York, pp 37–46

  12. Zhang J, Wang H (2006) Detecting outlying subspaces for high-dimensional data: the new task, algorithms, and performance. Knowl Inform Syst 10: 333–355

    Article  Google Scholar 

  13. Chawla S, Sun P (2006) Slom: a new measure for local spatial outliers. Knowl Inform Syst 9(4): 412–429

    Article  Google Scholar 

  14. Agarwal D (2007) Detecting anomalies in cross-classified streams: a bayesian approach. Knowl Inform Syst 11(1): 29–44

    Article  Google Scholar 

  15. Yu JX, Qian W, Lu H, Zhou A (2006) Finding centric local outliers in categorical/numerical spaces. Knowl Inform Syst 9: 309–338

    Article  Google Scholar 

  16. Tang J, Chen Z, Fu AW, Cheung DW (2006) Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inform Syst 11: 45–84

    Article  Google Scholar 

  17. Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC ’98: proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, New York, pp 604–613

  18. Houle ME, Sakuma J (2005) Fast approximate similarity search in extremely high-dimensional data sets. In: ICDE, pp 619–630

  19. Sharma A, Paliwal KK (2007) Fast principal component analysis using fixed-point algorithm. Pattern Recogn Lett 28(10): 1151–1155

    Article  Google Scholar 

  20. Deegalla S, Bostrom H (2006) Reducing high-dimensional data by principal component analysis vs. random projection for nearest neighbor classification. In: ICMLA ’06: proceedings of the 5th international conference on machine learning and applications. IEEE Computer Society, Washington, DC, pp 245–250

  21. Johnson WB, Lindenstrauss J (1982) Extensions of lipschitz mappings into a hilbert space. In: Conference in modern analysis and probability (New Haven, Conn.). Amer. Math. Soc., pp 189–206

  22. Dasgupta S, Gupta A (1999) An elementary proof of the johnson-lindenstrauss lemma. International Computer Science Institute, Berkeley, CA, Technical Report TR-99-006

  23. Achlioptas D (2001) Database-friendly random projections. In: 20th ACM symposium on principles of database systems. ACM, pp 274–281

  24. Karger DR, Ruhl M (2002) Finding nearest neighbors in growth-restricted metrics. In: STOC ’02: proceedings of the thiry-fourth annual ACM symposium on theory of computing. ACM, New York, pp 741–750

  25. Yap CK (1988) A geometric consistency theorem for a symbolic perturbation scheme. In: SCG ’88: proceedings of the fourth annual symposium on Computational geometry. ACM, New York, pp 134–142

  26. Asuncion A, Newman D (2007) UCI machine learning repository, [Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Timothy de Vries.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Vries, T., Chawla, S. & Houle, M.E. Density-preserving projections for large-scale local anomaly detection. Knowl Inf Syst 32, 25–52 (2012). https://doi.org/10.1007/s10115-011-0430-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0430-4

Keywords

Navigation