On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs

Okkels, Camilla Birch; Aumüller, Martin; Zimek, Arthur

doi:10.1007/978-3-031-75823-2_14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15268))

Included in the following conference series:

International Conference on Similarity Search and Applications

256 Accesses
1 Citations

Abstract

Efficient and reliable methods for distinguishing outliers in data remain crucial for data analysis. Although supervised methods based on neural networks have gained recent traction, unsupervised methods such as the kNN outlier method and local outlier factor (LOF) remain state-of-the-art solutions according to different standardized benchmarks. Unfortunately, exact outlier detection through nearest neighbor search queries provides a scalability bottleneck for the high-dimensional, big datasets that are routinely analyzed in data science applications. This paper explores benefits and limitations of using approximate nearest neighbor search via Hierarchical Navigable Small World graphs (HNSW) to overcome this scalability barrier. We evaluate direct implementations that compute the kNN and LOF score from approximate neighborhoods and show the robustness of the outlier detection even in settings where the approximation is far away from the exact neighborhoods. Furthermore, we design white-box methods that compute the outlier scores directly from the underlying graph. These methods show much more variability in the quality of the outlier scores and open new ground for the development of task-aware tools based on approximate nearest neighbor search techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 54.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Fast, exact, and parallel-friendly outlier detection algorithms with proximity graph in metric spaces

Article Open access 27 January 2022

SDROF: outlier detection algorithm based on relative skewness density ratio outlier factor

Article 02 December 2024

KAGO: an approximate adaptive grid-based outlier detection approach using kernel density estimate

Article 12 July 2021

Notes

1.
https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html.
2.
https://github.com/nmslib/hnswlib.
3.
https://github.com/scikit-learn/scikit-learn/blob/1.5.1/sklearn/neighbors/_lof.py.
4.
Building the HNSW index using hnswlib in a multicore environment lead to slightly different graphs due to different pruning orders. This non-determinism resulted in different quality scores for the same hyperparameter settings.

References

Davídsson, ÓA., Henriksen, S.B., Davídsson, T.B.: Improving the Efficiency of Outlier Detection Using Approximate Nearest Neighbor Search, Master thesis, IT University of Copenhagen (2023)
Google Scholar
Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4, 1–57 (2009)
Google Scholar
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
MATH Google Scholar
Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)
Google Scholar
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of KDD, pp. 29–38 (2003)
Google Scholar
Bhattacharya, A., Varambally, S., Bagchi, A., Bedathur, S.: Fast one-class classification using class boundary-preserving random projections. In: KDD (2021)
Google Scholar
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104. ACM (2000)
Google Scholar
Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016)
Google Scholar
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
MATH Google Scholar
Han, S., Hu, X., Huang, H., Jiang, M., Zhao, Y.: Adbench: anomaly detection benchmark. In: NeurIPS (2022)
Google Scholar
Hautamäki, V., Kärkkäinen, I., Fränti, P.: Outlier detection using k-nearest neighbour graph. ICPR 3, 430–433 (2004)
Google Scholar
Indyk, P., Xu, H.: Worst-case performance of popular approximate nearest neighbor search implementations: guarantees and limitations. In: NeurIPS (2023)
Google Scholar
Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data (2018)
Google Scholar
Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of KDD, pp. 293–298 (2001)
Google Scholar
Kirner, E., Schubert, E., Zimek, A.: Good and bad neighborhood approximations for outlier detection ensembles. In: SISAP. Lecture Notes in Computer Science, vol. 10609, pp. 173–187. Springer (2017)
Google Scholar
Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5) (2003)
Google Scholar
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE TPAMI 42(4) (2020)
Google Scholar
Manohar, M.D., et al.: Parlayann: scalable and deterministic parallel graph-based approximate nearest neighbor search algorithms. In: PPoPP, pp. 270–285. ACM (2024)
Google Scholar
Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: Kitsune: an ensemble of autoencoders for online network intrusion detection. In: The Network and Distributed System Security Symposium (NDSS) (2018)
Google Scholar
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings of ECML PKDD, pp. 160–175 (2009)
Google Scholar
Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Jr., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Google Scholar
Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5) (2015)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of SIGMOD (2000)
Google Scholar
Schubert, E., Zimek, A., Kriegel, H.P.: Fast and scalable outlier detection with approximate nearest neighbor ensembles. In: Proceedings of DASFAA (2015)
Google Scholar
Schubert, E., Zimek, A., Kriegel, H.: Generalized outlier detection with flexible kernel density estimates. In: SDM, pp. 542–550. SIAM (2014)
Google Scholar
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28, 190–237 (2014)
Article MathSciNet MATH Google Scholar
Subramanya, S.J., Devvrit, F., Simhadri, H.V., Krishnaswamy, R., Kadekodi, R.: DiskANN: fast accurate billion-point nearest neighbor search on a single node. In: NeurIPS, pp. 13748–13758 (2019)
Google Scholar
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings of ICDE (2011)
Google Scholar
Zhang, X., et al.: LSHiForest: a generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings of ICDE (2017)
Google Scholar
Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96) (2019)
Google Scholar
Zimek, A., Filzmoser, P.: There and back again: outlier detection between statistical reasoning and data mining algorithms. Data Mining Knowl. Discov. 8(6) (2018)
Google Scholar
Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

The initial research question was investigated in a Master thesis by Davidsson et al. [1]. We thank the anonymous reviewers for the careful comments on the paper. This project received funding from the Innovation Fund Denmark for the project DIREC (9142-00001B).

Author information

Authors and Affiliations

IT University of Copenhagen, Copenhagen, Denmark
Camilla Birch Okkels & Martin Aumüller
University of Southern Denmark, Odense, Denmark
Arthur Zimek

Authors

Camilla Birch Okkels
View author publications
You can also search for this author in PubMed Google Scholar
Martin Aumüller
View author publications
You can also search for this author in PubMed Google Scholar
Arthur Zimek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Camilla Birch Okkels or Martin Aumüller .

Editor information

Editors and Affiliations

Center for Scientific Research and Higher Education at Ensenada, Ensenada, Mexico
Edgar Chávez
Brown University, Providence, RI, USA
Benjamin Kimia
Charles University, Prague, Czech Republic
Jakub Lokoč
University of Bologna, Bologna, Italy
Marco Patella
Masaryk University, Brno, Czech Republic
Jan Sedmidubsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Okkels, C.B., Aumüller, M., Zimek, A. (2025). On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs. In: Chávez, E., Kimia, B., Lokoč, J., Patella, M., Sedmidubsky, J. (eds) Similarity Search and Applications. SISAP 2024. Lecture Notes in Computer Science, vol 15268. Springer, Cham. https://doi.org/10.1007/978-3-031-75823-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-75823-2_14
Published: 25 October 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75822-5
Online ISBN: 978-3-031-75823-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs