Abstract
Efficient and reliable methods for distinguishing outliers in data remain crucial for data analysis. Although supervised methods based on neural networks have gained recent traction, unsupervised methods such as the kNN outlier method and local outlier factor (LOF) remain state-of-the-art solutions according to different standardized benchmarks. Unfortunately, exact outlier detection through nearest neighbor search queries provides a scalability bottleneck for the high-dimensional, big datasets that are routinely analyzed in data science applications. This paper explores benefits and limitations of using approximate nearest neighbor search via Hierarchical Navigable Small World graphs (HNSW) to overcome this scalability barrier. We evaluate direct implementations that compute the kNN and LOF score from approximate neighborhoods and show the robustness of the outlier detection even in settings where the approximation is far away from the exact neighborhoods. Furthermore, we design white-box methods that compute the outlier scores directly from the underlying graph. These methods show much more variability in the quality of the outlier scores and open new ground for the development of task-aware tools based on approximate nearest neighbor search techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
Building the HNSW index using hnswlib in a multicore environment lead to slightly different graphs due to different pruning orders. This non-determinism resulted in different quality scores for the same hyperparameter settings.
References
Davídsson, ÓA., Henriksen, S.B., Davídsson, T.B.: Improving the Efficiency of Outlier Detection Using Approximate Nearest Neighbor Search, Master thesis, IT University of Copenhagen (2023)
Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4, 1–57 (2009)
Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)
Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of KDD, pp. 29–38 (2003)
Bhattacharya, A., Varambally, S., Bagchi, A., Bedathur, S.: Fast one-class classification using class boundary-preserving random projections. In: KDD (2021)
Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104. ACM (2000)
Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016)
de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)
Han, S., Hu, X., Huang, H., Jiang, M., Zhao, Y.: Adbench: anomaly detection benchmark. In: NeurIPS (2022)
Hautamäki, V., Kärkkäinen, I., Fränti, P.: Outlier detection using k-nearest neighbour graph. ICPR 3, 430–433 (2004)
Indyk, P., Xu, H.: Worst-case performance of popular approximate nearest neighbor search implementations: guarantees and limitations. In: NeurIPS (2023)
Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data (2018)
Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of KDD, pp. 293–298 (2001)
Kirner, E., Schubert, E., Zimek, A.: Good and bad neighborhood approximations for outlier detection ensembles. In: SISAP. Lecture Notes in Computer Science, vol. 10609, pp. 173–187. Springer (2017)
Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5) (2003)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE TPAMI 42(4) (2020)
Manohar, M.D., et al.: Parlayann: scalable and deterministic parallel graph-based approximate nearest neighbor search algorithms. In: PPoPP, pp. 270–285. ACM (2024)
Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: Kitsune: an ensemble of autoencoders for online network intrusion detection. In: The Network and Distributed System Security Symposium (NDSS) (2018)
Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings of ECML PKDD, pp. 160–175 (2009)
Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Jr., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)
Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5) (2015)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of SIGMOD (2000)
Schubert, E., Zimek, A., Kriegel, H.P.: Fast and scalable outlier detection with approximate nearest neighbor ensembles. In: Proceedings of DASFAA (2015)
Schubert, E., Zimek, A., Kriegel, H.: Generalized outlier detection with flexible kernel density estimates. In: SDM, pp. 542–550. SIAM (2014)
Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28, 190–237 (2014)
Subramanya, S.J., Devvrit, F., Simhadri, H.V., Krishnaswamy, R., Kadekodi, R.: DiskANN: fast accurate billion-point nearest neighbor search on a single node. In: NeurIPS, pp. 13748–13758 (2019)
Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings of ICDE (2011)
Zhang, X., et al.: LSHiForest: a generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings of ICDE (2017)
Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96) (2019)
Zimek, A., Filzmoser, P.: There and back again: outlier detection between statistical reasoning and data mining algorithms. Data Mining Knowl. Discov. 8(6) (2018)
Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)
Acknowledgments
The initial research question was investigated in a Master thesis by Davidsson et al. [1]. We thank the anonymous reviewers for the careful comments on the paper. This project received funding from the Innovation Fund Denmark for the project DIREC (9142-00001B).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Okkels, C.B., Aumüller, M., Zimek, A. (2025). On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs. In: Chávez, E., Kimia, B., Lokoč, J., Patella, M., Sedmidubsky, J. (eds) Similarity Search and Applications. SISAP 2024. Lecture Notes in Computer Science, vol 15268. Springer, Cham. https://doi.org/10.1007/978-3-031-75823-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-75823-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-75822-5
Online ISBN: 978-3-031-75823-2
eBook Packages: Computer ScienceComputer Science (R0)