Skip to main content

On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2024)

Abstract

Efficient and reliable methods for distinguishing outliers in data remain crucial for data analysis. Although supervised methods based on neural networks have gained recent traction, unsupervised methods such as the kNN outlier method and local outlier factor (LOF) remain state-of-the-art solutions according to different standardized benchmarks. Unfortunately, exact outlier detection through nearest neighbor search queries provides a scalability bottleneck for the high-dimensional, big datasets that are routinely analyzed in data science applications. This paper explores benefits and limitations of using approximate nearest neighbor search via Hierarchical Navigable Small World graphs (HNSW) to overcome this scalability barrier. We evaluate direct implementations that compute the kNN and LOF score from approximate neighborhoods and show the robustness of the outlier detection even in settings where the approximation is far away from the exact neighborhoods. Furthermore, we design white-box methods that compute the outlier scores directly from the underlying graph. These methods show much more variability in the quality of the outlier scores and open new ground for the development of task-aware tools based on approximate nearest neighbor search techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.NearestNeighbors.html.

  2. 2.

    https://github.com/nmslib/hnswlib.

  3. 3.

    https://github.com/scikit-learn/scikit-learn/blob/1.5.1/sklearn/neighbors/_lof.py.

  4. 4.

    Building the HNSW index using hnswlib in a multicore environment lead to slightly different graphs due to different pruning orders. This non-determinism resulted in different quality scores for the same hyperparameter settings.

References

  1. Davídsson, ÓA., Henriksen, S.B., Davídsson, T.B.: Improving the Efficiency of Outlier Detection Using Approximate Nearest Neighbor Search, Master thesis, IT University of Copenhagen (2023)

    Google Scholar 

  2. Angiulli, F., Fassetti, F.: DOLPHIN: an efficient algorithm for mining distance-based outliers in very large datasets. ACM TKDD 3(1), 4, 1–57 (2009)

    Google Scholar 

  3. Angiulli, F., Pizzuti, C.: Outlier mining in large high-dimensional data sets. IEEE TKDE 17(2), 203–215 (2005)

    MATH  Google Scholar 

  4. Aumüller, M., Bernhardsson, E., Faithfull, A.J.: Ann-benchmarks: a benchmarking tool for approximate nearest neighbor algorithms. Inf. Syst. 87 (2020)

    Google Scholar 

  5. Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Proceedings of KDD, pp. 29–38 (2003)

    Google Scholar 

  6. Bhattacharya, A., Varambally, S., Bagchi, A., Bedathur, S.: Fast one-class classification using class boundary-preserving random projections. In: KDD (2021)

    Google Scholar 

  7. Breunig, M.M., Kriegel, H., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: SIGMOD Conference, pp. 93–104. ACM (2000)

    Google Scholar 

  8. Campos, G.O., et al.: On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min. Knowl. Disc. 30(4), 891–927 (2016)

    Google Scholar 

  9. de Vries, T., Chawla, S., Houle, M.E.: Density-preserving projections for large-scale local anomaly detection. KAIS 32(1), 25–52 (2012)

    MATH  Google Scholar 

  10. Han, S., Hu, X., Huang, H., Jiang, M., Zhao, Y.: Adbench: anomaly detection benchmark. In: NeurIPS (2022)

    Google Scholar 

  11. Hautamäki, V., Kärkkäinen, I., Fränti, P.: Outlier detection using k-nearest neighbour graph. ICPR 3, 430–433 (2004)

    Google Scholar 

  12. Indyk, P., Xu, H.: Worst-case performance of popular approximate nearest neighbor search implementations: guarantees and limitations. In: NeurIPS (2023)

    Google Scholar 

  13. Iwasaki, M., Miyazaki, D.: Optimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-Dimensional Data (2018)

    Google Scholar 

  14. Jin, W., Tung, A.K., Han, J.: Mining top-n local outliers in large databases. In: Proceedings of KDD, pp. 293–298 (2001)

    Google Scholar 

  15. Kirner, E., Schubert, E., Zimek, A.: Good and bad neighborhood approximations for outlier detection ensembles. In: SISAP. Lecture Notes in Computer Science, vol. 10609, pp. 173–187. Springer (2017)

    Google Scholar 

  16. Kollios, G., Gunopulos, D., Koudas, N., Berchthold, S.: Efficient biased sampling for approximate clustering and outlier detection in large datasets. IEEE TKDE 15(5) (2003)

    Google Scholar 

  17. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE TPAMI 42(4) (2020)

    Google Scholar 

  18. Manohar, M.D., et al.: Parlayann: scalable and deterministic parallel graph-based approximate nearest neighbor search algorithms. In: PPoPP, pp. 270–285. ACM (2024)

    Google Scholar 

  19. Mirsky, Y., Doitshman, T., Elovici, Y., Shabtai, A.: Kitsune: an ensemble of autoencoders for online network intrusion detection. In: The Network and Distributed System Security Symposium (NDSS) (2018)

    Google Scholar 

  20. Nguyen, H.V., Gopalkrishnan, V.: Efficient pruning schemes for distance-based outlier detection. In: Proceedings of ECML PKDD, pp. 160–175 (2009)

    Google Scholar 

  21. Orair, G.H., Teixeira, C., Wang, Y., Meira, W., Jr., Parthasarathy, S.: Distance-based outlier detection: consolidation and renewed bearing. PVLDB 3(2), 1469–1480 (2010)

    Google Scholar 

  22. Radovanovic, M., Nanopoulos, A., Ivanovic, M.: Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27(5) (2015)

    Google Scholar 

  23. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: Proceedings of SIGMOD (2000)

    Google Scholar 

  24. Schubert, E., Zimek, A., Kriegel, H.P.: Fast and scalable outlier detection with approximate nearest neighbor ensembles. In: Proceedings of DASFAA (2015)

    Google Scholar 

  25. Schubert, E., Zimek, A., Kriegel, H.: Generalized outlier detection with flexible kernel density estimates. In: SDM, pp. 542–550. SIAM (2014)

    Google Scholar 

  26. Schubert, E., Zimek, A., Kriegel, H.P.: Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min. Knowl. Disc. 28, 190–237 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  27. Subramanya, S.J., Devvrit, F., Simhadri, H.V., Krishnaswamy, R., Kadekodi, R.: DiskANN: fast accurate billion-point nearest neighbor search on a single node. In: NeurIPS, pp. 13748–13758 (2019)

    Google Scholar 

  28. Wang, Y., Parthasarathy, S., Tatikonda, S.: Locality sensitive outlier detection: a ranking driven approach. In: Proceedings of ICDE (2011)

    Google Scholar 

  29. Zhang, X., et al.: LSHiForest: a generic framework for fast tree isolation based ensemble anomaly analysis. In: Proceedings of ICDE (2017)

    Google Scholar 

  30. Zhao, Y., Nasrullah, Z., Li, Z.: Pyod: a python toolbox for scalable outlier detection. J. Mach. Learn. Res. 20(96) (2019)

    Google Scholar 

  31. Zimek, A., Filzmoser, P.: There and back again: outlier detection between statistical reasoning and data mining algorithms. Data Mining Knowl. Discov. 8(6) (2018)

    Google Scholar 

  32. Zimek, A., Schubert, E., Kriegel, H.: A survey on unsupervised outlier detection in high-dimensional numerical data. Stat. Anal. Data Min. 5(5), 363–387 (2012)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

The initial research question was investigated in a Master thesis by Davidsson et al. [1]. We thank the anonymous reviewers for the careful comments on the paper. This project received funding from the Innovation Fund Denmark for the project DIREC (9142-00001B).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Camilla Birch Okkels or Martin Aumüller .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Okkels, C.B., Aumüller, M., Zimek, A. (2025). On the Design of Scalable Outlier Detection Methods Using Approximate Nearest Neighbor Graphs. In: Chávez, E., Kimia, B., Lokoč, J., Patella, M., Sedmidubsky, J. (eds) Similarity Search and Applications. SISAP 2024. Lecture Notes in Computer Science, vol 15268. Springer, Cham. https://doi.org/10.1007/978-3-031-75823-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-75823-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-75822-5

  • Online ISBN: 978-3-031-75823-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics