Skip to main content

Advertisement

Log in

Combining KNN with AutoEncoder for Outlier Detection

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

K-nearest neighbor (KNN) is one of the most fundamental methods for unsupervised outlier detection because of its various advantages, e.g., ease of use and relatively high accuracy. Currently, most data analytic tasks need to deal with high-dimensional data, and the KNN-based methods often fail due to “the curse of dimensionality”. AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data, but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection. In this study, we propose to combine KNN with AutoEncoder for outlier detection. First, we propose the Nearest Neighbor AutoEncoder (NNAE) by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN. Second, we propose the K-nearest reconstruction neighbors (KNRNs) by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers. Third, we develop a method to automatically choose better parameters for optimizing the structure of NNAE. Finally, using five real-world datasets, we experimentally show that our proposed approach NNAE+KNRN is much better than existing methods, i.e., KNN, Isolation Forest, a traditional AutoEncoder using reconstruction errors (AutoEncoder-RE), and Robust AutoEncoder.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  1. Aggarwal C C. Outlier analysis. In Data Mining, Aggarwal C C (ed.), Springer, 2015. DOI: https://doi.org/10.1007/978-3-319-14142-8_8.

    Chapter  Google Scholar 

  2. Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM Computing Surveys, 2009, 41 (3): Article No. 15. DOI: https://doi.org/10.1145/1541880.1541882.

  3. Zhou C, Paffenroth R C. Anomaly detection with robust deep autoencoders. In Proc. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2017, pp.665–674. DOI: https://doi.org/10.1145/3097983.3098052.

    Google Scholar 

  4. Liu F T, Ting K M, Zhou Z H. Isolation forest. In Proc. the 8th IEEE International Conference on Data Mining, Dec. 2008, pp.413–422. DOI: https://doi.org/10.1109/ICDM.2008.17.

    Google Scholar 

  5. Sequeira K, Zaki M. ADMIT: Anomaly-based data mining for intrusions. In Proc. the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2002, pp.386–395. DOI: https://doi.org/10.1145/775047.775103.

    Google Scholar 

  6. Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Barbará D, Jajodia S (eds.), Springer, 2002, pp.77–101. DOI: https://doi.org/10.1007/978-1-4615-0953-0_4.

    Chapter  Google Scholar 

  7. An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Technical Report, Data Mining Center of Seoul National University, 2015. https://paperswithcode.com/paper/variational-autoencoder-based-anomaly, Sept. 2024.

    Google Scholar 

  8. Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Ming and Knowledge Discovery, Aug. 2002, pp.15–26. DOI: https://doi.org/10.1007/3-540-45681-3_2.

    Chapter  Google Scholar 

  9. Idé T, Kashima H. Eigenspace-based anomaly detection in computer systems. In Proc. the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.440–449. DOI: https://doi.org/10.1145/1014052.1014102.

    Google Scholar 

  10. Hu R J, Aggarwal C C, Ma S, Huai J P. An embedding approach to anomaly detection. In Proc. the 32nd IEEE International Conference on Data Engineering, May 2016, pp.385–396. DOI: https://doi.org/10.1109/ICDE.2016.7498256.

    Google Scholar 

  11. Zhu M X, Aggarwal C C, Ma S, Zhang H, Huai J P. Outlier detection in sparse data with factorization machines. In Proc. the 2017 ACM Conference on Information and Knowledge Management, Nov. 2017, pp.817–826. DOI: https://doi.org/10.1145/3132847.3132987.

    Chapter  Google Scholar 

  12. Ng A. Sparse autoencoder. CS294A Lecture Notes, 2011, 72(2011): 1–19. https://graphics.stanford.edu/courses/cs233-21-spring/ReferencedPapers/SAE.pdf/, Sept. 2024.

    Google Scholar 

  13. Chen J H, Sathe S, Aggarwal C, Turaga D. Outlier detection with autoencoder ensembles. In Proc. the 2017 SIAM International Conference on Data Mining, Apr. 2017, pp.90–98. DOI: https://doi.org/10.1137/1.9781611974973.11.

    Google Scholar 

  14. Zhang C X, Song D J, Chen Y C, Feng X Y, Lumezanu C, Cheng W, Ni J C, Zong B, Chen H F, Chawla N V. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.1409–1416. DOI: https://doi.org/10.1609/aaai.v33i01.33011409.

    Google Scholar 

  15. Pang G S, Cao L B, Aggarwal C. Deep learning for anomaly detection: Challenges, methods, and opportunities. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.1127–1130. DOI: https://doi.org/10.1145/3437963.3441659.

    Chapter  Google Scholar 

  16. Ruff L, Zemlyanskiy Y, Vandermeulen R, Schnake T, Kloft M. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.4061–4071. DOI: https://doi.org/10.18653/v1/p19-1398.

    Chapter  Google Scholar 

  17. Bergmann P, Fauser M, Sattlegger D, Steger C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.9592–9600. DOI: https://doi.org/10.1109/CVPR.2019.00982.

    Google Scholar 

  18. Gong D, Liu L Q, Le V, Saha B, Mansour M R, Venkatesh S, Van Den Hengel A. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.1705–1714. DOI: https://doi.org/10.1109/ICCV.2019.00179.

    Google Scholar 

  19. Hou J L, Zhang Y Y, Zhong Q Y, Xie D, Pu S L, Zhou H. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.8771–8780. DOI: https://doi.org/10.1109/ICCV48922.2021.00867.

    Google Scholar 

  20. Chen X H, Deng L W, Huang F T, Zhang C W, Zhang Z Q, Zhao Y, Zheng K. DAEMON: Unsupervised anomaly detection and interpretation for multivariate time series. In Proc. the 37th IEEE International Conference on Data Engineering, Apr. 2021, pp.2225–2230. DOI: https://doi.org/10.1109/ICDE51399.2021.00228.

    Google Scholar 

  21. Lai C H, Zou D M, Lerman G. Robust subspace recovery layer for unsupervised anomaly detection. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.

    Google Scholar 

  22. Chen W X, Xu H W, Li Z Y, Pei D, Chen J, Qiao H L, Feng Y, Wang Z G. Unsupervised anomaly detection for intricate KPIs via adversarial training of VAE. In Proc. the 2019 IEEE Conference on Computer Communications, Apr. 29–May 2, 2019, pp.1891–1899. DOI: https://doi.org/10.1109/INFOCOM.2019.8737430.

    Google Scholar 

  23. Audibert J, Michiardi P, Guyard F, Marti S, Zuluaga M A. USAD: UnSupervised anomaly detection on multivariate time series. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug. 2020, pp.3395–3404. DOI: https://doi.org/10.1145/3394486.3403392.

    Chapter  Google Scholar 

  24. Putina A, Sozio M, Rossi D, Navarro J M. Random histogram forest for unsupervised anomaly detection. In Proc. the 2020 IEEE International Conference on Data Mining, Nov. 2020, pp.1226–1231. DOI: https://doi.org/10.1109/ICDM50108.2020.00154.

    Google Scholar 

  25. Mohotti W A, Nayak R. Efficient outlier detection in text corpus using rare frequency and ranking. ACM Trans. Knowledge Discovery from Data, 2020, 14 (6): Article No. 71. DOI: https://doi.org/10.1145/3399712.

  26. Li J, Di S M, Shen Y Y, Chen L. FluxEV: A fast and effective unsupervised framework for time-series anomaly detection. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.824–832. DOI: https://doi.org/10.1145/3437963.3441823.

    Chapter  Google Scholar 

  27. Hawkins D M. Identification of Outliers. Springer, 1980.

    Book  Google Scholar 

  28. Mahoney M V, Chan P K. Learning rules for anomaly detection of hostile network traffic. In Proc. the 3rd IEEE International Conference on Data Mining, Nov. 2003, pp.601–604. DOI: https://doi.org/10.1109/ICDM.2003.1250987.

    Google Scholar 

  29. Tandon G, Chan P K. Weighting versus pruning in rule validation for detecting network and host anomalies. In Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2007, pp.697–706. DOI: https://doi.org/10.1145/1281192.1281267.

    Chapter  Google Scholar 

  30. Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, pp.427–438. DOI: https://doi.org/10.1145/342009.335437.

    Chapter  Google Scholar 

  31. Hautamäki V, Kärkkäinen I, Fränti P. Outlier detection using k-nearest neighbour graph. In Proc. the 17th International Conference on Pattern Recognition, Aug. 2004, pp.430–433. DOI: https://doi.org/10.1109/ICPR.2004.1334558.

    Google Scholar 

  32. Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series database. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 1999, pp.102–113.

    Google Scholar 

  33. Zimek A, Schubert E, Kriegel H P. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 2012, 5(5): 363–387. DOI: https://doi.org/10.1002/sam.11161.

    Article  MathSciNet  Google Scholar 

  34. Lee Y J, Yeh Y R, Wang Y C F. Anomaly detection via online oversampling principal component analysis. IEEE Trans. Knowledge and Data Engineering, 2013, 25(7): 1460–1470. DOI: https://doi.org/10.1109/TKDE.2012.99.

    Article  Google Scholar 

  35. Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507. DOI: https://doi.org/10.1126/science.1127647.

    Article  MathSciNet  Google Scholar 

  36. Borg I, Groenen P J F. Modern Multidimensional Scaling: Theory and Applications (2nd edition). Springer, 2005.

    Google Scholar 

  37. Xiong L, Chen X, Schneider J. Direct robust matrix factorizatoin for anomaly detection. In Proc. the 11th IEEE International Conference on Data Mining, Dec. 2011, pp.844–853. DOI: https://doi.org/10.1109/ICDM.2011.52.

    Google Scholar 

  38. Hawkins S, He H X, Williams G, Baxter R. Outlier detection using replicator neural networks. In Proc. the 4th International Conference, Sept. 2002, pp.170–180. DOI: https://doi.org/10.1007/3-540-46145-0_17.

    Google Scholar 

  39. Sakurada M, Yairi T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proc. the 2nd Workshop on Machine Learning for Sensory Data Analysis, Dec. 2014, pp.4–11. DOI: https://doi.org/10.1145/2689746.2689747.

    Google Scholar 

  40. Aytekin C, Ni X Y, Cricri F, Aksu E. Clustering and unsupervised anomaly detection with l2 normalized deep auto-encoder representations. In Proc. the 2018 International Joint Conference on Neural Networks, Jul. 2018, pp.1–6. DOI: https://doi.org/10.1109/IJCNN.2018.8489068.

    Google Scholar 

  41. Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 2010, 11: 3371–3408. DOI: https://doi.org/10.5555/1756006.1953039.

    MathSciNet  Google Scholar 

  42. Li Y, Fang B X, Guo L, Chen Y. Network anomaly detection based on TCM-KNN algorithm. In Proc. the 2nd ACM Symposium on Information, Computer and Communications Security, Mar. 2007, pp.13–19. DOI: https://doi.org/10.1145/1229285.1229292.

    Chapter  Google Scholar 

  43. Wu G J, Zhao Z H, Fu G, Wang H P, Wang Y, Wang Z Y, Hou J T, Huang L. A fast kNN-based approach for time sensitive anomaly detection over data streams. In Proc. the 19th International Conference on Computational Science, Jun. 2019, pp.59–74. DOI: https://doi.org/10.1007/978-3-030-22741-8_5.

    Google Scholar 

  44. Goldstein M, Uchida S. A comparative study on outlier removal from a large-scale dataset using unsupervised anomaly detection. In Proc. the 5th International Conference on Pattern Recognition Applications and Methods, Feb. 2016, pp.263–269. DOI: https://doi.org/10.5220/0005701302630269.

    Google Scholar 

  45. Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv: 1708.07747, 2017. http://arxiv.org/abs/1708.07747, Aug. 2024.

    Google Scholar 

  46. Coates A, Ng A Y, Lee H. An analysis of single-layer networks in unsupervised feature learning. In Proc. the 14th International Conference on Artificial Intelligence and Statistics, Apr. 2011, pp.215–223.

    Google Scholar 

  47. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.

    Google Scholar 

  48. Duan L, Aggarwal C C, Ma S, Sathe S. Improving spectral clustering with deep embedding and cluster estimation. In Proc. the 2019 IEEE International Conference on Data Mining, Nov. 2019, pp.170–179. DOI: https://doi.org/10.1109/ICDM.2019.00027.

    Google Scholar 

  49. Duan L, Ma S, Aggarwal C, Sathe S. Improving spectral clustering with deep embedding, cluster estimation and metric learning. Knowledge and Information Systems, 2021, 63(3): 675–694. DOI: https://doi.org/10.1007/s10115-020-01530-8.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuai Ma  (马 帅).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61925203 and U22B2021.

Shu-Zheng Liu received his M.S. degree in industrial engineering from Rutgers University, New Brunswick, in 2017, and his M.E. degree in computer science from Beihang University, Beijing, in 2021. He is currently working at Alibaba, Beijing. His research interests include data mining and outlier detection.

Shuai Ma received his Ph.D. degrees in computer science from Peking University, Beijing, in 2004, and from The University of Edinburgh, Edinburgh, in 2010, respectively. He is a professor with the School of Computer Science and Engineering, Beihang University, Beijing. He was a postdoctoral research fellow with the Database Group, The University of Edinburgh, a summer intern at Bell Labs, Murray Hill, NJ, and a visiting researcher of MSRA. His current research interests include big data, database, theory and systems, graph and social data analysis, data cleaning and data quality.

Han-Qing Chen received his B.S. degree in software engineering from Beihang University, Beijing, in 2019. He is a Ph.D. candidate in the School of Computer Science and Technology, Beihang University, Beijing. His research interests include big data and graph analytics.

Li-Zhen Cui received his Ph.D. degree in computer science from Shandong University, Jinan, in 2005. He is a professor with the School of Software & C-FAIR, Shandong University, Jinan. He is a president of the School of Software & C-FAIR, Shandong University, Jinan. His current research interests include recommender systems and data mining.

Jie Ding received his Ph.D. degree in computer science from The University of Edinburgh, Edinburgh, in 2010. He is a professor with the School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang. His current research interests include theoretical computer science, big data, and reinforcement learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, SZ., Ma, S., Chen, HQ. et al. Combining KNN with AutoEncoder for Outlier Detection. J. Comput. Sci. Technol. 39, 1153–1166 (2024). https://doi.org/10.1007/s11390-023-2403-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-023-2403-y

Keywords