Combining KNN with AutoEncoder for Outlier Detection

Liu, Shu-Zheng; Ma, Shuai; Chen, Han-Qing; Cui, Li-Zhen; Ding, Jie

doi:10.1007/s11390-023-2403-y

Combining KNN with AutoEncoder for Outlier Detection

Regular Paper
Published: 05 December 2024

Volume 39, pages 1153–1166, (2024)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Shu-Zheng Liu (刘叔正)¹,
Shuai Ma (马帅)¹,
Han-Qing Chen (陈瀚清)¹,
Li-Zhen Cui (崔立真)^2,3 &
…
Jie Ding (丁杰)⁴

166 Accesses
1 Altmetric
Explore all metrics

Abstract

K-nearest neighbor (KNN) is one of the most fundamental methods for unsupervised outlier detection because of its various advantages, e.g., ease of use and relatively high accuracy. Currently, most data analytic tasks need to deal with high-dimensional data, and the KNN-based methods often fail due to “the curse of dimensionality”. AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data, but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection. In this study, we propose to combine KNN with AutoEncoder for outlier detection. First, we propose the Nearest Neighbor AutoEncoder (NNAE) by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN. Second, we propose the K-nearest reconstruction neighbors (KNRNs) by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers. Third, we develop a method to automatically choose better parameters for optimizing the structure of NNAE. Finally, using five real-world datasets, we experimentally show that our proposed approach NNAE+KNRN is much better than existing methods, i.e., KNN, Isolation Forest, a traditional AutoEncoder using reconstruction errors (AutoEncoder-RE), and Robust AutoEncoder.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal C C. Outlier analysis. In Data Mining, Aggarwal C C (ed.), Springer, 2015. DOI: https://doi.org/10.1007/978-3-319-14142-8_8.
Chapter Google Scholar
Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM Computing Surveys, 2009, 41 (3): Article No. 15. DOI: https://doi.org/10.1145/1541880.1541882.
Zhou C, Paffenroth R C. Anomaly detection with robust deep autoencoders. In Proc. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2017, pp.665–674. DOI: https://doi.org/10.1145/3097983.3098052.
Google Scholar
Liu F T, Ting K M, Zhou Z H. Isolation forest. In Proc. the 8th IEEE International Conference on Data Mining, Dec. 2008, pp.413–422. DOI: https://doi.org/10.1109/ICDM.2008.17.
Google Scholar
Sequeira K, Zaki M. ADMIT: Anomaly-based data mining for intrusions. In Proc. the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2002, pp.386–395. DOI: https://doi.org/10.1145/775047.775103.
Google Scholar
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Barbará D, Jajodia S (eds.), Springer, 2002, pp.77–101. DOI: https://doi.org/10.1007/978-1-4615-0953-0_4.
Chapter Google Scholar
An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Technical Report, Data Mining Center of Seoul National University, 2015. https://paperswithcode.com/paper/variational-autoencoder-based-anomaly, Sept. 2024.
Google Scholar
Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Ming and Knowledge Discovery, Aug. 2002, pp.15–26. DOI: https://doi.org/10.1007/3-540-45681-3_2.
Chapter Google Scholar
Idé T, Kashima H. Eigenspace-based anomaly detection in computer systems. In Proc. the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.440–449. DOI: https://doi.org/10.1145/1014052.1014102.
Google Scholar
Hu R J, Aggarwal C C, Ma S, Huai J P. An embedding approach to anomaly detection. In Proc. the 32nd IEEE International Conference on Data Engineering, May 2016, pp.385–396. DOI: https://doi.org/10.1109/ICDE.2016.7498256.
Google Scholar
Zhu M X, Aggarwal C C, Ma S, Zhang H, Huai J P. Outlier detection in sparse data with factorization machines. In Proc. the 2017 ACM Conference on Information and Knowledge Management, Nov. 2017, pp.817–826. DOI: https://doi.org/10.1145/3132847.3132987.
Chapter Google Scholar
Ng A. Sparse autoencoder. CS294A Lecture Notes, 2011, 72(2011): 1–19. https://graphics.stanford.edu/courses/cs233-21-spring/ReferencedPapers/SAE.pdf/, Sept. 2024.
Google Scholar
Chen J H, Sathe S, Aggarwal C, Turaga D. Outlier detection with autoencoder ensembles. In Proc. the 2017 SIAM International Conference on Data Mining, Apr. 2017, pp.90–98. DOI: https://doi.org/10.1137/1.9781611974973.11.
Google Scholar
Zhang C X, Song D J, Chen Y C, Feng X Y, Lumezanu C, Cheng W, Ni J C, Zong B, Chen H F, Chawla N V. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.1409–1416. DOI: https://doi.org/10.1609/aaai.v33i01.33011409.
Google Scholar
Pang G S, Cao L B, Aggarwal C. Deep learning for anomaly detection: Challenges, methods, and opportunities. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.1127–1130. DOI: https://doi.org/10.1145/3437963.3441659.
Chapter Google Scholar
Ruff L, Zemlyanskiy Y, Vandermeulen R, Schnake T, Kloft M. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.4061–4071. DOI: https://doi.org/10.18653/v1/p19-1398.
Chapter Google Scholar
Bergmann P, Fauser M, Sattlegger D, Steger C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.9592–9600. DOI: https://doi.org/10.1109/CVPR.2019.00982.
Google Scholar
Gong D, Liu L Q, Le V, Saha B, Mansour M R, Venkatesh S, Van Den Hengel A. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.1705–1714. DOI: https://doi.org/10.1109/ICCV.2019.00179.
Google Scholar
Hou J L, Zhang Y Y, Zhong Q Y, Xie D, Pu S L, Zhou H. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.8771–8780. DOI: https://doi.org/10.1109/ICCV48922.2021.00867.
Google Scholar
Chen X H, Deng L W, Huang F T, Zhang C W, Zhang Z Q, Zhao Y, Zheng K. DAEMON: Unsupervised anomaly detection and interpretation for multivariate time series. In Proc. the 37th IEEE International Conference on Data Engineering, Apr. 2021, pp.2225–2230. DOI: https://doi.org/10.1109/ICDE51399.2021.00228.
Google Scholar
Lai C H, Zou D M, Lerman G. Robust subspace recovery layer for unsupervised anomaly detection. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
Google Scholar
Chen W X, Xu H W, Li Z Y, Pei D, Chen J, Qiao H L, Feng Y, Wang Z G. Unsupervised anomaly detection for intricate KPIs via adversarial training of VAE. In Proc. the 2019 IEEE Conference on Computer Communications, Apr. 29–May 2, 2019, pp.1891–1899. DOI: https://doi.org/10.1109/INFOCOM.2019.8737430.
Google Scholar
Audibert J, Michiardi P, Guyard F, Marti S, Zuluaga M A. USAD: UnSupervised anomaly detection on multivariate time series. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug. 2020, pp.3395–3404. DOI: https://doi.org/10.1145/3394486.3403392.
Chapter Google Scholar
Putina A, Sozio M, Rossi D, Navarro J M. Random histogram forest for unsupervised anomaly detection. In Proc. the 2020 IEEE International Conference on Data Mining, Nov. 2020, pp.1226–1231. DOI: https://doi.org/10.1109/ICDM50108.2020.00154.
Google Scholar
Mohotti W A, Nayak R. Efficient outlier detection in text corpus using rare frequency and ranking. ACM Trans. Knowledge Discovery from Data, 2020, 14 (6): Article No. 71. DOI: https://doi.org/10.1145/3399712.
Li J, Di S M, Shen Y Y, Chen L. FluxEV: A fast and effective unsupervised framework for time-series anomaly detection. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.824–832. DOI: https://doi.org/10.1145/3437963.3441823.
Chapter Google Scholar
Hawkins D M. Identification of Outliers. Springer, 1980.
Book Google Scholar
Mahoney M V, Chan P K. Learning rules for anomaly detection of hostile network traffic. In Proc. the 3rd IEEE International Conference on Data Mining, Nov. 2003, pp.601–604. DOI: https://doi.org/10.1109/ICDM.2003.1250987.
Google Scholar
Tandon G, Chan P K. Weighting versus pruning in rule validation for detecting network and host anomalies. In Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2007, pp.697–706. DOI: https://doi.org/10.1145/1281192.1281267.
Chapter Google Scholar
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, pp.427–438. DOI: https://doi.org/10.1145/342009.335437.
Chapter Google Scholar
Hautamäki V, Kärkkäinen I, Fränti P. Outlier detection using k-nearest neighbour graph. In Proc. the 17th International Conference on Pattern Recognition, Aug. 2004, pp.430–433. DOI: https://doi.org/10.1109/ICPR.2004.1334558.
Google Scholar
Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series database. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 1999, pp.102–113.
Google Scholar
Zimek A, Schubert E, Kriegel H P. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 2012, 5(5): 363–387. DOI: https://doi.org/10.1002/sam.11161.
Article MathSciNet Google Scholar
Lee Y J, Yeh Y R, Wang Y C F. Anomaly detection via online oversampling principal component analysis. IEEE Trans. Knowledge and Data Engineering, 2013, 25(7): 1460–1470. DOI: https://doi.org/10.1109/TKDE.2012.99.
Article Google Scholar
Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507. DOI: https://doi.org/10.1126/science.1127647.
Article MathSciNet Google Scholar
Borg I, Groenen P J F. Modern Multidimensional Scaling: Theory and Applications (2nd edition). Springer, 2005.
Google Scholar
Xiong L, Chen X, Schneider J. Direct robust matrix factorizatoin for anomaly detection. In Proc. the 11th IEEE International Conference on Data Mining, Dec. 2011, pp.844–853. DOI: https://doi.org/10.1109/ICDM.2011.52.
Google Scholar
Hawkins S, He H X, Williams G, Baxter R. Outlier detection using replicator neural networks. In Proc. the 4th International Conference, Sept. 2002, pp.170–180. DOI: https://doi.org/10.1007/3-540-46145-0_17.
Google Scholar
Sakurada M, Yairi T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proc. the 2nd Workshop on Machine Learning for Sensory Data Analysis, Dec. 2014, pp.4–11. DOI: https://doi.org/10.1145/2689746.2689747.
Google Scholar
Aytekin C, Ni X Y, Cricri F, Aksu E. Clustering and unsupervised anomaly detection with l2 normalized deep auto-encoder representations. In Proc. the 2018 International Joint Conference on Neural Networks, Jul. 2018, pp.1–6. DOI: https://doi.org/10.1109/IJCNN.2018.8489068.
Google Scholar
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 2010, 11: 3371–3408. DOI: https://doi.org/10.5555/1756006.1953039.
MathSciNet Google Scholar
Li Y, Fang B X, Guo L, Chen Y. Network anomaly detection based on TCM-KNN algorithm. In Proc. the 2nd ACM Symposium on Information, Computer and Communications Security, Mar. 2007, pp.13–19. DOI: https://doi.org/10.1145/1229285.1229292.
Chapter Google Scholar
Wu G J, Zhao Z H, Fu G, Wang H P, Wang Y, Wang Z Y, Hou J T, Huang L. A fast kNN-based approach for time sensitive anomaly detection over data streams. In Proc. the 19th International Conference on Computational Science, Jun. 2019, pp.59–74. DOI: https://doi.org/10.1007/978-3-030-22741-8_5.
Google Scholar
Goldstein M, Uchida S. A comparative study on outlier removal from a large-scale dataset using unsupervised anomaly detection. In Proc. the 5th International Conference on Pattern Recognition Applications and Methods, Feb. 2016, pp.263–269. DOI: https://doi.org/10.5220/0005701302630269.
Google Scholar
Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv: 1708.07747, 2017. http://arxiv.org/abs/1708.07747, Aug. 2024.
Google Scholar
Coates A, Ng A Y, Lee H. An analysis of single-layer networks in unsupervised feature learning. In Proc. the 14th International Conference on Artificial Intelligence and Statistics, Apr. 2011, pp.215–223.
Google Scholar
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.
Google Scholar
Duan L, Aggarwal C C, Ma S, Sathe S. Improving spectral clustering with deep embedding and cluster estimation. In Proc. the 2019 IEEE International Conference on Data Mining, Nov. 2019, pp.170–179. DOI: https://doi.org/10.1109/ICDM.2019.00027.
Google Scholar
Duan L, Ma S, Aggarwal C, Sathe S. Improving spectral clustering with deep embedding, cluster estimation and metric learning. Knowledge and Information Systems, 2021, 63(3): 675–694. DOI: https://doi.org/10.1007/s10115-020-01530-8.
Article Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China
Shu-Zheng Liu (刘叔正), Shuai Ma (马帅) & Han-Qing Chen (陈瀚清)
School of Software, Shandong University, Jinan, 250100, China
Li-Zhen Cui (崔立真)
Joint SDU-NTU Centre for Artificial Intelligence Research, Shandong University, Jinan, 250100, China
Li-Zhen Cui (崔立真)
School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang, 212003, China
Jie Ding (丁杰)

Authors

Shu-Zheng Liu (刘叔正)
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Ma (马帅)
View author publications
You can also search for this author in PubMed Google Scholar
Han-Qing Chen (陈瀚清)
View author publications
You can also search for this author in PubMed Google Scholar
Li-Zhen Cui (崔立真)
View author publications
You can also search for this author in PubMed Google Scholar
Jie Ding (丁杰)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuai Ma (马帅).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61925203 and U22B2021.

Shu-Zheng Liu received his M.S. degree in industrial engineering from Rutgers University, New Brunswick, in 2017, and his M.E. degree in computer science from Beihang University, Beijing, in 2021. He is currently working at Alibaba, Beijing. His research interests include data mining and outlier detection.

Shuai Ma received his Ph.D. degrees in computer science from Peking University, Beijing, in 2004, and from The University of Edinburgh, Edinburgh, in 2010, respectively. He is a professor with the School of Computer Science and Engineering, Beihang University, Beijing. He was a postdoctoral research fellow with the Database Group, The University of Edinburgh, a summer intern at Bell Labs, Murray Hill, NJ, and a visiting researcher of MSRA. His current research interests include big data, database, theory and systems, graph and social data analysis, data cleaning and data quality.

Han-Qing Chen received his B.S. degree in software engineering from Beihang University, Beijing, in 2019. He is a Ph.D. candidate in the School of Computer Science and Technology, Beihang University, Beijing. His research interests include big data and graph analytics.

Li-Zhen Cui received his Ph.D. degree in computer science from Shandong University, Jinan, in 2005. He is a professor with the School of Software & C-FAIR, Shandong University, Jinan. He is a president of the School of Software & C-FAIR, Shandong University, Jinan. His current research interests include recommender systems and data mining.

Jie Ding received his Ph.D. degree in computer science from The University of Edinburgh, Edinburgh, in 2010. He is a professor with the School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang. His current research interests include theoretical computer science, big data, and reinforcement learning.

Electronic supplementary material