Abstract
K-nearest neighbor (KNN) is one of the most fundamental methods for unsupervised outlier detection because of its various advantages, e.g., ease of use and relatively high accuracy. Currently, most data analytic tasks need to deal with high-dimensional data, and the KNN-based methods often fail due to “the curse of dimensionality”. AutoEncoder-based methods have recently been introduced to use reconstruction errors for outlier detection on high-dimensional data, but the direct use of AutoEncoder typically does not preserve the data proximity relationships well for outlier detection. In this study, we propose to combine KNN with AutoEncoder for outlier detection. First, we propose the Nearest Neighbor AutoEncoder (NNAE) by persevering the original data proximity in a much lower dimension that is more suitable for performing KNN. Second, we propose the K-nearest reconstruction neighbors (KNRNs) by incorporating the reconstruction errors of NNAE with the K-distances of KNN to detect outliers. Third, we develop a method to automatically choose better parameters for optimizing the structure of NNAE. Finally, using five real-world datasets, we experimentally show that our proposed approach NNAE+KNRN is much better than existing methods, i.e., KNN, Isolation Forest, a traditional AutoEncoder using reconstruction errors (AutoEncoder-RE), and Robust AutoEncoder.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal C C. Outlier analysis. In Data Mining, Aggarwal C C (ed.), Springer, 2015. DOI: https://doi.org/10.1007/978-3-319-14142-8_8.
Chandola V, Banerjee A, Kumar V. Anomaly detection: A survey. ACM Computing Surveys, 2009, 41 (3): Article No. 15. DOI: https://doi.org/10.1145/1541880.1541882.
Zhou C, Paffenroth R C. Anomaly detection with robust deep autoencoders. In Proc. the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2017, pp.665–674. DOI: https://doi.org/10.1145/3097983.3098052.
Liu F T, Ting K M, Zhou Z H. Isolation forest. In Proc. the 8th IEEE International Conference on Data Mining, Dec. 2008, pp.413–422. DOI: https://doi.org/10.1109/ICDM.2008.17.
Sequeira K, Zaki M. ADMIT: Anomaly-based data mining for intrusions. In Proc. the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Jul. 2002, pp.386–395. DOI: https://doi.org/10.1145/775047.775103.
Eskin E, Arnold A, Prerau M, Portnoy L, Stolfo S. A geometric framework for unsupervised anomaly detection. In Applications of Data Mining in Computer Security, Barbará D, Jajodia S (eds.), Springer, 2002, pp.77–101. DOI: https://doi.org/10.1007/978-1-4615-0953-0_4.
An J, Cho S. Variational autoencoder based anomaly detection using reconstruction probability. Technical Report, Data Mining Center of Seoul National University, 2015. https://paperswithcode.com/paper/variational-autoencoder-based-anomaly, Sept. 2024.
Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In Proc. the 6th European Conference on Principles of Data Ming and Knowledge Discovery, Aug. 2002, pp.15–26. DOI: https://doi.org/10.1007/3-540-45681-3_2.
Idé T, Kashima H. Eigenspace-based anomaly detection in computer systems. In Proc. the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2004, pp.440–449. DOI: https://doi.org/10.1145/1014052.1014102.
Hu R J, Aggarwal C C, Ma S, Huai J P. An embedding approach to anomaly detection. In Proc. the 32nd IEEE International Conference on Data Engineering, May 2016, pp.385–396. DOI: https://doi.org/10.1109/ICDE.2016.7498256.
Zhu M X, Aggarwal C C, Ma S, Zhang H, Huai J P. Outlier detection in sparse data with factorization machines. In Proc. the 2017 ACM Conference on Information and Knowledge Management, Nov. 2017, pp.817–826. DOI: https://doi.org/10.1145/3132847.3132987.
Ng A. Sparse autoencoder. CS294A Lecture Notes, 2011, 72(2011): 1–19. https://graphics.stanford.edu/courses/cs233-21-spring/ReferencedPapers/SAE.pdf/, Sept. 2024.
Chen J H, Sathe S, Aggarwal C, Turaga D. Outlier detection with autoencoder ensembles. In Proc. the 2017 SIAM International Conference on Data Mining, Apr. 2017, pp.90–98. DOI: https://doi.org/10.1137/1.9781611974973.11.
Zhang C X, Song D J, Chen Y C, Feng X Y, Lumezanu C, Cheng W, Ni J C, Zong B, Chen H F, Chawla N V. A deep neural network for unsupervised anomaly detection and diagnosis in multivariate time series data. In Proc. the 33rd AAAI Conference on Artificial Intelligence, Jan. 27–Feb. 1, 2019, pp.1409–1416. DOI: https://doi.org/10.1609/aaai.v33i01.33011409.
Pang G S, Cao L B, Aggarwal C. Deep learning for anomaly detection: Challenges, methods, and opportunities. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.1127–1130. DOI: https://doi.org/10.1145/3437963.3441659.
Ruff L, Zemlyanskiy Y, Vandermeulen R, Schnake T, Kloft M. Self-attentive, multi-context one-class classification for unsupervised anomaly detection on text. In Proc. the 57th Annual Meeting of the Association for Computational Linguistics, Jul. 2019, pp.4061–4071. DOI: https://doi.org/10.18653/v1/p19-1398.
Bergmann P, Fauser M, Sattlegger D, Steger C. MVTec AD—A comprehensive real-world dataset for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.9592–9600. DOI: https://doi.org/10.1109/CVPR.2019.00982.
Gong D, Liu L Q, Le V, Saha B, Mansour M R, Venkatesh S, Van Den Hengel A. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In Proc. the 2019 IEEE/CVF International Conference on Computer Vision, Oct. 27–Nov. 2, 2019, pp.1705–1714. DOI: https://doi.org/10.1109/ICCV.2019.00179.
Hou J L, Zhang Y Y, Zhong Q Y, Xie D, Pu S L, Zhou H. Divide-and-assemble: Learning block-wise memory for unsupervised anomaly detection. In Proc. the 2021 IEEE/CVF International Conference on Computer Vision, Oct. 2021, pp.8771–8780. DOI: https://doi.org/10.1109/ICCV48922.2021.00867.
Chen X H, Deng L W, Huang F T, Zhang C W, Zhang Z Q, Zhao Y, Zheng K. DAEMON: Unsupervised anomaly detection and interpretation for multivariate time series. In Proc. the 37th IEEE International Conference on Data Engineering, Apr. 2021, pp.2225–2230. DOI: https://doi.org/10.1109/ICDE51399.2021.00228.
Lai C H, Zou D M, Lerman G. Robust subspace recovery layer for unsupervised anomaly detection. In Proc. the 8th International Conference on Learning Representations, Apr. 2020.
Chen W X, Xu H W, Li Z Y, Pei D, Chen J, Qiao H L, Feng Y, Wang Z G. Unsupervised anomaly detection for intricate KPIs via adversarial training of VAE. In Proc. the 2019 IEEE Conference on Computer Communications, Apr. 29–May 2, 2019, pp.1891–1899. DOI: https://doi.org/10.1109/INFOCOM.2019.8737430.
Audibert J, Michiardi P, Guyard F, Marti S, Zuluaga M A. USAD: UnSupervised anomaly detection on multivariate time series. In Proc. the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug. 2020, pp.3395–3404. DOI: https://doi.org/10.1145/3394486.3403392.
Putina A, Sozio M, Rossi D, Navarro J M. Random histogram forest for unsupervised anomaly detection. In Proc. the 2020 IEEE International Conference on Data Mining, Nov. 2020, pp.1226–1231. DOI: https://doi.org/10.1109/ICDM50108.2020.00154.
Mohotti W A, Nayak R. Efficient outlier detection in text corpus using rare frequency and ranking. ACM Trans. Knowledge Discovery from Data, 2020, 14 (6): Article No. 71. DOI: https://doi.org/10.1145/3399712.
Li J, Di S M, Shen Y Y, Chen L. FluxEV: A fast and effective unsupervised framework for time-series anomaly detection. In Proc. the 14th ACM International Conference on Web Search and Data Mining, Mar. 2021, pp.824–832. DOI: https://doi.org/10.1145/3437963.3441823.
Hawkins D M. Identification of Outliers. Springer, 1980.
Mahoney M V, Chan P K. Learning rules for anomaly detection of hostile network traffic. In Proc. the 3rd IEEE International Conference on Data Mining, Nov. 2003, pp.601–604. DOI: https://doi.org/10.1109/ICDM.2003.1250987.
Tandon G, Chan P K. Weighting versus pruning in rule validation for detecting network and host anomalies. In Proc. the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Aug. 2007, pp.697–706. DOI: https://doi.org/10.1145/1281192.1281267.
Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In Proc. the 2000 ACM SIGMOD International Conference on Management of Data, May 2000, pp.427–438. DOI: https://doi.org/10.1145/342009.335437.
Hautamäki V, Kärkkäinen I, Fränti P. Outlier detection using k-nearest neighbour graph. In Proc. the 17th International Conference on Pattern Recognition, Aug. 2004, pp.430–433. DOI: https://doi.org/10.1109/ICPR.2004.1334558.
Jagadish H V, Koudas N, Muthukrishnan S. Mining deviants in a time series database. In Proc. the 25th International Conference on Very Large Data Bases, Sept. 1999, pp.102–113.
Zimek A, Schubert E, Kriegel H P. A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 2012, 5(5): 363–387. DOI: https://doi.org/10.1002/sam.11161.
Lee Y J, Yeh Y R, Wang Y C F. Anomaly detection via online oversampling principal component analysis. IEEE Trans. Knowledge and Data Engineering, 2013, 25(7): 1460–1470. DOI: https://doi.org/10.1109/TKDE.2012.99.
Hinton G E, Salakhutdinov R R. Reducing the dimensionality of data with neural networks. Science, 2006, 313(5786): 504–507. DOI: https://doi.org/10.1126/science.1127647.
Borg I, Groenen P J F. Modern Multidimensional Scaling: Theory and Applications (2nd edition). Springer, 2005.
Xiong L, Chen X, Schneider J. Direct robust matrix factorizatoin for anomaly detection. In Proc. the 11th IEEE International Conference on Data Mining, Dec. 2011, pp.844–853. DOI: https://doi.org/10.1109/ICDM.2011.52.
Hawkins S, He H X, Williams G, Baxter R. Outlier detection using replicator neural networks. In Proc. the 4th International Conference, Sept. 2002, pp.170–180. DOI: https://doi.org/10.1007/3-540-46145-0_17.
Sakurada M, Yairi T. Anomaly detection using autoencoders with nonlinear dimensionality reduction. In Proc. the 2nd Workshop on Machine Learning for Sensory Data Analysis, Dec. 2014, pp.4–11. DOI: https://doi.org/10.1145/2689746.2689747.
Aytekin C, Ni X Y, Cricri F, Aksu E. Clustering and unsupervised anomaly detection with l2 normalized deep auto-encoder representations. In Proc. the 2018 International Joint Conference on Neural Networks, Jul. 2018, pp.1–6. DOI: https://doi.org/10.1109/IJCNN.2018.8489068.
Vincent P, Larochelle H, Lajoie I, Bengio Y, Manzagol P A. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. The Journal of Machine Learning Research, 2010, 11: 3371–3408. DOI: https://doi.org/10.5555/1756006.1953039.
Li Y, Fang B X, Guo L, Chen Y. Network anomaly detection based on TCM-KNN algorithm. In Proc. the 2nd ACM Symposium on Information, Computer and Communications Security, Mar. 2007, pp.13–19. DOI: https://doi.org/10.1145/1229285.1229292.
Wu G J, Zhao Z H, Fu G, Wang H P, Wang Y, Wang Z Y, Hou J T, Huang L. A fast kNN-based approach for time sensitive anomaly detection over data streams. In Proc. the 19th International Conference on Computational Science, Jun. 2019, pp.59–74. DOI: https://doi.org/10.1007/978-3-030-22741-8_5.
Goldstein M, Uchida S. A comparative study on outlier removal from a large-scale dataset using unsupervised anomaly detection. In Proc. the 5th International Conference on Pattern Recognition Applications and Methods, Feb. 2016, pp.263–269. DOI: https://doi.org/10.5220/0005701302630269.
Xiao H, Rasul K, Vollgraf R. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv: 1708.07747, 2017. http://arxiv.org/abs/1708.07747, Aug. 2024.
Coates A, Ng A Y, Lee H. An analysis of single-layer networks in unsupervised feature learning. In Proc. the 14th International Conference on Artificial Intelligence and Statistics, Apr. 2011, pp.215–223.
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In Proc. the 3rd International Conference on Learning Representations, May 2015.
Duan L, Aggarwal C C, Ma S, Sathe S. Improving spectral clustering with deep embedding and cluster estimation. In Proc. the 2019 IEEE International Conference on Data Mining, Nov. 2019, pp.170–179. DOI: https://doi.org/10.1109/ICDM.2019.00027.
Duan L, Ma S, Aggarwal C, Sathe S. Improving spectral clustering with deep embedding, cluster estimation and metric learning. Knowledge and Information Systems, 2021, 63(3): 675–694. DOI: https://doi.org/10.1007/s10115-020-01530-8.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61925203 and U22B2021.
Shu-Zheng Liu received his M.S. degree in industrial engineering from Rutgers University, New Brunswick, in 2017, and his M.E. degree in computer science from Beihang University, Beijing, in 2021. He is currently working at Alibaba, Beijing. His research interests include data mining and outlier detection.
Shuai Ma received his Ph.D. degrees in computer science from Peking University, Beijing, in 2004, and from The University of Edinburgh, Edinburgh, in 2010, respectively. He is a professor with the School of Computer Science and Engineering, Beihang University, Beijing. He was a postdoctoral research fellow with the Database Group, The University of Edinburgh, a summer intern at Bell Labs, Murray Hill, NJ, and a visiting researcher of MSRA. His current research interests include big data, database, theory and systems, graph and social data analysis, data cleaning and data quality.
Han-Qing Chen received his B.S. degree in software engineering from Beihang University, Beijing, in 2019. He is a Ph.D. candidate in the School of Computer Science and Technology, Beihang University, Beijing. His research interests include big data and graph analytics.
Li-Zhen Cui received his Ph.D. degree in computer science from Shandong University, Jinan, in 2005. He is a professor with the School of Software & C-FAIR, Shandong University, Jinan. He is a president of the School of Software & C-FAIR, Shandong University, Jinan. His current research interests include recommender systems and data mining.
Jie Ding received his Ph.D. degree in computer science from The University of Edinburgh, Edinburgh, in 2010. He is a professor with the School of Computer Science, Jiangsu University of Science and Technology, Zhenjiang. His current research interests include theoretical computer science, big data, and reinforcement learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Liu, SZ., Ma, S., Chen, HQ. et al. Combining KNN with AutoEncoder for Outlier Detection. J. Comput. Sci. Technol. 39, 1153–1166 (2024). https://doi.org/10.1007/s11390-023-2403-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-023-2403-y