Abstract
The paper discusses outlier detection algorithms used in data mining systems. Basic approaches currently used for solving this problem are considered, and their advantages and disadvantages are discussed. A new outlier detection algorithm is suggested. It is based on methods of fuzzy set theory and the use of kernel functions and possesses a number of advantages compared to the existing methods. The performance of the algorithm suggested is studied by the example of the applied problem of anomaly detection arising in computer protection systems, the so-called intrusion detection systems.
Similar content being viewed by others
REFERENCES
Han, J. and Kamber, M., Data Mining: Concepts and Techniques, Morgan Kaufmann, 2000.
Knorr, E.M. and Ng, R.T., Algorithms for Mining Distance-Based Outliers in Large Datasets, Proc. 24th VLDB, 1998.
Yamanishi, K, Takeichi, J., and Williams, G., On-Line Unsupervised Outlier Detection Using Finite Mixtures with Discounting Learning Algorithms, Proc. of the Sixth ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, Boston, 2000, pp. 320-324.
Kemmerer, R.A. and Vigna, G., Intrusion Detection: Brief History and Survey, http://kiev-security.org.ua/box/12/-19.shtml.
Intrusion Detection Pages, Purdue University, 2003, http://www.cerias.purdue.edu/coast/intrusion-detection/-index.html.
Hadi, A.S., A New Measure of Overall Potential Influence in Linear Regression, Computational Statistics Data Analysis, 1992, vol. 14, pp. 1-27.
Hawkins, S., He, H., Williams, G., and Baxter, R., Outlier Detection Using Replicator Neural Networks, Proc. of the Fifth Int. Conf. on Data Warehousing and Knowledge Discovery, 2002.
Knorr, E.M. and Ng, R.T., Algorithms for Mining Distance-Based Outliers in Large Datasets, Proc. 24th VLDB, 1998.
Knorr, E.M., Ng, R.T., and Tucakov, V., Distance-Based Outliers: Algorithms and Applications, VLDB J., 2000, vol. 8, no. 3-4, pp. 237-253.
Ramaswamy, S., Rastogi, R., and Shim, K., Efficient Algorithms for Mining Outliers from Large Data Sets, Proc. of ACM SIGMOD Int. Conf. on Management of Data, 2000, pp. 427-438.
Breunig, M.M., Kriegel, H.-P., Ng, R., and Sander, J., OPTICS-OF: Identifying Local Outliers, Proc. Conf. on Principles of Data Mining and Knowledge Discovery, Prague, 1999.
Tang, J., Chen, Z., Wai-chee Fu A., and Cheung, D., A Robust Outlier Detection Scheme for Large Data Sets, 2001.
Breunig, S., Kriegel, H.-P., Ng, R., and Sander, J., LOF: Identifying Density-Based Local Outliers, ACM SIGMOD Int. Conf. on Management of Data, Dallas, 2000.
Wen Jin, Tung, A.K.H., and Han, J., Mining Top-n Local Outliers in Large Databases, KDD, 2001, pp. 293-298.
Scholkopf, B. and Smola, A.J., Learning with Kernels, Cambridge, London: MIT, 2002.
Aizerman, M.A., Braverman, E.M., and Rozonoer, L.I., Metod potentsial'nykh funktsii v teorii obucheniya mashin (Kernel Function Method in Machine Learning), Moscow: Nauka, 1970.
Haussler, D., Convolution Kernels on Discrete Structures, Techn. Report CSD-TR-98-11 from Royal Holloway Univ. of London, 1999.
Petrovskiy, M.I., Similarity Measure for Comparing Precedents in Data Mining Systems Supporting OLEDB Standard in Programmnye sistemy i instrumenty, Moscow: Izdatel'skii otdel fakul'teta VMiK MGU, 2002, no. 3, pp. 33-43.
Levene, M. and Loizou, G., A Fully Precise Null Extended Nested Relational Algebra, Fundamenta Informaticae, 1993, vol. 19, pp. 303-343.
OLE DB for Data Mining Specification, Microsoft Corp., 2000, http://www.microsoft.com/data/oledb/dm.htm.
Ben-Hur, A., Horn, D., Siegelmann, H.T., and Vapnik, V., Support Vector Clustering, J. Machine Learning Research, 2001, no. 2, pp. 125-137.
Takuya Inoue and Shigeo Abe, Fuzzy Support Vector Machine for Pattern Classification, Proc. of IJCNN 2001, pp. 1449-1455.
Girolami, M., Mercer Kernel Based Clustering in Feature Space, IEEE Trans. Neural Networks, 2001, vol. 13, no. 4, pp. 780-784.
Lukatskii, A.V., Attack Detection, St. Petersburg: BKhV-Peterburg, 2003.
Mell, P., Computer Attacks: What They Are and How To Defend against Them, NIST, Comput. Security Division, 1999.
Portnoy, L., Eskin, E., and Stolfo, S.J., Intrusion Detection with Unlabeled Data Using Clustering, Proc. of ACM CSS.
MIT Lincoln Lab KDD Cup 99 Data Set, http://www.ll.mit.edu/IST/ideval/data.
Kumar, V., Data Mining for Network Intrusion Detection, NSF Workshop on Next Generation Data Mining, 2002.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Petrovskiy, M.I. Outlier Detection Algorithms in Data Mining Systems. Programming and Computer Software 29, 228–237 (2003). https://doi.org/10.1023/A:1024974810270
Issue Date:
DOI: https://doi.org/10.1023/A:1024974810270