Abstract
It is desirable to find unusual data objects by Ramaswamy et al’s distance-based outlier definition because only a metric distance function between two objects is required. It does not need any neighborhood distance threshold required by many existing algorithms based on the definition of Knorr and Ng. Bay and Schwabacher proposed an efficient algorithm ORCA, which can give near linear time performance, for this task. To further reduce the running time, we propose in this paper two algorithms RC and RS using the following two techniques respectively: (i) faster cutoff update, and (ii) space utilization after pruning. We tested RC, RS and RCS (a hybrid approach combining both RC and RS) on several large and high-dimensional real data sets with millions of objects. The experiments show that the speed of RCS is as fast as 1.4 to 2.3 times that of ORCA, and the improvement of RCS is relatively insensitive to the increase in the data size.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hawkins, D.M.: Identification of outliers. Chapman and Hall, Boca Raton (1980)
Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In: VLDB 1998: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 392–403. Morgan Kaumann Publishers Inc., San Francisco (1998)
Tao, Y., Xiao, X., Zhou, S.: Mining distance-based outliers from large databases in any metric space. In: KDD 2006: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 394–403. ACM, New York (2006)
Angiulli, F., Fassetti, F.: Very efficient mining of distance-based outliers. In: CIKM 2007: Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pp. 791–800. ACM, New York (2007)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. SIGMOD Rec. 29(2), 427–438 (2000)
Bay, S.D., Schwabacher, M.: Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 29–38. ACM, New York (2003)
Ghoting, A., Parthasarathy, S., Otey, M.E.: Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16(3), 349–364 (2008)
Asuncion, A., Newman, D.: UCI machine learning repository (2007)
Hettich, S., Bay, S.D.: The UCI KDD archive (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Szeto, CC., Hung, E. (2009). Mining Outliers with Faster Cutoff Update and Space Utilization. In: Theeramunkong, T., Kijsirikul, B., Cercone, N., Ho, TB. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2009. Lecture Notes in Computer Science(), vol 5476. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-01307-2_85
Download citation
DOI: https://doi.org/10.1007/978-3-642-01307-2_85
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-01306-5
Online ISBN: 978-3-642-01307-2
eBook Packages: Computer ScienceComputer Science (R0)