Abstract
Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.
Similar content being viewed by others
References
Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Aref WG (ed) Proceedings of the 2001 ACM-SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 2001, ACM, pp 37–46
Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Tapio Elomaa, Heikki Mannila, Hannu Toivonen (eds) Principles of data mining and knowledge discovery, proceedings of the 6th European PKDD conference, Helsinki, Finland, August 2002. Lecture notes in computer science, vol 2431. Springer, Berlin Heidelberg New York, pp 15–26
Arning A, Aggarwal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD96), Portland, Oregon, USA, 1996. AAAI Press, pp 164–169
Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison Wesley, Reading, MA
Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, August 2003. ACM, pp 29–38
Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York
Blake CL, Merz CJ (1998) UCI Repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. Department of Information and Computer Science, University of California, Irvine, CA
Breuning M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, Texas, USA, May 2000. ACM, pp 427–438
Chen Z, Fu A, Tang J (2003) On complementarity of cluster and outlier detection schemes. In: Kambayashi Y, Mohania MK, Wóß W (eds) Data warehousing and knowledge discovery, proceedings of the 5th international DaWaK conference, Prague, Czech Republic, September 2003. Lecture notes in computer science, vol 2737. Springer, Berlin Heidelberg New York, pp 234–243
Chen Z, Tang J, Fu A (2003) Modeling and efficient mining of intentional knowledge of outliers. In: Proceedings of the 7th international database engineering and applications symposium (IDEAS03), Hong Kong, China, July 2003. IEEE Computer Society, pp 44–53
Chen Z, Meng X, Fowler R, Zhu B (2001) FEATURES: real-time adaptive feature and document learning for web search. J Am Soc Inform Sci Technol 52(8):655–665
Cormen T, Leiserson C, Rivest R, Stein C (2002) Introduction to algorithms, 2nd edn. McGraw-Hill, New York
DuMouchel W, Schonlau M (1998) A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD98), New York City, New York, USA, August 1998. AAAI Press, pp 189–193
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD96), 1996. AAAI Press, pp 226–231
Fawcett T, Provost F (1997) Adaptive fraud detection. Data Min Knowledge Discov J 1(3):291–316
Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Haas LM, Tiwary A (eds) Proceedings ACM SIGMOD international conference on management of data, Seattle, Washington, USA, June 1998. ACM Press, pp 73–84
Harkins S, He H, Williams CJ, Baster RA (2002) Outlier detection using replicator neural networks. In: Kambayashi Y, Winiwarter W, Arikawa M (eds) Data warehousing and knowledge discovery, proceedings of the 4th international DaWaK conference, Aix-en-Provence, France, September 2002. Lecture notes in computer science, vol 2454. Springer, Berlin Heidelberg New York, pp 170–180
Hawkins D (1980) Identification of outliers. Chapman & Hall, London
He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recog Lett 24:1641–1650
Hu T, Sung SY (2003) Detecting pattern-based outliers. Pattern Recog Lett 24:3509–3068
Jin W, Tung A, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, August 2001. ACM, pp 293–298
Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowledge Discov 7(4):349–371
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24rd international conference on very large databases, New York City, New York, USA, August 1998. Morgan Kaufmann, pp 392–403
Knorr E, Ng R (1999) Finding intentional knowledge of distance-based outliers. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of the 25th international conference on very large databases, Edinburgh, Scotland, UK, September 1999. Morgan Kaufmann, pp 211–222
Lazarevic A, Ertoz L, Ozgur A, Srivastava J, Kumar V (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Barbar D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, San Francisco, CA, USA, May 2003. SIAM
Meng X, Chen Z (2004) On user-oriented measurements of effectiveness of web information retrieval systems. In: Arabnia HR, Droegehorn O (eds) Proceedings of the international conference on internet computing, Las Vegas, Nevada, USA, June 2004, vol 1. CSREA Press, pp 527–533
Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large databases, Santiago de Chile, Chile, September 1994. Morgan Kaufmann, pp 144–155
Ramaswamy S, Rastogi R, Kyuseok S (2000) Efficient algorithms for mining outliers from large data sets. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, Texas, USA, May 2000. ACM, pp 427–438
Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Carey MJ, Schneider DA (eds) Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, USA, May 1995. ACM, pp 71–79
Salton G (1989) Automated text processing: the transformation, analysis, and retrieval of information by computer. Addison Wesley, Reading, MA
Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24rd international conference on very large databases, New York City, New York, USA, August 1998. Morgan Kaufmann, pp 428–439
Stolfo S, Fan W, Lee W, Prodromidis A, Chan P (2000) Cost-based modeling for fraud and intrusion detection: results from the JAM Project. In: Proceedings of DARPA information survivability conference and exposition, vol 2, pp 1130–1144
Tang J, Chen Z, Fu A, Cheung D (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Cheng M-S, Yu PS, Liu B (eds) Advances in knowledge discovery and data mining, proceedings of the 6th Pacific-Asia PAKDD conference, Taipei, Taiwan, May 2002. Lecture notes in computer science, vol 2336. Springer, Berlin Heidelberg New York, pp 535–548
Zhang T, Ramakrishnan R, Linvy M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, June 1996. ACM, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Additional information
Jian Tang received an MS degree from the University of Iowa in 1983, and PhD from the Pennsylvania State University in 1988, both from the Department of Computer Science. He joined the Department of Computer Science, Memorial University of Newfoundland, Canada, in 1988, where he is currently a professor. He has visited a number of research institutions to conduct researches ranging over a variety of topics relating to theories and practices for database management and systems. His current research interests include data mining, e-commerce, XML and bioinformatics.
Zhixiang Chen is an associate professor in the Computer Science Department, University of Texas-Pan American. He received his PhD in computer science from Boston University in January 1996, BS and MS degrees in software engineering from Huazhong University of Science and Technology. He also studied at the University of Illinois at Chicago. He taught at Southwest State University from Fall 1995 to September 1997, and Huazhong University of Science and Technology from 1982 to 1990. His research interests include computational learning theory, algorithms and complexity, intelligent Web search, informational retrieval, and data mining.
Ada Waichee Fu received her BSc degree in computer science in the Chinese University of Hong Kong in 1983, and both MSc and PhD degrees in computer science in Simon Fraser University of Canada in 1986, 1990, respectively; worked at Bell Northern Research in Ottawa, Canada, from 1989 to 1993 on a wide-area distributed database project; joined the Chinese University of Hong Kong in 1993. Her research interests are XML data, time series databases, data mining, content-based retrieval in multimedia databases, parallel, and distributed systems.
David Wai-lok Cheung received the MSc and PhD degrees in computer science from Simon Fraser University, Canada, in 1985 and 1989, respectively. He also received the BSc degree in mathematics from the Chinese University of Hong Kong. From 1989 to 1993, he was a member of Scientific Staff at Bell Northern Research, Canada. Since 1994, he has been a faculty member of the Department of Computer Science in the University of Hong Kong. He is also the Director of the Center for E-Commerce Infrastructure Development. His research interests include data mining, data warehouse, XML technology for e-commerce and bioinformatics. Dr. Cheung was the Program Committee Chairman of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2001), Program Co-Chair of the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005). Dr. Cheung is a member of the ACM and the IEEE Computer Society.
Rights and permissions
About this article
Cite this article
Tang, J., Chen, Z., Fu, A.W. et al. Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11, 45–84 (2007). https://doi.org/10.1007/s10115-005-0233-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-005-0233-6