Skip to main content

Advertisement

Log in

Capabilities of outlier detection schemes in large datasets, framework and methodologies

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Outlier detection is concerned with discovering exceptional behaviors of objects. Its theoretical principle and practical implementation lay a foundation for some important applications such as credit card fraud detection, discovering criminal behaviors in e-commerce, discovering computer intrusion, etc. In this paper, we first present a unified model for several existing outlier detection schemes, and propose a compatibility theory, which establishes a framework for describing the capabilities for various outlier formulation schemes in terms of matching users'intuitions. Under this framework, we show that the density-based scheme is more powerful than the distance-based scheme when a dataset contains patterns with diverse characteristics. The density-based scheme, however, is less effective when the patterns are of comparable densities with the outliers. We then introduce a connectivity-based scheme that improves the effectiveness of the density-based scheme when a pattern itself is of similar density as an outlier. We compare density-based and connectivity-based schemes in terms of their strengths and weaknesses, and demonstrate applications with different features where each of them is more effective than the other. Finally, connectivity-based and density-based schemes are comparatively evaluated on both real-life and synthetic datasets in terms of recall, precision, rank power and implementation-free metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Yu P (2001) Outlier detection for high dimensional data. In: Aref WG (ed) Proceedings of the 2001 ACM-SIGMOD international conference on management of data, Santa Barbara, CA, USA, May 2001, ACM, pp 37–46

  2. Angiulli F, Pizzuti C (2002) Fast outlier detection in high dimensional spaces. In: Tapio Elomaa, Heikki Mannila, Hannu Toivonen (eds) Principles of data mining and knowledge discovery, proceedings of the 6th European PKDD conference, Helsinki, Finland, August 2002. Lecture notes in computer science, vol 2431. Springer, Berlin Heidelberg New York, pp 15–26

  3. Arning A, Aggarwal R, Raghavan P (1996) A linear method for deviation detection in large databases. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD96), Portland, Oregon, USA, 1996. AAAI Press, pp 164–169

  4. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison Wesley, Reading, MA

    Google Scholar 

  5. Bay SD, Schwabacher M (2003) Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In: Getoor L, Senator TE, Domingos P, Faloutsos C (eds) Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, August 2003. ACM, pp 29–38

  6. Barnett V, Lewis T (1994) Outliers in statistical data. Wiley, New York

    MATH  Google Scholar 

  7. Blake CL, Merz CJ (1998) UCI Repository of machine learning databases. http://www.ics.uci.edu/mlearn/MLRepository.html. Department of Information and Computer Science, University of California, Irvine, CA

  8. Breuning M, Kriegel H, Ng R, Sander J (2000) LOF: identifying density-based local outliers. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, Texas, USA, May 2000. ACM, pp 427–438

  9. Chen Z, Fu A, Tang J (2003) On complementarity of cluster and outlier detection schemes. In: Kambayashi Y, Mohania MK, Wóß W (eds) Data warehousing and knowledge discovery, proceedings of the 5th international DaWaK conference, Prague, Czech Republic, September 2003. Lecture notes in computer science, vol 2737. Springer, Berlin Heidelberg New York, pp 234–243

  10. Chen Z, Tang J, Fu A (2003) Modeling and efficient mining of intentional knowledge of outliers. In: Proceedings of the 7th international database engineering and applications symposium (IDEAS03), Hong Kong, China, July 2003. IEEE Computer Society, pp 44–53

  11. Chen Z, Meng X, Fowler R, Zhu B (2001) FEATURES: real-time adaptive feature and document learning for web search. J Am Soc Inform Sci Technol 52(8):655–665

    Article  Google Scholar 

  12. Cormen T, Leiserson C, Rivest R, Stein C (2002) Introduction to algorithms, 2nd edn. McGraw-Hill, New York

    MATH  Google Scholar 

  13. DuMouchel W, Schonlau M (1998) A fast computer intrusion detection algorithm based on hypothesis testing of command transition probabilities. In: Agrawal R, Stolorz PE, Piatetsky-Shapiro G (eds) Proceedings of the fourth international conference on knowledge discovery and data mining (KDD98), New York City, New York, USA, August 1998. AAAI Press, pp 189–193

  14. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Simoudis E, Han J, Fayyad UM (eds) Proceedings of the second international conference on knowledge discovery and data mining (KDD96), 1996. AAAI Press, pp 226–231

  15. Fawcett T, Provost F (1997) Adaptive fraud detection. Data Min Knowledge Discov J 1(3):291–316

    Article  Google Scholar 

  16. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Haas LM, Tiwary A (eds) Proceedings ACM SIGMOD international conference on management of data, Seattle, Washington, USA, June 1998. ACM Press, pp 73–84

  17. Harkins S, He H, Williams CJ, Baster RA (2002) Outlier detection using replicator neural networks. In: Kambayashi Y, Winiwarter W, Arikawa M (eds) Data warehousing and knowledge discovery, proceedings of the 4th international DaWaK conference, Aix-en-Provence, France, September 2002. Lecture notes in computer science, vol 2454. Springer, Berlin Heidelberg New York, pp 170–180

  18. Hawkins D (1980) Identification of outliers. Chapman & Hall, London

    MATH  Google Scholar 

  19. He Z, Xu X, Deng S (2003) Discovering cluster-based local outliers. Pattern Recog Lett 24:1641–1650

    Article  MATH  Google Scholar 

  20. Hu T, Sung SY (2003) Detecting pattern-based outliers. Pattern Recog Lett 24:3509–3068

    Google Scholar 

  21. Jin W, Tung A, Han J (2001) Mining top-n local outliers in large databases. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, August 2001. ACM, pp 293–298

  22. Keogh E, Kasetty S (2003) On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Min Knowledge Discov 7(4):349–371

    Article  MathSciNet  Google Scholar 

  23. Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24rd international conference on very large databases, New York City, New York, USA, August 1998. Morgan Kaufmann, pp 392–403

  24. Knorr E, Ng R (1999) Finding intentional knowledge of distance-based outliers. In: Atkinson MP, Orlowska ME, Valduriez P, Zdonik SB, Brodie ML (eds) Proceedings of the 25th international conference on very large databases, Edinburgh, Scotland, UK, September 1999. Morgan Kaufmann, pp 211–222

  25. Lazarevic A, Ertoz L, Ozgur A, Srivastava J, Kumar V (2003) A comparative study of anomaly detection schemes in network intrusion detection. In: Barbar D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, San Francisco, CA, USA, May 2003. SIAM

  26. Meng X, Chen Z (2004) On user-oriented measurements of effectiveness of web information retrieval systems. In: Arabnia HR, Droegehorn O (eds) Proceedings of the international conference on internet computing, Las Vegas, Nevada, USA, June 2004, vol 1. CSREA Press, pp 527–533

  27. Ng R, Han J (1994) Efficient and effective clustering methods for spatial data mining. In: Bocca JB, Jarke M, Zaniolo C (eds) Proceedings of the 20th international conference on very large databases, Santiago de Chile, Chile, September 1994. Morgan Kaufmann, pp 144–155

  28. Ramaswamy S, Rastogi R, Kyuseok S (2000) Efficient algorithms for mining outliers from large data sets. In: Chen W, Naughton JF, Bernstein PA (eds) Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas, Texas, USA, May 2000. ACM, pp 427–438

  29. Roussopoulos N, Kelley S, Vincent F (1995) Nearest neighbor queries. In: Carey MJ, Schneider DA (eds) Proceedings of the 1995 ACM SIGMOD international conference on management of data, San Jose, California, USA, May 1995. ACM, pp 71–79

  30. Salton G (1989) Automated text processing: the transformation, analysis, and retrieval of information by computer. Addison Wesley, Reading, MA

    Google Scholar 

  31. Sheikholeslami G, Chatterjee S, Zhang A (1998) WaveCluster: a multi-resolution clustering approach for very large spatial databases. In: Gupta A, Shmueli O, Widom J (eds) Proceedings of 24rd international conference on very large databases, New York City, New York, USA, August 1998. Morgan Kaufmann, pp 428–439

  32. Stolfo S, Fan W, Lee W, Prodromidis A, Chan P (2000) Cost-based modeling for fraud and intrusion detection: results from the JAM Project. In: Proceedings of DARPA information survivability conference and exposition, vol 2, pp 1130–1144

  33. Tang J, Chen Z, Fu A, Cheung D (2002) Enhancing effectiveness of outlier detections for low density patterns. In: Cheng M-S, Yu PS, Liu B (eds) Advances in knowledge discovery and data mining, proceedings of the 6th Pacific-Asia PAKDD conference, Taipei, Taiwan, May 2002. Lecture notes in computer science, vol 2336. Springer, Berlin Heidelberg New York, pp 535–548

  34. Zhang T, Ramakrishnan R, Linvy M (1996) BIRCH: an efficient data clustering method for very large databases. In: Jagadish HV, Mumick IS (eds) Proceedings of the 1996 ACM SIGMOD international conference on management of data, Montreal, Quebec, Canada, June 1996. ACM, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian Tang.

Additional information

Jian Tang received an MS degree from the University of Iowa in 1983, and PhD from the Pennsylvania State University in 1988, both from the Department of Computer Science. He joined the Department of Computer Science, Memorial University of Newfoundland, Canada, in 1988, where he is currently a professor. He has visited a number of research institutions to conduct researches ranging over a variety of topics relating to theories and practices for database management and systems. His current research interests include data mining, e-commerce, XML and bioinformatics.

Zhixiang Chen is an associate professor in the Computer Science Department, University of Texas-Pan American. He received his PhD in computer science from Boston University in January 1996, BS and MS degrees in software engineering from Huazhong University of Science and Technology. He also studied at the University of Illinois at Chicago. He taught at Southwest State University from Fall 1995 to September 1997, and Huazhong University of Science and Technology from 1982 to 1990. His research interests include computational learning theory, algorithms and complexity, intelligent Web search, informational retrieval, and data mining.

Ada Waichee Fu received her BSc degree in computer science in the Chinese University of Hong Kong in 1983, and both MSc and PhD degrees in computer science in Simon Fraser University of Canada in 1986, 1990, respectively; worked at Bell Northern Research in Ottawa, Canada, from 1989 to 1993 on a wide-area distributed database project; joined the Chinese University of Hong Kong in 1993. Her research interests are XML data, time series databases, data mining, content-based retrieval in multimedia databases, parallel, and distributed systems.

David Wai-lok Cheung received the MSc and PhD degrees in computer science from Simon Fraser University, Canada, in 1985 and 1989, respectively. He also received the BSc degree in mathematics from the Chinese University of Hong Kong. From 1989 to 1993, he was a member of Scientific Staff at Bell Northern Research, Canada. Since 1994, he has been a faculty member of the Department of Computer Science in the University of Hong Kong. He is also the Director of the Center for E-Commerce Infrastructure Development. His research interests include data mining, data warehouse, XML technology for e-commerce and bioinformatics. Dr. Cheung was the Program Committee Chairman of the Fifth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2001), Program Co-Chair of the Ninth Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2005). Dr. Cheung is a member of the ACM and the IEEE Computer Society.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, J., Chen, Z., Fu, A.W. et al. Capabilities of outlier detection schemes in large datasets, framework and methodologies. Knowl Inf Syst 11, 45–84 (2007). https://doi.org/10.1007/s10115-005-0233-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0233-6