Skip to main content
Log in

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1.
Figure 2.
Figure 3:
Figure 4:
Figure 5.
Figure 6.
Figure 7:
Figure 8:
Figure 9:
Figure 10:
Figure 11.

Similar content being viewed by others

Notes

  1. For all experiments unless otherwise noted, we use the following parameter settings: s = 10, τ = 1.96, δ = 30%, (Score − 10, ScoreWindowSize − 40.

  2. Detection and false positive rates for the Adult data set are not available since the data is unlabeled

References

  • Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the International Conference on Very Large Data Bases VLDB (pp. 487–499). Morgan Kaufmann.

  • Barnett, V. and Lewis, T. 1994. Outliers in statistical data. John Wiley.

  • Bay, S.D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Blake, C. and Merz, C. 1998. UCI machine learning repository.

  • Bolton, R.J. and Hand, D.J. 2002. Statistical fraud detection: A review. Statistical Science, 17:235–255.

    Google Scholar 

  • Breunig, M.M., Kriegel, H.-P., Ng, R.T. and Sander, J. 2000. LOF: Identifying density-based local outliers. Proc. of the ACM SIGMOD International Conference on Management of Data.

  • Gamberger, D., Lavračc, N. and Grošselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of the International Conference on Machine Learning.

  • Ghoting, A., Otey, M.E. and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. Proceedings of the IEEE International Conference on Data Mining.

  • Guha, S., Rastogi, R. and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25:345–366.

    Google Scholar 

  • Hettich, S. and Bay, S. 1999. KDDCUP 1999 dataset, UCI KDD archive.

  • Huang, Y.-A. and Lee, W. 2003. A cooperative intrusion detection system for ad hoc networks. Proc. of the ACM workshop on Security of ad hoc and sensor networks (SASN) (pp. 135–147). Fairfax, Virginia: ACM Press.

  • Jain, A.K. and Dubes, R.C. 1988. Algorithms for clustering data. Prentice Hall.

  • Johnson, T., Kwok, I. and Ng, R. 1998. Fast computation of 2-dimensional depth contours. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Knorr, E., Ng, R. and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB Journal.

  • Knorr, E. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. Proc. of the International Conference on Very Large Databases.

  • Lazarevic, A., Ertoz, L., Ozgur, A., Kumar, V. and Srivastava, J. 2003. A comparative study of outlier detection schemes for network intrusion detection. Proc. of the SIAM International Conference on Data Mining.

  • Locasto, M.E., Parekh, J.J., Stolfo, S.J., Keromytis, A.D., Malkin, T. and Misra, V. 2004. Collaborative distributed intrusion detection (Technical Report CUCS-012-04). Department of Computer Science, Columbia University in the City of New York.

  • Mahoney, M.V. and Chan, P.K. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S. and Panda, D. 2003. Towards nic-based intrusion detection. Proceedings of 9th annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Palpanas, T., Papadopoulos, D., Kalogeraki, V. and Gunopulos, D. 2003. Distributed deviation detection in sensor networks. SIGMOD Record, 32:77–82.

    Google Scholar 

  • Papadimitriou, S., Kitawaga, H., Gibbons, P.B. and Faloutsos, C. 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering.

  • Penny, K.I. and Jolliffe, IT.. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician, Journal of the Royal Statistical Society, 50:295–308.

    Google Scholar 

  • Porras, P.A. and Neumann, P.G. 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. Proc. of the 20th NIST-NCSC National Information Systems Security Conference (pp. 353–365).

  • Rice, J. 1995. Mathematical statistics and data analysis. Duxbury Press.

  • Sequeira, K. and Zaki, M. 2002. ADMIT: Anomaly-based data mining for intrusions. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

  • Veloso, A.A., Meira W., Jr., de Carvalho, M.B., Possas, B., Parthasarathy, S. and Zaki, M.J. 2002. Mining frequent itemsets in evolving databases. Proc. of the SIAM International Conference on Data Mining.

  • Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15:353–367.

    Google Scholar 

  • Zhang, S., Wu, X. and Zhang, C. 2003a. Multi-database mining. IEEE Computational Intelligence Bulletin, 2:5–13.

    Google Scholar 

  • Zhang, Y. and Lee, W. 2000. Intrusion detection in wireless ad-hoc networks. Mobile Computing and Networking (pp. 275–283).

  • Zhang, Y., Lee, W. and Huang, Y.-A. 2003b. Intrusion detection techniques for mobile wireless networks. Wireless Networks, 9:545–556.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Matthew Eric Otey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Otey, M.E., Ghoting, A. & Parthasarathy, S. Fast Distributed Outlier Detection in Mixed-Attribute Data Sets. Data Min Knowl Disc 12, 203–228 (2006). https://doi.org/10.1007/s10618-005-0014-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-005-0014-6

Keywords

Navigation