Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Otey, Matthew Eric; Ghoting, Amol; Parthasarathy, Srinivasan

doi:10.1007/s10618-005-0014-6

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Published: 07 April 2006

Volume 12, pages 203–228, (2006)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Matthew Eric Otey¹,
Amol Ghoting¹ &
Srinivasan Parthasarathy¹

2132 Accesses
157 Citations
12 Altmetric
Explore all metrics

Abstract

Efficiently detecting outliers or anomalies is an important problem in many areas of science, medicine and information technology. Applications range from data cleaning to clinical diagnosis, from detecting anomalous defects in materials to fraud and intrusion detection. Over the past decade, researchers in data mining and statistics have addressed the problem of outlier detection using both parametric and non-parametric approaches in a centralized setting. However, there are still several challenges that must be addressed. First, most approaches to date have focused on detecting outliers in a continuous attribute space. However, almost all real-world data sets contain a mixture of categorical and continuous attributes. Categorical attributes are typically ignored or incorrectly modeled by existing approaches, resulting in a significant loss of information. Second, there have not been any general-purpose distributed outlier detection algorithms. Most distributed detection algorithms are designed with a specific domain (e.g. sensor networks) in mind. Third, the data sets being analyzed may be streaming or otherwise dynamic in nature. Such data sets are prone to concept drift, and models of the data must be dynamic as well. To address these challenges, we present a tunable algorithm for distributed outlier detection in dynamic mixed-attribute data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

For all experiments unless otherwise noted, we use the following parameter settings: s = 10, τ = 1.96, δ = 30%, (Score − 10, ScoreWindowSize − 40.
Detection and false positive rates for the Adult data set are not available since the data is unlabeled

References

Agrawal, R. and Srikant, R. 1994. Fast algorithms for mining association rules. In Proc. of the International Conference on Very Large Data Bases VLDB (pp. 487–499). Morgan Kaufmann.
Barnett, V. and Lewis, T. 1994. Outliers in statistical data. John Wiley.
Bay, S.D. and Schwabacher, M. 2003. Mining distance-based outliers in near linear time with randomization and a simple pruning rule. Proc. of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Blake, C. and Merz, C. 1998. UCI machine learning repository.
Bolton, R.J. and Hand, D.J. 2002. Statistical fraud detection: A review. Statistical Science, 17:235–255.
Google Scholar
Breunig, M.M., Kriegel, H.-P., Ng, R.T. and Sander, J. 2000. LOF: Identifying density-based local outliers. Proc. of the ACM SIGMOD International Conference on Management of Data.
Gamberger, D., Lavračc, N. and Grošselj, C. 1999. Experiments with noise filtering in a medical domain. Proc. of the International Conference on Machine Learning.
Ghoting, A., Otey, M.E. and Parthasarathy, S. 2004. Loaded: Link-based outlier and anomaly detection in evolving data sets. Proceedings of the IEEE International Conference on Data Mining.
Guha, S., Rastogi, R. and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems, 25:345–366.
Google Scholar
Hettich, S. and Bay, S. 1999. KDDCUP 1999 dataset, UCI KDD archive.
Huang, Y.-A. and Lee, W. 2003. A cooperative intrusion detection system for ad hoc networks. Proc. of the ACM workshop on Security of ad hoc and sensor networks (SASN) (pp. 135–147). Fairfax, Virginia: ACM Press.
Jain, A.K. and Dubes, R.C. 1988. Algorithms for clustering data. Prentice Hall.
Johnson, T., Kwok, I. and Ng, R. 1998. Fast computation of 2-dimensional depth contours. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Knorr, E., Ng, R. and Tucakov, V. 2000. Distance-based outliers: Algorithms and applications. VLDB Journal.
Knorr, E. and Ng, R.T. 1998. Algorithms for mining distance-based outliers in large datasets. Proc. of the International Conference on Very Large Databases.
Lazarevic, A., Ertoz, L., Ozgur, A., Kumar, V. and Srivastava, J. 2003. A comparative study of outlier detection schemes for network intrusion detection. Proc. of the SIAM International Conference on Data Mining.
Locasto, M.E., Parekh, J.J., Stolfo, S.J., Keromytis, A.D., Malkin, T. and Misra, V. 2004. Collaborative distributed intrusion detection (Technical Report CUCS-012-04). Department of Computer Science, Columbia University in the City of New York.
Mahoney, M.V. and Chan, P.K. 2002. Learning nonstationary models of normal network traffic for detecting novel attacks. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Otey, M., Parthasarathy, S., Ghoting, A., Li, G., Narravula, S. and Panda, D. 2003. Towards nic-based intrusion detection. Proceedings of 9th annual ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Palpanas, T., Papadopoulos, D., Kalogeraki, V. and Gunopulos, D. 2003. Distributed deviation detection in sensor networks. SIGMOD Record, 32:77–82.
Google Scholar
Papadimitriou, S., Kitawaga, H., Gibbons, P.B. and Faloutsos, C. 2003. LOCI: Fast outlier detection using the local correlation integral. Proc. of the International Conference on Data Engineering.
Penny, K.I. and Jolliffe, IT.. 2001. A comparison of multivariate outlier detection methods for clinical laboratory safety data. The Statistician, Journal of the Royal Statistical Society, 50:295–308.
Google Scholar
Porras, P.A. and Neumann, P.G. 1997. EMERALD: Event monitoring enabling responses to anomalous live disturbances. Proc. of the 20th NIST-NCSC National Information Systems Security Conference (pp. 353–365).
Rice, J. 1995. Mathematical statistics and data analysis. Duxbury Press.
Sequeira, K. and Zaki, M. 2002. ADMIT: Anomaly-based data mining for intrusions. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Veloso, A.A., Meira W., Jr., de Carvalho, M.B., Possas, B., Parthasarathy, S. and Zaki, M.J. 2002. Mining frequent itemsets in evolving databases. Proc. of the SIAM International Conference on Data Mining.
Wu, X. and Zhang, S. 2003. Synthesizing high-frequency rules from different data sources. IEEE Transactions on Knowledge and Data Engineering, 15:353–367.
Google Scholar
Zhang, S., Wu, X. and Zhang, C. 2003a. Multi-database mining. IEEE Computational Intelligence Bulletin, 2:5–13.
Google Scholar
Zhang, Y. and Lee, W. 2000. Intrusion detection in wireless ad-hoc networks. Mobile Computing and Networking (pp. 275–283).
Zhang, Y., Lee, W. and Huang, Y.-A. 2003b. Intrusion detection techniques for mobile wireless networks. Wireless Networks, 9:545–556.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering,, The Ohio State University, 395 Dreese Labs, 2015 Neil Avenue, Columbus, Ohio, 43210, USA
Matthew Eric Otey, Amol Ghoting & Srinivasan Parthasarathy

Authors

Matthew Eric Otey
View author publications
You can also search for this author in PubMed Google Scholar
Amol Ghoting
View author publications
You can also search for this author in PubMed Google Scholar
Srinivasan Parthasarathy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Matthew Eric Otey.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Otey, M.E., Ghoting, A. & Parthasarathy, S. Fast Distributed Outlier Detection in Mixed-Attribute Data Sets. Data Min Knowl Disc 12, 203–228 (2006). https://doi.org/10.1007/s10618-005-0014-6

Download citation

Received: 04 April 2005
Accepted: 27 July 2005
Published: 07 April 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s10618-005-0014-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Abstract

Access this article

Similar content being viewed by others

NeoLOD: A Novel Generalized Coupled Local Outlier Detection Model Embedded Non-IID Similarity Metric

Outlier Detection with Arbitrary Probability Functions

Uncertain distance-based outlier detection with arbitrarily shaped data objects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast Distributed Outlier Detection in Mixed-Attribute Data Sets

Abstract

Access this article

Similar content being viewed by others

NeoLOD: A Novel Generalized Coupled Local Outlier Detection Model Embedded Non-IID Similarity Metric

Outlier Detection with Arbitrary Probability Functions

Uncertain distance-based outlier detection with arbitrarily shaped data objects

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation