Outlier detection by example

Zhu, Cui; Kitagawa, Hiroyuki; Papadimitriou, Spiros; Faloutsos, Christos

doi:10.1007/s10844-010-0128-1

Cui Zhu¹,
Hiroyuki Kitagawa²,
Spiros Papadimitriou³ &
…
Christos Faloutsos⁴

961 Accesses
9 Citations
Explore all metrics

Abstract

Outlier detection is a useful technique in such areas as fraud detection, financial analysis and health monitoring. Many recent approaches detect outliers according to reasonable, pre-defined concepts of an outlier (e.g., distance-based, density-based, etc.). However, the definition of an outlier differs between users or even datasets. This paper presents a solution to this problem by including input from the users. Our OBE (Outlier By Example) system is the first that allows users to provide examples of outliers in low-dimensional datasets. By incorporating a small number of such examples, OBE can successfully develop an algorithm by which to identify further outliers based on their outlierness. Several algorithmic challenges and engineering decisions must be addressed in building such a system. We describe the key design decisions and algorithms in this paper. In order to interact with users having different degrees of domain knowledge, we develop two detection schemes: OBE-Fraction and OBE-RF. Our experiments on both real and synthetic datasets demonstrate that OBE can discover values that a user would consider outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Big Data Analytics: A Literature Review Paper

Uncertainty in big data analytics: survey, opportunities, and challenges

Article Open access 04 June 2019

Reihaneh H. Hariri, Erik M. Fredericks & Kate M. Bowers

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

Notes

The Euclidean distance measure is used to compute MDEF values in OBE.
The value of α should be between 0 and 1. In all experiments, we set α = 0.5 as in Papadimitriou et al. (2003).
\(\mathit{\mbox{MDEF}(p_{i}, \, r, \, \alpha)=0}\) until \(\mathit{n(p_{i},\,r)=nb}\).
More precisely, if \(\mathit{r_{j}} \geq \mathit{r_{{\rm min},i}}\), then \(m_{ij}=\mbox{MDEF}(p_{i},\,r_{j},\, \alpha)\), otherwise m _ij = 0.
We completed experiments by varying the number of n from 50 to 400. The results showed that OBE is insensitive to n. In all our experiments of OBE, n = 100.
S is the pace of stepping forward. For simplicity, we set S = 1 in our experiments.
In fact, in all of our experiments, instances in which the outlier examples or outstanding outliers are misclassified as negatives were never observed.
For the polynomial kernel, we use a kernel function of \((\mathit{u}' \ast \mathit{v}+1)^{2}\).
The large range of choices is (2, 5, 10, 20, 30, 40, 50, 60, 80, 100, 150, 200, 250, 300, 400, 500, 700, and 1,000).
This is the fact in all our experiments.

References

Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proc. SIGMOD.
Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.
Article MathSciNet Google Scholar
Barbará, D., Domeniconi, C., & Rogers, J. P. (2006). Detecting outliers using transduction and statistical testing. In Proc. SIGKDD conf. (pp. 55–64).
Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley.
MATH Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbors meaningful? In Proc. international conf. on database theory (pp. 217–235).
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proc. SIGMOD Conf. (pp. 93–104).
Goh, K., Chang, E., & Cheng, K. (2001). SVM binary classifier ensembles for image classification. In Proc. International conf. on information and knowledge management (pp. 395–402).
Hawkins, D. M. (1980). Identification of outliers. London, UK: Chapman and Hall.
MATH Google Scholar
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Article Google Scholar
Joachims, T. (1998). Text categorization with support vector machines. In Proc. European conf. machine learning (ECML) (pp. 137–142).
Johnson, T., Kwok, I., & Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proc. KDD (pp. 224–228).
Knorr, E. M., & Ng, R. T. (1997). A unified notion of outliers: Properties and computation. In Proc. KDD (pp. 219–222).
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proc. VLDB (pp. 392–403).
Knorr, E. M., & Ng, R. T. (1999). Finding intentional knowledge of distance-based outliers. In Proc. VLDB (pp. 211–222).
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal, 8(3–4), 237–253.
Google Scholar
Markowetz, F. (2003). Support vector machines in bioinformatics. Ph.D. Thesis, University of Heidelberg.
Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE (pp. 315–326).
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD international conference on management of data (pp. 427–438).
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.
Book MATH Google Scholar
Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1991–1999.
Article Google Scholar
Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proc. KDD (pp. 389–394).
Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. KDD (pp. 250–254).
Yu, H., Han, J., & Chang, K. (2002). PEBL: Positive example based learning for web page classification using SVM. In Proc. KDD (pp. 239–248).
Zhu, C., Kitagawa, H., & Faloutsos, C. (2005). Example-based robust outlier detection in high dimensional datasets. In Proc. ICDM (pp. 829–832).
Zhu, C., Kitagawa, H., Papadimitriou, S., & Faloutsos, C. (2004). OBE: Outlier by example. In Proc. PAKDD (pp. 222–234).

Download references

Acknowledgements

This research was supported in part by the Japan-U.S. Cooperative Science Program of JSPS, U.S.-Japan Joint Seminar (NSFgrant0318547), the Grant-in-Aid for Scientific Research from JSPS (#15300027), and the Beijing outstanding talents training and subsidization.

Author information

Authors and Affiliations

College of Computer Science, Beijing University of Technology, Beijing, 100124, People’s Republic of China
Cui Zhu
Graduate School of Systems and Information Engineering, Center for Computational Sciences, University of Tsukuba, Tsukuba, Ibaraki, 305-8577, Japan
Hiroyuki Kitagawa
IBM T.J. Watson, Hawthorne, NY, USA
Spiros Papadimitriou
Carnegie Mellon University, Pittsburgh, PA, USA
Christos Faloutsos

Authors

Cui Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Hiroyuki Kitagawa
View author publications
You can also search for this author in PubMed Google Scholar
Spiros Papadimitriou
View author publications
You can also search for this author in PubMed Google Scholar
Christos Faloutsos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cui Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, C., Kitagawa, H., Papadimitriou, S. et al. Outlier detection by example. J Intell Inf Syst 36, 217–247 (2011). https://doi.org/10.1007/s10844-010-0128-1

Download citation

Received: 09 February 2007
Revised: 14 May 2010
Accepted: 20 July 2010
Published: 10 August 2010
Issue Date: April 2011
DOI: https://doi.org/10.1007/s10844-010-0128-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Outlier detection by example

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

Uncertainty in big data analytics: survey, opportunities, and challenges

Big data preprocessing: methods and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Outlier detection by example

Abstract

Access this article

Similar content being viewed by others

Big Data Analytics: A Literature Review Paper

Uncertainty in big data analytics: survey, opportunities, and challenges

Big data preprocessing: methods and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation