Abstract
Outlier detection is a useful technique in such areas as fraud detection, financial analysis and health monitoring. Many recent approaches detect outliers according to reasonable, pre-defined concepts of an outlier (e.g., distance-based, density-based, etc.). However, the definition of an outlier differs between users or even datasets. This paper presents a solution to this problem by including input from the users. Our OBE (Outlier By Example) system is the first that allows users to provide examples of outliers in low-dimensional datasets. By incorporating a small number of such examples, OBE can successfully develop an algorithm by which to identify further outliers based on their outlierness. Several algorithmic challenges and engineering decisions must be addressed in building such a system. We describe the key design decisions and algorithms in this paper. In order to interact with users having different degrees of domain knowledge, we develop two detection schemes: OBE-Fraction and OBE-RF. Our experiments on both real and synthetic datasets demonstrate that OBE can discover values that a user would consider outliers.
Similar content being viewed by others
Notes
The Euclidean distance measure is used to compute MDEF values in OBE.
The value of α should be between 0 and 1. In all experiments, we set α = 0.5 as in Papadimitriou et al. (2003).
\(\mathit{\mbox{MDEF}(p_{i}, \, r, \, \alpha)=0}\) until \(\mathit{n(p_{i},\,r)=nb}\).
More precisely, if \(\mathit{r_{j}} \geq \mathit{r_{{\rm min},i}}\), then \(m_{ij}=\mbox{MDEF}(p_{i},\,r_{j},\, \alpha)\), otherwise m ij = 0.
We completed experiments by varying the number of n from 50 to 400. The results showed that OBE is insensitive to n. In all our experiments of OBE, n = 100.
S is the pace of stepping forward. For simplicity, we set S = 1 in our experiments.
In fact, in all of our experiments, instances in which the outlier examples or outstanding outliers are misclassified as negatives were never observed.
For the polynomial kernel, we use a kernel function of \((\mathit{u}' \ast \mathit{v}+1)^{2}\).
The large range of choices is (2, 5, 10, 20, 30, 40, 50, 60, 80, 100, 150, 200, 250, 300, 400, 500, 700, and 1,000).
This is the fact in all our experiments.
References
Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proc. SIGMOD.
Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.
Barbará, D., Domeniconi, C., & Rogers, J. P. (2006). Detecting outliers using transduction and statistical testing. In Proc. SIGKDD conf. (pp. 55–64).
Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley.
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbors meaningful? In Proc. international conf. on database theory (pp. 217–235).
Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proc. SIGMOD Conf. (pp. 93–104).
Goh, K., Chang, E., & Cheng, K. (2001). SVM binary classifier ensembles for image classification. In Proc. International conf. on information and knowledge management (pp. 395–402).
Hawkins, D. M. (1980). Identification of outliers. London, UK: Chapman and Hall.
Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.
Joachims, T. (1998). Text categorization with support vector machines. In Proc. European conf. machine learning (ECML) (pp. 137–142).
Johnson, T., Kwok, I., & Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proc. KDD (pp. 224–228).
Knorr, E. M., & Ng, R. T. (1997). A unified notion of outliers: Properties and computation. In Proc. KDD (pp. 219–222).
Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proc. VLDB (pp. 392–403).
Knorr, E. M., & Ng, R. T. (1999). Finding intentional knowledge of distance-based outliers. In Proc. VLDB (pp. 211–222).
Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal, 8(3–4), 237–253.
Markowetz, F. (2003). Support vector machines in bioinformatics. Ph.D. Thesis, University of Heidelberg.
Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE (pp. 315–326).
Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD international conference on management of data (pp. 427–438).
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.
Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1991–1999.
Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proc. KDD (pp. 389–394).
Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. KDD (pp. 250–254).
Yu, H., Han, J., & Chang, K. (2002). PEBL: Positive example based learning for web page classification using SVM. In Proc. KDD (pp. 239–248).
Zhu, C., Kitagawa, H., & Faloutsos, C. (2005). Example-based robust outlier detection in high dimensional datasets. In Proc. ICDM (pp. 829–832).
Zhu, C., Kitagawa, H., Papadimitriou, S., & Faloutsos, C. (2004). OBE: Outlier by example. In Proc. PAKDD (pp. 222–234).
Acknowledgements
This research was supported in part by the Japan-U.S. Cooperative Science Program of JSPS, U.S.-Japan Joint Seminar (NSFgrant0318547), the Grant-in-Aid for Scientific Research from JSPS (#15300027), and the Beijing outstanding talents training and subsidization.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Zhu, C., Kitagawa, H., Papadimitriou, S. et al. Outlier detection by example. J Intell Inf Syst 36, 217–247 (2011). https://doi.org/10.1007/s10844-010-0128-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-010-0128-1