Skip to main content
Log in

Outlier detection by example

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Outlier detection is a useful technique in such areas as fraud detection, financial analysis and health monitoring. Many recent approaches detect outliers according to reasonable, pre-defined concepts of an outlier (e.g., distance-based, density-based, etc.). However, the definition of an outlier differs between users or even datasets. This paper presents a solution to this problem by including input from the users. Our OBE (Outlier By Example) system is the first that allows users to provide examples of outliers in low-dimensional datasets. By incorporating a small number of such examples, OBE can successfully develop an algorithm by which to identify further outliers based on their outlierness. Several algorithmic challenges and engineering decisions must be addressed in building such a system. We describe the key design decisions and algorithms in this paper. In order to interact with users having different degrees of domain knowledge, we develop two detection schemes: OBE-Fraction and OBE-RF. Our experiments on both real and synthetic datasets demonstrate that OBE can discover values that a user would consider outliers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Notes

  1. The Euclidean distance measure is used to compute MDEF values in OBE.

  2. The value of α should be between 0 and 1. In all experiments, we set α = 0.5 as in Papadimitriou et al. (2003).

  3. \(\mathit{\mbox{MDEF}(p_{i}, \, r, \, \alpha)=0}\) until \(\mathit{n(p_{i},\,r)=nb}\).

  4. More precisely, if \(\mathit{r_{j}} \geq \mathit{r_{{\rm min},i}}\), then \(m_{ij}=\mbox{MDEF}(p_{i},\,r_{j},\, \alpha)\), otherwise m ij  = 0.

  5. We completed experiments by varying the number of n from 50 to 400. The results showed that OBE is insensitive to n. In all our experiments of OBE, n = 100.

  6. S is the pace of stepping forward. For simplicity, we set S = 1 in our experiments.

  7. In fact, in all of our experiments, instances in which the outlier examples or outstanding outliers are misclassified as negatives were never observed.

  8. For the polynomial kernel, we use a kernel function of \((\mathit{u}' \ast \mathit{v}+1)^{2}\).

  9. The large range of choices is (2, 5, 10, 20, 30, 40, 50, 60, 80, 100, 150, 200, 250, 300, 400, 500, 700, and 1,000).

  10. This is the fact in all our experiments.

References

  • Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proc. SIGMOD.

  • Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203–215.

    Article  MathSciNet  Google Scholar 

  • Barbará, D., Domeniconi, C., & Rogers, J. P. (2006). Detecting outliers using transduction and statistical testing. In Proc. SIGKDD conf. (pp. 55–64).

  • Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley.

    MATH  Google Scholar 

  • Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbors meaningful? In Proc. international conf. on database theory (pp. 217–235).

  • Breunig, M. M., Kriegel, H. P., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proc. SIGMOD Conf. (pp. 93–104).

  • Goh, K., Chang, E., & Cheng, K. (2001). SVM binary classifier ensembles for image classification. In Proc. International conf. on information and knowledge management (pp. 395–402).

  • Hawkins, D. M. (1980). Identification of outliers. London, UK: Chapman and Hall.

    MATH  Google Scholar 

  • Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323.

    Article  Google Scholar 

  • Joachims, T. (1998). Text categorization with support vector machines. In Proc. European conf. machine learning (ECML) (pp. 137–142).

  • Johnson, T., Kwok, I., & Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proc. KDD (pp. 224–228).

  • Knorr, E. M., & Ng, R. T. (1997). A unified notion of outliers: Properties and computation. In Proc. KDD (pp. 219–222).

  • Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers in large datasets. In Proc. VLDB (pp. 392–403).

  • Knorr, E. M., & Ng, R. T. (1999). Finding intentional knowledge of distance-based outliers. In Proc. VLDB (pp. 211–222).

  • Knorr, E. M., Ng, R. T., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal, 8(3–4), 237–253.

    Google Scholar 

  • Markowetz, F. (2003). Support vector machines in bioinformatics. Ph.D. Thesis, University of Heidelberg.

  • Papadimitriou, S., Kitagawa, H., Gibbons, P. B., & Faloutsos, C. (2003). LOCI: Fast outlier detection using the local correlation integral. In Proc. ICDE (pp. 315–326).

  • Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In Proc. ACM SIGMOD international conference on management of data (pp. 427–438).

  • Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley.

    Book  MATH  Google Scholar 

  • Tax, D. M. J., & Duin, R. P. W. (1999). Support vector domain description. Pattern Recognition Letters, 20, 1991–1999.

    Article  Google Scholar 

  • Yamanishi, K., & Takeuchi, J. (2001). Discovering outlier filtering rules from unlabeled data. In Proc. KDD (pp. 389–394).

  • Yamanishi, K., Takeuchi, J., Williams, G., & Milne, P. (2000). On-line unsupervised outlier detection using finite mixtures with discounting learning algorithms. In Proc. KDD (pp. 250–254).

  • Yu, H., Han, J., & Chang, K. (2002). PEBL: Positive example based learning for web page classification using SVM. In Proc. KDD (pp. 239–248).

  • Zhu, C., Kitagawa, H., & Faloutsos, C. (2005). Example-based robust outlier detection in high dimensional datasets. In Proc. ICDM (pp. 829–832).

  • Zhu, C., Kitagawa, H., Papadimitriou, S., & Faloutsos, C. (2004). OBE: Outlier by example. In Proc. PAKDD (pp. 222–234).

Download references

Acknowledgements

This research was supported in part by the Japan-U.S. Cooperative Science Program of JSPS, U.S.-Japan Joint Seminar (NSFgrant0318547), the Grant-in-Aid for Scientific Research from JSPS (#15300027), and the Beijing outstanding talents training and subsidization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cui Zhu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhu, C., Kitagawa, H., Papadimitriou, S. et al. Outlier detection by example. J Intell Inf Syst 36, 217–247 (2011). https://doi.org/10.1007/s10844-010-0128-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-010-0128-1

Keywords

Navigation