Skip to main content
Log in

Exploiting domain knowledge to detect outliers

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We present a novel definition of outlier whose aim is to embed an available domain knowledge in the process of discovering outliers. Specifically, given a background knowledge, encoded by means of a set of first-order rules, and a set of positive and negative examples, our approach aims at singling out the examples showing abnormal behavior. The technique here proposed is unsupervised, since there are no examples of normal or abnormal behavior, even if it has connections with supervised learning, since it is based on induction from examples. We provide a notion of compliance of a set of facts with respect to a background knowledge and a set of examples, which is exploited to detect the examples that prevent to improve generalization of the induced hypothesis. By testing compliance with respect to both the direct and the dual concept, we are able to distinguish among three kinds of abnormalities, that are irregular, anomalous, and outlier observations. This allows us to provide a finer characterization of the anomaly at hand and to single out subtle forms of anomalies. Moreover, we are also able to provide explanations for the abnormality of an observation which make intelligible the motivation underlying its exceptionality. We present both exact and approximate algorithms for mining abnormalities. The approximate algorithms improve execution time while guaranteeing good accuracy. Moreover, we discuss peculiarities of the novel approach, present examples of knowledge mined, analyze the scalability of the algorithms, and provide comparison with noise handling mechanisms and some alternative approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. www.comlab.ox.ac.uk/oucl/research/areas/machlearn/PProgol/pprogol.pl.

  2. http://archive.ics.uci.edu/ml/datasets/Zoo.

  3. http://archive.ics.uci.edu/ml.

  4. Data are available at http://www.comlab.ox.ac.uk/activities/machinelearning/mutagenesis.html.

  5. We employed Intel Xeon E5620 2.40GHz based computer with 4 GB of main memory and the Linux operating system.

References

  • Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 37–46

  • Angiulli F, Fassetti F (2009a) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1):Article 4

  • Angiulli F, Fassetti F (2009b) Outlier detection using inductive logic programming. In: ICDM, pp 693–698

  • Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of the international conference on principles of data mining and knowledge discovery (PKDD), pp 15–26

  • Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng pp 203–215

  • Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160

    Article  Google Scholar 

  • Angiulli F, Greco G, Palopoli L (2007) Outlier detection by logic programming. ACM Trans Comput Log 9(1):Article 7

  • Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872

    Article  MATH  MathSciNet  Google Scholar 

  • Bain M, Srinivasan A (1995) Inductive logic programming with large-scale unstructured data. In: Furukawa K, Michie D, Muggleton S (eds) Machine intelligence 14. Clarendon Press, Oxford

  • Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: Identifying density-based local outliers. In: Proceedings of the international conference on management of data (SIGMOD), pp 93–104

  • Bruno G, Garza P, Quintarelli E, Rosato R (2007) Anomaly detection through quasi-functional dependency analysis. J Digit Inf Manag 5(4):190–200

    Google Scholar 

  • Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58

    Google Scholar 

  • Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6

    Article  Google Scholar 

  • Debnath A, de Compadre RL, Debnath G, Shusterman A, Hansch C (1991) The structure–activity relationship of mutagenic aromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem 34:786–797

    Article  Google Scholar 

  • Fassetti F, Fazzinga B (2007) Approximate functional dependencies for xml data. In: ADBIS research communications. Springer, Heidelberg, pp 86–95

  • He Z, Xu X, Huang J, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118

    Article  Google Scholar 

  • Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126

    Google Scholar 

  • Kirsten M, Wrobel S, Horváth T (2001) Distance based approaches to relational learning and clustering. In: Dz̆eroski S, Lavrac̆ N (eds) Relational data mining, Springer, Berlin, pp 213–232

  • Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. TCS 149:129–149

    Article  MATH  MathSciNet  Google Scholar 

  • Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large data bases (VLDB), pp 392–403

  • Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: KDD, pp 444–452

  • Lavrac̆ N, Dz̆eroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, Chichester

  • Lavrac̆ N, Dz̆eroski S, Bratko I (1996) Handling imperfect data in inductive logic programming. In: Raedt LD (ed) Advances in inductive logic programming. IOS Press, Amsterdam, pp 48–64

  • Liu FT, Ting KM, Zhou ZH (2012) Isolation-based anomaly detection. TKDD 6(1):3

    Article  Google Scholar 

  • Lloyd JW (1987) Foundations of logic programming. Springer, Berlin

    Book  MATH  Google Scholar 

  • Mannila H, Räihä K (1987) Dependency inference. In: VLDB, pp 155–158

  • Muggleton S (1995) Inverse entailment and Progol. New Gen Comput 13(3–4):245–286

    Article  Google Scholar 

  • Muggleton S, Feng C (1990) Efficient induction of logic programs. In: First conference on algorithmic learning theory, pp 368–381

  • Muggleton S, Bain M, Hayes-Michie J, Michie D (1989) An exeperimental comparison of human and machine learning formalisms. In: Sixth international workshop on machine learning

  • Novelli N, Cicchetti R (2001) Functional and embedded dependency inference: a data mining point of view. IS 26(7):477–506

    MATH  Google Scholar 

  • Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the international conference on data engineering (ICDE) , pp 315–326

  • Plotkin G (1971) A further note on inductive generalization. In: Machine learning, vol 6, chap 8. American Elsevier, New York, pp 101–124

  • Quinlan J, Cameron-Jones R (1993) Foil: a midterm report. In: 6th European conference on machine learning, pp 3–20

  • Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the international conference on management of data (SIGMOD), pp 427–438

  • Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: KDD, pp 252–257

  • Srinivasan A, Muggleton S, Sternberg M, King R (1996) Theories for mutagenicity: a study in first-order and feature-based induction. Artif Intell 85(1–2):277–299

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabrizio Angiulli.

Additional information

Responsible editor: Eamonn Keogh.

A preliminary version of this article appears under the title “Outlier Detection using Inductive Logic Programming” in the Proceedings of the IEEE International Conference on Data Mining (ICDM), Miami, Florida, December 6–9, 2009 (Angiulli and Fassetti 2009b).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Angiulli, F., Fassetti, F. Exploiting domain knowledge to detect outliers. Data Min Knowl Disc 28, 519–568 (2014). https://doi.org/10.1007/s10618-013-0310-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-013-0310-5

Keywords

Navigation