Abstract
We present a novel definition of outlier whose aim is to embed an available domain knowledge in the process of discovering outliers. Specifically, given a background knowledge, encoded by means of a set of first-order rules, and a set of positive and negative examples, our approach aims at singling out the examples showing abnormal behavior. The technique here proposed is unsupervised, since there are no examples of normal or abnormal behavior, even if it has connections with supervised learning, since it is based on induction from examples. We provide a notion of compliance of a set of facts with respect to a background knowledge and a set of examples, which is exploited to detect the examples that prevent to improve generalization of the induced hypothesis. By testing compliance with respect to both the direct and the dual concept, we are able to distinguish among three kinds of abnormalities, that are irregular, anomalous, and outlier observations. This allows us to provide a finer characterization of the anomaly at hand and to single out subtle forms of anomalies. Moreover, we are also able to provide explanations for the abnormality of an observation which make intelligible the motivation underlying its exceptionality. We present both exact and approximate algorithms for mining abnormalities. The approximate algorithms improve execution time while guaranteeing good accuracy. Moreover, we discuss peculiarities of the novel approach, present examples of knowledge mined, analyze the scalability of the algorithms, and provide comparison with noise handling mechanisms and some alternative approaches.
Similar content being viewed by others
Notes
Data are available at http://www.comlab.ox.ac.uk/activities/machinelearning/mutagenesis.html.
We employed Intel Xeon E5620 2.40GHz based computer with 4 GB of main memory and the Linux operating system.
References
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: Proceedings of the international conference on management of data (SIGMOD), pp 37–46
Angiulli F, Fassetti F (2009a) Dolphin: an efficient algorithm for mining distance-based outliers in very large datasets. ACM Trans Knowl Discov Data (TKDD) 3(1):Article 4
Angiulli F, Fassetti F (2009b) Outlier detection using inductive logic programming. In: ICDM, pp 693–698
Angiulli F, Pizzuti C (2002) Fast outlier detection in large high-dimensional data sets. In: Proceedings of the international conference on principles of data mining and knowledge discovery (PKDD), pp 15–26
Angiulli F, Pizzuti C (2005) Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng pp 203–215
Angiulli F, Basta S, Pizzuti C (2006) Distance-based detection and prediction of outliers. IEEE Trans Knowl Data Eng 18(2):145–160
Angiulli F, Greco G, Palopoli L (2007) Outlier detection by logic programming. ACM Trans Comput Log 9(1):Article 7
Angiulli F, Ben-Eliyahu-Zohary R, Palopoli L (2008) Outlier detection using default reasoning. Artif Intell 172(16–17):1837–1872
Bain M, Srinivasan A (1995) Inductive logic programming with large-scale unstructured data. In: Furukawa K, Michie D, Muggleton S (eds) Machine intelligence 14. Clarendon Press, Oxford
Breunig MM, Kriegel H, Ng RT, Sander J (2000) Lof: Identifying density-based local outliers. In: Proceedings of the international conference on management of data (SIGMOD), pp 93–104
Bruno G, Garza P, Quintarelli E, Rosato R (2007) Anomaly detection through quasi-functional dependency analysis. J Digit Inf Manag 5(4):190–200
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):1–58
Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor 6(1):1–6
Debnath A, de Compadre RL, Debnath G, Shusterman A, Hansch C (1991) The structure–activity relationship of mutagenic aromatic nitro compounds. Correlation with molecular orbital energies and hydrophobicity. J Med Chem 34:786–797
Fassetti F, Fazzinga B (2007) Approximate functional dependencies for xml data. In: ADBIS research communications. Springer, Heidelberg, pp 86–95
He Z, Xu X, Huang J, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Hodge V, Austin J (2004) A survey of outlier detection methodologies. Artif Intell Rev 22(2):85–126
Kirsten M, Wrobel S, Horváth T (2001) Distance based approaches to relational learning and clustering. In: Dz̆eroski S, Lavrac̆ N (eds) Relational data mining, Springer, Berlin, pp 213–232
Kivinen J, Mannila H (1995) Approximate inference of functional dependencies from relations. TCS 149:129–149
Knorr E, Ng R (1998) Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the international conference on very large data bases (VLDB), pp 392–403
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: KDD, pp 444–452
Lavrac̆ N, Dz̆eroski S (1994) Inductive logic programming: techniques and applications. Ellis Horwood, Chichester
Lavrac̆ N, Dz̆eroski S, Bratko I (1996) Handling imperfect data in inductive logic programming. In: Raedt LD (ed) Advances in inductive logic programming. IOS Press, Amsterdam, pp 48–64
Liu FT, Ting KM, Zhou ZH (2012) Isolation-based anomaly detection. TKDD 6(1):3
Lloyd JW (1987) Foundations of logic programming. Springer, Berlin
Mannila H, Räihä K (1987) Dependency inference. In: VLDB, pp 155–158
Muggleton S (1995) Inverse entailment and Progol. New Gen Comput 13(3–4):245–286
Muggleton S, Feng C (1990) Efficient induction of logic programs. In: First conference on algorithmic learning theory, pp 368–381
Muggleton S, Bain M, Hayes-Michie J, Michie D (1989) An exeperimental comparison of human and machine learning formalisms. In: Sixth international workshop on machine learning
Novelli N, Cicchetti R (2001) Functional and embedded dependency inference: a data mining point of view. IS 26(7):477–506
Papadimitriou S, Kitagawa H, Gibbons PB, Faloutsos C (2003) Loci: fast outlier detection using the local correlation integral. In: Proceedings of the international conference on data engineering (ICDE) , pp 315–326
Plotkin G (1971) A further note on inductive generalization. In: Machine learning, vol 6, chap 8. American Elsevier, New York, pp 101–124
Quinlan J, Cameron-Jones R (1993) Foil: a midterm report. In: 6th European conference on machine learning, pp 3–20
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the international conference on management of data (SIGMOD), pp 427–438
Schölkopf B, Burges C, Vapnik V (1995) Extracting support data for a given task. In: KDD, pp 252–257
Srinivasan A, Muggleton S, Sternberg M, King R (1996) Theories for mutagenicity: a study in first-order and feature-based induction. Artif Intell 85(1–2):277–299
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Eamonn Keogh.
A preliminary version of this article appears under the title “Outlier Detection using Inductive Logic Programming” in the Proceedings of the IEEE International Conference on Data Mining (ICDM), Miami, Florida, December 6–9, 2009 (Angiulli and Fassetti 2009b).
Rights and permissions
About this article
Cite this article
Angiulli, F., Fassetti, F. Exploiting domain knowledge to detect outliers. Data Min Knowl Disc 28, 519–568 (2014). https://doi.org/10.1007/s10618-013-0310-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-013-0310-5