Abstract
When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a query object \(o\) in a multidimensional numeric data set \(O\), in which subspace is \(o\) most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.







Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
References
Aggarwal CC (2013) An introduction to outlier analysis. Springer, New York
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Record, ACM, vol 30, pp 37–46
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB ’94, pp 487–499
Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):7:1–7:62
Angiulli F, Fassetti F, Palopoli L, Manco G (2013) Outlying property detection with numerical attributes. CoRR abs/1306.3558
Bache K, Lichman M (2013) UCI machine learning repository
Bhaduri K, Matthews BL, Giannella CR (2011) Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 859–867
Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proceedings of the 13th SIAM international conference on data mining, SDM ’13, pp 198–206
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
Härdle W (1990) Smoothing techniques: with implementations in S. Springer, New York
Härdle W, Werwatz A, Müller M, Sperlich S (2004) Nonparametric and semiparametric modelss., Springer Series in StatisticsSpringer, Berlin
He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst/ComSIS 2(1):103–118
Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 28th international conference on data engineering, ICDE ’12, pp 1037–1048
Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, pp 211–222
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 444–452
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD ’09, pp 831–838
Müller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: Proceedings of the 27th IEEE international conference on data engineering, ICDE ’11, pp 434–445
Müller E, Assent I, Iglesias P, Mülle Y, Böhm K (2012a) Outlier ranking via subspace analysis in multiple views of the data. In: Proceedings of the 12th IEEE international conference on data mining, ICDM ’12, pp 529–538
Müller E, Keller F, Blanc S, Böhm K (2012b) OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML/PKDD (2), pp 828–832
Paravastu R, Kumar H, Pudi V (2008) Uniqueness mining. In: Proceedings of the 13th international conference on database systems for advanced applications, DASFAA ’08, pp 84–94
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 427–438
Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 3rd international conference on principle of knowledge representation and reasoning, KR ’92, pp 539–550
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization., Wiley Series in Probability and StatisticsWiley, New York
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London
Tang G, Bailey J, Pei J, Dong G (2013) Mining multidimensional contextual outliers from categorical relational data. In: Proceedings of the 25th international conference on scientific and statistical database management, SSDBM ’13, pp 43:1–43:4
Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387
Acknowledgments
The authors thank the editor and the anonymous reviewers for their invaluable comments, which help to improve this paper. Lei Duan’s research is supported in part by Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371). Work by Lei Duan at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. Jian Pei’s and Guanting Tang’s research is supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work is supported by an ARC Future Fellowship (FT110100112). All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.
Appendices
Appendix 1:Proof of Proposition 1
Proof
For any dimension \(D_i \in S\,(1 \le i \le d)\), the mean value of \(\{o.D_i \mid o \in O\}\), denoted by \(\mu _i\), is \(\frac{1}{|O|}\sum \limits _{o \in O}o.D_i\), the standard deviation of \(\{o.D_i \mid o \in O\}\), denoted by \(\sigma _i\), is \(\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2}\), and the bandwidth of \(D_{i}\,(h_i)\) is \(1.06\min \{\sigma _i, \frac{R}{1.34}\}|O|^{-\frac{1}{5}}\), where \(R\) is the difference between the first and the third quartiles of \(O\) in \(D_i\).
We perform the linear transformation \(g(o).D_i = a_io.D_i + b_i\) for any \(o \in O\). Then, the mean value of \(\{g(o).D_i \mid o \in O\}\) is \(\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i) = a_i \mu _i + b_i\), and the standard deviation of \(\{g(o).D_i \mid o \in O\}\) is \(\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i - a_i \mu _i - b_i)^2} = a_i \sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2} = a_i \sigma _i\).
Correspondingly, the bandwidth of \(D_i\) is \(1.06\min \{a_i\sigma _i, \frac{a_iR}{1.34}\}|O|^{-\frac{1}{5}}\) after the linear transformation. As the distance between two objects in \(D_i\) is also enlarged by \(a_i\), the quasi-density calculated by Eq. 7 keeps unchanged. Thus, the ranking is invariant under linear transformation. \(\square \)
Appendix 2: Proof of Theorem 1
Proof
-
(i)
Given an object \(o' \in TN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \epsilon _{D_i}\). Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$That is, \(dc_S^\epsilon \le dc_S(o, o') \le dc^{max}_S(o)\).
-
(ii)
Given an object \(o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}\). Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$That is, \(dc^{min}_S(o) \le dc_S(o, o') \le dc^{max}_S(o)\).
-
(iii)
Given an object \(o' \in O \setminus LN_S^{\epsilon ,o}\), for any dimension \(D_i \in S\), \(\epsilon _{D_i} < |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}\). Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} < e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}}. \end{aligned}$$That is, \(dc^{min}_S(o) \le dc_S(o, o') < dc_S^{\epsilon }\).
\(\square \)
Appendix 3: Proof of Corollary 1
Proof
We divide \(O\) into three disjoint subsets \(TN_S^{\epsilon ,o}\), \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\) and \(O \setminus LN_S^{\epsilon ,o}\). By Theorem 1, for objects belonging to \(TN_S^{\epsilon ,o}\), we have
For objects belonging to \(LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}\), we have
For objects belonging to \(O \setminus LN_S^{\epsilon ,o}\), we have
As
Thus,
Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O \setminus LN_S^{\epsilon ,o} \ne \emptyset \), then
\(\square \)
Appendix 4: Proof of Corollary 2
Proof
Since \(O' \subseteq TN_S^{\epsilon ,o}\), for objects belonging to \(O\!\setminus \! O'\), we divide them into \(TN_S^{\epsilon ,o}\!\setminus \!O'\), \(LN_S^{\epsilon ,o} \!\setminus \! TN_S^{\epsilon ,o}\) and \(O \!\setminus \! LN_S^{\epsilon ,o}\). Then
By Theorem 1, for objects belonging to \(TN_S^{\epsilon ,o}\setminus \! O'\), we have
For objects belonging to \(LN_S^{\epsilon ,o}\!\setminus \! TN_S^{\epsilon ,o}\), we have
For objects belonging to \(O {\setminus } LN_S^{\epsilon ,o}\), we have
Thus,
Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O \setminus LN_S^{\epsilon ,o} \ne \emptyset \), then
\(\square \)
Appendix 5: Proof of Corollary 3
Proof
Since \(TN_S^{\epsilon ,o} \subset O' \subseteq LN_S^{\epsilon ,o}\), for objects belonging to \(O\! \setminus \! O'\), we divide them into \(LN_S^{\epsilon ,o} \setminus \! O'\) and \(O \setminus \! LN_S^{\epsilon ,o}\). Then
By Theorem 1, for objects belonging to \(LN_S^{\epsilon ,o} {\setminus } O'\), we have
For objects belonging to \(O {\setminus } LN_S^{\epsilon ,o}\), we have
Thus,
Moreover, if \(LN_S^{\epsilon ,o} \subset O\), i.e. \(O {\setminus } LN_S^{\epsilon ,o} \ne \emptyset \), then
\(\square \)
Appendix 6: Proof of Corollary 4
Proof
Since \(LN_S^{\epsilon ,o} \subset O' \subseteq O\), Then
By Theorem 1, for objects belonging to \(O \!\setminus \! O'\), we have
Thus,
\(\square \)
Appendix 7: Proof of Theorem 2
Proof
We prove by contradiction.
Given a set of objects \(O\), a subspace \(S\), two neighborhood distances \(\epsilon _1\) and \(\epsilon _2\). Let \(q \in O\) be the query object. For an object \(o \in O\), denote by \(L_{\epsilon _1}\) the lower bound of \(\tilde{f}_S(o)\) estimated by \(\epsilon _1\), \(U_{\epsilon _2}\) the upper bound of \(\tilde{f}_S(o)\) estimated by \(\epsilon _2\).
Assume that \(\tilde{f}_S(q) < L_{\epsilon _1}\) and \(\tilde{f}_S(q) > U_{\epsilon _2}\).
As \(L_{\epsilon _1}\) is a lower bound of \(\tilde{f}_S(o)\), and \(U_{\epsilon _2}\) is an upper bound of \(\tilde{f}_S(o)\), so that \(L_{\epsilon _1} < \tilde{f}_S(o) < U_{\epsilon _2}\). Then, we have \(\tilde{f}_S(q) < L_{\epsilon _1} < \tilde{f}_S(o)\) and \(\tilde{f}_S(o) < U_{\epsilon _2} < \tilde{f}_S(q)\). Consequently, \(\tilde{f}_S(o) < \tilde{f}_S(q) < \tilde{f}_S(o)\). A contradiction.
Thus, \(rank^{\epsilon _1}_S(q) = |\{o \in O \mid \tilde{f}_S(o) < \tilde{f}_S(q)\}|+1 =rank^{\epsilon _2}_S(q)\). \(\square \)
Appendix 8: Proof of Theorem 3
Proof
We prove by contradiction.
Let \(Ans\) be the set of minimal outlying subspaces of \(q\) found by OAMiner, \(r_{best}\) the best rank. Assume that subspace \(S \notin Ans\) satisfying \(S \subseteq D\) and \(0 < |S| \le \ell \) is a minimal outlying subspace of \(q\).
Recall that OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. As \(S \notin Ans\), \(S\) is pruned by Pruning Rule 1 or Pruning Rule 2.
In the case that \(S\) is pruned by Pruning Rule 1, \(S\) is not minimal. A contradiction;
In the case that \(S\) is pruned by Pruning Rule 2, then there exist a subspace \(S'\), such that \(S'\) is a parent of \(S\) in the subspace enumeration tree and \(|Comp_{S'}(q)| \ge r_{best}\). By the property of competitors, we have \(Comp_{S'}(q) \subseteq Comp_S(q)\). Correspondingly, \(rank_S(q) \ge |Comp_S(q)| \ge |Comp_{S'}(q)| \ge r_{best}\). A contradiction. \(\square \)
Rights and permissions
About this article
Cite this article
Duan, L., Tang, G., Pei, J. et al. Mining outlying aspects on numeric data. Data Min Knowl Disc 29, 1116–1151 (2015). https://doi.org/10.1007/s10618-014-0398-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0398-2