Mining outlying aspects on numeric data

Duan, Lei; Tang, Guanting; Pei, Jian; Bailey, James; Campbell, Akiko; Tang, Changjie

doi:10.1007/s10618-014-0398-2

Mining outlying aspects on numeric data

Published: 17 January 2015

Volume 29, pages 1116–1151, (2015)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Lei Duan¹,
Guanting Tang²,
Jian Pei²,
James Bailey³,
Akiko Campbell⁴ &
…
Changjie Tang¹

1362 Accesses
Explore all metrics

Abstract

When we are investigating an object in a data set, which itself may or may not be an outlier, can we identify unusual (i.e., outlying) aspects of the object? In this paper, we identify the novel problem of mining outlying aspects on numeric data. Given a query object $o$ in a multidimensional numeric data set $O$, in which subspace is $o$ most outlying? Technically, we use the rank of the probability density of an object in a subspace to measure the outlyingness of the object in the subspace. A minimal subspace where the query object is ranked the best is an outlying aspect. Computing the outlying aspects of a query object is far from trivial. A naïve method has to calculate the probability densities of all objects and rank them in every subspace, which is very costly when the dimensionality is high. We systematically develop a heuristic method that is capable of searching data sets with tens of dimensions efficiently. Our empirical study using both real data and synthetic data demonstrates that our method is effective and efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering outlying aspects in large datasets

Article 09 February 2016

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Article Open access 29 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

http://sports.yahoo.com/nba/stats.
The object id and dimension id in Tables 7 and 8 are consistent with the original data sets in Keller et al. (2012).

References

Aggarwal CC (2013) An introduction to outlier analysis. Springer, New York
Book Google Scholar
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Record, ACM, vol 30, pp 37–46
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th international conference on very large data bases, VLDB ’94, pp 487–499
Angiulli F, Fassetti F, Palopoli L (2009) Detecting outlying properties of exceptional objects. ACM Trans Database Syst 34(1):7:1–7:62
Article Google Scholar
Angiulli F, Fassetti F, Palopoli L, Manco G (2013) Outlying property detection with numerical attributes. CoRR abs/1306.3558
Bache K, Lichman M (2013) UCI machine learning repository
Bhaduri K, Matthews BL, Giannella CR (2011) Algorithms for speeding up distance-based outlier detection. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’11, pp 859–867
Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proceedings of the 13th SIAM international conference on data mining, SDM ’13, pp 198–206
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 93–104
Chandola V, Banerjee A, Kumar V (2009) Anomaly detection: a survey. ACM Comput Surv 41(3):15:1–15:58
Article Google Scholar
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques, 3rd edn. Morgan Kaufmann Publishers Inc., San Francisco
Google Scholar
Härdle W (1990) Smoothing techniques: with implementations in S. Springer, New York
Book Google Scholar
Härdle W, Werwatz A, Müller M, Sperlich S (2004) Nonparametric and semiparametric modelss., Springer Series in StatisticsSpringer, Berlin
Book Google Scholar
He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst/ComSIS 2(1):103–118
Article Google Scholar
Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proceedings of the 28th international conference on data engineering, ICDE ’12, pp 1037–1048
Knorr EM, Ng RT (1999) Finding intensional knowledge of distance-based outliers. In: Proceedings of the 25th international conference on very large data bases, VLDB ’99, pp 211–222
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, pp 444–452
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proceedings of the 13th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD ’09, pp 831–838
Müller E, Schiffer M, Seidl T (2011) Statistical selection of relevant subspace projections for outlier ranking. In: Proceedings of the 27th IEEE international conference on data engineering, ICDE ’11, pp 434–445
Müller E, Assent I, Iglesias P, Mülle Y, Böhm K (2012a) Outlier ranking via subspace analysis in multiple views of the data. In: Proceedings of the 12th IEEE international conference on data mining, ICDM ’12, pp 529–538
Müller E, Keller F, Blanc S, Böhm K (2012b) OutRules: a framework for outlier descriptions in multiple context spaces. In: ECML/PKDD (2), pp 828–832
Paravastu R, Kumar H, Pudi V (2008) Uniqueness mining. In: Proceedings of the 13th international conference on database systems for advanced applications, DASFAA ’08, pp 84–94
Ramaswamy S, Rastogi R, Shim K (2000) Efficient algorithms for mining outliers from large data sets. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, pp 427–438
Rymon R (1992) Search through systematic set enumeration. In: Proceedings of the 3rd international conference on principle of knowledge representation and reasoning, KR ’92, pp 539–550
Scott DW (1992) Multivariate density estimation: theory, practice, and visualization., Wiley Series in Probability and StatisticsWiley, New York
Book MATH Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London
Book MATH Google Scholar
Tang G, Bailey J, Pei J, Dong G (2013) Mining multidimensional contextual outliers from categorical relational data. In: Proceedings of the 25th international conference on scientific and statistical database management, SSDBM ’13, pp 43:1–43:4
Zimek A, Schubert E, Kriegel HP (2012) A survey on unsupervised outlier detection in high-dimensional numerical data. Stat Anal Data Min 5(5):363–387
Article MathSciNet Google Scholar

Download references

Acknowledgments

The authors thank the editor and the anonymous reviewers for their invaluable comments, which help to improve this paper. Lei Duan’s research is supported in part by Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371). Work by Lei Duan at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. Jian Pei’s and Guanting Tang’s research is supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work is supported by an ARC Future Fellowship (FT110100112). All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

School of Computer Science, Sichuan University, Chengdu, Sichuan, China
Lei Duan & Changjie Tang
School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Guanting Tang & Jian Pei
Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
James Bailey
Pacific Blue Cross, Burnaby, BC, Canada
Akiko Campbell

Authors

Lei Duan
View author publications
You can also search for this author inPubMed Google Scholar
Guanting Tang
View author publications
You can also search for this author inPubMed Google Scholar
Jian Pei
View author publications
You can also search for this author inPubMed Google Scholar
James Bailey
View author publications
You can also search for this author inPubMed Google Scholar
Akiko Campbell
View author publications
You can also search for this author inPubMed Google Scholar
Changjie Tang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lei Duan.

Additional information

Responsible editors: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Appendices

Appendix 1:Proof of Proposition 1

Proof

For any dimension $D_i \in S\,(1 \le i \le d)$, the mean value of $\{o.D_i \mid o \in O\}$, denoted by $\mu _i$, is $\frac{1}{|O|}\sum \limits _{o \in O}o.D_i$, the standard deviation of $\{o.D_i \mid o \in O\}$, denoted by $\sigma _i$, is $\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2}$, and the bandwidth of $D_{i}\,(h_i)$ is $1.06\min \{\sigma _i, \frac{R}{1.34}\}|O|^{-\frac{1}{5}}$, where $R$ is the difference between the first and the third quartiles of $O$ in $D_i$.

We perform the linear transformation $g(o).D_i = a_io.D_i + b_i$ for any $o \in O$. Then, the mean value of $\{g(o).D_i \mid o \in O\}$ is $\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i) = a_i \mu _i + b_i$, and the standard deviation of $\{g(o).D_i \mid o \in O\}$ is $\sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(a_i o.D_i + b_i - a_i \mu _i - b_i)^2} = a_i \sqrt{\frac{1}{|O|}\sum \limits _{o \in O}(o.D_i - \mu _i)^2} = a_i \sigma _i$.

Correspondingly, the bandwidth of $D_i$ is $1.06\min \{a_i\sigma _i, \frac{a_iR}{1.34}\}|O|^{-\frac{1}{5}}$ after the linear transformation. As the distance between two objects in $D_i$ is also enlarged by $a_i$, the quasi-density calculated by Eq. 7 keeps unchanged. Thus, the ranking is invariant under linear transformation. $\square $

Appendix 2: Proof of Theorem 1

Proof

(i)
Given an object $o' \in TN_S^{\epsilon ,o}$, for any dimension $D_i \in S$, $\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \epsilon _{D_i}$. Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$
That is, $dc_S^\epsilon \le dc_S(o, o') \le dc^{max}_S(o)$.
(ii)
Given an object $o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}$, for any dimension $D_i \in S$, $\min \limits _{o'' \in O}\{|o.D_i - o''.D_i|\} \le |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}$. Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{\min \limits _{o'' \in O}\left\{ |o.D_i - o''.D_i|\right\} ^2}{2h_{D_i}^2}}. \end{aligned}$$
That is, $dc^{min}_S(o) \le dc_S(o, o') \le dc^{max}_S(o)$.
(iii)
Given an object $o' \in O \setminus LN_S^{\epsilon ,o}$, for any dimension $D_i \in S$, $\epsilon _{D_i} < |o.D_i - o'.D_i| \le \max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}$. Thus,
$$\begin{aligned} e^{- \sum \limits _{D_i \in S} \frac{\max \limits _{o'' \in O}\{|o.D_i - o''.D_i|\}^2}{2h_{D_i}^2}} \le e^{- \sum \limits _{D_i \in S} \frac{|o.D_i - o'.D_i|^2}{2h_{D_i}^2}} < e^{- \sum \limits _{D_i \in S} \frac{\epsilon _{D_i}^2}{2h_{D_i}^2}}. \end{aligned}$$
That is, $dc^{min}_S(o) \le dc_S(o, o') < dc_S^{\epsilon }$.

$\square $

Appendix 3: Proof of Corollary 1

Proof

We divide $O$ into three disjoint subsets $TN_S^{\epsilon ,o}$, $LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}$ and $O \setminus LN_S^{\epsilon ,o}$. By Theorem 1, for objects belonging to $TN_S^{\epsilon ,o}$, we have

$$\begin{aligned} |TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } \le \sum \limits _{o' \in TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \le |TN_S^{\epsilon ,o}| \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to $LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}$, we have

$$\begin{aligned}&\left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \\&\quad \le \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to $O \setminus LN_S^{\epsilon ,o}$, we have

$$\begin{aligned} \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < (|O|-|LN_S^{\epsilon ,o}|)\ dc_S^{\epsilon } \end{aligned}$$

As

$$\begin{aligned} \tilde{f}_S(o)&= \sum \limits _{o' \in O}dc_S\left( o, o'\right) = \sum \limits _{o' \in TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) + \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}} dc_S\left( o, o'\right) \\&+ \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\!\ge \|TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } \!+\! \left( |LN_S^{\epsilon ,o}|\!-\!|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \!+\! \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \\&= |TN_S^{\epsilon ,o}| \ dc_S^{\epsilon } + \left( |O|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\!\le |TN_S^{\epsilon ,o}| \ dc_S^{max}(o) \!+\! \left( |LN_S^{\epsilon ,o}|\!-\!|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \!+\! \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \\&= |LN_S^{\epsilon ,o}| \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if $LN_S^{\epsilon ,o} \subset O$, i.e. $O \setminus LN_S^{\epsilon ,o} \ne \emptyset $, then

$$\begin{aligned} \tilde{f}_S(o) < |LN_S^{\epsilon ,o}| \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

$\square $

Appendix 4: Proof of Corollary 2

Proof

Since $O' \subseteq TN_S^{\epsilon ,o}$, for objects belonging to $O\!\setminus \! O'$, we divide them into $TN_S^{\epsilon ,o}\!\setminus \!O'$, $LN_S^{\epsilon ,o} \!\setminus \! TN_S^{\epsilon ,o}$ and $O \!\setminus \! LN_S^{\epsilon ,o}$. Then

$$\begin{aligned} \tilde{f}_S(o)&= \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in TN_S^{\epsilon ,o} \setminus O'}dc_S\left( o, o'\right) \\&+\, \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}} dc_S\left( o, o'\right) + \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to $TN_S^{\epsilon ,o}\setminus \! O'$, we have

$$\begin{aligned} \left( |TN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{\epsilon } \le \sum \limits _{o' \in TN_S^{\epsilon ,o} \setminus O'}dc_S\left( o, o'\right) \le (|TN_S^{\epsilon ,o}|-|O'|) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to $LN_S^{\epsilon ,o}\!\setminus \! TN_S^{\epsilon ,o}$, we have

$$\begin{aligned}&\left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \\&\quad \le \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to $O {\setminus } LN_S^{\epsilon ,o}$, we have

$$\begin{aligned} (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |TN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^{\epsilon } + \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\&+ \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\&= \tilde{f}_{S}^{O'}(o)+ \left( |TN_S^{\epsilon ,o}| - |O'|\right) \ dc_S^\epsilon + \left( |O| - |TN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |TN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |LN_S^{\epsilon ,o}|-|TN_S^{\epsilon ,o}|\right) \ dc_S^{max}(o)\\&+\left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \\&= \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if $LN_S^{\epsilon ,o} \subset O$, i.e. $O \setminus LN_S^{\epsilon ,o} \ne \emptyset $, then

$$\begin{aligned} \tilde{f}_S(o) < \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

$\square $

Appendix 5: Proof of Corollary 3

Proof

Since $TN_S^{\epsilon ,o} \subset O' \subseteq LN_S^{\epsilon ,o}$, for objects belonging to $O\! \setminus \! O'$, we divide them into $LN_S^{\epsilon ,o} \setminus \! O'$ and $O \setminus \! LN_S^{\epsilon ,o}$. Then

$$\begin{aligned} \tilde{f}_S(o) = \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in LN_S^{\epsilon ,o} \setminus O'} dc_S\left( o, o'\right) + \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to $LN_S^{\epsilon ,o} {\setminus } O'$, we have

$$\begin{aligned} \left( |LN_S^{\epsilon ,o}|\!-\!|O'|\right) \ dc_S^{min}(o) \!\le \! \sum \limits _{o' \in LN_S^{\epsilon ,o} {\setminus } TN_S^{\epsilon ,o}}dc_S\left( o, o'\right) \!\le \! \left( |LN_S^{\epsilon ,o}|\!-\!|O'|\right) \ dc_S^{max}(o) \end{aligned}$$

For objects belonging to $O {\setminus } LN_S^{\epsilon ,o}$, we have

$$\begin{aligned} \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus LN_S^{\epsilon ,o}}dc_S\left( o, o'\right) < \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{min}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{min}(o) \\&= \tilde{f}_{S}^{O'}(o) + (|O| - |O'|) \ dc_S^{min}(o) \\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{max}(o) + \left( |O|-|LN_S^{\epsilon ,o}|\right) \ dc_S^{\epsilon } \end{aligned}$$

Moreover, if $LN_S^{\epsilon ,o} \subset O$, i.e. $O {\setminus } LN_S^{\epsilon ,o} \ne \emptyset $, then

$$\begin{aligned} \tilde{f}_S(o) < \tilde{f}^{O'}_S(o) + (|LN_S^{\epsilon ,o}|-|O'|) \ dc_S^{max}(o) + (|O|-|LN_S^{\epsilon ,o}|) \ dc_S^{\epsilon } \end{aligned}$$

$\square $

Appendix 6: Proof of Corollary 4

Proof

Since $LN_S^{\epsilon ,o} \subset O' \subseteq O$, Then

$$\begin{aligned} \tilde{f}_S(o) = \tilde{f}^{O'}_S(o) + \sum \limits _{o' \in O \setminus O'} dc_S\left( o, o'\right) , \end{aligned}$$

By Theorem 1, for objects belonging to $O \!\setminus \! O'$, we have

$$\begin{aligned} \left( |LN_S^{\epsilon ,o}|-|O'|\right) \ dc_S^{min}(o) \le \sum \limits _{o' \in O \setminus O'} \ dc_S\left( o, o'\right) \le (|O|-|O'|) \ dc_S^{\epsilon } \end{aligned}$$

Thus,

$$\begin{aligned} \tilde{f}_S(o)&\ge \tilde{f}^{O'}_S(o) + \left( |O|-|O'|\right) \ dc_S^{min}(o)\\ \tilde{f}_S(o)&\le \tilde{f}^{O'}_S(o) + \left( |O|-|O'|\right) \ dc_S^{\epsilon } \end{aligned}$$

$\square $

Appendix 7: Proof of Theorem 2

Proof

We prove by contradiction.

Given a set of objects $O$, a subspace $S$, two neighborhood distances $\epsilon _1$ and $\epsilon _2$. Let $q \in O$ be the query object. For an object $o \in O$, denote by $L_{\epsilon _1}$ the lower bound of $\tilde{f}_S(o)$ estimated by $\epsilon _1$, $U_{\epsilon _2}$ the upper bound of $\tilde{f}_S(o)$ estimated by $\epsilon _2$.

Assume that $\tilde{f}_S(q) < L_{\epsilon _1}$ and $\tilde{f}_S(q) > U_{\epsilon _2}$.

As $L_{\epsilon _1}$ is a lower bound of $\tilde{f}_S(o)$, and $U_{\epsilon _2}$ is an upper bound of $\tilde{f}_S(o)$, so that $L_{\epsilon _1} < \tilde{f}_S(o) < U_{\epsilon _2}$. Then, we have $\tilde{f}_S(q) < L_{\epsilon _1} < \tilde{f}_S(o)$ and $\tilde{f}_S(o) < U_{\epsilon _2} < \tilde{f}_S(q)$. Consequently, $\tilde{f}_S(o) < \tilde{f}_S(q) < \tilde{f}_S(o)$. A contradiction.

Thus, $rank^{\epsilon _1}_S(q) = |\{o \in O \mid \tilde{f}_S(o) < \tilde{f}_S(q)\}|+1 =rank^{\epsilon _2}_S(q)$. $\square $

Appendix 8: Proof of Theorem 3

Proof

We prove by contradiction.

Let $Ans$ be the set of minimal outlying subspaces of $q$ found by OAMiner, $r_{best}$ the best rank. Assume that subspace $S \notin Ans$ satisfying $S \subseteq D$ and $0 < |S| \le \ell $ is a minimal outlying subspace of $q$.

Recall that OAMiner searches subspaces by traversing the subspace enumeration tree in a depth-first manner. As $S \notin Ans$, $S$ is pruned by Pruning Rule 1 or Pruning Rule 2.

In the case that $S$ is pruned by Pruning Rule 1, $S$ is not minimal. A contradiction;

In the case that $S$ is pruned by Pruning Rule 2, then there exist a subspace $S'$, such that $S'$ is a parent of $S$ in the subspace enumeration tree and $|Comp_{S'}(q)| \ge r_{best}$. By the property of competitors, we have $Comp_{S'}(q) \subseteq Comp_S(q)$. Correspondingly, $rank_S(q) \ge |Comp_S(q)| \ge |Comp_{S'}(q)| \ge r_{best}$. A contradiction. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, L., Tang, G., Pei, J. et al. Mining outlying aspects on numeric data. Data Min Knowl Disc 29, 1116–1151 (2015). https://doi.org/10.1007/s10618-014-0398-2

Download citation

Received: 02 March 2014
Accepted: 18 December 2014
Published: 17 January 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10618-014-0398-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Mining outlying aspects on numeric data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering outlying aspects in large datasets

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1:Proof of Proposition 1

Proof

Appendix 2: Proof of Theorem 1

Proof

Appendix 3: Proof of Corollary 1

Proof

Appendix 4: Proof of Corollary 2

Proof

Appendix 5: Proof of Corollary 3

Proof

Appendix 6: Proof of Corollary 4

Proof

Appendix 7: Proof of Theorem 2

Proof

Appendix 8: Proof of Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now