Discovering outlying aspects in large datasets

Vinh, Nguyen Xuan; Chan, Jeffrey; Romano, Simone; Bailey, James; Leckie, Christopher; Ramamohanarao, Kotagiri; Pei, Jian

doi:10.1007/s10618-016-0453-2

Discovering outlying aspects in large datasets

Published: 09 February 2016

Volume 30, pages 1520–1555, (2016)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Nguyen Xuan Vinh¹,
Jeffrey Chan¹,
Simone Romano¹,
James Bailey¹,
Christopher Leckie¹,
Kotagiri Ramamohanarao¹ &
…
Jian Pei²

1473 Accesses
52 Citations
Explore all metrics

Abstract

We address the problem of outlying aspects mining: given a query object and a reference multidimensional data set, how can we discover what aspects (i.e., subsets of features or subspaces) make the query object most outlying? Outlying aspects mining can be used to explain any data point of interest, which itself might be an inlier or outlier. In this paper, we investigate several open challenges faced by existing outlying aspects mining techniques and propose novel solutions, including (a) how to design effective scoring functions that are unbiased with respect to dimensionality and yet being computationally efficient, and (b) how to efficiently search through the exponentially large search space of all possible subspaces. We formalize the concept of dimensionality unbiasedness, a desirable property of outlyingness measures. We then characterize existing scoring measures as well as our novel proposed ones in terms of efficiency, dimensionality unbiasedness and interpretability. Finally, we evaluate the effectiveness of different methods for outlying aspects discovery and demonstrate the utility of our proposed approach on both large real and synthetic data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

An Efficient Method for Outlying Aspect Mining Based on Genetic Algorithm

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Article Open access 29 April 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on management of data, SIGMOD ’01, ACM, New York, pp 37–46
Bache K, Lichman M (2013) UCI machine learning repository. University of California, Irvine
Google Scholar
Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local outliers. SIGMOD Rec 29(2):93–104
Article Google Scholar
Cormen TH, Leiserson CE, Rivest RL, Stein C (2009) Introduction to algorithms, 3rd edn. The MIT Press, Cambridge
MATH Google Scholar
Dang X, Micenkova B, Assent I, Ng R (2013) Local outlier detection with interpretation. In: Blockeel H, Kersting K, Nijssen S, Elezn F (eds) Machine learning and knowledge discovery in databases, vol 8190., Lecture notes in computer scienceSpringer, Berlin, pp 304–320
Chapter Google Scholar
Dang XH, Assent I, Ng RT, Zimek A, Schubert E (2014) Discriminative features for identifying and interpreting outliers. In Proceedings of the IEEE 30th international conference on data engineering (ICDE), pp 88–99
Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Tseng V, Ho T, Zhou Z-H, Chen A, Kao H-Y (eds) Advances in knowledge discovery and data mining, vol 8443., Lecture notes in computer scienceSpringer International Publishing, Berlin, pp 249–260
Chapter Google Scholar
Duan L, Tang G, Pei J, Bailey J, Campbell A, Tang C (2015) Mining outlying aspects on numeric data. Data Min Knowl Discov 29(5):1116–1151
Article MathSciNet Google Scholar
Garfinkel S, Spafford G, Schwartz A (2003) Practical unix & internet security, 3rd edn. O’Reilly Media Inc, California
Google Scholar
He Z, Xu X, Huang ZJ, Deng S (2005) Fp-outlier: frequent pattern based outlier detection. Comput Sci Inform Syst 2(1):103–118
Article Google Scholar
Keller F, Muller E, Bohm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In Proceedings of the 2012 IEEE 28th international conference on data engineering, ICDE ’12, IEEE Computer Society, Washington, pp 1037–1048
Kriegel H-P, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, ACM, New York, pp 444–452
Kriegel H-P, Kruger P, Schubert E, Zimek A (2009) Outlier detectionin axis-parallel subspaces of high dimensional data. In: Theeramunkong T, Kijsirikul B, Cercone N, Ho T-B (eds) Advances in knowledge discovery and data mining, vol 5476., Lecture notes in computer scienceSpringer, Berlin, pp 831–838
Chapter Google Scholar
Liu F, Ting KM, Zhou Z-H (2008) Isolation forest. In Proceedings of the 8th IEEE international conference on data mining, ICDM ’08., pp 413–422
Liu FT, Ting KM, Zhou Z-H (2012) Isolation-based anomaly detection. ACM Trans Knowl Discov Data 6(1):3:1–3:39
Article Google Scholar
Micenkova B, Dang X-H, Assent I, Ng R (2013) Explaining outliers by subspace separability. In Proceedings of the 2013 IEEE 13th international conference on data mining (ICDM), pp 518–527
Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In Proceedings of the 2013 SIAM data mining conference (SDM), pp 198–206
Peng H, Long F, Ding C (2005) Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238
Article Google Scholar
Romano S, Bailey J, Vinh NX, Verspoor K (2014) Standardized mutual information for clustering comparisons: One step further in adjustment for chance. In T. Jebara and E. P. Xing (eds) Proceedings of the 31st international conference on machine learning (ICML-14), pp 1143–1151
Russell SJ, Norvig P (2003) Artificial intelligence: a modern approach, 2nd edn. Pearson Education, London
MATH Google Scholar
Sabhnani M, Serpen G (2003) KDD feature set complaint heuristic rules for R2L attack detection. In Proceedings of the international conference on security and management, SAM ’03, Vol 1, Las Vegas, 23–26 June 2003, pp 310–316
Smets K, Vreeken J (2011) The odd one out: Identifying and characterising anomalies. In Proceedings of the 2011 SIAM international conference on data mining, chapter 69, pp 804–815
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res 11:2837–2854
MathSciNet MATH Google Scholar
Vinh NX, Chan J, Romano S, Bailey J (2014a) Effective global approaches for mutual information based feature selection. In Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’14, ACM, New York, pp 512–521
Vinh NX, Chan J, Bailey J (2014b) Reconsidering mutual informationbased feature selection: a statistical significance view. In Proceedings of the twenty-eighth AAAI conference on artificialintelligence, Québec City, 27 -31 July 2014, pp 2092–2098
Wu T, Xin D, Mei Q, Han J (2009) Promotion analysis in multi-dimensional space. Proc VLDB Endow 2(1):109–120
Article Google Scholar
Zhang J, Lou M, Ling TW, Wang H (2004) Hos-miner: a system for detecting outlyting subspaces of high-dimensional data. In Proceedings of the thirtieth international conference on very large data bases , Vol 30, VLDB ’04, VLDB Endowment, Brussels, pp 1265–1268

Download references

Acknowledgments

This work is supported by the Australian Research Council via Grant Numbers FT110100112 and DP140101969.

Author information

Authors and Affiliations

The University of Melbourne, Melbourne, Australia
Nguyen Xuan Vinh, Jeffrey Chan, Simone Romano, James Bailey, Christopher Leckie & Kotagiri Ramamohanarao
Simon Fraser University, Burnaby, Canada
Jian Pei

Authors

Nguyen Xuan Vinh
View author publications
You can also search for this author inPubMed Google Scholar
Jeffrey Chan
View author publications
You can also search for this author inPubMed Google Scholar
Simone Romano
View author publications
You can also search for this author inPubMed Google Scholar
James Bailey
View author publications
You can also search for this author inPubMed Google Scholar
Christopher Leckie
View author publications
You can also search for this author inPubMed Google Scholar
Kotagiri Ramamohanarao
View author publications
You can also search for this author inPubMed Google Scholar
Jian Pei
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Nguyen Xuan Vinh.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vinh, N.X., Chan, J., Romano, S. et al. Discovering outlying aspects in large datasets. Data Min Knowl Disc 30, 1520–1555 (2016). https://doi.org/10.1007/s10618-016-0453-2

Download citation

Received: 03 May 2015
Accepted: 25 January 2016
Published: 09 February 2016
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10618-016-0453-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Discovering outlying aspects in large datasets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Scalable Outlying-Inlying Aspects Discovery via Feature Ranking

An Efficient Method for Outlying Aspect Mining Based on Genetic Algorithm

A New Dimensionality-Unbiased Score for Efficient and Effective Outlying Aspect Mining

Explore related subjects

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now