Efficient discovery of contrast subspaces for object explanation and characterization

Duan, Lei; Tang, Guanting; Pei, Jian; Bailey, James; Dong, Guozhu; Nguyen, Vinh; Campbell, Akiko; Tang, Changjie

doi:10.1007/s10115-015-0835-6

Efficient discovery of contrast subspaces for object explanation and characterization

Regular Paper
Published: 26 April 2015

Volume 47, pages 99–129, (2016)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Lei Duan¹,
Guanting Tang²,
Jian Pei²,
James Bailey³,
Guozhu Dong⁴,
Vinh Nguyen³,
Akiko Campbell⁵ &
…
Changjie Tang¹

511 Accesses
Explore all metrics

Abstract

We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes $C_+$ and $C_-$ and a query object $o$, we want to find the top-$k$ subspaces that maximize the ratio of likelihood of $o$ in $C_+$ against that in $C_-$. Such subspaces are very useful for characterizing an object and explaining how it differs between two classes. We demonstrate that this problem has important applications, and, at the same time, is very challenging, being MAX SNP-hard. We present CSMiner, a mining method that uses kernel density estimation in conjunction with various pruning techniques. We experimentally investigate the performance of CSMiner on a range of data sets, evaluating its efficiency, effectiveness, and stability and demonstrating it is substantially faster than a baseline method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DACC: A Data Exploration Method for High-Dimensional Data Sets

Information-Theoretic Non-redundant Subspace Clustering

Supervised Human-Guided Data Exploration

Notes

While [8] presented a contrast-pattern length based algorithm to detection global outliers, their problem setting is different from ours.
Generally, given a set of observations $Q$, the plausibility of two models $M_1$ and $M_2$ can be assessed by the Bayes factor $K=\frac{Pr(Q\mid M_1)}{Pr(Q \mid M_2)}$.
If it is not unimodal, then there could be multiple peaks at different distances from the query, which is counter to intuition. Similarly, we have no basis for preferring any direction over another, so symmetry is natural.

References

Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30:37–46
Article Google Scholar
Bache K, Lichman M (2013) UCI machine learning repository
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246
Article MATH Google Scholar
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proc. of the 7th Int’l Conf on Database Theory, pp 217–235
Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proc. of the 13th SIAM Int’l Conf on Data Min, pp 198–206
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int’l Conf on Manag of data, pp 93–104
Cai Y, Zhao HK, Han H, Lau RYK, Leung HF, Min H (2012) Answering typicality query based on automatically prototype construction. In: Proc. of the 2012 IEEE/WIC/ACM Int’l Joint Conf Web Intell Intell Agent Technol, 01:362–366
Chen L, Dong G (2006) Masquerader detection using OCLEP: one class classification using length statistics of emerging patterns. In: Proc. of Int’l workshop on information Processing over Evolving Networks (WINPEN), p 5
Dong G, Bailey J (eds) (2013) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca Raton
Google Scholar
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proc. of the 5th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 43–52
Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Proc. of the 18th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 249–260
Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 28–36
He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Article Google Scholar
Hua M, Pei J, Fu AW, Lin X, Leung HF (2009) Top-k typicality queries and efficient query answering methods on large databases. VLDB J 18(3):809–835
Article Google Scholar
Jeffreys H (1961) The theory of probability, 3rd edn. Oxford
Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proc. of the IEEE 28th Int’l Conf on Data Engineering, pp 1037–1048
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 444–452
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 831–838
Novak PK, Lavrac N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
MATH Google Scholar
Papadimitriou CH, Yannakakis M (1991) Optimization, approximation, and complexity classes. J Comput Syst Sci 43(3):425–440
Article MathSciNet MATH Google Scholar
Rymon R (1992) Search through systematic set enumeration. In: Proc. of the 3rd Int’l Conf on Principles of Knowledge Representation and Reasoning, pp 539–550
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London
Book MATH Google Scholar
Wang L, Zhao H, Dong G, Li J (2005) On the complexity of finding emerging patterns. Theor Comput Sci 335(1):15–27
Article MathSciNet MATH Google Scholar
Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 28(4):20:1–20:38
Article Google Scholar
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proc. of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, pp 78–87
Wu S, Crestani F (2003) Methods for ranking information retrieval systems without relevance judgments. In: Proc. of the 2003 ACM Symposium on Applied Computing. ACM, New York, NY, USA, pp 811–816

Download references

Acknowledgments

The authors are grateful to the editor and the anonymous reviewers for their constructive comments, which help to improve this paper. Lei Duan’s research was supported in part by National Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371), and SRFDP 20100181120029. Jian Pei’s and Guanting Tang’s research was supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work was supported by an ARC Future Fellowship (FT110100112). Work by Lei Duan and Guozhu Dong at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. All opinions, findings, conclusions, and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

Author information

Authors and Affiliations

School of Computer Science, Sichuan University, Chengdu, Sichuan, China
Lei Duan & Changjie Tang
School of Computing Science, Simon Fraser University, Burnaby, BC, Canada
Guanting Tang & Jian Pei
Department of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, Australia
James Bailey & Vinh Nguyen
Department of Computer Science and Engineering, Wright State University, Dayton, OH, USA
Guozhu Dong
Pacific Blue Cross, Burnaby, BC, Canada
Akiko Campbell

Authors

Lei Duan
View author publications
You can also search for this author inPubMed Google Scholar
Guanting Tang
View author publications
You can also search for this author inPubMed Google Scholar
Jian Pei
View author publications
You can also search for this author inPubMed Google Scholar
James Bailey
View author publications
You can also search for this author inPubMed Google Scholar
Guozhu Dong
View author publications
You can also search for this author inPubMed Google Scholar
Vinh Nguyen
View author publications
You can also search for this author inPubMed Google Scholar
Akiko Campbell
View author publications
You can also search for this author inPubMed Google Scholar
Changjie Tang
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Lei Duan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Duan, L., Tang, G., Pei, J. et al. Efficient discovery of contrast subspaces for object explanation and characterization. Knowl Inf Syst 47, 99–129 (2016). https://doi.org/10.1007/s10115-015-0835-6

Download citation

Received: 30 November 2014
Accepted: 07 April 2015
Published: 26 April 2015
Issue Date: April 2016
DOI: https://doi.org/10.1007/s10115-015-0835-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Efficient discovery of contrast subspaces for object explanation and characterization

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DACC: A Data Exploration Method for High-Dimensional Data Sets

Information-Theoretic Non-redundant Subspace Clustering

Supervised Human-Guided Data Exploration

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now