Abstract
We tackle the novel problem of mining contrast subspaces. Given a set of multidimensional objects in two classes \(C_+\) and \(C_-\) and a query object \(o\), we want to find the top-\(k\) subspaces that maximize the ratio of likelihood of \(o\) in \(C_+\) against that in \(C_-\). Such subspaces are very useful for characterizing an object and explaining how it differs between two classes. We demonstrate that this problem has important applications, and, at the same time, is very challenging, being MAX SNP-hard. We present CSMiner, a mining method that uses kernel density estimation in conjunction with various pruning techniques. We experimentally investigate the performance of CSMiner on a range of data sets, evaluating its efficiency, effectiveness, and stability and demonstrating it is substantially faster than a baseline method.










Similar content being viewed by others
Notes
While [8] presented a contrast-pattern length based algorithm to detection global outliers, their problem setting is different from ours.
Generally, given a set of observations \(Q\), the plausibility of two models \(M_1\) and \(M_2\) can be assessed by the Bayes factor \(K=\frac{Pr(Q\mid M_1)}{Pr(Q \mid M_2)}\).
If it is not unimodal, then there could be multiple peaks at different distances from the query, which is counter to intuition. Similarly, we have no basis for preferring any direction over another, so symmetry is natural.
References
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. ACM Sigmod Rec 30:37–46
Bache K, Lichman M (2013) UCI machine learning repository
Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Discov 5(3):213–246
Beyer K, Goldstein J, Ramakrishnan R, Shaft U (1999) When is “nearest neighbor” meaningful? In: Proc. of the 7th Int’l Conf on Database Theory, pp 217–235
Böhm K, Keller F, Müller E, Nguyen HV, Vreeken J (2013) CMI: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: Proc. of the 13th SIAM Int’l Conf on Data Min, pp 198–206
Breunig MM, Kriegel HP, Ng RT, Sander J (2000) LOF: Identifying density-based local outliers. In: Proc. of the 2000 ACM SIGMOD Int’l Conf on Manag of data, pp 93–104
Cai Y, Zhao HK, Han H, Lau RYK, Leung HF, Min H (2012) Answering typicality query based on automatically prototype construction. In: Proc. of the 2012 IEEE/WIC/ACM Int’l Joint Conf Web Intell Intell Agent Technol, 01:362–366
Chen L, Dong G (2006) Masquerader detection using OCLEP: one class classification using length statistics of emerging patterns. In: Proc. of Int’l workshop on information Processing over Evolving Networks (WINPEN), p 5
Dong G, Bailey J (eds) (2013) Contrast data mining: concepts, algorithms, and applications. CRC Press, Boca Raton
Dong G, Li J (1999) Efficient mining of emerging patterns: discovering trends and differences. In: Proc. of the 5th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 43–52
Duan L, Tang G, Pei J, Bailey J, Dong G, Campbell A, Tang C (2014) Mining contrast subspaces. In: Proc. of the 18th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 249–260
Fagin R, Kumar R, Sivakumar D (2003) Comparing top k lists. In: Proc. of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 28–36
He Z, Xu X, Huang ZJ, Deng S (2005) FP-outlier: frequent pattern based outlier detection. Comput Sci Inf Syst 2(1):103–118
Hua M, Pei J, Fu AW, Lin X, Leung HF (2009) Top-k typicality queries and efficient query answering methods on large databases. VLDB J 18(3):809–835
Jeffreys H (1961) The theory of probability, 3rd edn. Oxford
Keller F, Müller E, Böhm K (2012) HiCS: high contrast subspaces for density-based outlier ranking. In: Proc. of the IEEE 28th Int’l Conf on Data Engineering, pp 1037–1048
Kriegel HP, Schubert M, Zimek A (2008) Angle-based outlier detection in high-dimensional data. In: Proc. of the 14th ACM SIGKDD Int’l Conf on Knowledge Discovery and Data Mining, pp 444–452
Kriegel HP, Kröger P, Schubert E, Zimek A (2009) Outlier detection in axis-parallel subspaces of high dimensional data. In: Proc. of the 13th Pacific-Asia Conf on Knowledge Discovery and Data Mining, pp 831–838
Novak PK, Lavrac N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403
Papadimitriou CH, Yannakakis M (1991) Optimization, approximation, and complexity classes. J Comput Syst Sci 43(3):425–440
Rymon R (1992) Search through systematic set enumeration. In: Proc. of the 3rd Int’l Conf on Principles of Knowledge Representation and Reasoning, pp 539–550
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall/CRC, London
Wang L, Zhao H, Dong G, Li J (2005) On the complexity of finding emerging patterns. Theor Comput Sci 335(1):15–27
Webber W, Moffat A, Zobel J (2010) A similarity measure for indefinite rankings. ACM Trans Inf Syst 28(4):20:1–20:38
Wrobel S (1997) An algorithm for multi-relational discovery of subgroups. In: Proc. of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, pp 78–87
Wu S, Crestani F (2003) Methods for ranking information retrieval systems without relevance judgments. In: Proc. of the 2003 ACM Symposium on Applied Computing. ACM, New York, NY, USA, pp 811–816
Acknowledgments
The authors are grateful to the editor and the anonymous reviewers for their constructive comments, which help to improve this paper. Lei Duan’s research was supported in part by National Natural Science Foundation of China (Grant No. 61103042), China Postdoctoral Science Foundation (Grant No. 2014M552371), and SRFDP 20100181120029. Jian Pei’s and Guanting Tang’s research was supported in part by an NSERC Discovery grant, a BCIC NRAS Team Project. James Bailey’s work was supported by an ARC Future Fellowship (FT110100112). Work by Lei Duan and Guozhu Dong at Simon Fraser University was supported in part by an Ebco/Eppich visiting professorship. All opinions, findings, conclusions, and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Duan, L., Tang, G., Pei, J. et al. Efficient discovery of contrast subspaces for object explanation and characterization. Knowl Inf Syst 47, 99–129 (2016). https://doi.org/10.1007/s10115-015-0835-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-015-0835-6