Abstract
Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.
Similar content being viewed by others
References
Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.
Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.
Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843
Allen JF, Ferguson G (1994) Actions and events in interval temporal logic. J Log Comput 4(5):531–579
Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087
Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512
Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598
Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.
Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.
Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.
Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226
Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86
Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.
Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.
Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226
Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New York
Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.
Liu R, Yang L (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677
Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44
Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185
Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.
Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.
Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.
Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742
Philip Preuß HD Ruprecht Puchstein (2013) Detection of multiple structural breaks in multivariate time series. arXiv:1309.1309v1.
Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471
Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431
Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)
Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London
Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.
Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863
Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214
Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638
Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
Acknowledgments
We thank the anomymous reviewers for their insightful comments. Hoang-Vu Nguyen is supported by the German Research Foundation (DFG) within GRK 1194. Emmanuel Müller is supported by the YIG program of KIT as part of the German Excellence Initiative. Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government. Emmanuel Müller and Jilles Vreeken are supported by Post-Doctoral Fellowships of the Research Foundation—Flanders (fwo).
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.
Appendix
Appendix
1.1 Proof of Theorem 2
Proof
(Theorem 2) Let \(H(\mathbf {A}) = P(\mathbf {A}) - R(\mathbf {A})\) and \(G(\mathbf {A}) = R(\mathbf {A}) - Q(\mathbf {A})\). The inequality becomes
which in turn is equivalent to
which is also known as Hölder’s inequality. \(\square \)
1.2 Proof of Theorem 3
Proof
(Theorem 3) Let \( ind (\alpha )\) be an indicator function with value 1 if \(\alpha \) is true and 0 otherwise. It holds
Using empirical data, we hence have
and therefore \([ ID (p(\mathbf {A})\; ||\; q(\mathbf {A}))]^2 \) equals to
Expanding the above term and bringing the integrals inside the sums, we have
by which we arrive at the final result. \(\square \)
1.3 Proof of Theorem 5
Proof
(Theorem 5) Consider a discretization \(dsc_i\) on dimension \(X_i\) with \(k_i\) macro bins \(\{B_i^1, \ldots , B_i^{k_i}\}\). We have
Let \(dsc_i^{T_i}\) be the discretization that puts each micro bin into a separate macro bin. We have
Let \(dsc_i^{ opt }\) and \(dsc_i^{ gr }\) be the discretization yielded by IPD \(_{ opt }\) and IPD \(_{ gr }\), respectively.
Let \(dsc_i\) be a discretization that merges two micro bins with a low interaction distance into the same macro bin and places each of the other micro bins into a separate macro bin. It holds that
Thus, \(L(\mathbf {M}_i, dsc_i) < L(\mathbf {M}_i, dsc_i^{T_i})\), i.e., merging two consecutive micro bins with a low interaction distance in the first place will yield an encoding cost lower than that of \(dsc_i^{T_i}\). Thus, IPD \(_{ gr }\) will proceed after this step. Hence, \(L(\mathbf {M}_i, dsc_i^{ gr }) \le L(\mathbf {M}_i, dsc_i)\). We have \(\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le \frac{L(\mathbf {M}_i, dsc_i)}{T_i \log T_i}\). This leads to
Let \( RHS \) be the right hand side of (20). It holds that \(\lim \limits _{T_i \rightarrow \infty } RHS = 2\) as \(\lim \limits _{T_i \rightarrow \infty } \frac{L_{\mathbb {N}}(T_i - 1)}{T_i \log T_i} = 0\) (Grünwald 2007). In other words, as \(\displaystyle T_i \rightarrow \infty \), \(\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2\). Therefore, asymptotically IPD \(_{ gr }\) is a \(2\)-approximation algorithm of IPD \(_{ opt }\). \(\square \)
1.4 Proof of Theorem 6
Proof
(Theorem 6) We assume that there are \((T_i - 1) \epsilon \) pairs of consecutive micro bins of \(X_i\) that have low interaction distance (\(0 \le \epsilon \le 1\)), i.e., \((T_i - 1) (1 - \epsilon )\) pairs have a large interaction distance. We have
This means \(\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le \)
Let \( RHS \) be the right hand side of (22). Note that \(\lim \limits _{T_i \rightarrow \infty } RHS = 2 - \epsilon \). In other words, as \(T_i \rightarrow \infty \), \(\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2 - \epsilon \). \(\square \)
Rights and permissions
About this article
Cite this article
Nguyen, HV., Müller, E., Vreeken, J. et al. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Disc 28, 1366–1397 (2014). https://doi.org/10.1007/s10618-014-0350-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-014-0350-5