Unsupervised interaction-preserving discretization of multivariate data

Nguyen, Hoang-Vu; Müller, Emmanuel; Vreeken, Jilles; Böhm, Klemens

doi:10.1007/s10618-014-0350-5

Unsupervised interaction-preserving discretization of multivariate data

Published: 04 April 2014

Volume 28, pages 1366–1397, (2014)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Hoang-Vu Nguyen¹,
Emmanuel Müller¹,
Jilles Vreeken^2,3 &
…
Klemens Böhm¹

918 Accesses
23 Citations
Explore all metrics

An Erratum to this article was published on 27 May 2014

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose $ ID $, a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised discretization by two-dimensional MDL-based histogram

Article Open access 16 February 2023

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Facilitating Cluster Counting in Multi-dimensional Feature Space by Intermediate Information Grouping

Notes

References

Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.
Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.
Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.
Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843
Article MATH Google Scholar
Allen JF, Ferguson G (1994) Actions and events in interval temporal logic. J Log Comput 4(5):531–579
Article MATH MathSciNet Google Scholar
Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087
Article MATH Google Scholar
Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512
Article MATH Google Scholar
Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article MATH Google Scholar
Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598
Article MATH MathSciNet Google Scholar
Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.
Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.
Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.
Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York
MATH Google Scholar
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30
MATH MathSciNet Google Scholar
Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.
Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.
Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226
Article MathSciNet Google Scholar
Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
Google Scholar
Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86
Article MathSciNet Google Scholar
Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.
Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.
Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226
Google Scholar
Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.
Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New York
Book MATH Google Scholar
Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.
Liu R, Yang L (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677
Article MATH MathSciNet Google Scholar
Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44
Article Google Scholar
Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185
Article Google Scholar
Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.
Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.
Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.
Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742
Article Google Scholar
Philip Preuß HD Ruprecht Puchstein (2013) Detection of multiple structural breaks in multivariate time series. arXiv:1309.1309v1.
Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27
Article MATH Google Scholar
Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524
Article Google Scholar
Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471
Article MATH Google Scholar
Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431
Article MATH MathSciNet Google Scholar
Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)
Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635
Article MathSciNet Google Scholar
Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London
Book MATH Google Scholar
Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.
Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863
Google Scholar
Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290
Article Google Scholar
Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214
Article MATH Google Scholar
Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638
Article Google Scholar
Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
Google Scholar
Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
Google Scholar

Download references

Acknowledgments

We thank the anomymous reviewers for their insightful comments. Hoang-Vu Nguyen is supported by the German Research Foundation (DFG) within GRK 1194. Emmanuel Müller is supported by the YIG program of KIT as part of the German Excellence Initiative. Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government. Emmanuel Müller and Jilles Vreeken are supported by Post-Doctoral Fellowships of the Research Foundation—Flanders (fwo).

Author information

Authors and Affiliations

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Hoang-Vu Nguyen, Emmanuel Müller & Klemens Böhm
Max-Planck Institute for Informatics, Saarbrücken, Germany
Jilles Vreeken
Cluster of Excellence MMCI, Saarland University, Saarbrücken, Germany
Jilles Vreeken

Authors

Hoang-Vu Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Müller
View author publications
You can also search for this author in PubMed Google Scholar
Jilles Vreeken
View author publications
You can also search for this author in PubMed Google Scholar
Klemens Böhm
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hoang-Vu Nguyen.

Additional information

Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Appendix

1.1 Proof of Theorem 2

Proof

(Theorem 2) Let $H(\mathbf {A}) = P(\mathbf {A}) - R(\mathbf {A})$ and $G(\mathbf {A}) = R(\mathbf {A}) - Q(\mathbf {A})$. The inequality becomes

$$\begin{aligned} \sqrt{\mathop \int \limits _{\varOmega } H^2(\mathbf {a}) d\mathbf {a}} + \sqrt{\mathop \int \limits _{\varOmega } G^2(\mathbf {a}) d\mathbf {a}} \ge \sqrt{\mathop \int \limits _{\varOmega } \left( H(\mathbf {a}) + G(\mathbf {a})\right) ^2 d\mathbf {a}} , \end{aligned}$$

(9)

which in turn is equivalent to

$$\begin{aligned} \sqrt{\mathop \int \limits _{\varOmega } H^2(\mathbf {a}) d\mathbf {a}\cdot \mathop \int \limits _{\varOmega } G^2(\mathbf {a}) d\mathbf {a}} \ge \mathop \int \limits _{\varOmega } H(\mathbf {a}) G(\mathbf {a}) d\mathbf {a}, \end{aligned}$$

(10)

which is also known as Hölder’s inequality. $\square $

1.2 Proof of Theorem 3

Proof

(Theorem 3) Let $ ind (\alpha )$ be an indicator function with value 1 if $\alpha $ is true and 0 otherwise. It holds

$$\begin{aligned} P(\mathbf {a}) = \mathop \int \limits _{ min _1}^{ max _1} \ldots \mathop \int \limits _{ min _m}^{ max _m} ind (x_1 \le a_1) \cdots ind (x_m\le a_m) p(x_1, \ldots , x_m) dx_1 \cdots dx_m\quad \end{aligned}$$

(11)

Using empirical data, we hence have

$$\begin{aligned} P(\mathbf {a}) = \frac{1}{k} \sum _{j=1}^k \prod _{i=1}^{m} ind (R_j^i \le a_i) , \quad \mathrm{and } \quad Q(\mathbf {a}) = \frac{1}{l} \sum _{j=1}^l \prod _{i=1}^{m} ind (S_j^i \le a_i) , \end{aligned}$$

and therefore $[ ID (p(\mathbf {A})\; ||\; q(\mathbf {A}))]^2 $ equals to

$$\begin{aligned} \mathop \int \limits _{ min _1}^{ max _1} \ldots \mathop \int \limits _{ min _m}^{ max _m} \left( \frac{1}{k} \sum _{j=1}^k \prod _{i=1}^{m} ind (R_j^i \le a_i) - \frac{1}{l} \sum _{j=1}^l \prod _{i=1}^{m} ind (S_j^i \le a_i)\right) ^2 da_1 \cdots da_m\end{aligned}$$

(12)

Expanding the above term and bringing the integrals inside the sums, we have

$$\begin{aligned}&\frac{1}{k^2} \sum _{j_1=1}^k \sum _{j_2=1}^k \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( R_{j_1}^i, R_{j_2}^i\right) \le a_i\right) da_i \nonumber \\&\qquad - \frac{2}{kl} \sum _{j_1=1}^k \sum _{j_2=1}^l \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( R_{j_1}^i, S_{j_2}^i\right) \le a_i\right) da_i\\&\qquad + \frac{1}{l^2} \sum _{j_1=1}^{l} \sum _{j_2=1}^{l} \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( S_{j_1}^i, S_{j_2}^i\right) \le a_i\right) da_i , \nonumber \end{aligned}$$

(13)

by which we arrive at the final result. $\square $

1.3 Proof of Theorem 5

Proof

(Theorem 5) Consider a discretization $dsc_i$ on dimension $X_i$ with $k_i$ macro bins $\{B_i^1, \ldots , B_i^{k_i}\}$. We have

$$\begin{aligned} L(\mathbf {M}_i, dsc_i)&\ge L_{ bid }(dsc_i(\mathbf {M}_i)) + L(\mathbf {M}_i \ominus dsc_i(\mathbf {M}_i)) \end{aligned}$$

(14)

$$\begin{aligned}&\ge \left( \sum _{j=1}^{k_i} L_{\mathbb {N}}\left( |B_i^j|\right) + \left( |B_i^j| + 1\right) \log \frac{T_i}{|B_i^j|}\right) + \sum _{j=1}^{k_i} |B_i^j| \log |B_i^j|\end{aligned}$$

(15)

$$\begin{aligned}&\ge (T_i + k_i) \log T_i - \sum _{j=1}^{k_i} \log |B_i^j|\end{aligned}$$

(16)

$$\begin{aligned}&\ge T_i \log T_i. \end{aligned}$$

(17)

Let $dsc_i^{T_i}$ be the discretization that puts each micro bin into a separate macro bin. We have

$$\begin{aligned} L\left( \mathbf {M}_i, dsc_i^{T_i}\right) = L_{\mathbb {N}}(T_i) + T_i \log c_0 + 2T_i \log T_i. \end{aligned}$$

(18)

Let $dsc_i^{ opt }$ and $dsc_i^{ gr }$ be the discretization yielded by IPD $_{ opt }$ and IPD $_{ gr }$, respectively.

Let $dsc_i$ be a discretization that merges two micro bins with a low interaction distance into the same macro bin and places each of the other micro bins into a separate macro bin. It holds that

$$\begin{aligned} L(\mathbf {M}_i, dsc_i) = L_{\mathbb {N}}(T_i - 1) + \log (T_i - 1) + (T_i - 1) \log c_0 + 2 T_i \log T_i - \log T_i. \end{aligned}$$

(19)

Thus, $L(\mathbf {M}_i, dsc_i) < L(\mathbf {M}_i, dsc_i^{T_i})$, i.e., merging two consecutive micro bins with a low interaction distance in the first place will yield an encoding cost lower than that of $dsc_i^{T_i}$. Thus, IPD $_{ gr }$ will proceed after this step. Hence, $L(\mathbf {M}_i, dsc_i^{ gr }) \le L(\mathbf {M}_i, dsc_i)$. We have $\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le \frac{L(\mathbf {M}_i, dsc_i)}{T_i \log T_i}$. This leads to

$$\begin{aligned} \frac{L\left( \mathbf {M}_i, dsc_i^{ gr }\right) }{L\left( \mathbf {M}_i, dsc_i^{ opt }\right) } \le \frac{L_{\mathbb {N}}(T_i {-} 1) + \log (T_i {-} 1) + (T_i {-} 1) \log c_0 + 2 T_i \log T_i {-} \log T_i}{T_i \log T_i}. \end{aligned}$$

(20)

Let $ RHS $ be the right hand side of (20). It holds that $\lim \limits _{T_i \rightarrow \infty } RHS = 2$ as $\lim \limits _{T_i \rightarrow \infty } \frac{L_{\mathbb {N}}(T_i - 1)}{T_i \log T_i} = 0$ (Grünwald 2007). In other words, as $\displaystyle T_i \rightarrow \infty $, $\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2$. Therefore, asymptotically IPD $_{ gr }$ is a $2$-approximation algorithm of IPD $_{ opt }$. $\square $

1.4 Proof of Theorem 6

Proof

(Theorem 6) We assume that there are $(T_i - 1) \epsilon $ pairs of consecutive micro bins of $X_i$ that have low interaction distance ($0 \le \epsilon \le 1$), i.e., $(T_i - 1) (1 - \epsilon )$ pairs have a large interaction distance. We have

$$\begin{aligned} L(\mathbf {M}_i, dsc_i^1)&= \log c_0 + L_{\mathbb {N}}(T_i) + L_{\mathbb {N}}\left( (T_i - 1)(1 - \epsilon )\right) \nonumber \\&+\, (T_i - 1)(1 - \epsilon ) \log (T_i - 1) + T_i \log T_i. \end{aligned}$$

(21)

This means $\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le $

$$\begin{aligned} \frac{\log c_0 + L_{\mathbb {N}}(T_i) + L_{\mathbb {N}}\left( (T_i - 1)(1 - \epsilon )\right) + (T_i - 1)(1 - \epsilon ) \log (T_i - 1) + T_i \log T_i}{T_i \log T_i}. \end{aligned}$$

(22)

Let $ RHS $ be the right hand side of (22). Note that $\lim \limits _{T_i \rightarrow \infty } RHS = 2 - \epsilon $. In other words, as $T_i \rightarrow \infty $, $\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2 - \epsilon $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, HV., Müller, E., Vreeken, J. et al. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Disc 28, 1366–1397 (2014). https://doi.org/10.1007/s10618-014-0350-5

Download citation

Received: 01 November 2013
Accepted: 21 March 2014
Published: 04 April 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10618-014-0350-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unsupervised interaction-preserving discretization of multivariate data

Abstract

Access this article

Similar content being viewed by others

Unsupervised discretization by two-dimensional MDL-based histogram

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Facilitating Cluster Counting in Multi-dimensional Feature Space by Intermediate Information Grouping

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

1.1 Proof of Theorem 2

Proof

1.2 Proof of Theorem 3

Proof

1.3 Proof of Theorem 5

Proof

1.4 Proof of Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Unsupervised interaction-preserving discretization of multivariate data

Abstract

Access this article

Similar content being viewed by others

Unsupervised discretization by two-dimensional MDL-based histogram

Density Estimation in High-Dimensional Spaces: A Multivariate Histogram Approach

Facilitating Cluster Counting in Multi-dimensional Feature Space by Intermediate Information Grouping

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

1.1 Proof of Theorem 2

Proof

1.2 Proof of Theorem 3

Proof

1.3 Proof of Theorem 5

Proof

1.4 Proof of Theorem 6

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation