Skip to main content
Log in

Unsupervised interaction-preserving discretization of multivariate data

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

An Erratum to this article was published on 27 May 2014

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.ipd.kit.edu/~nguyenh/ipd/.

  2. http://www.ipd.kit.edu/~nguyenh/ipd/.

  3. http://www.pamap.org/demo.html.

References

  • Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.

  • Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.

  • Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.

  • Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843

    Article  MATH  Google Scholar 

  • Allen JF, Ferguson G (1994) Actions and events in interval temporal logic. J Log Comput 4(5):531–579

    Article  MATH  MathSciNet  Google Scholar 

  • Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087

    Article  MATH  Google Scholar 

  • Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512

    Article  MATH  Google Scholar 

  • Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.

  • Breiman L (2001) Random forests. Mach Learn 45(1):5–32

    Article  MATH  Google Scholar 

  • Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598

    Article  MATH  MathSciNet  Google Scholar 

  • Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.

  • Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.

  • Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.

  • Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  • Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MATH  MathSciNet  Google Scholar 

  • Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.

  • Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.

  • Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226

    Article  MathSciNet  Google Scholar 

  • Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge

    Google Scholar 

  • Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.

  • Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86

    Article  MathSciNet  Google Scholar 

  • Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.

  • Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.

  • Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226

    Google Scholar 

  • Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.

  • Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New York

    Book  MATH  Google Scholar 

  • Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.

  • Liu R, Yang L (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677

    Article  MATH  MathSciNet  Google Scholar 

  • Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44

    Article  Google Scholar 

  • Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185

    Article  Google Scholar 

  • Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.

  • Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.

  • Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.

  • Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742

    Article  Google Scholar 

  • Philip Preuß HD Ruprecht Puchstein (2013) Detection of multiple structural breaks in multivariate time series. arXiv:1309.1309v1.

  • Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27

    Article  MATH  Google Scholar 

  • Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524

    Article  Google Scholar 

  • Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471

    Article  MATH  Google Scholar 

  • Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431

    Article  MATH  MathSciNet  Google Scholar 

  • Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)

  • Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635

    Article  MathSciNet  Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, London

    Book  MATH  Google Scholar 

  • Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.

  • Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863

    Google Scholar 

  • Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290

    Article  Google Scholar 

  • Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214

    Article  MATH  Google Scholar 

  • Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638

    Article  Google Scholar 

  • Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645

    Google Scholar 

  • Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910

    Google Scholar 

Download references

Acknowledgments

We thank the anomymous reviewers for their insightful comments. Hoang-Vu Nguyen is supported by the German Research Foundation (DFG) within GRK 1194. Emmanuel Müller is supported by the YIG program of KIT as part of the German Excellence Initiative. Jilles Vreeken is supported by the Cluster of Excellence “Multimodal Computing and Interaction” within the Excellence Initiative of the German Federal Government. Emmanuel Müller and Jilles Vreeken are supported by Post-Doctoral Fellowships of the Research Foundation—Flanders (fwo).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hoang-Vu Nguyen.

Additional information

Responsible editor: Toon Calders, Floriana Esposito, Eyke Hüllermeier, Rosa Meo.

Appendix

Appendix

1.1 Proof of Theorem 2

Proof

(Theorem 2) Let \(H(\mathbf {A}) = P(\mathbf {A}) - R(\mathbf {A})\) and \(G(\mathbf {A}) = R(\mathbf {A}) - Q(\mathbf {A})\). The inequality becomes

$$\begin{aligned} \sqrt{\mathop \int \limits _{\varOmega } H^2(\mathbf {a}) d\mathbf {a}} + \sqrt{\mathop \int \limits _{\varOmega } G^2(\mathbf {a}) d\mathbf {a}} \ge \sqrt{\mathop \int \limits _{\varOmega } \left( H(\mathbf {a}) + G(\mathbf {a})\right) ^2 d\mathbf {a}} , \end{aligned}$$
(9)

which in turn is equivalent to

$$\begin{aligned} \sqrt{\mathop \int \limits _{\varOmega } H^2(\mathbf {a}) d\mathbf {a}\cdot \mathop \int \limits _{\varOmega } G^2(\mathbf {a}) d\mathbf {a}} \ge \mathop \int \limits _{\varOmega } H(\mathbf {a}) G(\mathbf {a}) d\mathbf {a}, \end{aligned}$$
(10)

which is also known as Hölder’s inequality. \(\square \)

1.2 Proof of Theorem 3

Proof

(Theorem 3) Let \( ind (\alpha )\) be an indicator function with value 1 if \(\alpha \) is true and 0 otherwise. It holds

$$\begin{aligned} P(\mathbf {a}) = \mathop \int \limits _{ min _1}^{ max _1} \ldots \mathop \int \limits _{ min _m}^{ max _m} ind (x_1 \le a_1) \cdots ind (x_m\le a_m) p(x_1, \ldots , x_m) dx_1 \cdots dx_m\quad \end{aligned}$$
(11)

Using empirical data, we hence have

$$\begin{aligned} P(\mathbf {a}) = \frac{1}{k} \sum _{j=1}^k \prod _{i=1}^{m} ind (R_j^i \le a_i) , \quad \mathrm{and } \quad Q(\mathbf {a}) = \frac{1}{l} \sum _{j=1}^l \prod _{i=1}^{m} ind (S_j^i \le a_i) , \end{aligned}$$

and therefore \([ ID (p(\mathbf {A})\; ||\; q(\mathbf {A}))]^2 \) equals to

$$\begin{aligned} \mathop \int \limits _{ min _1}^{ max _1} \ldots \mathop \int \limits _{ min _m}^{ max _m} \left( \frac{1}{k} \sum _{j=1}^k \prod _{i=1}^{m} ind (R_j^i \le a_i) - \frac{1}{l} \sum _{j=1}^l \prod _{i=1}^{m} ind (S_j^i \le a_i)\right) ^2 da_1 \cdots da_m\end{aligned}$$
(12)

Expanding the above term and bringing the integrals inside the sums, we have

$$\begin{aligned}&\frac{1}{k^2} \sum _{j_1=1}^k \sum _{j_2=1}^k \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( R_{j_1}^i, R_{j_2}^i\right) \le a_i\right) da_i \nonumber \\&\qquad - \frac{2}{kl} \sum _{j_1=1}^k \sum _{j_2=1}^l \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( R_{j_1}^i, S_{j_2}^i\right) \le a_i\right) da_i\\&\qquad + \frac{1}{l^2} \sum _{j_1=1}^{l} \sum _{j_2=1}^{l} \prod _{i=1}^{m} \mathop \int \limits _{ min _i}^{ max _i} ind \left( \max \left( S_{j_1}^i, S_{j_2}^i\right) \le a_i\right) da_i , \nonumber \end{aligned}$$
(13)

by which we arrive at the final result. \(\square \)

1.3 Proof of Theorem 5

Proof

(Theorem 5) Consider a discretization \(dsc_i\) on dimension \(X_i\) with \(k_i\) macro bins \(\{B_i^1, \ldots , B_i^{k_i}\}\). We have

$$\begin{aligned} L(\mathbf {M}_i, dsc_i)&\ge L_{ bid }(dsc_i(\mathbf {M}_i)) + L(\mathbf {M}_i \ominus dsc_i(\mathbf {M}_i)) \end{aligned}$$
(14)
$$\begin{aligned}&\ge \left( \sum _{j=1}^{k_i} L_{\mathbb {N}}\left( |B_i^j|\right) + \left( |B_i^j| + 1\right) \log \frac{T_i}{|B_i^j|}\right) + \sum _{j=1}^{k_i} |B_i^j| \log |B_i^j|\end{aligned}$$
(15)
$$\begin{aligned}&\ge (T_i + k_i) \log T_i - \sum _{j=1}^{k_i} \log |B_i^j|\end{aligned}$$
(16)
$$\begin{aligned}&\ge T_i \log T_i. \end{aligned}$$
(17)

Let \(dsc_i^{T_i}\) be the discretization that puts each micro bin into a separate macro bin. We have

$$\begin{aligned} L\left( \mathbf {M}_i, dsc_i^{T_i}\right) = L_{\mathbb {N}}(T_i) + T_i \log c_0 + 2T_i \log T_i. \end{aligned}$$
(18)

Let \(dsc_i^{ opt }\) and \(dsc_i^{ gr }\) be the discretization yielded by IPD \(_{ opt }\) and IPD \(_{ gr }\), respectively.

Let \(dsc_i\) be a discretization that merges two micro bins with a low interaction distance into the same macro bin and places each of the other micro bins into a separate macro bin. It holds that

$$\begin{aligned} L(\mathbf {M}_i, dsc_i) = L_{\mathbb {N}}(T_i - 1) + \log (T_i - 1) + (T_i - 1) \log c_0 + 2 T_i \log T_i - \log T_i. \end{aligned}$$
(19)

Thus, \(L(\mathbf {M}_i, dsc_i) < L(\mathbf {M}_i, dsc_i^{T_i})\), i.e., merging two consecutive micro bins with a low interaction distance in the first place will yield an encoding cost lower than that of \(dsc_i^{T_i}\). Thus, IPD \(_{ gr }\) will proceed after this step. Hence, \(L(\mathbf {M}_i, dsc_i^{ gr }) \le L(\mathbf {M}_i, dsc_i)\). We have \(\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le \frac{L(\mathbf {M}_i, dsc_i)}{T_i \log T_i}\). This leads to

$$\begin{aligned} \frac{L\left( \mathbf {M}_i, dsc_i^{ gr }\right) }{L\left( \mathbf {M}_i, dsc_i^{ opt }\right) } \le \frac{L_{\mathbb {N}}(T_i {-} 1) + \log (T_i {-} 1) + (T_i {-} 1) \log c_0 + 2 T_i \log T_i {-} \log T_i}{T_i \log T_i}. \end{aligned}$$
(20)

Let \( RHS \) be the right hand side of (20). It holds that \(\lim \limits _{T_i \rightarrow \infty } RHS = 2\) as \(\lim \limits _{T_i \rightarrow \infty } \frac{L_{\mathbb {N}}(T_i - 1)}{T_i \log T_i} = 0\) (Grünwald 2007). In other words, as \(\displaystyle T_i \rightarrow \infty \), \(\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2\). Therefore, asymptotically IPD \(_{ gr }\) is a \(2\)-approximation algorithm of IPD \(_{ opt }\). \(\square \)

1.4 Proof of Theorem 6

Proof

(Theorem 6) We assume that there are \((T_i - 1) \epsilon \) pairs of consecutive micro bins of \(X_i\) that have low interaction distance (\(0 \le \epsilon \le 1\)), i.e., \((T_i - 1) (1 - \epsilon )\) pairs have a large interaction distance. We have

$$\begin{aligned} L(\mathbf {M}_i, dsc_i^1)&= \log c_0 + L_{\mathbb {N}}(T_i) + L_{\mathbb {N}}\left( (T_i - 1)(1 - \epsilon )\right) \nonumber \\&+\, (T_i - 1)(1 - \epsilon ) \log (T_i - 1) + T_i \log T_i. \end{aligned}$$
(21)

This means \(\frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le \)

$$\begin{aligned} \frac{\log c_0 + L_{\mathbb {N}}(T_i) + L_{\mathbb {N}}\left( (T_i - 1)(1 - \epsilon )\right) + (T_i - 1)(1 - \epsilon ) \log (T_i - 1) + T_i \log T_i}{T_i \log T_i}. \end{aligned}$$
(22)

Let \( RHS \) be the right hand side of (22). Note that \(\lim \limits _{T_i \rightarrow \infty } RHS = 2 - \epsilon \). In other words, as \(T_i \rightarrow \infty \), \(\displaystyle \frac{L(\mathbf {M}_i, dsc_i^{ gr })}{L(\mathbf {M}_i, dsc_i^{ opt })} \le 2 - \epsilon \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nguyen, HV., Müller, E., Vreeken, J. et al. Unsupervised interaction-preserving discretization of multivariate data. Data Min Knowl Disc 28, 1366–1397 (2014). https://doi.org/10.1007/s10618-014-0350-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-014-0350-5

Keywords

Navigation