Abstract
Many data mining tasks rely on pattern mining. To identify the patterns of interest in a dataset, an analyst may define several measures that score, in different ways, the relevance of a pattern. Until recently, most algorithms have only handled constraints in an efficient way, i.e., every measure had to be associated with a user-defined threshold, which can be tricky to determine. Skypatterns were introduced to allow analysts to simply define the measures of interest, and to get as a result a set of globally optimal and semantically relevant patterns. Skypatterns are Pareto-optimal patterns: no other pattern scores better on one of the chosen measures and scores at least as well on every remaining measure. This article tackles the search of the skypatterns in a more general context than the 0/1 (aka Boolean) matrix: the fuzzy tensor. The proposed solution supports a large class of measures. After explaining why and how their common mathematical property enables a safe pruning of the search space, an algorithm is presented. It builds upon multidupehack, a generalist pattern mining framework, which is now able to efficiently list skypatterns in addition to enforcing constraints on them. Experiments on two real-world fuzzy tensors illustrate the versatility of the proposal. Other experiments show it is typically more than one order of magnitude faster than the state-of-the-art algorithms, which can only mine 0/1 matrices.
Similar content being viewed by others
Notes
ET-n-set stands for Error-Tolerantn-set.
References
Bistarelli S, Bonchi F (2007) Soft constraint based pattern mining. Data Knowl Eng 62(1):118–137
Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: ICDE’01: proceedings of the 17th international conference on data engineering. IEEE Computer Society, pp 421–430
Cerf L, Meira Jr. W (2014) Complete discovery of high-quality patterns in large numerical tensors. In: ICDE’14: proceedings of the 30th international conference on data engineering. IEEE Computer Society, pp 448–459
Cerf L, Besson J, Robardet C, Boulicaut J-F (2009) Closed patterns meet $n$-ary relations. ACM Trans Knowl Discov Data 3(1):1–36
Coussat A, Nadisic N, Cerf L (2018) Mining high-utility patterns in uncertain tensors. In: KES’18: proceedings of the 22nd international conference on knowledge-based and intelligent information & engineering systems. Elsevier, pp 403–412
Goyal V, Sureka A, Patel D (2015) Efficient skyline itemsets mining. In: C3S2E’15: proceedings of the eighth international C* conference on computer science & software engineering. ACM Press, pp 119–124
Lin JC-W, Yang L, Fournier-Viger P, Dawar S, Goyal V, Sureka A, Vo B (2016) A more efficient algorithm to mine skyline frequent-utility patterns. In: ICGEC’16: proceedings of the tenth international conference on genetic and evolutionary computing, pp 127–135
Négrevergne B, Dries A, Guns T, Nijssen S (2013) Dominance programming for itemset mining. In: ICDM’13: proceedings of the 13th international conference on data mining. IEEE Computer Society, pp 557–566
Papadopoulos AN, Lyritsis A, Manolopoulos Y (2008) SkyGraph: an algorithm for important subgraph discovery in relational graphs. Data Min Knowl Discov 17(1):57–76
Soulet A, Crémilleux B (2005) Exploiting virtual patterns for automatically pruning the search space. In: KDID’05: Proceedings of the fourth international workshop on knowledge discovery in inductive databases. Springer, pp 202–221
Soulet A, Crémilleux B (2009) Mining constraint-based patterns using automatic relaxation. Intell Data Anal 13(1):109–133
Soulet A, Raïssi C, Plantevit M, Crémilleux B (2011) Mining dominant patterns in the sky. In: ICDM’11: proceedings of the 11th international conference on data mining. IEEE Computer Society, pp 655–664
Ugarte W, Boizumault P, Loudni S, Crémilleux B (2014a) Computing skypattern cubes. In: ECAI’14: proceedings of the 21st European conference on artificial intelligence. IOS Press, pp 903–908
Ugarte W, Boizumault P, Loudni S, Crémilleux B, Lepailleur A (2014b) Mining (soft-) skypatterns using dynamic CSP. In: CPAIOR’14: proceedings of the 11th international conference on integration of AI and OR techniques in constraint programming. Springer, pp 71–87
Ugarte W, Boizumault P, Crémilleux B, Lepailleur A, Loudni S, Plantevit M, Raïssi C, Soulet A (2017) Skypattern mining: from pattern condensed representations to dynamic constraint satisfaction problems. Artif Intell 244:48–69
van Leeuwen M, Ukkonen A (2013) Discovering skylines of subgroup sets. In: ECML PKDD’13: proceeding of the European conference on machine learning and knowledge discovery in databases. Springer, pp 272–287
Wang J, Han J, Lu Y, Tzvetkov P (2005) TFP: an efficient algorithm for mining top-k frequent closed itemsets. IEEE Trans Knowl Data Eng 17(5):652–663
Acknowledgements
We would like to thank Willy Ugarte, Bruno Crémilleux, Chedy Raïssi and Benjamin Négrevergne for providing the source codes of their algorithms and for their valuable comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Po-ling Loh, Evimaria Terzi, Antti Ukkonen, Karsten Borgwardt and Katharina Heinrich.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The work has been partially funded by the FAPEMIG under Grant No. APQ-04224-16 (Multilateral Cooperation FAPEMIG-CNRS) and by the ERC Starting Grant No. 679515.
A Piecewise (Anti-)Monotonicity of the Slope Measure
A Piecewise (Anti-)Monotonicity of the Slope Measure
To simplify the proof that the slope is piecewise (anti-)monotone, all the outputs of the x and y data-access functions, i.e., the abscissas and the ordinates of the points, are supposed positive. If it is not the case, \(\min _{t \in \prod _{i \in I} X_i} x(t)\) is subtracted from every abscissa and \(\min _{t \in \prod _{i \in I} X_i} y(t)\) is subtracted from every ordinate, what moves all the points to the positive quadrant of the Cartesian coordinate system. The slope of the fitting line being invariant under translation, \(x \ge 0\) and \(y \ge 0\) are assumed without loss of generality.
A rewriting \(m'_{\text {slope}}\) of the slope \(m_{\text {slope}}\) maps \(({L}, {U}) \in \left( \prod _{i = 1}^n 2^{D_i}\right) ^2\) to:
-
case 1.
if denom\(({U}, {L}) > 0\) then
-
(a)
\(\displaystyle \frac{\text {num}({L}, {U})}{\text {denom}({U}, {L})}\) if num\(({L}, {U}) > 0\)
-
(a)
\(\displaystyle \frac{\text {num}({L}, {U})}{\text {denom}({L}, {U})}\) otherwise
-
(a)
-
case 2.
if denom\(({L}, {U}) < 0\) then
-
(a)
\(\displaystyle \frac{\text {num}({U}, {L})}{\text {denom}({L}, {U})}\) if num\(({U}, {L}) < 0\)
-
(b)
\(\displaystyle \frac{\text {num}({U}, {L})}{\text {denom}({U}, {L})}\) otherwise
-
(a)
-
case 3.
otherwise \(+\infty \)
where \(\forall (X^1, X^2) = (X_1^1, \dots , X_n^1, X_1^2, \dots , X_n^2) \in \left( \prod _{i = 1}^n 2^{D_i}\right) ^2\):
-
num\((X^1, X^2) = \displaystyle \sum _{t \in \prod _{i \in I} X_i^2} x(t) \sum _{t \in \prod _{i \in I} X_i^2} y(t) - \left| \prod _{i \in I} X_i^1\right| \sum _{t \in \prod _{i \in I} X_i^1} x(t)y(t)\);
-
denom\((X^1, X^2) = \displaystyle \left( \sum _{t \in \prod _{i \in I} X_i^2} x(t)\right) ^2 - \left| \prod _{i \in I} X_i^1\right| \sum _{t \in \prod _{i \in I} X_i^1} x(t)^2\).
The equality \(m'_{\text {slope}}(X, X) = m_{\text {slope}}(X)\), for any pattern \(X \in \prod _{i = 1}^n 2^{D_i}\), derives from the equality \(\frac{\text {num}(X, X)}{\text {denom}(X, X)} = m_{\text {slope}}(X)\), for cases 1 and 2 in the definition of \(m'_{\text {slope}}\), and from the nullity of denom(X, X) in case 3.
The rewriting \(m'_{\text {slope}}\) actually proves that \(m_{\text {slope}}\) is piecewise (anti-)monotone. To show it, following Definition 8, let us take \(U \in \prod _{i = 1}^n 2^{D_i}\), \(X \in \prod _{i = 1}^n 2^{U_i}\) and \(L \in \prod _{i = 1}^n 2^{X_i}\). L being a sub-pattern of X, its subsets of the dimensions with indexes in I are subsets of those of X, i.e., \(\forall i \in I\), \(L_i \subseteq X_i\). That implies \(\prod _{i \in I} L_i \subseteq \prod _{i \in I} X_i\), which in turn implies both \(\left| \prod _{i \in I} L_i\right| \le \left| \prod _{i \in I} X_i\right| \) and \(\sum _{t \in \prod _{i \in I} L_i} x(t)^2 \le \sum _{t \in \prod _{i \in I} X_i} x(t)^2\). As a consequence, the (positive) quantity subtracted in the expression of denom is smaller if L, rather than X, is input as the first argument. U being a super-pattern of X, the first sum, in the expression of denom, involves more terms when U, rather than X, is input as the second argument. Because \(x \ge 0\), that sum is greater and so is its square. Combining the results on both parts in the expression of denom, \(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) stands. It entails \(\hbox {denom}(X, X) > 0 \Rightarrow \)\(\hbox {denom}(L, U) > 0\), i.e., if (X, X) triggers case 1 of \(m'_{\text {slope}}\) then (L, U) cannot trigger case 2.
The same steps as in the previous paragraph, but considering X or its super-pattern U as the first input of denom, X or its sub-pattern L as the second input of denom, prove \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X)\). That inequality entails \(\hbox {denom}(X, X) < 0 \Rightarrow \)\(\hbox {denom}(U, L) < 0\), i.e., if (X, X) triggers case 2 of \(m'_{\text {slope}}\) then (L, U) cannot trigger case 1. Also, \(\hbox {denom}(X, X) = 0\) implies both \(\hbox {denom}(U, L) \le 0\) and \(\hbox {denom}(L, U) \ge 0\), i.e., if (X, X) triggers case 3 then (L, U) triggers neither case 1 nor case 2. Given all the impossibilities proven so far, if (X, X) triggers case \(k \in \{1, 2, 3\}\) in the definition of \(m'_{\text {slope}}\) then (L, U) triggers either case k or case 3.
If (L, U) triggers case 3, \(m_{\text {slope}}(X) = m'_{\text {slope}}(X, X) \le m'_{\text {slope}}(L, U) = +\infty \). It remains to prove \(m_{\text {slope}}(X) \le m'_{\text {slope}}(L, U)\) when (X, X) and (L, U) both trigger case 1 or when they both trigger case 2. An analysis of the expression of num, which is analog to the earlier analysis of denom and uses both \(x \ge 0\) and \(y \ge 0\), proves \(\hbox {num}(U, L) \le \)\(\hbox {num}(X, X) \le \)\(\hbox {num}(L, U)\) and, in sequence, the impossibility for (L, U) to trigger a sub-case (b) if (X, X) triggers the related sub-case (a). If, on the contrary, (X, X) triggers a sub-case (b) and (L, U) triggers the related sub-case (a) then \(m(X) = m'_{\text {slope}}(X, X) \le m'_{\text {slope}}(L, U)\). Indeed, given the tests in \(m'_{\text {slope}}\) and the inequations \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) that were proven above, the sub-cases (a) always provide positive outputs, whereas the sub-cases (b) always provide negative (hence smaller) outputs.
Finally, when (X, X) and (L, U) trigger, in the definition of \(m'_{\text {slope}}\), not only a same case but also a same sub-case, \(m_{\text {slope}}(X) \le m'_{\text {slope}}(L, U)\) still stands. Indeed, the inequality \(\hbox {num}(U, L) \le \)\(\hbox {num}(X, X) \le \)\(\hbox {num}(L, U)\) and the inequality \(\hbox {denom}(U, L) \le \)\(\hbox {denom}(X, X) \le \)\(\hbox {denom}(L, U)\) together entail:
-
\(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(L, U)}{\text {denom}(U, L)}\) if the two numerators and the two denominators are positive, i.e., in case 1a;
-
\(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(L, U)}{\text {denom}(L, U)}\) if the two numerators are negative and the two denominators are positive, i.e., in case 1b;
-
\(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(U, L)}{\text {denom}(L, U)}\) if the two numerators and the two denominators are negative, i.e., in case 2a;
-
\(m_{\text {slope}}(X) = \frac{\text {num}(X, X)}{\text {denom}(X, X)} \le \frac{\text {num}(U, L)}{\text {denom}(U, L)}\) if the two numerators are positive and the two denominators are negative, i.e., in case 2b.
Rights and permissions
About this article
Cite this article
Nadisic, N., Coussat, A. & Cerf, L. Mining skypatterns in fuzzy tensors. Data Min Knowl Disc 33, 1298–1322 (2019). https://doi.org/10.1007/s10618-019-00640-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-019-00640-4