Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

Saccá, Domenico; Serra, Edoardo; Rullo, Antonino

doi:10.1007/s10618-019-00643-1

Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

Published: 20 July 2019

Volume 33, pages 1736–1774, (2019)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

417 Accesses
4 Citations
Explore all metrics

Abstract

The development of novel platforms and techniques for emerging “Big Data” applications requires the availability of real-life datasets for data-driven experiments, which are however not accessible in most cases for various reasons, e.g., confidentiality, privacy or simply insufficient availability. An interesting solution to ensure high quality experimental findings is to synthesize datasets that reflect patterns of real ones using a two-step approach: first a real dataset X is analyzed to derive relevant patterns Z (latent variables) and, then, such patterns are used to reconstruct a new dataset $X'$ that is like X but not exactly the same. The approach can be implemented using inverse mining techniques such as inverse frequent itemset mining ($\texttt {IFM}$), which consists of generating a transactional dataset satisfying given support constraints on the itemsets of an input set, that are typically the frequent ones. This paper introduces various extensions of $\texttt {IFM}$ within a uniform framework with the aim to generate artificial datasets that reflect more elaborated patterns (in particular infrequency and duplicate constraints) of real ones. Furthermore, in order to further enlarge the application domain of $\texttt {IFM}$, an additional extension is introduced that considers more structured schemes for the datasets to be generated, as required in emerging big data applications, e.g., social network analytics.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

Rashmin Gajera, Suresh Patel, … Ayush Solanki

References

Aggarwal CC, Yu PS (2008) A general survey of privacy-preserving data mining models and algorithms. In: Aggarwal CC, Yu PS (eds) Privacy-preserving data mining—models and algorithms, volume 34 of advances in database systems. Springer, Berlin, pp 11–52
Chapter Google Scholar
Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD international conference on management of data, SIGMOD ’93, New York, NY, USA. ACM, pp 207–216
Agrawal R, Srikant R (2000) Privacy-preserving data mining. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data, SIGMOD ’00, New York, NY, USA. ACM, pp 439–450
Beheshti AK, Hejazi SR (2015) A novel hybrid column generation-metaheuristic approach for the vehicle routing problem with general soft time window. Inf Sci 316:598–615
Article Google Scholar
Bengio Y (2009) Learning deep architectures for AI. Found Trends Mach Learn 2(1):1–127
Article MATH Google Scholar
Bertsimas D, Tsitsiklis JN (1997) Introduction to linear optimization. Athena Scientific, Belmont
Google Scholar
Bykowski A, Rigotti C (2001) A condensed representation to find frequent patterns. In: Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’01, New York, NY, USA. ACM, pp 267–273
Cagliero L, Garza P (2013) Itemset generalization with cardinality-based constraints. Inf Sci 244:161–174
Article MathSciNet MATH Google Scholar
Calders T (2004) Computational complexity of itemset frequency satisfiability. In: Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’04, New York, NY, USA. ACM, pp 143–154
Calders T (2007) The complexity of satisfying constraints on databases of transactions. Acta Inf 44(7–8):591–624
Article MathSciNet MATH Google Scholar
Chen CP, Zhang C-Y (2014) Data-intensive applications, challenges, techniques and technologies: a survey on big data. Inf Sci 275:314–347
Article Google Scholar
Evfimievski A, Gehrke J, Srikant R (2003) Limiting privacy breaches in privacy preserving data mining. In: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’03, New York, NY, USA. ACM, pp 211–222
Gilmore PC, Gomory RE (1961) A linear programming approach to the cutting-stock problem. Oper Res 9(6):849–859
Article MathSciNet MATH Google Scholar
Gunopulos D, Khardon R, Mannila H, Toivonen H (1997) Data mining, hypergraph transversals, and machine learning. In: Mendelzon AO, Özsoyoglu ZM (eds) Proceedings of the 16-th ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’97, ACM Press, pp 209–216
Guns T, Nijssen S, Raedt LD (2011) Itemset mining: a constraint programming perspective. Artif Intell 175(12):1951–1983
Article MathSciNet MATH Google Scholar
Guzzo A, Moccia L, Saccà D, Serra E (2013) Solving inverse frequent itemset mining with infrequency constraints via large-scale linear programs. ACM Trans Knowl Discov Data 7(4):18:1–18:39
Article Google Scholar
Guzzo A, Saccà D, Serra E (2009) An effective approach to inverse frequent set mining. In: Proceedings of the 2009 ninth IEEE international conference on data mining, ICDM ’09, Washington, DC, USA. IEEE Computer Society, pp 806–811
Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Discov 15(1):55–86
Article MathSciNet Google Scholar
Han J, Kamber M (2005) Data mining: concepts and techniques. Kaufmann, San Francisco
MATH Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Hu T, Sung SY, Xiong H, Fu Q (2008) Discovery of maximum length frequent itemsets. Inf Sci 178(1):69–87
Article MathSciNet Google Scholar
Jindal R, Malaya DB (2016) A novel approach for mining frequent patterns from incremental data. IJDMMM 8(3):244–264
Article Google Scholar
KDDCUP2000 (2000). https://www.kdd.org/kdd-cup/view/kdd-cup-2000. Accessed 4 May 2018
Liu L, Kantarcioglu M, Thuraisingham B (2008) The applicability of the perturbation based privacy preserving data mining for real-world data. Data Knowl Eng 65(1):5–21
Article Google Scholar
Luenberger DG (2003) Linear and nonlinear programming, 2nd edn. Springer, Berlin
MATH Google Scholar
Mendes R, Vilela JP (2017) Privacy-preserving data mining: methods, metrics, and applications. IEEE Access 5:10562–10582
Article Google Scholar
Michael K, Miller KW (2013) Big data: new opportunities and new challenges [guest editors’ introduction]. Computer 46(6):22–24
Article Google Scholar
Mielikainen T (2003) On inverse frequent set mining. In: Proceedings of 2nd workshop on privacy preserving data mining, PPDM ’03, Washington, DC, USA. IEEE Computer Society, pp 18–23
ms-IFM code (2018). Datasets and codes used by paper’s experiments for ms-IFM ans stored in GitHub repository. https://github.com/ninorullo/NoSQL-IFM. Accessed 18 Dec 2018
ms-IFM dataset (2017). Yelp challenge. https://www.yelp.com/dataset. Accessed 18 Dec 2018
Narayanan A, Shmatikov V(2009) De-anonymizing social networks. In: Proceedings—-IEEE symposium on security and privacy 2009 30th IEEE symposium on security and privacy, pp 173–187
Oliveira S RM, Zaïane OR (2003) Protecting sensitive knowledge by data sanitization. In: Proceedings of the third IEEE international conference on data mining, ICDM ’03, Washington, DC, USA. IEEE Computer Society, pp 613–616
Papadimitriou CH (1994) Computational complexity. Addison-Wesley, Boston
MATH Google Scholar
Pasquier N, Bastide Y, Taouil R, Lakhal L (1999) Discovering frequent closed itemsets for association rules. In: Proceedings of the 7th international conference on database theory, ICDT ’99, London, UK. Springer-Verlag, pp 398–416
Patki N, Wedge R, Veeramachaneni K (2016) The synthetic data vault. In: 2016 IEEE international conference on data science and advanced analytics, DSAA 2016, Montreal, QC, Canada, October 17–19, 2016, IEEE, pp 399–410
Ramesh G, Maniatty W, Zaki MJ (2003) Feasible itemset distributions in data mining: theory and application. In Neven F, Beeri C, Milo T (eds) PODS, ACM, pp 284–295
Ramesh G, Zaki MJ, Maniatty W (2005) Distribution-based synthetic database generation techniques for itemset mining. In: IDEAS, IEEE Computer Society, pp 307–316
Saccà D, Serra E (2013) Number of minimal hypergraph transversals and complexity of IFM with infrequency: high in theory, but often not so much in practice!. Online Preliminary Paper from http://sacca.deis.unical.it/#view=object&format=object&id=1490/gid=160. Accessed 4 May 2018
Shah A, Gulati R (2016) Article: Privacy preserving data mining: techniques, classification and implications—a survey. International Journal of Computer Applications, 137(12):40–46. Published by Foundation of Computer Science (FCS), NY, USA
Stavropoulos EC, Verykios VS, Kagklis V (2016) A transversal hypergraph approach for the frequent itemset hiding problem. Knowl Inf Syst 47(3):625–645
Article Google Scholar
Sweeney L (2002) K-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowl-Based Syst 10(5):557–570
Article MathSciNet MATH Google Scholar
Weikum G (2013) Where’s the data in the big data wave? ACM Sigmod Blog http://wp.sigmod.org/?p=786. Accessed 4 May 2018
Wu H, Ning Y, Chakraborty P, Vreeken J, Tatti N, Ramakrishnan N (2018) Generating realistic synthetic population datasets. ACM Trans Knowl Discov Data 12(4):45:1–45:22
Article Google Scholar
Wu X, Wu Y, Wang Y, Li Y (2005) Privacy aware market basket data set generation: A feasible approach for inverse frequent set mining. In: Proceedings of SIAM international conference on data mining, SDM’ 05, Philadelphia, PA, USA. SIAM, pp 103–114
Zheng Z, Kohavi R, Mason L (2001) Real world performance of association rule algorithms. In: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, KDD ’01, New York, NY, USA. ACM, pp 401–406
Zhong S (2007) Privacy-preserving algorithms for distributed mining of frequent itemsets. Inf Sci 177(2):490–503
Article MATH Google Scholar
Zhou B, Pei J, Luk W (2008) A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explor Newsl 10(2):12–22
Article Google Scholar

Download references

Funding

The funding was supported by MISE, Italian Ministry for Industry (Grant No. PON ID Service and Protect ID).

Author information

Authors and Affiliations

DIMES Department, University of Calabria, Rende, Italy
Domenico Saccá & Antonino Rullo
CS Department, Boise State University, Boise, ID, USA
Edoardo Serra

Authors

Domenico Saccá
View author publications
You can also search for this author in PubMed Google Scholar
Edoardo Serra
View author publications
You can also search for this author in PubMed Google Scholar
Antonino Rullo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Domenico Saccá.

Additional information

Responsible editor: Toon Calders.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix: Proofs

Theorem 1

(1) Decision ${\texttt {IFM}} _{\texttt {G} }$ and ${\texttt {IFM}} _{\texttt {I} }$ are ${\texttt {NEXP}} $-complete; (2) decision ${\texttt {IFM}} _{\texttt {D} }$ is in ${\texttt {PSPACE}} $ and ${\texttt {PP}} $-hard; (3) decision ${\texttt {IFM}} $ is in ${\texttt {PSPACE}} $ and ${\texttt {NP}} $-hard; (4) decision ${\texttt {IFM}} _{\texttt {S} }$ is ${\texttt {NP}} $-complete.

Proof

The complexity results of $\texttt {IFM}_\texttt {I}$ and $\texttt {IFM}$ have been proved respectively in Guzzo et al. (2013); Saccà and Serra (2013) and in Mielikainen (2003).

Let us now prove the ${\texttt {PSPACE}}$-membership of $\texttt {IFM}_\texttt {D}$. A polynomial space nondeterministic algorithm can set at the beginning $size=0$ and $\forall I \in S$, $\sigma _I=0$. Then, for each possible transaction $I \in {\mathcal {U}}_{\mathcal {I}}$, it guesses a duplicate value $\delta _I$ such that $\delta _I \le {\texttt {size}}$ if there exists $J\in S$ such that $J \subseteq I$ or $\delta _J \le \delta '$ otherwise (i.e., $J \in S'$). After each guessing, the algorithm updates ${\texttt {size}}$ and $\sigma _J$ for each $J \in S: \, J \subseteq I$. Once all guesses are completed, it remains to verify whether $size = {\texttt {size}}$ and for each $I \in S: \, I \subseteq J$, $\sigma _{\min }^I \le \sigma _I \le \sigma _{\max }^I$. It follows that the problem is in ${\texttt {PSPACE}}$.

The ${\texttt {PP}}$-hardness of $\texttt {IFM}_\texttt {D}$ can be easily proved by reduction from ${\texttt {FREQSAT}}$$\small {\{\texttt {NDUP}, \texttt {NTRANS}\}}$ that has been shown to be ${\texttt {PP}}$-hard in Calders (2007) (Theorem 13). The ${\texttt {NEXP}}$-hardness of $\texttt {IFM}_\texttt {G}$ immediately derives from the ${\texttt {NEXP}}$-hardness of $\texttt {IFM}_\texttt {I}$, which is a sub-problem of $\texttt {IFM}_\texttt {G}$.

Membership of $\texttt {IFM}_\texttt {G}$ to ${\texttt {NEXP}}$ immediately follows from the fact that, given any instance x of $\texttt {IFM}_\texttt {G}$, x is a yes-instance if and only if both $x_I$ (obtained from x by removing the constraint 3) is a yes-instance of $\texttt {IFM}_\texttt {I}$ and $x_D$ (obtained from x by removing the constraint 2) is a yes-instance of $\texttt {IFM}_\texttt {D}$.

We conclude by proving the ${\texttt {NP}}$-completeness of $\texttt {IFM}_S$ – this result was first presented without a proof in Guzzo et al. (2009). As for the membership to ${\texttt {NP}}$, we observe that the size of the output database ${\mathcal {D}}$ is certainly bounded by |S| and by the largest support required in the specification. Hence, in polynomial time a non-deterministic Turing machine may first guess such database (basically, the support for each itemset I in S) and then verify whether it satisfies the constraints $\sigma _{min}^i$ and $\sigma _{max}^i$, by simply computing the support $\sigma ^{{\mathcal {D}}}(I_i)$ on ${\mathcal {D}}$.

To prove that $\texttt {IFM}_{S}$ is ${\texttt {NP}}$-hard, we exhibit a reduction from the graph 3-colorability problem of deciding whether, given a graph $G=(V,E)$, there is a 3-coloring $c:V\rightarrow \{r,g,b\}$ such that $c(i)\ne c(j)$ for each pair of edges $(i,j)\in E$.

Based on the input graph $G=(V,E)$, we construct an instance of the $\texttt {IFM}_{S}({\mathcal {I}},\varSigma _S)$ problem such that: the set ${\mathcal {I}}$ of items is $\{r,g,b,l_1,l_2,l_3\}\cup \{v_{x} | x \in V \}\cup \{e_{z,y}| (z,y) \in E\}$, where conceptually the item r, g, b are the colors in G, $l_1, l_2, l_3$ are labels implementing an encoding of the three colors, $v_{x}$ is an item for each node in G and $e_{z,y}$ is an item for each edge in G. The encoding of colors by label is such that any two colors share exactly one label; in the proof we shall use the encoding $r=\{l_1, l_2\}$, $g=\{l_1, l_3\}$ and $b=\{l_2, l_3\}$.

The set $\varSigma _S$ contains two groups of constraints:

Group (I) these constraints are repeated for each node $x \in V$ and enforces that x must be colored with exactly one color. There are 7 itemsets associated to x organized on 3 levels:

3 itemsets at the highest level 2: there is an itemset for each possible color c for x, containing the items corresponding to the node x, to all the arcs leaving x, to the color c and to the encoding of the color — the support for such itemsets can be either 0 or 1;
3 itemsets at level 1: there is an itemset for each of the 3 encoding labels, containing the item corresponding to the label and the item corresponding to the node x – the support must be exactly 1;
1 itemset at level 0 containg the item corresponding to the node x – its support must be exactly 2;

We explicit the constraints below:

$(I_{x,r},0,1)$, $(I_{x,g},0,1)$, $(b_{x,r},0,1)$, where $I_{x,r}=$$\{v_x, r,l_1, l_2\} \cup I$,

$I_{x,g}=$$\{v_x, g,l_1, l_3\}$$\cup I$, $I_{x,b}=$$\{v_x, b,l_2, l_3\} \cup I$, and $I = \{e_{x,y}|(x,y)\in E\}$;
$(\{l_1,v_{x}\},1,1)$, $(\{l_2,v_{x}\},1,1)$ and $(\{l_3,v_{x}\},1,1)$;
$(\{v_{x}\},2,2)$.

Because of the support constraints for the itemsets at level 1, $\{l_1,v_{x}\}$ cannot occur as transaction in D as it inherits support from those itemsets; in addition, as the support of $\{l_1,v_{x}\}$ is 2 and there are 3 itemsets at level 1 with obligatory support 1, exactly one of the itemsets at level 1 must occur as transaction, whereas the other two inherit support from a same itemset at level 2. It turns out that exactly one itemset at level 2 can occur as transaction whereas all others must have support 0 - the itemset selected as transaction will then fix the unique color for the node x.

Group (II) these constraints are repeated for each edge $(x,y) \in E$ and enforce that two end nodes of the edge have different color. There are 3 itemsets, one for each possible color; the constraints are: $(\{r,e_{z,y},0,1)\}$, $(\{g,e_{z,y},0,1)\}$, $(\{b,e_{z,y},0,1)\}$. These itemsets inherit support from two itemsets at level 2: one for x and the other for y. The constraints of Group (II) enforces that the two itemsets at level 2 cannot be of the same color, thus any two adjacent nodes cannot share the same color. It follows that the existence of a solution to $\texttt {IFM}_{S}$ witnesses the fact that the graph G admits a 3-coloring; on the other hand, if $\texttt {IFM}_{S}$ has no solution then the graph cannot be 3-colored. Hence, $\texttt {IFM}_{S}$ is ${\texttt {NP}}$-hard in general. $\square $

Proposition 1

Decision k-${\texttt {IFM}} _{\texttt {G} }$ is in ${\texttt {PSPACE}} $ and ${\texttt {PP}} $-hard and decision k-${\texttt {IFM}} _{\texttt {I} }$ is in ${\texttt {PSPACE}} $ and ${\texttt {NP}} $-hard.

Proof

The complexity results for k-$\texttt {IFM}_\texttt {I}$ have been proved in (Guzzo et al. 2013; Saccà and Serra 2013). The ${\texttt {PP}}$-hardness of k-$\texttt {IFM}_\texttt {G}$ immediately derives from the ${\texttt {PP}}$-hardness of $\texttt {IFM}_\texttt {D}$, which is a sub-problem of k-$\texttt {IFM}_\texttt {G}$. Finally ${\texttt {PSPACE}}$ membership of k-$\texttt {IFM}_\texttt {G}$ follows from the fact that, given any instance x of k-$\texttt {IFM}_\texttt {G}$, x is a yes-instance if and only if both $x_I$ is a yes-instance of k-$\texttt {IFM}_\texttt {I}$ and $x_D$ is a yes-instance of $\texttt {IFM}_\texttt {D}$.

Proposition 2

(1) Relaxed k-${\texttt {IFM}} _{\texttt {I} }$ and relaxed ${\texttt {IFM}} $ are ${\texttt {NP}} $-complete; (2) relaxed k-${\texttt {IFM}} _{\texttt {G} }$ and relaxed ${\texttt {IFM}} _{\texttt {D} }$ are in ${\texttt {PSPACE}} $ and ${\texttt {PP}} $-hard; (3) relaxed ${\texttt {IFM}} _{{S} }$ is in ${\texttt {P}}$.

Proof

The ${\texttt {NP}}$-completeness of relaxed k-$\texttt {IFM}_\texttt {I}$ has been proved in Guzzo et al. (2013); Saccà and Serra (2013). The ${\texttt {NP}}$-membership of $\texttt {IFM}$ derives from the fact that $\texttt {IFM}$ is a sub-problem of k-$\texttt {IFM}_\texttt {I}$; the ${\texttt {NP}}$-hardness can be easily proved by reduction from ${\texttt {FREQSAT}}$ that has been shown to be ${\texttt {NP}}$-complete in Calders (2004, 2007). The ${\texttt {PSPACE}}$-memberships of relaxed k-$\texttt {IFM}_\texttt {G}$ and relaxed $\texttt {IFM}_\texttt {D}$ are obvious. The ${\texttt {PP}}$-hardness can be easily proved by reduction from FREQSAT$ \small {\{\texttt {NDUP}\}}$ that has been shown to be ${\texttt {PP}}$-hard in Calders (2007). It follows that k-$\texttt {IFM}_\texttt {G}$ is ${\texttt {PP}}$-hard as well. Finally, to prove that relaxed $\texttt {IFM}_S$ is in ${\texttt {P}}$, we encode the problem as a system of linear equations over |S| rational variables $\delta ^I$, $\forall I \in S$, and $2\, |S|$ inequalities implementing the minimal and maximum support constraint for each itemset in S. The result then follows, since it is well-known that deciding whether a system of linear equations admits a solution is feasible in polynomial time.

Proposition 3

Each column $j \in U$ is an infeasible solution for the ILP price formulation.

Proof

By constraints (14), a generic column $j\in U$ represented by the itemset $I_j$ can be uniquely identified by the itemsets in $S\cup B_{S'}$ that are included in $I_j$. Then, in order to prove that the itemset $I^*$ returned by PLI formulation is different from each other itemset representing a column j in U, the following condition must hold: either the itemset in $S\cup B_{S'}$ contained in $I^*$ is a subset of some itemset in $S\cup B_{S'}$ contained in $I_j$ or there exists an itemset in $S\cup B_{S'}$ that is contained in $I^*$ but not in $I_j$. The previous condition can be formulate by the following disjunction of integer linear inequalities:

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\subset I_j\end{array} \end{array}}\ y_i\le k_j-1 \end{aligned}$$

(32)

or

$$\begin{aligned} \sum _{{\begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array}}}\ y_i \ge 1 \end{aligned}$$

(33)

Now we prove that the above disjunction is equivalent to the inequality (15).

Suppose that the integer linear inequality (32) is not satisfied. Then

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\subset I_j\end{array} \end{array}} \ y_i =k_j \end{aligned}$$

and the integer linear inequality (15) becomes

$$\begin{aligned} -k_j\sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array} \end{array}} \ y_i \le -1 \end{aligned}$$

that is equivalent to integer linear inequality (33). Instead if the integer linear inequality (33) is unsatisfied then

$$\begin{aligned} \sum _{\begin{array}{c} \begin{array}{c}1 \le i \le m+m',\\ I_{s_i}\not \subset I_j\end{array} \end{array}}\ y_i=0 \end{aligned}$$

and the integer linear inequality (15) is equivalent to (32). It follows that each column $j \in U$ is an infeasible solution for the PLI price formulation.

Proposition 4

The decision version of ms-${\texttt {IFM}} $ is in ${\texttt {PSPACE}} $ and ${\texttt {NP}} $-hard and the decision version of relaxed ms-${\texttt {IFM}} $ is ${\texttt {NP}} $-complete.

Proof

The NP-hardness of decision ${ms\texttt {-}{} \texttt {IFM}}$ derives from the fact that an instance of it without SV attributes and infrequency constraints and with exactly one MV attribute is an instance of the classical $\texttt {IFM}$ decision problem, that is known to be NP-hard (see Mielikainen 2003). To prove membership to ${\texttt {PSPACE}}$, we use the following non-deterministic algorithm to solve any instance of the problem. We use $m+{\widehat{m}}+1$ variables $\ddot{s}$, $\ddot{b}^f_1, \dots , \ddot{b}^f_m$, $\ddot{b}^i_1, \dots , \ddot{b}^i_m$ and set all of them to 0. Then we perform the following steps. For each possible transaction t: (1) guess a value $n_t$ between 0 and ${\texttt {size}}$, (2) if $n_t>0$ then: (3) set $\ddot{s} = \ddot{s}+n_t$, (4) for all $I_i \in S$ s.t. $I_i$ subsumes t then set $\ddot{b}^f_i$ := $\ddot{b}^f_i+n_t$ and (5) for all $I_i \in S'$ s.t. $I_i$ subsumes t then set $\ddot{b}^i_i$ := $\ddot{b}^i_i+n_t$. At the end of all iterations we check that: (1) $\ddot{s} = {\texttt {size}}$, (2) for each $I_i \in S$, $\ddot{b}^f_i$ is included into the support constraint range for the frequent itemset $I_i$ and (3) for each $I_i \in S'$, $\ddot{b}^i_i$ is less than or equal to the support constraint upper bound for the infrequent itemset $I_i$. It is easy to see that the algorithm is correct and runs in a nondeterministic polynomial space. So the problem is in $\mathrm {NPSPACE}$; hence, by Savitch’s theorem, it is in ${\texttt {PSPACE}}$ as well.

The NP-hardness of relaxed ${ms\texttt {-}{} \texttt {IFM}}$ derives from the fact that an instance of it without SV attributes and infrequency constraints and with exactly one MV attribute is an instance of decision ${\texttt {FREQSAT}}$, that is known to be NP-hard (see Calders 2004, 2007). To prove membership to ${\texttt {NP}}$, observe that the problem can be formulated as linear program LP with $m+{\widehat{m}}+1$ constraints. It is well known that LP has a solution if and only if there exists a basic feasible solution x (i.e., with at most $m+{\widehat{m}}+1$ values in x different from 0). Therefore we can guess a $(m+{\widehat{m}}+1)$-vector S of variable indices and non-deterministically assign values $v \le {\texttt {size}}$ to them. At this point we are able to verify that this basic solution is admissible (i.e. satisfy the linear constraints).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Saccá, D., Serra, E. & Rullo, A. Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications. Data Min Knowl Disc 33, 1736–1774 (2019). https://doi.org/10.1007/s10618-019-00643-1

Download citation

Received: 03 August 2017
Accepted: 05 July 2019
Published: 20 July 2019
Issue Date: November 2019
DOI: https://doi.org/10.1007/s10618-019-00643-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs

Theorem 1

Proof

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Extending inverse frequent itemsets mining to generate realistic datasets: complexity, accuracy and emerging applications

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big Data Analytics: A Literature Review Paper

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Proofs

Appendix: Proofs

Theorem 1

Proof

Proposition 1

Proof

Proposition 2

Proof

Proposition 3

Proof

Proposition 4

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation