1 Introduction and motivation

Itemset mining deals with searching for a compact set of itemsets to summarize a given transaction dataset in the most effective and efficient way. For example, in market basket analysis, a dataset contains a number of items and transactions. Each transaction is a list of items a customer has purchased. We can examine which items are sold together to analyze user behavior, increase sales and make predictions. Early works in this domain focus on finding frequent items that satisfy minimum support thresholds for the analysis of datasets. Apriori is the first work introduced for finding frequent itemsets whose minimum support is above a user-specified threshold [2] and it has been applied extensively in numerous applications since then. Apriori and many similar algorithms, e.g., Eclat [40] and FPGrowth [11] suffer from the pattern explosion, i.e., high minsup thresholds lead to return a small number of well-known patterns. Additionally, these methods return an incredibly large number of patterns for small values of minimum support threshold, many of which are only variations of the same theme. For example, if we learn from the transactions that bread and butter are often purchased together and many people buy milk, then it is entailed by redundancy to inspect if these three items are purchased together [30]. A few other works tried to solve this problem [4, 9, 22] but they do not fully resolve the problem of pattern explosion [1].

This field is introduced as ‘Interesting Itemset Mining’ by advanced itemset mining community which focus on finding the non-redundant and self-sufficient summary of data [7, 8, 13, 17, 21, 26, 28, 30]. These works have achieved comparatively interesting and nonredundant patterns than the frequent itemset mining works. This helps to intelligently analyze the data-driven problems from the domain of finance, graph search [20, 24, 44], recommendation systems [23, 36, 38, 46] and data engineering [34, 39, 45, 47] etc. however, existing works do not consider sparsity constraints of the encoding. In some applications, e.g., compression of transaction databases, sparsity constraint might be preferred to limit the maximum size of each selected itemset. Additionally, these works do not learn a convolutional hierarchical representation of data.

In this paper, we propose a convolutional sparse coding-based approach for interesting itemset mining that is essentially different from the tasks in image processing domain with real values. We propose a matching pursuit greedy approach which performs dictionary learning from transaction data to reduce data loss compression under sparsity constraint. To further enhance its performance, we embed our sparse coding algorithm into a convolutional neural network based architecture such that each layer learns a complex discrete representation from the transformed database. This resembles state-of-the-art convolutional sparse coding in the image processing domain [14, 42]. Adding sparse representation of images and signals into training instances helps to improve the classification accuracy [3]. Nevertheless, leveraging the sparse representation of itemsets to enhance the performance of classifiers (e.g., Naive Bayes, Decision Trees, Random Forest etc) is still an open question. To summarize, we make the following contributions:

  • Sparse coding of itemsets is first time addressed and formulated as an optimization problem. We prove it NP-hard by reducing it to set cover problem. We propose approximation based sparse coding algorithm, Dictionary Learning for Sparse Coding of Itemsets (DSI) to efficiently learn nonredundant dictionary elements for lossless compression. It provides a bottom-up mapping from transaction to dictionary items, efficiently giving a reconstruction close to the original transactions.

  • We propose a new approach Layered Convolutional Dictionary Learning for Sparse Coding of Itemsets (CDSI) to deploy sparse coding within a convolutional resembling model to learn grouping representation at each level. The dictionary itemsets are interfused in the database to learn a meaningful representation.

  • An extensive empirical validation on thirteen datasets shows the superiority of our proposed methods as compared to the recent works. A text dataset (JMLR) is used to evaluate the pattern meaningfulness just by eyeballing. Transactions of nine UCI [6] and three SIPO [21] datasets (Section 6.3) are sparse coded, to determine its impact on the prediction accuracy of different classifiers.

Section 2 discusses the other important related works. Our targeted problem is formally defined and proved NP-hard in Section 3. Greedy approach for sparse representing itemsets is presented in Section 4. In Section 5, we explain a layered convolutional process for transforming database and dictionary learning. Section 6 describes our extensive empirical validation in detail. We conclude our work with future directions in Section 7.

2 Related work

Our model draws inspiration from the field of sparse dictionary learning, convolutional sparse coding and itemset mining that operates on transaction data. In what follows, we provide a brief overview of related work in these fields.

2.1 Sparse dictionary learning

Sparse coding is an unsupervised algorithm which is widely used in signal and image processing to compress images or signals using a compact set of basis learned from data. It discovers basis functions called dictionary to adapt it to specific data, an approach that has recently proven to be very effective for signal reconstruction and classification in the audio and image processing domain [15, 16]. A dictionary consisting of image edges can give a better representation of images than the pixel intensity values. Sparsity constraint is enforced to restrict the size of the basis for sparse coding image hence the dictionary is overcomplete. Sparse dictionary learning mainly deals with the continuous data while in practice many datasets are discrete. Continuing this highly promising line of work, we explore how to represent itemsets under sparsity constraint and learn dictionary. Though the idea of coding binary data is not new, handling of discrete data for the sparse coding problem is still challenging. It is in high demand to study sparse coding techniques for discrete data.

2.2 Layered convolutional sparse dictionary learning

A sparse feature vector is computed to reconstruct the original input vector by minimizing an energy function. A highly redundant representation of an image is produced if patches are processed independently, as these features can be correlated. The sparse coding algorithm cannot capture dependencies alone. To address this problem, a variety of convolutional sparse coding methods have been introduced in the image processing domain [12, 42]. These techniques are based on the convolutional decomposition of input data to learn dictionary under a sparsity constraint. It is a top-down approach seeking to generate the input signal by summing up the convolutions of the feature maps with learned filters. Sparsity limits the representation by imposing size restriction at each layer, which facilitates assembling parsimonious features into more complex structures. A convolutional sparse coded dictionary contains rich information which many existing feature detectors cannot detect.

2.3 Interesting itemset mining

In this section, we summarize a few recent works who mine small, high quality and non-redundant patterns that yield the best lossless compression of the database. This problem is already proved NP-hard [27]. A few clustering based approaches are used to create frequent feature value pairs belonging to a specific cluster. The compression ratios are dependent on the number of clusters. Outliers detection and compaction gain are bottlenecks in this work [5]. MTV [18] uses Minimum Description Length (MDL) principle together with the maximum entropy distribution to directly calculate the expected frequencies of itemsets and identify interesting contents. KRIMP [28] applies MDL principle to create a simple two column translation based code table that optimally describes the data. The candidate itemset is selected w.r.t. the standard candidate order. It uses a cover algorithm to select a smaller compressed sized encoding. KRIMP candidate generation technique requires high running time and selecting right threshold values for larger databases or candidate collections is challenging. SLIM [25] addressed this issue by directly mining descriptive patterns from the data. It uses MDL along with an accurate heuristic to greedily construct patterns in a bottom-up fashion. OPUS Miner [31] is a branch and bound approach which deploys two pruning mechanisms considering itemset values and statistical significance levels. It finds top k productive and nonredundant itemsets to identify small sets of key associations ultimately leading to self-sufficient itemsets. Interesting Itemset Miner (IIM) [7] uses a generative model over itemsets in the form of Bayesian networks. The greedy approximation based weighted set cover approach infers interesting itemsets.

3 Problem definition and proof of NP-hardness

For ease of presentation, we first introduce some preliminary concepts and notations. Let D = {T1,T2,⋯ ,Tn} be a database of n transactions where each transaction belongs to a set of items I = {ip|1 = 1,.....,p}. The cardinality of a transaction is the number of items in it. When a set of items called itemset, contains p items, it is referred as p-itemset. We aim to learn a dictionary B = {I1,I2,⋯ ,Im} of m basis (itemsets), from which discrete sparse code of the database can be inferred. A sparse code of transaction T is the union of k itemsets U(b): \(\cup ^{k}_{i = 1}B_{i}\) from B such that U(b) ⊆ T and k is less than the cardinality of T. With these notations, we formulate the following research problems:

Problem 1

[Finding sparse representation of transaction T] Given a dictionary B and sparsity constraint (k: the maximum number of basis to choose from B), a sparse code of T is denoted as B(T):

$$ B(T) = \arg min_{b \subset B, |b| \leq k}|T - U(b)| $$
(1)

where U(b) represents a set of items in T that are covered by b.

Example 1

Given T = qrvwx, B = {qr, vw, vy, yz} when k is set to 1. The basis are B(T) = {qr}, and when k = 2, U(B) = {qrvw}.

Sparse coding over the whole database D with the basis B incurs a loss function defined as \(L_{B}(D) = {\Sigma }^{n}_{j = 1} |T_{j} - U(B(T_{j}))|\). In Example 1, the loss for B(T) = {qr} to encode T = qrvwx is 3 while the loss for B(T) = {qr, vw} is 1. Since vy is not a subset of qrvwx, it cannot be added into B. To better preserve the original information contained in a transaction database, a beneficial dictionary with less encoding loss is expected to be learned.

Problem 2

[Dictionary learning from candidates] Given a database of transactions D = {T1,T2,⋯ ,Tn}, the maximum number of basis allowed in a sparse code (sparsity constraint) k, and a set of candidate itemsets C, find a dictionary BC with maximum m basis, such that B = arg minBCLB(D).

To solve Problem 2, people may solve the following problem first:

Problem 3

[Candidate set Construction from database] The encoding loss function in Problem 2 requires a candidate set C for inclusion in the dictionary. How to construct a high-quality candidate set C from the database D is another challenging and important problem, as the C contents determine the quality of the learned dictionary and the encoding loss of the database to some extent.

Theorem 1

Problem 1 is NP-hard.

Proof

We prove the NP-hard nature of problem by reduction to the set-cover problem. Let S = {1,2,⋯ ,n} and H = {s1,s2,⋯ ,sm}, where siS. Set cover problem asks whether we can construct a set xH such that |x| = k and \(\cup ^{k}_{i = 1}s_{i}= S\). Let T = S and B = H, then solving (1) will result in sparse representation of T, that is bB such that |TU(b)| is minimized. Let b be the solution to Problem 1. If |TU(b)| = 0 then it is easy to see that b is the set cover of S otherwise if the size is more than zero then no set-cover of size k exists. Hence solving (1) will solve the set-cover problem. Problem 1 has been reduced to the set cover problem and this reduction is polynomial in the problem input size. Hence, the theorem is proved. □

figure a

4 Dictionary Learning for Sparse Coding Itemsets (DSI)

In this section, we present our proposed algorithmic framework (DSI) to learn sparse code dictionary in detail, and the pseudocode is given in Algorithm 1. It iteratively selects m basis from a set of candidate itemsets C. In each iteration, a single itemset I from C is chosen to form a transitory dictionary B+ with already selected itemsets, and then the encoding loss for the database based on B+ is computed (lines 7–9). In addition, it also calculates the number of overlapping items between the newly selected itemset I and learned dictionary B. The new itemset I is added to the dictionary if the loss and overlaps with selected basis are less than other candidates so far (lines 10–14). We present Example 2 for the better understanding of DSI:

Example 2

Assume that we have a database D = {T1 = qrvwx, T2 = qrvyz, T3 = qrvwyz}, C = {qr, vw, vy, yz}, m = 3 and k = 2. We explain Table 1 to show how Algorithm 1 works:

  • Step 1: Initially B is empty. In each iteration (lines 6 -14), we look for an itemset I such that when we add it to B, the database D can be encoded with minimum loss and overlaps. Step 1 shows loss of encoding each transaction in D using candidates I from C. As observed, the overall loss is minimum when I = qr with the loss equal to 10. Therefore qr is added to B = {qr} and deleted from C.

  • Step 2: The next itemset that works together with B to minimize the overall loss is I = vw with the overall loss equal to 6. We update B to {qr, vw} and remove vw from C accordingly.

  • Step 3: We calculate the encoding loss for each remaining candidate in C considering the learned dictionary B. We can see that {vy} and {yz} lead to the same loss value of 4. Nonetheless, item v in vy intersects with the dictionary element vw, making the overlap Ovy to be 1. On the other hand, yz has no overlap with itemsets in dictionary B, i.e., Oyz= 0. Ultimately, we update B to {qr, vw, yz} and stop the algorithm after selecting m = 3 basis.

Table 1 The illustration of running Algorithm 1 in Example 2. Selected items are emphasized in bold

DSI uses a greedy method (MaxSetCover) to calculate the encoding loss, pseudocode is given in Algorithm 2. Our loss calculation method greedily encodes every transaction TiD with the basis of B+. Algorithm 2 follows the standard procedure to solve the max set cover problem which guarantees an approximation factor of \(1-\frac {1}{e}\) to the optimal solution [27]. The algorithm inputs a transaction T, a set of potential candidates C and a sparsity parameter k. It performs a greedy strategy to solve the max set cover problem by selecting up to k basis that curtail the encoding loss simultaneously. It returns the encoding loss, i.e., the number of items in T that have not been covered by the selected basis from B+. Example 3 explains the working of matching pursuit greedy approach given in Algorithm 2.

Example 3

Assume that T = qrvwyz, C = {qr, vw, vy, yz} and k = 2, Algorithm 2 performs following steps to calculate encoding loss used in Table 1:

  • Step 1: Initially G is empty.

  • Step 2: The itemset IC that maximizes the coverage of T is qr, so qr is added to G and deleted from C. (G = {qr} , C = {vw, vy, yz}).

  • Step 3: The next itemset IC that together with selected itemsets in G maximizes the overall coverage of T is vw so R = {qr, vw}. The algorithm stops when sparsity limit approaches, i.e., k = 2. The encoding loss is two (|TU(G)| = 6 − 4 = 2), as two items in T are not covered by G.

figure b

5 Layered Convolutional Dictionary Learning for Sparse Coding of Itemsets (CDSI)

In this section, we introduce a novel convolutional sparse coding mechanism (CDSI) to learn statistically dependent sparse dictionary in a hierarchical fashion. Dictionary items are convolved in each layer to transform the database; allowing next layer to learn more complicated patterns. This is similar to the idea of the deep learning technique: Convolutional Neural Networks (CNNs) [14], where learned filters are convolved with the input image and next layer of convolutional filters work on the output of the previous layer, allowing CNN to capture features at different levels of abstractness [41]. The convolution process has an advantage that the itemsets are learned in a hierarchical way, and various dictionaries with different-granularity abstractions can be achieved for different applications. We provide an overview of our layered convolutional dictionary learning algorithm below, and outline how it works:

  1. 1.

    Construct a candidate set C using chi-square (see Section 5.1 for a discussion of how to construct a meaningful candidate set).

  2. 2.

    Run Algorithm 1 to learn a dictionary from C that sparse-codes the database D well.

  3. 3.

    Run Algorithm 3 to transform the database D using the learned dictionary in the second step (see Section 5.2).

  4. 4.

    To learn patterns in the next layer, return to step 1.

figure c

5.1 Candidate set construction

Quality of sparse dictionary learning (Algorithm 1) is highly dependent upon the contents of candidate set C. To build up C, a possible solution is to use frequent pattern mining algorithm such as the Apriori algorithm [2] which is subject to explosion (see Chapter 2 of [1]). In this section, we propose a refined approach to find statistically dependent itemsets. Intuitively, a pattern is only admissible if there is a strong dependency and correlation. Therefore, in order to compose the candidate set C, we use chi-square test [32]. Let q and r be two items and we define:

  • Fqr = |{TiD|qrTi}|, i.e., the frequency of the itemset qr.

  • \(F_{q\bar {r}} \,=\, |\{T_{i} \!\in \! D| q \!\in \! T_{i}, r \!\notin \! T_{i}\}|\), i.e., the number of transactions that contain q but not r.

  • \(F_{\bar {q}r} \,=\, |\{T_{i} \in D| q \notin T_{i}, r \in T_{i}\}|\), i.e., the number of transactions that contain r but not q.

  • \(F_{\bar {q}\bar {r}} \,=\, |\{T_{i} \in D| q \notin T_{i}, r \notin T_{i}\}|\), i.e., the number of transactions that neither contain q nor r.

  • \(E_{qr} = \frac {F^{2}_{qr}}{N}\), i.e., the expected frequency of qr given the assumption that q is independent from r.

  • \(E_{q\bar {r}} = \frac {F^{2}_{q\bar {r}}}{N}\), i.e., the expected number of transactions that contain q but not r.

  • \(E_{\bar {q}r} = \frac {F^{2}_{\bar {q}r}}{N}\), i.e., the expected number of transactions that contain r but not q.

  • \(E_{\bar {q}\bar {r}} = \frac {F^{2}_{\bar {q}\bar {r}}}{N}\), i.e., the expected number of transactions that neither contain q nor r.

The chi-square statistics is defined as follows:

$$\begin{array}{@{}rcl@{}} chi-square = \frac{(F_{qr} - E_{qr})^{2}}{E_{qr}} + \frac{(F_{\bar{q}r} - E_{\bar{q}r})^{2}}{E_{\bar{q}r}} + \frac{(F_{q\bar{r}} - E_{q\bar{r}})^{2}}{E_{q\bar{r}}} + \frac{(F_{\bar{q}\bar{r}} - E_{\bar{q}\bar{r}})^{2}}{E_{\bar{q}\bar{r}}} \end{array} $$
(2)

If q and r are statistically independent then it follows a chi-square distribution with one degree of freedom. Based on this observation, we can test for our null hypothesis: q and r are statistically independent. The test can be performed for any pair of items in the database and only pair that passes the test (when null hypothesis is rejected at a significant level of 0.05) will be scrutinized as potential itemsets in the candidate set C. Adding statistically dependent item pairs into the candidate set, ultimately leads to the dictionary learning by running Algorithm 1.

5.2 Database transformation and convolution

We elucidate database transformation process with the toy database described in Example 2. Given a dictionary B = {qr, vw, yz}, Algorithm 3 transforms the database D into an advanced database with refined items where each item corresponds to an itemset in the dictionary B. Let us re-write the basis itemsets in B as B = {α = qr, β = vw, γ = yz}, where each basis itemset in B is now represented by a new item (symbol) that is not present in the current alphabet. Algorithm 3 transforms the database D = {T1 = qrvwx, T2 = qrvyz, T3 = qrvwyz} into D = {T1 = αβx, T2 = αvγ, T3 = αβγ}. The new database D contains transactions with dependent itemsets {α, β, γ, v, x}, while the original itemsets were {q, r, v, w, x, y, z}. Table 2 shows the process of dictionary learning for the transformed database D at second layer. Note that the candidate set C in this example is constructed by randomly selecting item pairs from the transaction database, as there are only three transactions, making it impossible for chi-square to find any dependent patterns.

Table 2 CDSI: Dictionary learning from convolved and transformed database at second layer. Items placed in B = {αβ, γx, αv} are highlighted in bold

6 Experiments

In interesting itemset mining, a powerful representation of data has higher values of (i) pattern interpretability, and (ii) classification accuracy. Our extensive empirical validation also considers these criteria to evaluate the effectiveness of proposed algorithmic framework. We compare our proposed sparse coding techniques with the IIM [7] and MTV [17], because they represent the state-of-the-art techniques for itemset mining and significantly outperform existing approaches developed in [21, 26, 28] on similar standard datasets as adopted in our experiment.

6.1 Dataset description

We use discretized version of Semi Interval Partial Order (SIPO) datasets (introduced in [21]) and UCI datasets [6] for classification. Table 3 summarizes the characteristics of datasets used. It is always a challenging task to measure the meaningfulness of discovered patterns as a potential solution, thus text datasets are used to informally evaluate the quality by comparing pattern interpretability and relevance. We use the JMLR abstract text dataset from Journal of Machine Learning websiteFootnote 1 which is easy to interpret.

Table 3 Summary of datasets

6.2 Interpretability of Sparse Representation

Table 4 shows MTV returns interrelated and less diverse frequent patterns, e.g., “synthetic real”, “real datasets”, “train classifi”, “classifi class”, etc. IIM derives relevant patterns (e.g., “anomali detect” and “semi supervised”, etc.), however, a few patterns (e.g., “parameter”, “parameters” and “sequenc”, “sequential”) require stemming to remove redundant patterns. Patterns extracted by CDSI at 4th layer of convolution dictionary with parameters (m = 10,k = 5) are also given. We can observe that CDSI generates more revealing, diverse and comprehensive patterns, e.g., “machine learning”, “graphic variable”, “probabl distribut”, etc. Besides, they do not require stemming. To conclude, CDSI comparatively generates interpretable, heterogeneous and less redundant patterns.

Table 4 Top 10 non-singleton patterns selected from the JMLR abstracts dataset to compare pattern interpretability for CDSI (Section 5), IIM [7] and MTV [17]

6.3 Classification accuracy

Classification accuracy inflates conceding that sparse representation techniques or interesting itemset mining algorithms are employed on data [13, 33]. Table 5 presents a fictitious scenario to explain our experimental setup with a database D containing 5 transactions: D = {T1, T2, ⋯ , T5} and two class labels {A, L}. These transactions are illustrating the purchase of items {q, r, v, w, x} with the proportionate input vector presentations, e.g., (T1) ={1,1,1,0,0}. Since 0 and 1 exhibit if any specific item has been purchased, third element of T1 is set 1 to depict purchase of v. These labeled transactions are feed to various classifiers. To evaluate if mined patterns are boosting the classification accuracy, we wrap them as binary values within transactions. To do so, we increase the length of transactions to append discovered patterns. Let us say if CDSI discovers two patterns (r, v) and (r, x) then 6th and 7th elements are added in each transaction to demonstrate the presence of distinct pattern (given in the extended transaction column of Table 5). Now the vector representations of T1 will become {1,1,1,0,0,1,0} while preserving record of purchase of remaining items q, r, v.

Table 5 Data preparation for classification using CDSI (Section 5), IIM [7] and MTV [17] mined patterns as binary features

Table 6 presents the accuracy of different classifiers (e.g., Naive Bayes, J48, Random Forest, and IBk) for SIPO and UCI datasets described in Table 3. To be unbiased, the number of mined patterns is set to minimum patterns returned by any of the interesting itemset mining algorithms. These patterns are incorporated in the transactions (singletons) following the way extended input vectors are created in Table 5. We run our experiments using WEKA [10] over 5 fold cross-validation with parameters set to default values. Patterns are extracted using CDSI (with parameters layers = 10, k = 10), IIM [7] and MTV [17] (default parameter values adjusted in online available codes are used for existing approaches). Each cell of Table 6 shows the accuracies of different methods for respective classifiers. The highest prediction accuracy for any input vector type is emphasized in bold. The last column (Best) shows the highest accuracy for all types of input data and highlights the topmost value in bold. The prediction accuracy of all datasets increases when extended transactions are fed in comparison to when the classifier is only trained on the transactions themselves (singletons). Generally, CDSI significantly improves the prediction accuracy certifying our assumption about convolutional sparse coded dictionary carrying influential objective information.

Table 6 CDSI (Section 5) improves the prediction accuracy of IIM [7] and MTV [17]. For a fair comparison, identical number of patterns returned from each method are used

7 Conclusions and future work

Convolutional sparse model dictionary learning has been used before in the image processing domain [12, 42], it is still not studied for the itemset mining so far. In this paper, we present approximation based algorithms to find the sparse representation of itemsets, which is discrete in nature. We propose an optimization technique to learn dictionary under the sparsity constraint from the transaction dataset. Based on this mechanism, a convolutional dictionary learning method is presented that allows extracting dictionaries at different levels of abstractness. Chi-square test is performed to extract statistically dependent patterns from the transaction data and input it to the layered dictionary learning algorithm; generating increasingly complex and statistically dependent patterns in each layer. We conduct extensive experiments on various datasets showing that sparse representation forms a succinct input representation and when combined with different classifiers, their efficacy is increased. In future, we plan to incorporate layered convolutional sparse dictionary learning techniques to tackle sequential, streaming and uncertain data mining problems [13, 19, 29, 35, 37, 43].