An efficient algorithm for unique class association rule mining

https://doi.org/10.1016/j.eswa.2020.113978Get rights and content

Highlights

  • Unique patterns extraction of datasets.

  • Efficient and complete search for Class-based association rules CARs.

  • Performance of extracting CARs based on the Subsumption and Nonsense hypotheses.

  • Building rule-spaces and Ranking of datasets.

Abstract

Association rule mining is one of the main means in Knowledge discovery and Machine learning. Such kind of rules present knowledge of interrelations among items in a dataset. Class Association Rules (CARs) are a subset of association rules which are always mined using labeled datasets. Simply, a typical CAR has an itemset that is associated to a class label. Mining CARs is vital for construction of pattern or rule-based classification models and has received recently increasing research interest. In this work, a complete efficient but not exhaustive CAR mining algorithm (UniqAR) is introduced. UniqAR generates always and only 100% accurate CARs which are called unique association rules using two rule search hypothesis of Subsumption and Nonsense to find unique itemsets in order to generate the Unique CARs. Unlike alternatives of CAR mining algorithms, UniqAR mined association rules aren’t based on itemset frequency or item selectivity. It can generate both frequent and rare association rules. No preferences of support, coverage, or item participant in itemsets are required to be provided for the proposed mining process. The main contribution of this work to CARs’ state of the art is describing unique itemsets and class association rules and providing an efficient mining process for them. Unlike the other unique rule mining alternatives in the literature, the proposed novel mining process depends on a complete but not exhaustive search that employs rules inter-relations. UniqAR has been modeled with computational analysis and extended evaluation. It is shown that UniqAR can extract all unique itemsets for unique association mining with no need to setup any user preferences, template or any constraints. Moreover, it describes accurately the effects of different dataset criteria like number of attributes/features, feature values, cases, and class labels on UniqAR unique itemset extraction mining process in an efficient way that avoids a huge number of itemsets/cases comparisons. Results have shown that the proposed UniqAR algorithm is feasible and promising.

Introduction

Class Association Rules CARs represent an interesting subset of association rules. A CAR implies always a class label as a consequent of itemset combination. CARs are vital in several applications and domains (Nguyen, Nguyen, Vo, & Pedrycz, 2016). In General, according to the literature, mining for high accuracy CARs that satisfy several constraints of frequency and accuracy represent the objective behind several CAR mining process. Higher accuracy of the generated CARs ensures a better application performance. Mining the highest (only: 100%) accurate CARs or the unique CARs represent the main motivation of this work.

Datasets consist of many cases, instances, objects or records which can be seen as combinations or of feature values/items in itemsets. Theses itemsets are mostly examined individually and in combinations in order to generate expressive interesting patterns like in frequent and rare pattern mining, and CARs (Han & Micheline Kamber, 2012). Itemsets which contain unique patterns that happen always with only one specific class label are unique itemsets. Unique itemsets are the raw material in generation of unique CARs which have an important role in several further analytic (e.g. Classification, Clustering, Data Profiling and Semantic Annotation). Nevertheless, availability of these kind of rules can measure the difficulty or the challenge degree of a specific dataset for supervised mining processes like Classification. It means that the higher number of available unique itemsets and then unique CARs, the easier expected training process is. Since, any of the unique association rules leads always to one specific class label, its accuracy is always 100%.

Mining unique itemsets in unlabeled datasets, where unique itemset happens only once, has received a significant research interest (Papenbrock & Naumann, 2017). In labeled datasets, CARs are mostly generated based on frequency (or minimum support) and itemset participation or selectivity constraints. Unique CARs represent unique itemsets that always happens with respect to a specific class label. A higher itemset support ensures a better confidence and coverage of the generated association rule as well. However, higher achieved confidence and coverage, no guarantee can be ensured for rule accuracy in mining for class labeled association rules. Generating the complete possible subsets of each object in a specific dataset, then comparing these subsets to all other objects in this dataset can ensure extraction of all unique itemsets and then the unique association rules but with an exponential complexity. The efficient generation of unique itemsets (then unique CARs) is always considered to be an NP-hard problem. Therefore, approximation (Nguyen et al., 2016), heuristic and stochastic (Wei, Leck, & Link, 2018) search methods represent an alternative, in spite of providing full search methods for unique pattern mining are proposed.

This work introduces a novel efficient algorithm for Unique Class Association Rule (itemsets) Mining (UniqAR) in labeled datasets. UniqAR avoids being exhaustive based on itemsets and rules inter-relations mining process. Such a mining process is an elementary step in generation of unique association rules. Rules have several inter-relations. They may contradict, complement, overlap and/or subsume each other (Vo, Le, Coenen, & Hong, 2014).

Rule Subsumption and Nonsense itemset filtration are item and rule inter-relationship properties which have a great impact in reducing the search computations for unique itemset patterns. Subsumption, from a rule representation point of view, means that shorter rules in terms of number of items can represent or subsume longer itemsets. No need to use any subsumed itemest in generating longer itemsets. The Subsumption property leads for minimal rule representations from a semantics perspective (Borgida & Patel-Schneider, 1994). In other words, unique rules which have smaller number of conditions represent simpler and more interpreted implications. Therefore, the proposed search is oriented to trace shorter or minimal itemsets because of considering the Subsumption criteria. This leads to find a minimal subset of the possible unique rules inside a dataset. This subset can efficiently represent and describe the whole unique rule domain of a dataset from a perspective of CARs.

The nonsense property refers to the ability of a shorter itemset to filter/indicate (be contained in) a number of cases in a dataset. If a longer itemset (longer in terms of a higher number of items than the shorter one and contains it) is observed in a similar number of cases as the shorter itemset (subset of the longer one), then the longer one is a nonsense itemset and no need to use any nonsense itemest in generating longer unique itemsets. One of the main contributions of this work is introducing the nonsense property. This property serves in keeping the proposed search focus on finding a minimal unique rule subset out of a specific dataset. Since the mined rules are presented as in an implication form of items’ combinations (conditions) and a class label, adding more conditions should has a specification effect. The specification effect should lead to decrease the number of cases which a rule can be observed in. If this specification effect can’t be achieved. This means that the last added condition has nonesense effect. Nevertheless, the conditions (items) in the unique rule before adding the nonsense condition are sufficient to specify the same cases The proposed novel algorithm uses these properties to present a complete but not exhaustive search for the unique patterns based on the rule properties.

This paper is organized as follows: In Section 2, a brief on the related work is introduced which positions this work contribution to CARs’ state of the art. Section 3 introduces the frequently used terminologies and concepts of this work. The two main hypotheses of Subsumption and Nonsense are presented and discussed in Section 4. UniqAR is fully described in Section 5. A computational analysis of UniqAR is developed in Section 6. UniqAR criteria are provided and discussed in Section 7. Section 8 presents an extensive Evaluation and discussions for UniqAR using 12 different datasets. Finally, conclusions are drawn in Section 9.

Section snippets

Related work

Association rule mining can be categorized based on the itemsets as a main input in rule induction. Finding association rules based on the interesting (e.g. frequent and rare) itemsets in unlabeled datasets is one of the classical unsupervised machine learning approaches in data mining. The well-known algorithms like Apriori (Agrawal et al., 1994), FP-growth (Han, Pei, Yin, & Mao, 2004), and ECLAT (Zaki, Parthasarathy, Ogihara, & Li, 1997) and their derivatives have introduced efficient

Dataset, itemset and unique class association rules

The frequently used terminology and concepts in this work are introduced here. A dataset (DS) is a collection of related discrete cases of data that may be accessed individually or in combinations and contains cases (objects or records) with different attributes. These attributes represents features. Each feature has a set of mutual feature values which form the different itemsets in the different cases. For instance in the Weather dataset1

Subsumption and nonsense hypotheses

This work draws two milestone hypotheses of UniqAR as a minimal unique CAR mining algorithm. The following subsections introduce both Subsumption and Nonsense proposed hypotheses.

The proposed efficient algorithm for unique class association rule mining (UniqAR)

The proposed algorithm, UniqAR, introduces the idea of skipping or pruning both subsumption and nonsense to build minimal unique itemsets for class association. In a sequential bottom-up approach starting from singular feature-values (items) and building all the possible feature values combinations or itemsets, UniqAR tests the generated itemsets against uniqueness.

So, UniqAR itemset generation is a sequential forward generation process. The itemsets are generated with respect to the length in

Computational analysis

The upper bound of finding all itemset combinations of a given object/case (x), excluding the empty set, considering (n) features/items is (2n-1). Per each of these formed itemset combinations, UniqAR compares it to all of the other objects/cases (M-1), so the maximum number of subsets/combinations (itemsets) of all objects/cases CX is:CX=M×(2n-1)

Where X represents the set of all cases in a dataset (xX). CX can be decomposed according to the level (L) of the itemset combinations as follows:CX=M

Criteria of the proposed algorithm

This section introduces a spot on the proposed algorithm criteria of complexity, efficiency, Rule Space, Dataset Ranking, and relations to Outliers.

Evaluation

In this work, UniqAR has been applied using several datasets. Most of these datasets are published online and being frequently used in the literature. Other datasets are collected from different sources.

All of the used datasets, their sources, the applied preprocessing (e.g. discretization), and detailed description are available online 2. The following table, Table 2, introduces a brief information of these datasets. As it shows a variety of

Conclusions

This work adds a new contribution to Class Association Rule induction state of the art which is an efficient algorithm for mining the minimal unique CAR mining algorithm. the proposed algorithm, UniqAR, can discover all of the unique CARs in reasonable time based on two hypothesis of Subsumption and nonsense. A formal description for the proposed algorithm has been introduced. Results of an extensive evaluation have shown that: There is always plenty of unique itemsets in different datasets in

CRediT authorship contribution statement

Mahmoud Nasr: Conceptualization, Investigation, Software. Mohamed Hamdy: Conceptualization, Software, Visualization, Writing - original draft. Doaa Hegazy: Conceptualization, Validation, Writing - review & editing. Khaled Bahnasy: Conceptualization, Validation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (34)

  • P.S. Bala et al.

    Q-genesis: Question generation system based on semantic relationships

  • J.L. Balcázar et al.

    Evaluation of association rule quality measures through feature extraction

  • S. Baset et al.

    Object-oriented modeling with ontologies around: A survey of existing approaches

    International Journal of Software Engineering and Knowledge Engineering

    (2018)
  • Bay, V., & Bac, L. (2008). A novel classification algorithm based on association rules mining. In Pacific Rim Know....
  • A. Borgida et al.

    A semantics and complete algorithm for subsumption in the classic description logic

    Journal of Artificial Intelligence Research

    (1994)
  • H. Cheng et al.

    Approximate frequent itemset mining in the presence of random noise

  • J. Han et al.

    Data mining concepts and techniques

    (2012)
  • Cited by (12)

    • An animal dynamic migration optimization method for directional association rule mining

      2023, Expert Systems with Applications
      Citation Excerpt :

      Some researchers have considered the relative importance of items as well as the frequency and applied a weighted function or associative classification to reduce the invalid and unnecessary item set (Gan et al., 2017; Lin, Gan, Fournier-Viger, Hong, & Tseng, 2016; Shao et al., 2020; Song & Lee, 2017). Furthermore, some necessary but infrequent rules can be mined by some new methods (Borah & Nath, 2018; Nasr, Hamdy, Hegazy, & Bahnasy, 2021), and the fuzzy rule has also been considered in recent research to handle complex non-binary data. Lin et al. (2017) proposed an multiple fuzzy frequent itemsets mining (MFFI-Miner)algorithm to find fuzzy rules without candidate generation.

    • Verifiable privacy-preserving association rule mining using distributed decryption mechanism on the cloud

      2022, Expert Systems with Applications
      Citation Excerpt :

      For outsourcing analysis of supermarket shopping data, insecure mining will undoubtedly lead to the leakage of corporate transaction data, thereby harming corporate interests. Therefore, it is particularly important to focus on preserving privacy in the context of the ARM scheme (Altay & Alatas, 2021; Liao et al., 2019; Nasr et al., 2021; Ruan et al., 2019). At present, many privacy-preserving schemes are based on homomorphic encryption (HE), e.g., Liu et al. (2018), Pang and Wang (2021) used the double decryption mechanism of Bresson–Catalano–Pointcheval (BCP) cryptosystem (Bresson et al., 2003) to realize multi-key PPARM.

    • A fast algorithm for mining temporal association rules in a multi-attributed graph sequence

      2022, Expert Systems with Applications
      Citation Excerpt :

      The concept of association rules was first proposed by Agrawal et al. in 1993 (Agrawal, Imielinski, & Swami, 1993). Association rules reveal associations in a transactional database (Antonello et al., 2021; Bernal Baró et al., 2020; Delgado-Osuna, García-Martínez, Gómez-Barbadillo, & Ventura, 2020; Geng, Liang, & Jiao, 2020; Nasr, Hamdy, Hegazy, & Bahnasy, 2021; Shabtay, Fournier-Viger, Yaari, & Dattner, 2020; Zhang & Shi, 2020), but do not reflect the temporal associations. Therefore, people became interested in temporal association rules and sequential patterns later.

    • FR-Tree: A novel rare association rule for big data problem

      2022, Expert Systems with Applications
      Citation Excerpt :

      However, it is not efficient because the high-utility itemset has lower support (Liu, Feng, Wang, & Tayi, 2018). Class ARM is a widely used technique in real-world mining applications where the output is integrated into the classification process for class prediction purposes (Mangat & Vig, 2014; Nasr, Hamdy, Hegazy, & Bahnasy, 2021; Nguyen, Nguyen, Vo, & Hong, 2015). The last pattern type method discovers infrequent association rules.

    View all citing articles on Scopus
    View full text