Constrained frequent pattern mining on univariate uncertain data

https://doi.org/10.1016/j.jss.2012.11.020Get rights and content

Abstract

In this paper, we propose a new algorithm called CUP-Miner (Constrained Univariate Uncertain Data Pattern Miner) for mining frequent patterns from univariate uncertain data under user-specified constraints. The discovered frequent patterns are called constrained frequent U2 patterns (where “U2” represents “univariate uncertain”). In univariate uncertain data, each attribute in a transaction is associated with a quantitative interval and a probability density function. The CUP-Miner algorithm is implemented in two phases: In the first phase, a U2P-tree (Univariate Uncertain Pattern tree) is constructed by compressing the target database transactions into a compact tree structure. Then, in the second phase, the constrained frequent U2 pattern is enumerated by traversing the U2P-tree with different strategies that correspond to different types of constraints. The algorithm speeds up the mining process by exploiting five constraint properties: succinctness, anti-monotonicity, monotonicity, convertible anti-monotonicity, and convertible monotonicity. Our experimental results demonstrate that CUP-Miner outperforms the modified CAP algorithm, the modified FIC algorithm, the modified U2P-Miner algorithm, and the modified Apriori algorithm.

Highlights

► CUP-Miner algorithm is proposed for constrained mining on univariate uncertain data. ► The CUP-Miner algorithm utilizes five well-known constraint properties. ► The CUP-Miner algorithm pushes constraint verification into the mining process. ► The CUP-Miner algorithm outperforms the compared methods in terms of runtime.

Introduction

Mining frequent patterns is an important aspect of data mining, which tries to identify patterns that appear frequently in data. Agrawal and Srikant (1994) introduced the concept of mining frequent itemsets, and since then many important topics have been studied, such as mining partial periodic patterns (Chen et al., 2011), summary queries (Zhang et al., 2010), and mining frequent patterns in images (Lee et al., 2009). Most studies deal with precise data, where items are either present in, or absent from, a transaction (Boolean data mining); or each attribute of a transaction is associated with a quantitative value (quantitative data mining). Mining frequent patterns from uncertain data is an emerging area of inquiry (Aggarwal et al., 2009, Bernecker et al., 2009, Chui et al., 2007, Chui and Kao, 2008, Gullo et al., 2008, Leung et al., 2007, Leung et al., 2008b, Leung et al., 2010a, Leung et al., 2010b, Leung and Brajczuk, 2009a, Leung and Brajczuk, 2009b, Leung and Brajczuk, 2010, Leung and Hao, 2009, Liu, 2012, Zhang et al., 2008). In contrast to precise data mining, approaches to uncertain data mining do not specifically record the precise value of an attribute or the existence of an item. One type of uncertain data has a quantitative interval for each attribute in each transaction; and each interval is accompanied by a probability density function that indicates the probability of the values appearing in the interval. For example, a low sensitivity sensor used to record atmospheric pollution may record a quantitative interval, instead of a precise value, to indicate the amounts of suspended particulates at 6 am everyday. Then, a probability density function is explicitly or implicitly assigned to the interval to indicate the possibility that each value exists in the quantitative interval. Another example is a network monitoring system that records a quantitative interval for network traffic flow every hour whereby a probability density function indicates the possibility that each value exists in the interval. Fig. 1(a) shows an example of a database containing three transactions. A quantitative interval is recorded for each of the two attributes in a transaction, namely, suspended particulates and ozone, denoted as A1 and A2 respectively. The probability density function for each quantitative interval can be defined according to the observed status, or simply as a uniform or normal distribution. This kind of uncertain data is referred to as univariate uncertain data in the literature (Gullo et al., 2008). An attribute whose value is represented by this format is called a univariate uncertain attribute.

A U2 pattern is comprised of univariate uncertain attributes, and a frequent U2 pattern is a U2 pattern that appears frequently. Unlike mining precise data patterns, mining frequent U2 patterns cannot derive a pattern's support by counting the number of transactions that contain the pattern. Liu (2012) proposed the U2P-Miner algorithm for identifying frequent U2 patterns, but user participation tends to be limited because the only way a user can guide the mining process is to set a level of minimum support to guide the mining process. We will review the U2P-Miner algorithm in Section 2.3. In practice, a user may set constraints to conduct mining. For example, Fig. 1(a) shows that users can find frequent U2 patterns in the database that only contain suspended particulates, or they can find frequent U2 patterns in which the amount of suspended particulates is greater than or equal to 100. This then can lead to the discovery of frequent U2 patterns of bad air quality with high levels of suspended particulates, such as what one would expect in a sand storm. Furthermore, when mining a database of atmospheric attributes about daily weather conditions, adding a constraint that the relative humidity should be higher than or equal to 70 finds frequent U2 patterns of wet weather. Therefore, a method that identifies frequent U2 patterns under user-specified constraints is required. Hereafter, a frequent U2 pattern that satisfies a constraint is called a constrained frequent U2 pattern.

As mentioned earlier, we propose the CUP-Miner (Constrained Univariate Uncertain Data Pattern Miner) algorithm for mining constrained frequent U2 patterns. Five types of constraints are defined in the literature: succinct and anti-monotone, succinct and monotone, non-succinct and anti-monotone, non-succinct and monotone, and not succinct, anti-monotone, or monotone (Ng et al., 1998, Pei and Han, 2000). The CUP-Miner algorithm, which is implemented in two phases, provides a solution to each type of constraint. First, the algorithm constructs a U2P-tree (Univariate Uncertain Pattern tree), which is a compact tree structure compressed from the transactions of the target database. In the second phase, the constrained frequent U2 pattern is enumerated by traversing the U2P-tree with different strategies that correspond to different types of constraints. Depending on the type of constraint, the number of constraint verification operations can be reduced, and the pattern search space, i.e., the set of U2 patterns that must be examined, can be pruned during the mining process.

This study makes a number of key contributions to the literature. First, to the best of our knowledge, it proposes the first approach designed to retrieve constrained frequent U2 patterns from univariate uncertain data. Second, it pushes constraint verification operations into the mining process. As a result, many constraint verification operations and pattern evaluation tasks, i.e., computing the pattern support, can be eliminated. Third, the proposed method utilizes the following properties of the constraints: succinctness, anti-monotonicity, monotonicity, convertible anti-monotonicity, and convertible monotonicity. Fourth, the proposed approach orders the attributes to speed up the mining process. Fifth, we conduct a comprehensive set of experiments on synthetic and real-world datasets to compare the performance of the CUP-Miner algorithm with that of the modified U2P-Miner algorithm, the modified CAP algorithm (Ng et al., 1998), the modified FIC algorithm (Pei et al., 2001, Pei and Han, 2000), and the modified Apriori algorithm (Agrawal and Srikant, 1994). The results demonstrate that the proposed method outperforms the compared methods.

The remainder of this paper is organized as follows. Section 2 provides a review of related works in the literature. In Section 3, we define the key terms used throughout the paper. In Section 4, we describe the proposed method in detail; and in Section 5, we present and discuss the experiment results. Section 6 contains some concluding remarks.

Section snippets

Literature review

In Section 2.1, we review important related studies on constrained mining of precise data, i.e., frequent pattern mining under users’ constraints on precise data; and in Section 2.2, we discuss the methods for mining uncertain data. Then, in Section 2.3, we consider the U2P-Miner algorithm.

Preliminaries

Table 2 lists the constraints studied in this paper. S represents a set of attributes extracted from a U2 pattern Pat, and S.A represents the value of each attribute. S does not necessarily contain all the attributes of Pat. If S only contains a subset of the attributes, they are sufficient to satisfy the constraint. Atts represents a set of attributes derived from all possible attributes. Let a U2 pattern with n attributes be denoted as [[LB1, UB1], [LB2, UB2], ⋯, [LBn, UBn]], where LBi and UBi

The CUP-Miner algorithm

The CUP-Miner algorithm exploits different constraint properties to perform constrained mining as efficiently as possible. The constraints are classified into five types: (1) succinct and anti-monotone, (2) succinct and monotone, (3) non-succinct and anti-monotone, (4) non-succinct and monotone, and (5) not succinct, anti-monotone, or monotone. For each type, we propose a solution that uses the constraint's characteristics in order to facilitate and accelerate the mining process. We describe

The experiments

We evaluated the performance of the CUP-Miner algorithm on both synthetic and real-world datasets. In addition, we modified the CAP algorithm and the FIC algorithm and compared their performance with that of the CUP-Miner algorithm. The modified CAP and FIC algorithms take every possible combination of base intervals in each attribute as an item. Thus, an item may contain more than one base interval. Calculating the existential probability of an item in a transaction involves computing the

Conclusion and future work

We have proposed a novel algorithm called CUP-Miner for retrieving frequent U2 patterns under user-specified constraints. The proposed method pushes constraint verification into the mining process. This strategy avoids a number of constraint verification operations and prunes the pattern search space.

The CUP-Miner algorithm proposes a solution to each type of constraint: (1) for a succinct and anti-monotone constraint, only base intervals from BIsC are considered in the mining process; (2) for

Acknowledgements

The authors are grateful to the anonymous referees for their helpful comments and suggestions. This research was supported in part by the National Science Council of the Republic of China under grant no. NSC 100-2221-E-259-035.

Ying-Ho Liu received the BS, MS, and PhD degrees in information management from National Taiwan University, Taiwan, in 2000, 2003, and 2009, respectively. In August 2009, he joined the Department of Information Management at National Dong Hwa University and he is now an assistant professor. His papers have appeared in Data and Knowledge Engineering, Journal of Systems and Software, Pattern Recognition, Intelligent Data Analysis, and Computer Vision and Image Understanding. His research interest

References (37)

  • T. Bernecker et al.

    Probabilistic frequent itemset mining in uncertain databases

  • C. Chui et al.

    A decremental approach for mining frequent itemsets from uncertain data

  • C. Chui et al.

    Mining frequent itemsets from uncertain data

  • DBAR website, 2010....
  • EPA website, 2010....
  • G. Grahne et al.

    Efficient mining of constrained correlated sets

  • F. Gullo et al.

    Clustering uncertain data via k-medoids

    Lecture Notes in Artificial Intelligence

    (2008)
  • J. Han et al.

    Mining frequent patterns without candidate generation

  • Cited by (13)

    • Mining time-interval univariate uncertain sequential patterns

      2015, Data and Knowledge Engineering
      Citation Excerpt :

      Liu proposed the U2P-Miner algorithm, which adopts the pattern growth methodology. Later, Liu and Wang proposed mining frequent patterns from univariate uncertain data under user-specified constraints [25]. Liu also proposed mining frequent patterns from univariate uncertain data streams, which are data streams comprising flows of univariate uncertain data [24].

    • Uncertain canonical correlation analysis for multi-view feature extraction from uncertain data streams

      2015, Neurocomputing
      Citation Excerpt :

      Recently, uncertain data stream starts to obtain attention and becomes one of the main tendencies of data stream mining research [6–8]. In many real applications, data values are inherently uncertain [9,10] due to various reasons, such as imprecise measurement [11], missing values [12], and privacy protection [13]. Image a RFID tracking and monitoring application in the fields of location-based services (LBS), where a RFID reader is used to detect moving objects׳ locations and speeds with possible errors caused by nearby electrical interference, such as high voltage lines or cell phones, etc.

    • Classifying univariate uncertain data

      2021, Applied Intelligence
    • A Systematic Survey on High Utility Itemset Mining

      2019, International Journal of Information Technology and Decision Making
    • Efficiently extracting frequent patterns from continuous uncertain data

      2019, Journal of the Chinese Institute of Engineers, Transactions of the Chinese Institute of Engineers,Series A
    View all citing articles on Scopus

    Ying-Ho Liu received the BS, MS, and PhD degrees in information management from National Taiwan University, Taiwan, in 2000, 2003, and 2009, respectively. In August 2009, he joined the Department of Information Management at National Dong Hwa University and he is now an assistant professor. His papers have appeared in Data and Knowledge Engineering, Journal of Systems and Software, Pattern Recognition, Intelligent Data Analysis, and Computer Vision and Image Understanding. His research interest includes data mining, multimedia content analysis and indexing, machine learning, and pattern recognition.

    Chun-Sheng Wang received an MBA and a PhD degree in Information Management from National Taiwan University, Taiwan, ROC, in 1996 and 2007, respectively. He is currently an assistant professor of Information Management at the Jinwen University of Science and Technology, Taiwan, ROC. His papers have appeared in Journal of Systems and Software, Information Sciences, Data and Knowledge Engineering, Expert Systems with Applications, Journal of Network and Systems Management, etc. His current research interests include data mining, information systems, and network management.

    View full text