Elsevier

Information Sciences

Volume 213, 5 December 2012, Pages 50-67
Information Sciences

Learning very fast decision tree from uncertain data streams with positive and unlabeled samples

https://doi.org/10.1016/j.ins.2012.05.023Get rights and content

Abstract

Most data stream classification algorithms need to supply input with a large amount of precisely labeled data. However, in many data stream applications, streaming data contains inherent uncertainty, and labeled samples are difficult to be collected, while abundant data are unlabeled. In this paper, we focus on classifying uncertain data streams with only positive and unlabeled samples available. Based on concept-adapting very fast decision tree (CVFDT) algorithm, we propose an algorithm namely puuCVFDT (CVFDT for positive and unlabeled uncertain data). Experimental results on both synthetic and real-life datasets demonstrate the strong ability and efficiency of puuCVFDT to handle concept drift with uncertainty under positive and unlabeled learning scenario. Even when 90% of the samples in the stream are unlabeled, the classification performance of the proposed algorithm is still compared to that of CVFDT, which is learned from fully labeled data without uncertainty.

Highlights

► We propose uncertain information gain for positive and unlabeled samples (puuIG). ► We give methods to summarize imprecise values into some distributions. ► These distributions can be used to calculate puuIG efficiently. ► We propose probabilistic Hoeffding bound to build the very fast decision trees. ► Our algorithm can learn well from positive and unlabeled samples with uncertainty.

Introduction

Data stream classification has been widely used in many applications, e.g., credit fraud detection, network intrusion detection, and environmental monitoring. In these applications, tremendous amount of streaming data is collected at an unprecedented rate. In addition, this streaming data is normally characterized by its drifting concepts [15], [32]. Thus, the major obstacles in data stream classification lie in memory, time, space, and evolving concepts. To address these obstacles, many algorithms have been proposed, mainly including various algorithms based on ensemble approach [30], [32] and decision tree, e.g., [8], [15], [20] proposed algorithms to learn very fast decision trees, while [29] presented fuzzy pattern tree algorithm for binary classification on data streams. Among these algorithms, concept-adapting very fast decision tree (CVFDT) [15] is a well-known algorithm for data stream classification. By keeping its model consistent with a sliding window of samples and using Hoeffding bound theory [14], CVFDT can learn decision trees incrementally with bounded memory usage, high processing speed, and detecting evolving concepts.

To learn a classifier, the CVFDT algorithm requires input of precise and fully labeled samples, which is impractical in many real world applications. The streaming data often contains uncertainty due to various reasons, such as imprecise measurement, missing values, and privacy protection. In addition, labeled samples are expensive to collect while unlabeled data may be abundant. Meanwhile for some binary classification problems, the target concept elements (we call them positive samples in this paper) are obtained easily [6]. These situations are predominant in data stream application. For example, in credit fraud detection, firstly, private information of customers such as ages, addresses and vocations may be masked by imprecise values when published for mining propose. Secondly, if customer behaviors have not caused any bad effect, we need to investigate thoroughly to decide whether they are fraud or not. It would be extremely expensive to label a huge volume of these customers. So it is better to use them as unlabeled data. Finally, this application could be modeled as a binary classification problem, while those behaviors which lead to bad effect could be labeled as positive samples. The same scenario could also be observed in environmental monitoring, network intrusion detection, and so on. Thus, it is helpful to design novel classification algorithms for these uncertain data streams with positive and unlabeled samples.

To the best of our knowledge, the problem of classifying uncertain data streams with only positive and unlabeled samples has not been studied by the research community yet. In this paper, based on the recent works on building uncertain decision trees [23], [31] and learning from positive and unlabeled samples [6], we address the problem of learning very fast decision trees by using only positive and unlabeled samples with both certain and uncertain attributes. We transform CVFDT [15] to cope with both numerical and categorical data with uncertainty under positive and unlabeled learning (PU learning) scenario, and propose a novel algorithm namely puuCVFDT (CVFDT for Positive and Unlabeled samples with Uncertainty). A series of experiments on both synthetic and real-life dataset show that the proposed puuCVFDT algorithm has strong capabilities to learn from uncertain data streams with positive and unlabeled samples and tackle concept drift. Even when only 10% of the samples in the stream are positive, the classification accuracy of puuCVFDT is still very close to that of CVFDT, which is trained on fully labeled data stream without uncertainty.

In the rest of this paper, we firstly discuss related works in Section 2. Problem definition is described in Section 3. Our method to cope with uncertain data streams under PU learning scenario is discussed in Section 4 and Section 5. Algorithm details are given in Section 6. The experimental study is presented in Section 7, and we conclude our paper in Section 8.

Section snippets

Positive and unlabeled learning

The goal of PU learning is to learn a classifier with only positive and unlabeled data. Many data mining algorithms, such as decision tree algorithms and naive Bayes algorithms, could be considered as statistical query (SQ-like) learning algorithms, since they only use samples to evaluate statistical queries [6]. In [6], Denis et al. have presented a scheme to convert any SQ-like learning algorithm into positive and unlabeled learning algorithm. They have also given that it is possible to learn

Data model

Uncertainty could arise in both numerical and categorical data. An attribute with imprecise data is called uncertain attribute. We write Xu for the set of uncertain attributes, and write Xiu for the ith attribute in Xu. An attribute Xiu can be an uncertain numerical attribute (UNA) or uncertain categorical attribute (UCA). Based on [23], [31], we define UNA and UCA as follows.

Definition 1

[23], [31]

We write Xiun for the ith UNA, and write Xitun for the value of Xiun on the tth sample st. Then Xitun is defined as a

Fractional sample

We adopt the idea of fractional sample [26], [31] to assign an uncertain sample to tree nodes. For a sample st=Xu,y,l observed by internal node N, the following sections introduce how to split it into fractional samples and how to assign these fractional samples to the child of node N.

Choosing the splitting attribute

To choose the splitting attribute is the key issue of building a fast decision tree. At each step of tree growing, splitting measures are evaluated for selecting the test attribute. In this section, we firstly define uncertain information gain for positive and unlabeled samples (puuIG) as splitting measures. Then we give the way to maintain the sufficient statistics incrementally for both UNA and UCA, and the way to use these statistics to evaluate puuIG.

The puuCVFDT algorithm

In this section, to build very fast decision tree from uncertain data streams with positive and unlabeled samples available, we give the details of our puuCVFDT algorithm.

Algorithm 1

Building uncertain very fast decision tree for positive and unlabeled samples

Input:
 S a sequence of positive and unlabeled samples with uncertainty,
 Xu a set of uncertain attributes,
 G(.) uncertain information gain for positive and unlabeled samples,
 δ one minus the desired probability of choosing the correct attribute at any given

Experimental study

In this section, we conduct experiments to evaluate the performance of the proposed puuCVFDT algorithm, by comparing with baseline methods:

  • DTU and UDT: decision tree algorithms for static dataset with uncertainty.

  • POSC4.5: a PU learning algorithm for static dataset with certain attributes.

  • OcVFDT: a PU learning algorithm for data stream with certain categorical attributes.

  • UCVFDT: a supervised learner for data stream with uncertain categorical attributes.

  • CVFDT: a supervised learner for data stream

Conclusions

In this paper, based on CVFDT, we propose puuCVFDT, a novel very fast decision tree algorithm, to cope with uncertain data streams with positive and unlabeled samples. The main contributions of this paper include: We propose puuIG, uncertain information gain for positive and unlabeled samples. We give methods to summarize the continuous arriving data with imprecise numerical value and/or categorical value into some distributions, and then we use these distributions to calculate puuIG

Acknowledgements

This research is supported by the National Natural Science Foundation of China (60873196) and Chinese Universities Scientific Fund (QN2009092). The research is also supported by high-performance computing platform of Northwest A&F University.

References (37)

  • C.A. Charu et al.

    A survey of uncertain data algorithms and applications

    IEEE Transactions on Knowledge and Data Engineering

    (2009)
  • P. Domingos, G. Hulten, Mining high-speed data streams, in: Proceedings of the Sixth ACM SIGKDD International...
  • C. Elkan, K. Noto, Learning classifiers from only positive and unlabeled data, in: Proceedings of the 14th ACM SIGKDD...
  • G. Fung et al.

    Text classification without negative examples revisit

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • B. Gao, T. Liu, W. Wei, T. Wang, H. Li, in: Semi-Supervised Ranking on Very Large Graphs with Rich Metadata...
  • C. Gao, J. Wang, Direct mining of discriminative patterns for classifying uncertain data, in: Proceedings of the 16th...
  • J. He, Y. Zhang, X. Li, Y. Wang, Naive Bayes classifier for positive unlabeled learning with uncertainty, in:...
  • Cited by (0)

    View full text