Bayesian belief network for positive unlabeled learning with uncertainty
Introduction
Data from real-world applications contains uncertainty due to various reasons, such as sampling error, imprecise measurement, outdated sources and privacy protection. The current research on uncertain data classifications focuses on supervised learning on uncertain data, which requires a large number of labeled examples for training. However, labeled examples are expensive to collect while unlabeled data may be abundant. And for some binary classification tasks, positive examples which are elements of the target concept are available [5]. The same scenario where only a subset of positive examples and unlabeled ones are available could also be observed in other real-life applications like environmental monitoring and network intrusion detection. So, it is helpful to design classification algorithms specifically for uncertain data under PU learning scenario.
To the best of our knowledge, only He et al. [8] are devoted to PU learning for uncertain static data. He et al. [8] proposed UPNB which extends PNB (Positive Naive Bayes) [1] to handle uncertain data. However, UPNB is based on the Bayesian assumption which assumes that given the class label of an example, the values of the attributes are conditionally independent of one another [12]. Since the assumption of class conditional independence does not hold in real-world applications, it depresses the classification performance of naive Bayesian classifiers.
In the TAN (Tree Augmented Naive Bayes) model of Bayesian network, each attribute could have another attribute as its parent. As some of the dependence information among attributes is formulated into the network structure, TAN outperforms Naive Bayes in classification tasks.
In this paper, we propose UPTAN (Uncertain Positive Tree Augmented Naive Bayes) algorithm, a Bayesian network [3] for uncertain data under PU learning scenario by exploring the dependence among attributes, so as to promote its classification performance. Two challenges are identified for this task, Bayesian network structure learning and parameter estimating.
In [13], the puuVFDT algorithm is proposed to classify uncertain data streams under the PU learning scenario. The uncertain information gain is proposed for measuring the importance of uncertain attributes, and it is plugged into the concept-adapting very fast decision tree (CVFDT) framework, so as to cope with the concept drift problem of data streams.
We borrow the idea of uncertain information gain in [13] to propose UCMI (Uncertain Conditional Mutual Information) for measuring the mutual information between uncertain attributes. With help of UCMI, the algorithm for learning the tree structure of Bayesian network is proposed. Furthermore, we propose the approach for parameter estimating for uncertain data without negative training examples.
In our experiment on 20 UCI datasets, it is shown that UPTAN has excellent classification performance, with average F1 being 0.8257, which outperforms UPNB, a state-of-art algorithm for classification under the same scenario, by 3.73%.
This paper is organized as follows. In the next section, we discuss related work briefly. In Section 3 we define the problem of PU learning for uncertain data. Section 4 illustrates the proposed algorithm in detail. The experimental results are shown in Section 5. And finally, Section 6 concludes this paper.
Section snippets
Related work
Here, we make a short introduction to related works on uncertain data classification, PU learning and TAN model [7].
Problem definition
Here, we give the problem definition of uncertain data classification under PU learning scenario formally. In this paper, we give methods to build a UPTAN Bayesian network classifier from positive and unlabeled examples with uncertain discrete attributes. As for uncertain examples with continues attributes, we could discretize them [8].
Suppose P is a set of positive examples, U is a set of unlabeled examples, the training dataset D in PU learning scenario is . Each example can be
Uncertain positive tree augmented Naive Bayes
Here, we present UPTAN (Uncertain Positive Tree Augmented Naive Bayes) algorithm, a Bayesian network, to cope with classification task for uncertain data under PU learning scenario. In the classical naive Bayes algorithm, parameters are estimated from the data by maximum likelihood estimators, but in the positive unlabeled learning scenario the absence of negative examples makes it unfeasible to estimate any probabilities associated with negative examples [1]. Another problem here is data in
Experiment
Due to lack of real uncertain datasets, we validate our algorithm on UCI1 datasets, and this experiment setting is widely used by the research community of uncertain data classification [8], [11]. 20 UCI datasets are used in our experiment. The missing values in the datasets were handled by ReplaceMissingValues in WEKA2, which is a function to replace missing values by the most frequent value. Since we focus on classification
Conclusion and future work
In this paper, we tackle the problem of uncertain data classification under PU learning scenario in Bayesian belief network framework. We propose UCMI (Uncertain Conditional Mutual Information) which is essential for the algorithm for learning the tree structure of Bayesian network and the approach for learning parameters in CPTs from uncertain positive and unlabeled data, so as to build the UPTAN classifier. Experiments on 20 UCI datasets show that UPTAN outperforms UPNB, a state-of-art
References (24)
- et al.
Learning Bayesian classifiers from positive and unlabeled examples
Pattern Recognit. Lett.
(2007) - et al.
Learning from positive and unlabeled examples
Theor. Comput. Sci.
(2005) - et al.
Positive and unlabeled learning in categorical data
Neurocomputing
(2016) - et al.
Learning very fast decision tree from uncertain data streams with positive and unlabeled samples
Inf. Sci.
(2012) - et al.
Approximating discrete probability distributions with dependence trees
Inf. Theory IEEE Trans.
(1968) - et al.
Network-Based Heuristics for Constraint-Satisfaction Problems
(1988) - et al.
Maximum likelihood from incomplete data via the em algorithm
J. Royal Stat. Soc.
(1977) - et al.
Learning classifiers from only positive and unlabeled data
Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
(2008) - et al.
Bayesian network classifiers
Mach. Learn.
(1997) - et al.
Naive Bayes classifier for positive unlabeled learning with uncertainty.
SDM
(2010)