Elsevier

Pattern Recognition Letters

Volume 90, 15 April 2017, Pages 28-35
Pattern Recognition Letters

Bayesian belief network for positive unlabeled learning with uncertainty

https://doi.org/10.1016/j.patrec.2017.03.007Get rights and content

Highlights

  • UPTAN, a Bayesian network, for uncertain data under PU learning scenario is given.

  • Uncertain Conditional Mutual Information (UCMI) is proposed.

  • The algorithm for learning the structure of the Bayesian network is given.

  • The approach for estimating parameters of the Bayesian network is given.

  • UPTAN outperforms UPNB, a state-of-art algorithm, in our experiment.

Abstract

The current state-of-art for tackling the problem of classification of static uncertain data under PU learning (Positive Unlabeled Learning) scenario, is UPNB. It is based on the Bayesian assumption, which does not hold for real-life applications, and hence it may depress the classification performance of UPNB. In this paper, we propose UPTAN (Uncertain Positive Tree Augmented Naive Bayes), a Bayesian network algorithm, so as to utilize the dependence information among uncertain attributes for classification. We propose uncertain conditional mutual information (UCMI) for measuring the mutual information between uncertain attributes, and then use it to learn the tree structure of Bayesian network. Furthermore, we give our approach for estimating the parameters of the Bayesian network for uncertain data without negative training examples. Our experiments on 20 UCI datasets show that UPTAN has excellent classification performance, with average F1 being 0.8257, which outperforms UPNB by 3.73%.

Introduction

Data from real-world applications contains uncertainty due to various reasons, such as sampling error, imprecise measurement, outdated sources and privacy protection. The current research on uncertain data classifications focuses on supervised learning on uncertain data, which requires a large number of labeled examples for training. However, labeled examples are expensive to collect while unlabeled data may be abundant. And for some binary classification tasks, positive examples which are elements of the target concept are available [5]. The same scenario where only a subset of positive examples and unlabeled ones are available could also be observed in other real-life applications like environmental monitoring and network intrusion detection. So, it is helpful to design classification algorithms specifically for uncertain data under PU learning scenario.

To the best of our knowledge, only He et al. [8] are devoted to PU learning for uncertain static data. He et al. [8] proposed UPNB which extends PNB (Positive Naive Bayes) [1] to handle uncertain data. However, UPNB is based on the Bayesian assumption which assumes that given the class label of an example, the values of the attributes are conditionally independent of one another [12]. Since the assumption of class conditional independence does not hold in real-world applications, it depresses the classification performance of naive Bayesian classifiers.

In the TAN (Tree Augmented Naive Bayes) model of Bayesian network, each attribute could have another attribute as its parent. As some of the dependence information among attributes is formulated into the network structure, TAN outperforms Naive Bayes in classification tasks.

In this paper, we propose UPTAN (Uncertain Positive Tree Augmented Naive Bayes) algorithm, a Bayesian network [3] for uncertain data under PU learning scenario by exploring the dependence among attributes, so as to promote its classification performance. Two challenges are identified for this task, Bayesian network structure learning and parameter estimating.

In [13], the puuVFDT algorithm is proposed to classify uncertain data streams under the PU learning scenario. The uncertain information gain is proposed for measuring the importance of uncertain attributes, and it is plugged into the concept-adapting very fast decision tree (CVFDT) framework, so as to cope with the concept drift problem of data streams.

We borrow the idea of uncertain information gain in [13] to propose UCMI (Uncertain Conditional Mutual Information) for measuring the mutual information between uncertain attributes. With help of UCMI, the algorithm for learning the tree structure of Bayesian network is proposed. Furthermore, we propose the approach for parameter estimating for uncertain data without negative training examples.

In our experiment on 20 UCI datasets, it is shown that UPTAN has excellent classification performance, with average F1 being 0.8257, which outperforms UPNB, a state-of-art algorithm for classification under the same scenario, by 3.73%.

This paper is organized as follows. In the next section, we discuss related work briefly. In Section 3 we define the problem of PU learning for uncertain data. Section 4 illustrates the proposed algorithm in detail. The experimental results are shown in Section 5. And finally, Section 6 concludes this paper.

Section snippets

Related work

Here, we make a short introduction to related works on uncertain data classification, PU learning and TAN model [7].

Problem definition

Here, we give the problem definition of uncertain data classification under PU learning scenario formally. In this paper, we give methods to build a UPTAN Bayesian network classifier from positive and unlabeled examples with uncertain discrete attributes. As for uncertain examples with continues attributes, we could discretize them [8].

Suppose P is a set of positive examples, U is a set of unlabeled examples, the training dataset D in PU learning scenario is D=PU. Each example can be

Uncertain positive tree augmented Naive Bayes

Here, we present UPTAN (Uncertain Positive Tree Augmented Naive Bayes) algorithm, a Bayesian network, to cope with classification task for uncertain data under PU learning scenario. In the classical naive Bayes algorithm, parameters are estimated from the data by maximum likelihood estimators, but in the positive unlabeled learning scenario the absence of negative examples makes it unfeasible to estimate any probabilities associated with negative examples [1]. Another problem here is data in

Experiment

Due to lack of real uncertain datasets, we validate our algorithm on UCI1 datasets, and this experiment setting is widely used by the research community of uncertain data classification [8], [11]. 20 UCI datasets are used in our experiment. The missing values in the datasets were handled by ReplaceMissingValues in WEKA2, which is a function to replace missing values by the most frequent value. Since we focus on classification

Conclusion and future work

In this paper, we tackle the problem of uncertain data classification under PU learning scenario in Bayesian belief network framework. We propose UCMI (Uncertain Conditional Mutual Information) which is essential for the algorithm for learning the tree structure of Bayesian network and the approach for learning parameters in CPTs from uncertain positive and unlabeled data, so as to build the UPTAN classifier. Experiments on 20 UCI datasets show that UPTAN outperforms UPNB, a state-of-art

References (24)

  • J. He et al.

    Bayesian classifiers for positive unlabeled learning

    Web-Age Information Management

    (2011)
  • J. Hernández-González et al.

    A novel weakly supervised problem: learning from positive-unlabeled proportions

    Conference of the Spanish Association for Artificial Intelligence

    (2015)
  • Cited by (0)

    View full text