Elsevier

Knowledge-Based Systems

Volume 24, Issue 8, December 2011, Pages 1151-1158
Knowledge-Based Systems

A novel Bayesian classification for uncertain data

https://doi.org/10.1016/j.knosys.2011.04.011Get rights and content

Abstract

Data uncertainty can be caused by numerous factors such as measurement precision limitations, network latency, data staleness and sampling errors. When mining knowledge from emerging applications such as sensor networks or location based services, data uncertainty should be handled cautiously to avoid erroneous results. In this paper, we apply probabilistic and statistical theory on uncertain data and develop a novel method to calculate conditional probabilities of Bayes theorem. Based on that, we propose a novel Bayesian classification algorithm for uncertain data. The experimental results show that the proposed method classifies uncertain data with potentially higher accuracies than the Naive Bayesian approach. It also has a more stable performance than the existing extended Naive Bayesian method.

Introduction

In many applications, data contains inherent uncertainty. A number of factors contribute to the uncertainty, such as the random nature of the physical data generation and collection process, measurement and decision errors, and unreliable data transmission. For example, in location based services, moving objects of interest are attached with locators and the information is periodically updated and streamed to the control center. However, those location data are typically inaccurate due to locator energy and precision limitation, network bandwidth constraint and latency. There are also massive uncertain data in sensor networks such as temperature, humidity and pressure.

When mining knowledge from these applications, data uncertainty needs to be handled with caution. Otherwise, unreliable or even wrong mining results would be obtained. In this paper, we focus on Naive Bayesian classification for uncertain data. Naive Bayesian classification is tremendously appealing because of its simplicity, elegance, and robustness. It is one of the oldest formal classifications, and it is often surprisingly effective. A large number of modifications have been introduced by the statistical, data mining, machine learning, and pattern recognition communities in an attempt to make it more flexible [32]. It is widely used in areas such as text classification and spam filtering. Based on Naive Bayesian classification, we propose a novel method to directly classify and predict uncertain data in this paper. The main contributions of this paper are:

  • based on a new method to calculate conditional probabilities of Bayes theory, we extend Naive Bayesian classification so that it can process uncertain data.

  • we prove through extensive experiments that the proposed classifier can be efficiently generated and it can classify uncertain data with potentially higher accuracies than Naive Bayesian classifier. Furthermore, the proposed classifier is more suitable for mining uncertain data than the previous work [23].

This paper is organized as follows. In the next section, we discuss related work. Section 3 introduces basic concepts of Naive Bayesian classification. Section 4 describes the techniques to calculate conditional probabilities for uncertain numerical data sets. Section 5 describes the Bayesian algorithm for uncertain data and its prediction. The experimental results are shown in Section 6. Section 7 concludes the paper.

Section snippets

Related work

Uncertain data, also called symbolic data [4], [13], has been studied for many years. Many works focus on clustering [5], [7], [11], [16], [20]. The key idea is that when computing the distance between two uncertain objects, the probability distributions of objects are used to calculate the expected distance. In [11], Cormode and McGregor showed reductions to their corresponding weighted versions on data with uncertainties. In [33], Xia and Xi introduced a new conceptual clustering algorithm

Background

The Naive Bayesian classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent, given the class label Ck. Suppose that there are n classes, C1, C2, … , Cn, the conditional independence assumption [26] can be formally stated as follows:P(X|Ck)=i=1mP(Xi|Ck)where every attribute set X = {X1, X2,  , Xm} consists of m attributes.

With the conditional independent assumption, instead of computing the class-conditional probability for every combination

Conditional probabilities for uncertain numerical attributes

In this section, we describe the uncertain data model and the new approach for calculating conditional probabilities for uncertain numerical data. In this paper, we focus on the uncertainty in attributes and assume the class type is certain.

Uncertain Bayesian classification and prediction

Based on Theorem 1, Theorem 2, this section discusses the techniques to construct the classifier for uncertain data and predict the class type of previous unseen data. If the classification is based on Theorem 1, we call it Naive Bayesian classification one and denote it by NBU1. The other classification is called Naive Bayesian classification two and denoted by NBU2 [23].

Experiments

Using Java, we implemented the proposed Bayesian classification to classify uncertain data sets. When NBU1 and NBU2 are applied on certain data, they work as the Naive Bayesian classification (NB), which has been implemented in Weka [31]. In the following experiments, we use ten times ten-fold cross validation. For every ten-fold cross validation, data is split into 10 approximately equal partitions; each one is used in turn for testing while the rest is used for training, that is, 9/10 of data

Conclusions

In this paper, we propose a novel Bayesian classification for classifying and predicting uncertain data sets. Uncertain data are extensively presented in modern applications such as sensor databases and biometric information systems. Instead of trying to eliminate uncertainty and noise from data sets, this paper follows the new paradigm of directly mining uncertain data. We integrate the uncertain data model with Bayes theorem and propose new techniques to calculate conditional probabilities.

Acknowledgments

This work is supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China Under Grant No. 10XNJ048.

References (36)

  • N. Cesa-Bianchi, C. Gentile, L. Zaniboni, Hierarchical classification: combining Bayes with SVM, in: Proceedings of...
  • M. Chavent et al.

    New clustering methods for interval data

    Computational Statistics

    (2006)
  • R. Cheng, D. Kalashnikov, S. Prabhakar, Evaluating probabilistic queries over imprecise data, in: Proceedings of the...
  • C. Chui, B. Kao, E. Hung, Mining frequent itemsets from uncertain data, in: Proceedings of the PAKDD, 2007, pp....
  • W. Cohen, Fast effective rule induction, in: Proceedings of ICML, 1995,...
  • G. Cormode, A. McGregor, Approximation algorithm for clustering uncertain data, in: Proceedings of the ACM PODS, 2008,...
  • E. Diday et al.

    Symbolic Data Analysis and the Sodas Software

    (2008)
  • C. Gao, J. Wang, Direct mining of discriminative patterns for classifying uncertain data, in: Proceedings of the...
  • Cited by (49)

    • Evidential reasoning based ensemble classifier for uncertain imbalanced data

      2021, Information Sciences
      Citation Excerpt :

      In this situation, classical classification methods, such as k-nearest neighbors (KNN) [10] and decision trees (DTs) [11], cannot model the data uncertainty well and fail to achieve desirable classification performance [3]. Several methods have been developed for the classification of uncertain data, such as modified machine learning methods using probability measures [12,13] and ER methods developed based on evidence theory [3,14]. Methods based on probabilistic measures depict only the randomness of uncertain data.

    • An interval fault diagnosis method for rotor cracks

      2020, Computers and Electrical Engineering
      Citation Excerpt :

      In other words, it is very important to select the proper combined parameters (c, σ)using the GSO algorithm. To further illustrate the effect of the proposed IN_LSOSVM method based on GSO, the classification results of NBU [20] and FBC [21] are compared with the proposed method. The classification accuracy of the test samples is shown in Table 4.

    View all citing articles on Scopus
    View full text