A novel Bayesian classification for uncertain data
Introduction
In many applications, data contains inherent uncertainty. A number of factors contribute to the uncertainty, such as the random nature of the physical data generation and collection process, measurement and decision errors, and unreliable data transmission. For example, in location based services, moving objects of interest are attached with locators and the information is periodically updated and streamed to the control center. However, those location data are typically inaccurate due to locator energy and precision limitation, network bandwidth constraint and latency. There are also massive uncertain data in sensor networks such as temperature, humidity and pressure.
When mining knowledge from these applications, data uncertainty needs to be handled with caution. Otherwise, unreliable or even wrong mining results would be obtained. In this paper, we focus on Naive Bayesian classification for uncertain data. Naive Bayesian classification is tremendously appealing because of its simplicity, elegance, and robustness. It is one of the oldest formal classifications, and it is often surprisingly effective. A large number of modifications have been introduced by the statistical, data mining, machine learning, and pattern recognition communities in an attempt to make it more flexible [32]. It is widely used in areas such as text classification and spam filtering. Based on Naive Bayesian classification, we propose a novel method to directly classify and predict uncertain data in this paper. The main contributions of this paper are:
- •
based on a new method to calculate conditional probabilities of Bayes theory, we extend Naive Bayesian classification so that it can process uncertain data.
- •
we prove through extensive experiments that the proposed classifier can be efficiently generated and it can classify uncertain data with potentially higher accuracies than Naive Bayesian classifier. Furthermore, the proposed classifier is more suitable for mining uncertain data than the previous work [23].
This paper is organized as follows. In the next section, we discuss related work. Section 3 introduces basic concepts of Naive Bayesian classification. Section 4 describes the techniques to calculate conditional probabilities for uncertain numerical data sets. Section 5 describes the Bayesian algorithm for uncertain data and its prediction. The experimental results are shown in Section 6. Section 7 concludes the paper.
Section snippets
Related work
Uncertain data, also called symbolic data [4], [13], has been studied for many years. Many works focus on clustering [5], [7], [11], [16], [20]. The key idea is that when computing the distance between two uncertain objects, the probability distributions of objects are used to calculate the expected distance. In [11], Cormode and McGregor showed reductions to their corresponding weighted versions on data with uncertainties. In [33], Xia and Xi introduced a new conceptual clustering algorithm
Background
The Naive Bayesian classifier estimates the class-conditional probability by assuming that the attributes are conditionally independent, given the class label Ck. Suppose that there are n classes, C1, C2, … , Cn, the conditional independence assumption [26] can be formally stated as follows:where every attribute set X = {X1, X2, … , Xm} consists of m attributes.
With the conditional independent assumption, instead of computing the class-conditional probability for every combination
Conditional probabilities for uncertain numerical attributes
In this section, we describe the uncertain data model and the new approach for calculating conditional probabilities for uncertain numerical data. In this paper, we focus on the uncertainty in attributes and assume the class type is certain.
Uncertain Bayesian classification and prediction
Based on Theorem 1, Theorem 2, this section discusses the techniques to construct the classifier for uncertain data and predict the class type of previous unseen data. If the classification is based on Theorem 1, we call it Naive Bayesian classification one and denote it by NBU1. The other classification is called Naive Bayesian classification two and denoted by NBU2 [23].
Experiments
Using Java, we implemented the proposed Bayesian classification to classify uncertain data sets. When NBU1 and NBU2 are applied on certain data, they work as the Naive Bayesian classification (NB), which has been implemented in Weka [31]. In the following experiments, we use ten times ten-fold cross validation. For every ten-fold cross validation, data is split into 10 approximately equal partitions; each one is used in turn for testing while the rest is used for training, that is, 9/10 of data
Conclusions
In this paper, we propose a novel Bayesian classification for classifying and predicting uncertain data sets. Uncertain data are extensively presented in modern applications such as sensor databases and biometric information systems. Instead of trying to eliminate uncertainty and noise from data sets, this paper follows the new paradigm of directly mining uncertain data. We integrate the uncertain data model with Bayes theorem and propose new techniques to calculate conditional probabilities.
Acknowledgments
This work is supported by the Fundamental Research Funds for the Central Universities, and the Research Funds of Renmin University of China Under Grant No. 10XNJ048.
References (36)
Feature interval learning algorithms for classification
Knowledge-Based Systems
(2010)- et al.
Research on the multi-attribute decision-making under risk with interval probability based on prospect theory and the uncertain linguistic variables
Knowledge-Based Systems
(2011) - et al.
Gaussian kernel optimization for pattern classification
Pattern Recognition
(2009) An extended TOPSIS for determining weights of decision makers with interval numbers
Knowledge-Based Systems
(2011)- et al.
Entropy of interval-valued fuzzy sets based on distance and its relationship with similarity measure
Knowledge-Based Systems
(2009) - ...
- C. Aggarwal, Y. Li, J. Wang, J. Wang, Frequent pattern mining with uncertain data, in: Proceedings of SIGKDD, 2009,...
- T. Bernecker, H. Kriegel, M. Renz, F. Verhein, A. Zfle, Probabilistic frequent itemset mining in uncertain databases,...
- et al.
Analysis of symbolic data
Exploratory Methods for Extracting Statistical Information from Complex Data
(2000) - et al.
Dynamic clustering for interval data based on L2 distance
Computational Statistics
(2006)
New clustering methods for interval data
Computational Statistics
Symbolic Data Analysis and the Sodas Software
Cited by (49)
Evidential reasoning based ensemble classifier for uncertain imbalanced data
2021, Information SciencesCitation Excerpt :In this situation, classical classification methods, such as k-nearest neighbors (KNN) [10] and decision trees (DTs) [11], cannot model the data uncertainty well and fail to achieve desirable classification performance [3]. Several methods have been developed for the classification of uncertain data, such as modified machine learning methods using probability measures [12,13] and ER methods developed based on evidence theory [3,14]. Methods based on probabilistic measures depict only the randomness of uncertain data.
Dengue models based on machine learning techniques: A systematic literature review
2021, Artificial Intelligence in MedicineAn integrated approach for modelling and quantifying housing infrastructure resilience against flood hazard
2021, Journal of Cleaner ProductionAn interval fault diagnosis method for rotor cracks
2020, Computers and Electrical EngineeringCitation Excerpt :In other words, it is very important to select the proper combined parameters (c, σ)using the GSO algorithm. To further illustrate the effect of the proposed IN_LSOSVM method based on GSO, the classification results of NBU [20] and FBC [21] are compared with the proposed method. The classification accuracy of the test samples is shown in Table 4.
PwAdaBoost: Possible world based AdaBoost algorithm for classifying uncertain data
2019, Knowledge-Based Systems