Not always simple classification: Learning SuperParent for class probability estimation

https://doi.org/10.1016/j.eswa.2015.02.049Get rights and content

Highlights

  • Accurate class probability estimation is also desirable in many applications.

  • We investigated the class probability estimation performance of SuperParent.

  • We proposed an improved SuperParent algorithm based on conditional log likelihood.

  • Experimental results on a large number of datasets validate its effectiveness.

Abstract

Of numerous proposals to improve naive Bayes (NB) by weakening its attribute independence assumption, SuperParent (SP) has demonstrated remarkable classification performance. In many real-world applications, however, accurate class probability estimation of instances is more desirable than simple classification. For example, we often need to recommend commodities to customers with the higher likelihood (class probability) of purchase. Conditional log likelihood (CLL) is currently a well-accepted measure for the quality of class probability estimation. Inspired by this, in this paper, we firstly investigate the class probability estimation performance of SP in terms of CLL and find that its class probability estimation performance almost ties the original distribution-based tree augmented naive Bayes (TAN). In order to scale up its class probability estimation performance, we then propose an improved CLL-based SuperParent algorithm (CLL-SP). In CLL-SP, a CLL-based approach, instead of a classification-based approach, is used to find the augmenting arcs. The experimental results on a large suite of benchmark datasets show that our CLL-based approach (CLL-SP) significantly outperforms the classification-based approach (SP) and the original distribution-based approach (TAN) in terms of CLL, yet at the same time maintains the high classification accuracy that characterizes the classification-based approach (SP).

Introduction

Classification is a basic task in data mining and machine learning. The goal of learning algorithms for classification is to construct a classifier given instances with class labels. The classifier predicts the possible class label to instances described by a set of attribute values. The predictive ability of a classifier is typically measured by its classification accuracy or error rate on testing instances. To the best of our knowledge, these classifiers can be broadly divided into two main categories: probability-based classifiers and margin-based classifiers. In this paper, we focus our attention on probability-based classifiers and find that probability-based classifiers can also produce probability estimates and offer “confidence” of the class prediction-that is, the information how “far-off” (be it 0.99 or 0.01?) is the prediction of each instance from its true class label. Yet, this information is often ignored when we use these probability-based classifiers for simple classification.

In probability-based classification, accurate class probability estimation plays the most important role. In fact, accurate class probability estimation is also widely used in other paradigms of machine learning, such as probability-based ranking (Provost & Domingos, 2003) and cost-sensitive learning (Margineantu, 2005, Wang et al., 2012), and many other paradigms, such as information retrieval (Gupta, Saini, & Saxena, 2015), distance learning (Li & Li, 2013), expert and intelligent systems (Bohanec and Rajkovic, 1988, Kurgan et al., 2001), and recommendation systems (Bobadilla, Serradilla, & Bernal, 2010). For example (Saar-Tsechansky & Provost, 2004), in target making, the estimated probability that a customer will respond to an offer is combined with the estimated profit to evaluate various offer propositions. For another example, in cost-sensitive decision-making, the class probability estimation is used to minimize the conditional risk (Domingos, 1999, Elkan, 2001). Besides, the estimated class membership probabilities are often used for ranking of cases (Saar-Tsechansky and Provost, 2004, Provost and Domingos, 2003), to improve response rate to provide different service strategies for different users. For instance, these statistics can help us to recommend the customer products of higher purchasing probability, which means, it may interest the customer and be profitable to the seller. These examples raise the following question: Can we directly learn a probability-based classifier for optimizing its class probability estimate performance?

To answer this question, we firstly need a proper measure to evaluate a classifier in terms of its class probability estimation performance. It is conditional log likelihood (Grossman and Domingos, 2004, Guo and Greiner, 2005, Jiang et al., 2009, Jiang et al., 2012), denoted by CLL. CLL is currently a well-accepted measure for the quality of class probability estimation. Given a classifier G and a set of test instances T={e1,e2,,ei,et}, where t is the number of test instances. Let ci be the true class label of ei. Then, the conditional log likelihood CLL(G|T) of the classifier G on the test instance set T is defined as:CLL(G|T)=i=1tlogPG(ci|ei),where PG(ci|ei) is the probability that the classifier G predicts the test instance ei belonging to its true class ci.

Let e, represented by an attribute value vector <a1,a2,,am>, be an arbitrary test instance and the true class label of it be c, then we can use the built classifier G to estimate the probability that e belongs to c. Now, the only left question to answer is how to estimate the class membership probability P(c|e) using the constructed probability-based classifiers. Bayesian network classifiers are typical probability-based classifiers, which estimate P(c|e) using Eq. (2).P(c|e)=P(c|a1,a2,,am)=P(c)P(a1,a2,,am|c)P(a1,a2,,am)=P(c)P(a1,a2,,am|c)cP(c)P(a1,a2,,am|c).

Assume that all attributes are fully independent given the class. Then, the resulting Bayesian network classifier is called naive Bayes (NB). NB estimates P(c|e) using Eq. (3).P(c|e)=P(c)j=1mP(aj|c)cP(c)j=1mP(aj|c),where the prior probability P(c) with Laplace correction is defined by Eq. (4), and the conditional probability P(aj|c) with Laplace correction is defined by Eq. (5).P(c)=i=1nδ(ci,c)+1n+nc,P(aj|c)=i=1nδ(aij,aj)δ(ci,c)+1i=1nδ(ci,c)+nj,where n is the number of training instances, nc is the number of classes, nj is the number of values of the jth attribute, ci is the class label of the ith training instance, aij is the jth attribute value of the ith training instance, aj is the jth attribute value of the test instance, and δ() is a binary function, which is one if its two parameters are identical and zero otherwise. Thus, i=1nδ(ci,c) is the frequency that the class label c occurs in the training data and i=1nδ(aij,aj)δ(ci,c) is the frequency that the class label c and the attribute value aj occur simultaneously in the training data.

Fig. 1 shows graphically an example of naive Bayes (NB). In NB, each attribute node has the class node as its parent, but does not have any parent from other attribute nodes. Because the values of P(c) and P(aj|c) can be easily estimated from training instances, NB is easy to construct. The structure of NB is the simplest form of Bayesian networks. It is obvious that the conditional independence assumption in NB is rarely true in reality, which would harm its performance in the applications with complex attribute dependencies. Since attribute dependencies can be explicitly represented by arcs, extending the structure of naive Bayes is a direct way to overcome its limitation. For example, tree augmented naive Bayes (TAN) (Friedman, Geiger, & Goldszmidt, 1997) is an extended tree-like naive Bayes, in which the class node directly points to all attribute nodes and an attribute node has at most one parent from another attribute node. Fig. 2 shows graphically an example of TAN.

To learn the structure of TAN, a distribution-based approach (Friedman et al., 1997) is originally proposed and has demonstrated remarkable classification performance. In order to scale up its classification performance, a classification-based approach, simply called SuperParent (SP), is proposed by Keogh and Pazzani (1999). Although SP has already been proved to be an effective classification algorithm, its class probability estimation performance, in terms of conditional log likelihood (CLL), is unknown. Inspired by this, in this paper, we firstly investigate the class probability estimation performance of SP in terms of CLL and find that its class probability estimation performance almost ties the original distribution-based tree augmented naive Bayes (TAN). To scale up its class probability estimation performance, we then propose an improved CLL-based SuperParent algorithm (CLL-SP). In CLL-SP, a CLL-based approach, instead of a classification-based approach, is used to find the augmenting arcs.

The rest of this paper is organized as follows. In Section 2, we simply introduce some related works on improving naive Bayes, especially revisit the SuperParent algorithm (SP). In Section 3, we propose an improved CLL-based SuperParent algorithm (CLL-SP) for class probability estimation. In Section 4, we conduct a series of experiments on a large suite of benchmark datasets to validate our proposed algorithm. In Section 5, we draw conclusions and outline the main directions for our future work.

Section snippets

Related work

Naive Bayes (NB) has emerged as a simple and effective classification algorithm for data mining and machine learning, but its conditional independence assumption is violated in many real-world applications. Intuitively, the Bayesian networks can provide a powerful model for arbitrary attribute dependencies (Pearl, 1988). Unfortunately, it has been proved that learning an optimal Bayesian network is NP-hard (Non-deterministic Polynomial-time hard) (Chickering, 1996). In order to avoid the

CLL-SuperParent

Numerous algorithms have been proposed to improve naive Bayes (NB) by weakening its conditional attribute independence assumption, among which tree augmented naive Bayes (TAN) has demonstrated remarkable classification performance in terms of classification accuracy or error rate. A key step of learning a TAN is to find the parent node of each attribute node, namely how to learn its structure is crucial.

Although SP achieves significant improvement on classification, its class probability

Experiments and results

In this section, we design a group of experiments to empirically investigate the class probability estimation performance of SuperParent (SP) and to validate the effectiveness of our proposed CLL-based SuperParent (CLL-SP). So, we compare related algorithms such as CLL-SP, NB, TAN, SP, and AUC-SP in terms of conditional log likelihood (CLL) defined by Eq. (1). Please note that, in our implementations, the Laplace correction is used to smooth the related probability estimates in all of the

Conclusion and future work

In many real-world applications, such as intelligent medical diagnostic systems, recommendation systems and cost-sensitive decision-making systems, accuracy class probability estimation of instances are more desirable than simple classification. In this paper, we firstly investigate the class probability estimation performance of the state-of-the-art SuperParent (SP) and then propose an improved CLL-based SuperParent algorithm (CLL-SP). In CLL-SP, a CLL-based approach, instead of a

Acknowledgments

The work was partially supported by the National Natural Science Foundation of China (61203287), the Program for New Century Excellent Talents in University (NCET-12-0953), the Chenguang Program of Science and Technology of Wuhan (201550431073), and the Fundamental Research Funds for the Central Universities (CUG130504, CUG130414).

References (30)

  • J. Demasal

    Statistical comparisons of classifiers over multiple data sets

    The Journal of Machine Learning Research

    (2006)
  • P. Domingos

    Metacost: A general method for making classifiers cost-sensitive

  • C. Elkan

    The foundations of cost-sensitive learning

  • E. Frank et al.

    Locally weighted naive bayes

  • N. Friedman et al.

    Bayesian network classifiers

    Machine Learning

    (1997)
  • Cited by (0)

    View full text