Adaptively entropy-based weighting classifiers in combination using Dempster–Shafer theory for word sense disambiguation

https://doi.org/10.1016/j.csl.2009.06.003Get rights and content

Abstract

In this paper we introduce an evidential reasoning based framework for weighted combination of classifiers for word sense disambiguation (WSD). Within this framework, we propose a new way of defining adaptively weights of individual classifiers based on ambiguity measures associated with their decisions with respect to each particular pattern under classification, where the ambiguity measure is defined by Shannon’s entropy. We then apply the discounting-and-combination scheme in Dempster–Shafer theory of evidence to derive a consensus decision for the classification task at hand. Experimentally, we conduct two scenarios of combining classifiers with the discussed method of weighting. In the first scenario, each individual classifier corresponds to a well-known learning algorithm and all of them use the same representation of context regarding the target word to be disambiguated, while in the second scenario the same learning algorithm applied to individual classifiers but each of them uses a distinct representation of the target word. These experimental scenarios are tested on English lexical samples of Senseval-2 and Senseval-3 resulting in an improvement in overall accuracy.

Introduction

Polysemous words that have multiple senses or meanings appear pervasively in many natural languages. While it seems not much difficult for human beings to recognize the correct meaning of a polysemous word among its possible senses in a particular language given the context or discourse where the word occurs, the issue of automatic disambiguation of word senses is still one of the most challenging tasks in natural language processing (NLP) (Montoyo et al., 2005), though it has received much interest and concern from the research community since the 1950s (see Ide and Véronis (1998) for an overview of WSD from then to the late 1990s). Roughly speaking, WSD is the task of associating a given word in a text or discourse with an appropriate sense among numerous possible senses of that word. This is only an “intermediate task” which necessarily accomplishes most NLP tasks such as grammatical analysis and lexicography in linguistic studies, or machine translation, man–machine communication, message understanding in language understanding applications (Ide and Véronis, 1998). Besides these directly language oriented applications, WSD also have potential uses in other applications involving knowledge engineering such as information retrieval, information extraction and text mining, and particularly is recently beginning to be applied in the topics of named-entity classification, co-reference determination, and acronym expansion (cf. Agirre and Edmonds, 2006, Bloehdorn and Andreas, 2004, Clough and Stevenson, 2004, Dill et al., 2003, Sanderson, 1994, Vossen et al., 2006).

So far, many approaches have been proposed for WSD in the literature. From a machine learning point of view, WSD is basically a classification problem and therefore it can directly benefit by the recent achievements from the machine learning community. As we have witnessed during the last two decades, many machine learning techniques and algorithms have been applied for WSD, including Naive Bayesian (NB) model, decision trees, exemplar-based model, support vector machines (SVM), maximum entropy models (MEM), etc. (Agirre and Edmonds, 2006, Lee and Ng, 2002, Leroy and Rindflesch, 2005, Mooney, 1996). On the other hand, as observed in studies of classification systems, the set of patterns misclassified by different learning algorithms or techniques would not necessarily overlap (Kittler et al., 1998). This means that different classifiers may potentially offer complementary information about patterns to be classified. In other words, features and classifiers of different types complement one another in classification performance. This observation highly motivated the interest in combining classifiers to build an ensemble classifier which would improve the performance of the individual classifiers. Particularly, classifier combination for WSD has been received considerable attention recently from the community as well (e.g. Escudero et al., 2000, Florian and Yarowsky, 2002, Hoste et al., 2002, Kilgarriff and Rosenzweig, 2000, Klein et al., 2002, Le et al., 2005, Le et al., 2007, Pedersen, 2000, Wang and Matsumoto, 2004).

Typically, there are two scenarios of combining classifiers mainly used in the literature (Kittler et al., 1998). The first approach is to use different learning algorithms for different classifiers operating on the same representation of the input pattern or on the same single data set, while the second approach aims to have all classifiers using a single learning algorithm but operating on different representations of the input pattern or different subsets of instances of the training data. In the context of WSD, the work by Klein et al. (2002), Florian and Yarowsky (2002), and Escudero et al. (2000) can be grouped into the first scenario. Whilst the studies given in Le et al., 2005, Le et al., 2007, Pedersen, 2000 can be considered as belonging to the second scenario. Also, Wang and Matsumoto (2004) used similar sets of features as in Pedersen (2000) and proposed a new voting strategy based on kNN method.

In addition, an important research issue in combining classifiers is what combination strategy should be used to derive an ensemble classifier. In Kittler et al. (1998), the authors proposed a common theoretical framework for combining classifiers which leads to many commonly used decision rules used in practice. Their framework is essentially based on the Bayesian theory and well-known mathematical approximations which are appropriately used to obtain other decision rules from the two basic combination schemes. On the other hand, when the classifier outputs are interpreted as evidence or belief values for making the classification decision, Dempster’s combination rule in the Dempster–Shafer theory of evidence (D–S theory, for short) offers a powerful tool for combining evidence from multiple sources of information for decision making (Al-Ani and Deriche, 2002, Bell et al., 2005, Denoeux, 1995, Denoeux, 2000, Le et al., 2007, Rogova, 1994, Xu et al., 1992). Despite the differences in approach and interpretation, almost D–S theory based methods of classifier combination assume the involved individual classifiers providing fully reliable sources of information for identifying the label of a particular input pattern. In other words, the issue of weighting individual classifiers in D–S theory based classifier combination has been ignored in previous studies. However, by observing that it is not always the case that all individual classifiers involved in a combination scenario completely agree on the classification decision, each of these classifiers does not by itself provide 100% certainty as the whole piece of evidence for identifying the label of the input pattern, therefore it should be weighted somehow before building a consensus decision. Fortunately, this weighting process can be modeled in D–S theory by the so-called discounting operator.

In this paper, we present a new method of weighting individual classifiers in which the weight associated with each classifier is defined adaptively depending on the input pattern under classification, making use of the measure of Shannon entropy. Intuitively, the higher ambiguity the output of a classifier is, the lower weight it is assigned and then the lesser important role it plays in the combination. Then by considering the problem of classifier combination as that of weighted combination of evidence for decision making, we develop a combination algorithm based on the discounting-and-combination scheme in D–S theory of evidence to derive a consensus decision for WSD. As for experimental results, we also conduct two typical scenarios of combination as briefly mentioned above: in the first scenario, different learning methods are used for different classifiers operating on the same representation of the context corresponding to a given polysemous word; in the second scenario all classifiers use the same learning algorithm, namely NB, but operating on different representations of the context as considered in Le et al. (2007). These combination scenarios are experimentally tested on English lexical samples of Senseval-2 and Senseval-3, resulting in an improvement in overall correctness.

The rest of this paper is organized as follows. Section 2 will begin with a brief introduction to basic notions from D–S theory of evidence and then follows by a short review of the related studies of classifier combination using D–S theory. Section 3 devotes to the D–S theory based framework for weighted combination of classifiers in WSD. The experimental results are presented and analyzed in Section 4. Finally, Section 5 presents some concluding remarks.

Section snippets

Background and related work

In this section we briefly review basic notions of D–S theory of evidence and its applications in ensemble learning studied previously.

Weighted combination of classifiers in D–S formalism

Let us return to the classification problem with M classes C={c1,,cM}. Also assume that we have R classifiers ψi (i=1,,R), built using different R learning algorithms or different R representations of patterns. For each input pattern x, let us denote byψi(x)=[si1(x),,siM(x)]the soft decision or output given by ψi for the task of assigning x into one of M classes cj. If the output ψi(x) is not a posterior probability distribution on C, it can be normalized to obtain an associated probability

Individual classifiers in combination

In the first scenario of combination, we used three well-known statistical learning methods including the Naive Bayes (NB), maximum entropy model (MEM), and support vector machines (SVM). The selection of individual classifiers in this scenario is basically guided by the direct use of output results for defining mass functions in the present work. Clearly, the first two classifiers produce classified outputs which are probabilistic in nature. Although a standard SVM classifier does not provide

Conclusions

In this paper the Dempster–Shafer theory based framework for weighted combination of classifiers for WSD has been introduced. Within this framework, we have proposed a new method for defining adaptively weights of individual classifiers using entropy measures considered as ambiguity associated with their classified outputs. We have also discussed two combination strategies using evidential operations in Dempster–Shafer theory, which consequently resulted in two corresponding rules for deriving

Acknowledgements

The authors would like to appreciate constructive comments and helpful suggestions from anonymous referees, which have helped improving the presentation of the paper.

References (42)

  • A.P. Dempster

    Upper and lower probabilities induced by a multi-valued mapping

    Annals of Mathematics and Statistics

    (1967)
  • T. Denoeux

    A k-nearest neighbor classification rule based on Dempster–Shafer theory

    IEEE Transactions on Systems, Man and Cybernetics

    (1995)
  • T. Denoeux

    A neural network classifier based on Dempster–Shafer theory

    IEEE Transactions on Systems, Man and Cybernetics A

    (2000)
  • Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin,...
  • Escudero, G., Màrquez, L., Rigau, G., 2000. Boosting applied to word sense disambiguation. In: Proceedings of the 11th...
  • Florian, R., Yarowsky, D., 2002. Modeling consensus: classifier combination for word sense disambiguation. In:...
  • Grozea, C., 2004. Finding optimal parameter settings for high performance word sense disambiguation. In: Proceedings of...
  • V. Hoste et al.

    Parameter optimization for machine-learning of word sense disambiguation

    Natural Language Engineering

    (2002)
  • N. Ide et al.

    Introduction to the special issue on word sense disambiguation: the state of the art

    Computational Linguistics

    (1998)
  • Kilgarriff, A., 2001. English lexical sample task description. In: Proceedings of Senseval-2: Second International...
  • A. Kilgarriff et al.

    Framework and results for English SENSEVAL

    Computers and the Humanities

    (2000)
  • Cited by (23)

    • Dust source susceptibility mapping in Tigris and Euphrates basin using remotely sensed imagery

      2022, Catena
      Citation Excerpt :

      Given the inconsistencies between different classifiers on the produced outputs, in conjunction with the fact that no classifier is 100% accurate, it seems adequate to assign certain degrees of importance (weights) to each model, before reaching an overall conclusion. In D-S theory, the process of weighting classifiers based on their unique capabilities is modeled using a discounting operator (Huynh et al., 2010). Fig. 7-f shows the final SDS map obtained using Equations (29)–(38) for ensemble the learning of the five mentioned classifiers and the D-S theory.

    • Two-probabilities focused combination in recommender systems

      2017, International Journal of Approximate Reasoning
      Citation Excerpt :

      This theory also provides a powerful tool to combine information from multiple sources. So far, DST has been applied in a variety of applications [19,20,22,24,31,37] including RSs [23,33,35,47,49]. In RSs based on DST, user preferences or ratings on items are modeled by using mass functions [18,41], and tasks of combining information play a significant role as well as being used frequently.

    • Deng entropy

      2016, Chaos, Solitons and Fractals
      Citation Excerpt :

      A lot of theories have been developed, for example, probability theory [3], fuzzy set theory [4], Dempster-Shafer evidence theory [5,6], rough sets [7], generalized evidence theory [8] and D numbers [9]. Dempster-Shafer theory evidence theory [5,6] is widely used in many applications such as decision making [10–13], supplier management [14–16], pattern recognition [17], risk evaluation [18,19] and so on [20]. However, some open issues are not well addressed.

    • A modified weighted TOPSIS to identify influential nodes in complex networks

      2016, Physica A: Statistical Mechanics and its Applications
      Citation Excerpt :

      As a result, to comprehensively combine different attribute data of each node is paid great attention. Due to the efficient modeling and fusion of uncertain information [17–24], evidence theory is widely used in identify influential nodes in complex networks [25]. Recently, with its ability to handle linguistic variables [26–32], fuzzy set theory also is applied to this field [33].

    View all citing articles on Scopus

    This work was partially supported by a Grant-in-Aid for Scientific Research (No. 20500202) from the Japan Society of the Promotion of Science (JSPS) and FY-2008 JAIST International Joint Research Grant.

    View full text