Adaptively entropy-based weighting classifiers in combination using Dempster–Shafer theory for word sense disambiguation

doi:10.1016/j.csl.2009.06.003

Computer Speech & Language

Volume 24, Issue 3, July 2010, Pages 461-473

https://doi.org/10.1016/j.csl.2009.06.003 Get rights and content

Abstract

In this paper we introduce an evidential reasoning based framework for weighted combination of classifiers for word sense disambiguation (WSD). Within this framework, we propose a new way of defining adaptively weights of individual classifiers based on ambiguity measures associated with their decisions with respect to each particular pattern under classification, where the ambiguity measure is defined by Shannon’s entropy. We then apply the discounting-and-combination scheme in Dempster–Shafer theory of evidence to derive a consensus decision for the classification task at hand. Experimentally, we conduct two scenarios of combining classifiers with the discussed method of weighting. In the first scenario, each individual classifier corresponds to a well-known learning algorithm and all of them use the same representation of context regarding the target word to be disambiguated, while in the second scenario the same learning algorithm applied to individual classifiers but each of them uses a distinct representation of the target word. These experimental scenarios are tested on English lexical samples of Senseval-2 and Senseval-3 resulting in an improvement in overall accuracy.

Introduction

Polysemous words that have multiple senses or meanings appear pervasively in many natural languages. While it seems not much difficult for human beings to recognize the correct meaning of a polysemous word among its possible senses in a particular language given the context or discourse where the word occurs, the issue of automatic disambiguation of word senses is still one of the most challenging tasks in natural language processing (NLP) (Montoyo et al., 2005), though it has received much interest and concern from the research community since the 1950s (see Ide and Véronis (1998) for an overview of WSD from then to the late 1990s). Roughly speaking, WSD is the task of associating a given word in a text or discourse with an appropriate sense among numerous possible senses of that word. This is only an “intermediate task” which necessarily accomplishes most NLP tasks such as grammatical analysis and lexicography in linguistic studies, or machine translation, man–machine communication, message understanding in language understanding applications (Ide and Véronis, 1998). Besides these directly language oriented applications, WSD also have potential uses in other applications involving knowledge engineering such as information retrieval, information extraction and text mining, and particularly is recently beginning to be applied in the topics of named-entity classification, co-reference determination, and acronym expansion (cf. Agirre and Edmonds, 2006, Bloehdorn and Andreas, 2004, Clough and Stevenson, 2004, Dill et al., 2003, Sanderson, 1994, Vossen et al., 2006).

So far, many approaches have been proposed for WSD in the literature. From a machine learning point of view, WSD is basically a classification problem and therefore it can directly benefit by the recent achievements from the machine learning community. As we have witnessed during the last two decades, many machine learning techniques and algorithms have been applied for WSD, including Naive Bayesian (NB) model, decision trees, exemplar-based model, support vector machines (SVM), maximum entropy models (MEM), etc. (Agirre and Edmonds, 2006, Lee and Ng, 2002, Leroy and Rindflesch, 2005, Mooney, 1996). On the other hand, as observed in studies of classification systems, the set of patterns misclassified by different learning algorithms or techniques would not necessarily overlap (Kittler et al., 1998). This means that different classifiers may potentially offer complementary information about patterns to be classified. In other words, features and classifiers of different types complement one another in classification performance. This observation highly motivated the interest in combining classifiers to build an ensemble classifier which would improve the performance of the individual classifiers. Particularly, classifier combination for WSD has been received considerable attention recently from the community as well (e.g. Escudero et al., 2000, Florian and Yarowsky, 2002, Hoste et al., 2002, Kilgarriff and Rosenzweig, 2000, Klein et al., 2002, Le et al., 2005, Le et al., 2007, Pedersen, 2000, Wang and Matsumoto, 2004).

Typically, there are two scenarios of combining classifiers mainly used in the literature (Kittler et al., 1998). The first approach is to use different learning algorithms for different classifiers operating on the same representation of the input pattern or on the same single data set, while the second approach aims to have all classifiers using a single learning algorithm but operating on different representations of the input pattern or different subsets of instances of the training data. In the context of WSD, the work by Klein et al. (2002), Florian and Yarowsky (2002), and Escudero et al. (2000) can be grouped into the first scenario. Whilst the studies given in Le et al., 2005, Le et al., 2007, Pedersen, 2000 can be considered as belonging to the second scenario. Also, Wang and Matsumoto (2004) used similar sets of features as in Pedersen (2000) and proposed a new voting strategy based on kNN method.

In addition, an important research issue in combining classifiers is what combination strategy should be used to derive an ensemble classifier. In Kittler et al. (1998), the authors proposed a common theoretical framework for combining classifiers which leads to many commonly used decision rules used in practice. Their framework is essentially based on the Bayesian theory and well-known mathematical approximations which are appropriately used to obtain other decision rules from the two basic combination schemes. On the other hand, when the classifier outputs are interpreted as evidence or belief values for making the classification decision, Dempster’s combination rule in the Dempster–Shafer theory of evidence (D–S theory, for short) offers a powerful tool for combining evidence from multiple sources of information for decision making (Al-Ani and Deriche, 2002, Bell et al., 2005, Denoeux, 1995, Denoeux, 2000, Le et al., 2007, Rogova, 1994, Xu et al., 1992). Despite the differences in approach and interpretation, almost D–S theory based methods of classifier combination assume the involved individual classifiers providing fully reliable sources of information for identifying the label of a particular input pattern. In other words, the issue of weighting individual classifiers in D–S theory based classifier combination has been ignored in previous studies. However, by observing that it is not always the case that all individual classifiers involved in a combination scenario completely agree on the classification decision, each of these classifiers does not by itself provide 100% certainty as the whole piece of evidence for identifying the label of the input pattern, therefore it should be weighted somehow before building a consensus decision. Fortunately, this weighting process can be modeled in D–S theory by the so-called discounting operator.

In this paper, we present a new method of weighting individual classifiers in which the weight associated with each classifier is defined adaptively depending on the input pattern under classification, making use of the measure of Shannon entropy. Intuitively, the higher ambiguity the output of a classifier is, the lower weight it is assigned and then the lesser important role it plays in the combination. Then by considering the problem of classifier combination as that of weighted combination of evidence for decision making, we develop a combination algorithm based on the discounting-and-combination scheme in D–S theory of evidence to derive a consensus decision for WSD. As for experimental results, we also conduct two typical scenarios of combination as briefly mentioned above: in the first scenario, different learning methods are used for different classifiers operating on the same representation of the context corresponding to a given polysemous word; in the second scenario all classifiers use the same learning algorithm, namely NB, but operating on different representations of the context as considered in Le et al. (2007). These combination scenarios are experimentally tested on English lexical samples of Senseval-2 and Senseval-3, resulting in an improvement in overall correctness.

The rest of this paper is organized as follows. Section 2 will begin with a brief introduction to basic notions from D–S theory of evidence and then follows by a short review of the related studies of classifier combination using D–S theory. Section 3 devotes to the D–S theory based framework for weighted combination of classifiers in WSD. The experimental results are presented and analyzed in Section 4. Finally, Section 5 presents some concluding remarks.

Section snippets

Background and related work

In this section we briefly review basic notions of D–S theory of evidence and its applications in ensemble learning studied previously.

Weighted combination of classifiers in D–S formalism

Let us return to the classification problem with M classes $C = {c_{1}, \dots, c_{M}}$ . Also assume that we have R classifiers $ψ_{i}$ ( $i = 1, \dots, R$ ), built using different R learning algorithms or different R representations of patterns. For each input pattern $x$ , let us denote by $ψ_{i} (x) = [s_{i 1} (x), \dots, s_{iM} (x)]$ the soft decision or output given by $ψ_{i}$ for the task of assigning $x$ into one of M classes $c_{j}$ . If the output $ψ_{i} (x)$ is not a posterior probability distribution on $C$ , it can be normalized to obtain an associated probability

Individual classifiers in combination

In the first scenario of combination, we used three well-known statistical learning methods including the Naive Bayes (NB), maximum entropy model (MEM), and support vector machines (SVM). The selection of individual classifiers in this scenario is basically guided by the direct use of output results for defining mass functions in the present work. Clearly, the first two classifiers produce classified outputs which are probabilistic in nature. Although a standard SVM classifier does not provide

Conclusions

In this paper the Dempster–Shafer theory based framework for weighted combination of classifiers for WSD has been introduced. Within this framework, we have proposed a new method for defining adaptively weights of individual classifiers using entropy measures considered as ambiguity associated with their classified outputs. We have also discussed two combination strategies using evidential operations in Dempster–Shafer theory, which consequently resulted in two corresponding rules for deriving

Acknowledgements

The authors would like to appreciate constructive comments and helpful suggestions from anonymous referees, which have helped improving the presentation of the paper.

References (42)

C.A. Le et al.
Combining classifiers for word sense disambiguation based on Dempster–Shafer theory and OWA operators
Data & Knowledge Engineering
(2007)
A.C. Le et al.
Semi-supervised learning integrated with classifier combination for word sense disambiguation
Computer Speech and Language
(2008)
G. Leroy et al.
Effects of information and machine learning algorithms on word sense disambiguation with small datasets
International Journal of Medical Informatics
(2005)
G. Rogova
Combining the results of several neural network classifiers
Neural Networks
(1994)
A. Al-Ani et al.
A new technique for combining multiple classifiers using the Dempster–Shafer theory of evidence
Journal of Artificial Intelligence Research
(2002)
D. Bell et al.
On combining classifiers mass functions for text categorization
IEEE Transactions on Knowledge and Data Engineering
(2005)
Bloehdorn, S., Andreas, H., 2004. Text classification by boosting weak learners based on terms and concepts. In:...
Chang, C.C., Lin, C.J., 2001. LIBSVM: A Library for Support Vector Machines....
Clough, P., Stevenson, M., 2004. Cross-language information retrieval using Euro WordNet and word sense disambiguation....

A.P. Dempster

Upper and lower probabilities induced by a multi-valued mapping

Annals of Mathematics and Statistics

(1967)

T. Denoeux

A k-nearest neighbor classification rule based on Dempster–Shafer theory

IEEE Transactions on Systems, Man and Cybernetics

(1995)

T. Denoeux

A neural network classifier based on Dempster–Shafer theory

IEEE Transactions on Systems, Man and Cybernetics A

(2000)

Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin,...

Escudero, G., Màrquez, L., Rigau, G., 2000. Boosting applied to word sense disambiguation. In: Proceedings of the 11th...

Florian, R., Yarowsky, D., 2002. Modeling consensus: classifier combination for word sense disambiguation. In:...

Grozea, C., 2004. Finding optimal parameter settings for high performance word sense disambiguation. In: Proceedings of...

V. Hoste et al.

Parameter optimization for machine-learning of word sense disambiguation

Natural Language Engineering

(2002)

N. Ide et al.

Introduction to the special issue on word sense disambiguation: the state of the art

Computational Linguistics

(1998)

Kilgarriff, A., 2001. English lexical sample task description. In: Proceedings of Senseval-2: Second International...

A. Kilgarriff et al.

Framework and results for English SENSEVAL

Computers and the Humanities

(2000)

Cited by (23)

Dust source susceptibility mapping in Tigris and Euphrates basin using remotely sensed imagery
2022, Catena
Citation Excerpt :
Given the inconsistencies between different classifiers on the produced outputs, in conjunction with the fact that no classifier is 100% accurate, it seems adequate to assign certain degrees of importance (weights) to each model, before reaching an overall conclusion. In D-S theory, the process of weighting classifiers based on their unique capabilities is modeled using a discounting operator (Huynh et al., 2010). Fig. 7-f shows the final SDS map obtained using Equations (29)–(38) for ensemble the learning of the five mentioned classifiers and the D-S theory.
The present study aims to develop an applicable, robust, and generalizable approach building on the integration of machine learning classifiers into dust source susceptibility mapping. Five machine learning classifiers were used for this purpose, including Support Vector Machine, Random Forest, Naïve Bayes, Logistic Regression, and Artificial Neural Network. The study area includes the Tigris-Euphrates river basin, considered one of the world’s largest active dust sources during the last two decades. A total of 411 active dust sources (i.e. hotspots) were identified as ground truth via visual interpretation of sub-daily MODIS-Terra/Aqua time-series imagery from 2000 to 2020 and then used to train and validate the classifiers. The main environmental drivers of dust formation were incorporated based on remote sensing data/products, including Tasseled Cap components, Soil Crust Index, Modified Fournier Index, and the Sentinel-5 Absorbing Aerosol Index. Dempster-Shafer (D-S) theory was used to merge the results from all classifiers to obtain an accurate susceptible dust sources map of the region. Each of the five classifiers and the final integrated D-S model were assessed using the ROC curve, AUC, and Calibration Plot. The mean prediction accuracy of all five classifiers was 87.7% and 86.3% for dust sources susceptible and non-susceptible classes, respectively. The D-S increased the classification accuracy in susceptible and non-susceptible classes by 7.5% and 12.6%, respectively. Finally, the wind erosion thresholds, as a limiting environmental factor in the emission of dust, were applied over the D-S-based susceptible dust sources map to measure the degree of susceptibility in the identified dust sources.
Classifying vaguely labeled data based on evidential fusion
2022, Information Sciences
Classification is one of the fundamental supervised learning tasks which learns classifiers from the given training data and related labels. The quality of labels is important in classification tasks. However, in many real-world scenarios, data annotation is often corrupted, especially when the annotation process is done by humans. Vaguely labeled data is one of the common problems caused by limited domain knowledge or partial data observation. In this paper, a novel method is proposed to classify vaguely labeled data based on evidential fusion. Vaguely labeled data are divided into several small data groups by the proposed valid label-set cover assignment algorithm. Evidence theory is applied to vaguely labeled data classification by regarding each base classifier on a small data group as one piece of evidence. This gives the chance of classifying unseen precise labeled data from related vague labels. Note that our approach is not restricted to any specific classifiers. It can be generalized to any off-the-shelf classification methods with probabilistic outputs. Finally, experiments are conducted on both synthetic data and real-world data with different base classifiers. Experimental results show that the proposed method achieves superior performance against compared methods.
Failure mode and effects analysis based on a novel fuzzy evidential method
2017, Applied Soft Computing
Failure mode and effect analysis (FMEA) has been widely applied to examine potential failures in systems, designs, and products. The risk priority number (RPN) is the key criteria to determine the risk priorities of the failure modes. Traditionally, the determination of RPN is based on the risk factors like occurrence (O), severity (S) and detection (D), which require to be precisely evaluated. However, this method has many irrationalities and needs to be improved for more applications. To overcome the shortcomings of the traditional FMEA and better model and process uncertainties, we propose a FMEA model based on a novel fuzzy evidential method. The risks of the risk factors are evaluated by fuzzy membership degree. As a result, a comprehensive way to rank the risk of failure modes is proposed by fusing the feature information of O, S and D with Dempster–Shafer (D–S) evidence theory. The advantages of the proposed method are that it can not only cover the diversity and uncertainty of the risk assessment, but also improve the reliability of the RPN by data fusion. To validate the proposed method, a case study of a micro-electro-mechanical system (MEMS) is performed. The experimental results show that this method is reasonable and effective for real applications.
Two-probabilities focused combination in recommender systems
2017, International Journal of Approximate Reasoning
Citation Excerpt :
This theory also provides a powerful tool to combine information from multiple sources. So far, DST has been applied in a variety of applications [19,20,22,24,31,37] including RSs [23,33,35,47,49]. In RSs based on DST, user preferences or ratings on items are modeled by using mass functions [18,41], and tasks of combining information play a significant role as well as being used frequently.
In this paper, we propose a new method called 2-probabilities focused combination for combining information about user preferences on products or services in recommender systems based on Dempster–Shafer theory. Regarding this method, in focal sets of mass functions representing user preferences, focal elements with probabilities in top two highest ones are retained and the remaining focal elements are considered as noise and then transferred to the whole set element. To demonstrate the advantages of the new method, a baseline known as 2-points focused combination is selected for performance comparison in a range of experiments using Movielens and Flixster data sets. According to the results of experiments, the new method is more effective in accuracy of recommendations and comparable in computational time. Also, the new method is capable of overcoming the weakness of the baseline because of the ability to generate stable results.
Deng entropy
2016, Chaos, Solitons and Fractals
Citation Excerpt :
A lot of theories have been developed, for example, probability theory [3], fuzzy set theory [4], Dempster-Shafer evidence theory [5,6], rough sets [7], generalized evidence theory [8] and D numbers [9]. Dempster-Shafer theory evidence theory [5,6] is widely used in many applications such as decision making [10–13], supplier management [14–16], pattern recognition [17], risk evaluation [18,19] and so on [20]. However, some open issues are not well addressed.
Dempster Shafer evidence theory has been widely used in many applications due to its advantages to handle uncertainty. However, how to measure uncertainty in evidence theory is still an open issue. The main contribution of this paper is that a new entropy, named as Deng entropy, is presented to measure the uncertainty of a basic probability assignment (BPA). Deng entropy is the generalization of Shannon entropy since the value of Deng entropy is identical to that of Shannon entropy when the BPA defines a probability measure. Numerical examples are illustrated to show the efficiency of Deng entropy.
A modified weighted TOPSIS to identify influential nodes in complex networks
2016, Physica A: Statistical Mechanics and its Applications
Citation Excerpt :
As a result, to comprehensively combine different attribute data of each node is paid great attention. Due to the efficient modeling and fusion of uncertain information [17–24], evidence theory is widely used in identify influential nodes in complex networks [25]. Recently, with its ability to handle linguistic variables [26–32], fuzzy set theory also is applied to this field [33].
Identifying influential nodes in complex networks is still an open issue. Although various centrality measures have been proposed to address this problem, such as degree, betweenness, and closeness centralities, they all have some limitations. Recently, technique for order performance by similarity to ideal solution (TOPSIS), as a tradeoff between the existing metrics, has been proposed to rank nodes effectively and efficiently. It regards the centrality measures as the multi-attribute of the complex network and connects the multi-attribute to synthesize the evaluation of node importance of each node. However, each attribute plays an equally important part in this method, which is not reasonable. In this paper, we improve the method to ranking the node’s spreading ability. A new method, named as weighted technique for order performance by similarity to ideal solution (weighted TOPSIS) is proposed. In our method, we not only consider different centrality measures as the multi-attribute to the network, but also propose a new algorithm to calculate the weight of each attribute. To evaluate the performance of our method, we use the Susceptible–Infected–Recovered (SIR) model to do the simulation on four real networks. The experiments on four real networks show that the proposed method can rank the spreading ability of nodes more accurately than the original method.

View all citing articles on Scopus

^☆: This work was partially supported by a Grant-in-Aid for Scientific Research (No. 20500202) from the Japan Society of the Promotion of Science (JSPS) and FY-2008 JAIST International Joint Research Grant.

View full text

Adaptively entropy-based weighting classifiers in combination using Dempster–Shafer theory for word sense disambiguation☆