Sprinkled semantic diffusion kernel for word sense disambiguation

https://doi.org/10.1016/j.engappai.2017.05.010Get rights and content

Abstract

Word sense disambiguation (WSD), the task of identifying the intended meanings (senses) of words in context, has been a long-standing research objective for natural language processing (NLP). In this paper, we are concerned with kernel methods for automatic WSD. Under this framework, the main difficulty is to design an appropriate kernel function to represent the sense distinction knowledge. Semantic diffusion kernel, which models semantic similarity by means of a diffusion process on a graph defined by lexicon and co-occurrence information to smooth the typical “Bag of Words” (BOW) representation, has been successfully applied to WSD. However, the diffusion is an unsupervised process, which fails to exploit the class information in a supervised classification scenario. To address the limitation, we present a sprinkled semantic diffusion kernel to make use of the class knowledge of training documents in addition to the co-occurrence knowledge. The basic idea is to construct an augmented term-document matrix by encoding class information as additional terms and appending them to training documents. Diffusion is then performed on the augmented term-document matrix. In this way, the words belonging to the same class are indirectly drawn closer to each other, hence the class-specific word correlations are strengthened. We evaluate our method on several Senseval/Semeval benchmark examples with support vector machine (SVM), and show that the proposed kernel can significantly improve the disambiguation performance over semantic diffusion kernel in terms of different measures and yield a competitive result with the state-of-the-art kernel methods for WSD.

Introduction

Ambiguity is inherent to human language. Particularly, word sense ambiguity is prevalent in all natural languages, with a large number of words having more than one meaning. For instance, the English noun bank can mean “sloping raised land, especially along the sides of a river” or “an organization where people and businesses can invest or borrow money, convert to foreign money, etc. or a building where these services are offered”. The correct sense of an ambiguous word can be determined based on the context where it occurs, and correspondingly the problem of word sense disambiguation (WSD) is defined as the task of automatically assigning the most appropriate meaning to a polysemous word in a given context (Navigli, 2009). As a fundamental semantic understanding task at the lexical level in natural language processing (NLP), WSD can benefit many applications such as machine translation, information retrieval, parsing, and question answering. WSD is considered to be a key step in order to approach language understanding beyond keyword matching (Agirre et al., 2014). Although WSD for human is essentially a subconscious process and presents no difficulties, it is very difficult to formalize the computational process of disambiguation since it is classified among “AI-complete” problems (Turdakov, 2010), that is, it is a task whose solution is at least as hard as the most difficult problems in artificial intelligence.

Generally, WSD methods can be classified into two types: knowledge-based and machine learning Navigli (2009), Raviv and Markovitch (2012). Knowledge-based WSD systems exploit the information in a lexical knowledge base, such as WordNet and Wikipedia, to perform WSD. These approaches usually pick the sense whose definition is most similar to the context of the ambiguous word, by means of textual overlap or using graph-based measures Abualhaija and Zimmermann (2016), Agirre et al. (2009), Navigli and Lapata (2010). Machine learning approaches, also called corpus-based approaches, do not make use of any knowledge resources for disambiguation. These approaches range from supervised learning in which a classifier is trained for each distinct word on a corpus of manually sense-annotated examples, to completely unsupervised methods that cluster occurrence of words, thereby inducing senses. Recent advances in WSD have benefited greatly from the availability of corpora annotated with word senses. Most accurate WSD systems to date exploit supervised methods which automatically learn cues useful for disambiguation from manually sense-annotated data.

For machine learning-based WSD systems, commonly used algorithms include Naïve Bayesian model, decision trees, maximum entropy, support vector machine (SVM), and so on. Among which, kernel methods Hofmann et al. (2008), Shawe-Taylor and Cristianini (2004), such as SVM, regularized least-squares classification (RLSC) and kernel principal component analysis (KPCA), have demonstrated excellent performance in terms of accuracy and robustness Giuliano et al. (2009), Gliozzo et al. (2005), Jin et al. (2008), Joshi et al. (2006), Lee and Ng (2002), Lee et al. (2004), Pahikkala et al. (2009), Popescu (2004), Wang et al. (2014), Wang et al. (), Wang et al. (2015), Wu et al. (2004). Recently, Li et al. (2016) presented an extensive survey of state-of-the-art in the field of kernel methods for WSD, concentrating on issues such as data representation, kernel selection and learning algorithms. Basically, kernel methods work by mapping the data from the input space into a high-dimensional (possibly infinite) feature space, which is usually chosen to be a reproducing kernel Hilbert space (RKHS), and then building linear algorithms in the feature space to implement nonlinear counterparts in the input space. The mapping, rather than being given in an explicit form, is determined implicitly by specifying a kernel function, which computes the inner product between each pair of data points in the feature space. There are several reasons that make kernel methods applicable to WSD and other NLP problems Li et al. (2016), Wang et al. (2015). Firstly, instead of manual construction of feature space for the learning task, kernel functions provide an alternative way to design useful features in the feature space automatically, therefore, ensuring necessary representational power. Secondly, kernel methods offer a flexible and efficient way to define application-specific kernels for introducing background knowledge and modeling explicitly linguistic insights. This property allows to notably improve the performance of the general learning methods and their simple adaptation to the specific application. Finally, kernel methods can be naturally applied to the non-vectorial types of data, thus taking into account the structure of the data and greatly reducing the need for careful feature engineering in these structures.

From the point of view of modularization, kernel methods consist of two main components, namely the kernel and actual learning algorithm. The kernel can be considered as an interface between the input data and the learning algorithm, and is the key component to ensure the good performance of kernel methods Shawe-Taylor and Cristianini (2004), Wang et al. (2009). Actually, for real applications, kernel is the only task-specific component of kernel methods. In the domain of WSD, the widely used kernel is the “Bag of Words” (BOW) kernel (Shawe-Taylor and Cristianini, 2004), which is based on the BOW representation of the context in which an ambiguous word occurs. In this representation, each word or term constitutes a dimension in a vector space, independent of other terms in the same context. Despite its ease of use, this kernel suffers from well-known limitations, mostly due to its inability to exploit semantic similarity between terms: contexts sharing terms that are different but semantically related will be considered as unrelated. To address this problem, a number of attempts have been made to incorporate semantic knowledge into the BOW kernel, resulting in the so-called semantic kernels (Shawe-Taylor and Cristianini, 2004). For example: the semantic kernels that use the external semantic knowledge provided by word thesauri or ontology were proposed to improve the kernel-based WSD system Jin et al. (2008), Joshi et al. (2006). In the absence of external semantic knowledge, Latent Semantic Indexing (LSI) technology was applied to capture the semantic relations between terms Giuliano et al. (2009), Gliozzo et al. (2005). More information about the semantic kernel can be found in text categorization Altınel et al. (2015), Cristianini et al. (2002), which is a more general application domain over WSD.

In our previous studies Wang et al. (2014), Wang et al. (), we proposed to apply the semantic diffusion kernel (Kandola et al., 2003) to improve the WSD system. Semantic diffusion kernel can be obtained through a matrix exponentiation transformation on the given kernel matrix, and virtually exploits higher order co-occurrences to infer semantic similarity between terms. Geometrically, this kernel models semantic similarities by means of a diffusion process on a graph defined by lexicon and co-occurrence information. However, the diffusion is an unsupervised process, which fails to exploit the class information in a classification scenario and may not be optimal for the supervised WSD system. Chakraborti et al. (2006), Chakraborti et al. (2007) introduced a simple approach called “sprinkling” to incorporate class labels of documents into LSI. In sprinkling, a set of class-specific artificial terms are appended to the representations of documents of the corresponding class. LSI is then applied on the sprinkled term-document space resulting in a concept space that better reflects the underlying class distribution of documents. Recently, this approach was also applied to sprinkle Latent Dirichlet Allocation (LDA) topics for weakly supervised text classification (Hingmire and Chakraborti, 2014). The inherent reason for this approach is that the sprinkled term can add contribution to exploit the class information of text documents in a classification procedure. Motivated by these works, in this paper we present a sprinkled semantic diffusion kernel with application to WSD. The basic idea is to construct an augmented term-document matrix by encoding class information as additional terms and appending them to training documents. Diffusion is then performed on the augmented term-document matrix to learn the semantic matrix, which is the key component of semantic kernels. In this way, the words belonging to the same class are indirectly drawn closer to each other, hence the class-specific word correlations are strengthened. Although the idea behind the sprinkled semantic diffusion kernel is very similar to that of sprinkled LSI, to the best of our knowledge, our work is the first time to simultaneously exploit higher order co-occurrences and class information to construct semantic smoothing kernel with application to the supervised WSD task.

The remainder of this article is outlined as follows. Section 2 briefly introduces the kernel methods in general and SVM in particular. Section 3 then details the proposed sprinkled semantic diffusion kernel with application to WSD. The proposed kernel is demonstrated with several Senseval/Semeval benchmark examples in Section 4, followed by conclusions with some potential future points of the current work.

Section snippets

Kernel methods and SVM

Kernel methods have been highly successful in solving various problems in machine learning and data mining community Hofmann et al. (2008), Shawe-Taylor and Cristianini (2004). These methods map data points from the input space to some feature space where even relatively simple algorithms such as linear methods can deliver very impressive performance. The most attractive feature of kernel methods is that they can be applied in high dimensional feature spaces without suffering from the high cost

Semantic diffusion kernel

In the machine learning-based WSD systems, the features extracted from the contexts are usually in the BOW representation which reduces a text to a histogram of word frequencies Wang et al. (2014), Wang et al. (), Wang et al. (2015). Let t0 denote the word to be disambiguated and x=(tr,,t1,t1,,ts) be the context of t0, where tr,,t1 are the words in the order they appear preceding t0, and correspondingly t1,,ts are the words that follow t0 in the text. We also define a context span

Experimental evaluation

This experiment evaluates the performance of the proposed method with several Senseval/Semeval3 benchmark examples. Senseval/Semeval is an international organization devoted to the evaluation of WSD systems. We consider three kernels, i.e., BOW kernel, semantic diffusion kernel (exponential kernel) and sprinkled semantic diffusion kernel for comparison. These kernels are embedded in the SVM classifier individually. Besides, we use the MFS (most

Conclusion and further study

We have presented a novel sprinkled semantic diffusion kernel which incorporates class knowledge into the diffusion process of mining higher order correlations between terms for WSD tasks. Since sprinkled terms are essentially class labels, the inclusion of them helps to artificially promote co-occurrences between existing terms and classes. As a result, our approach can be considered as supervised semantic smoothing kernels which make use of the class knowledge. Experimental evaluation shows

Acknowledgments

The authors are grateful to the anonymous referees whose constructive and insightful comments helped to substantially improve the paper. This work is supported in part by the National Natural Science Foundation of China (No. 61562003), the Natural Science Foundation of Jiangxi Province of China (No. 20161BAB202070), the China Scholarship Council (No. 201508360144) and the “Bai Ren Yuan Hang” Project of Jiangxi Province of China in 2015.

References (46)

  • Bruce, R.F., Wiebe, J., 1994. Word-sense disambiguation using decomposable models. In: Proceedings of the 32nd Annual...
  • Chakraborti, S., Lothian, R., Wiratunga, N., Watt, S., 2006. Sprinkling: supervised Latent Semantic Indexing. In:...
  • Chakraborti, S., Mukras, R., Lothian, R., Wiratunga, N., Watt, S., Harper, D., 2007. Supervised Latent Semantic...
  • CristianiniN. et al.

    Latent semantic kernels

    J. Intell. Inf. Syst.

    (2002)
  • DemšarJ.

    Statistical comparisons of classifiers over multiple data sets

    J. Mach. Learn. Res.

    (2006)
  • FanR.E. et al.

    IBLINEAR: A library for large linear classification

    J. Mach. Learn. Res.

    (2008)
  • GiulianoC et al.

    Kernel methods for minimally supervised WSD

    Comput. Linguist.

    (2009)
  • Gliozzo, A., Giuliano, C., Strapparava, C., 2005. Domain kernels for word sense disambiguation. In: Proceedings of the...
  • GönenM. et al.

    Multiple kernel learning algorithms

    J. Mach. Learn. Res.

    (2011)
  • Hingmire, S., Chakraborti, S., 2014. Sprinkling topics for weakly supervised text classification. In: Proceedings of...
  • HofmannH et al.

    Kernel methods in machine learning

    Ann. Statist.

    (2008)
  • HsuC.W et al.

    A comparison of methods for multiclass support vector machines

    IEEE Trans. Neural Netw.

    (2002)
  • Jin, P., Li, F., Zhu, D., Wu, Y., Yu, S., 2008. Exploiting external knowledge sources to improve kernel-based word...
  • Cited by (0)

    View full text