Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary

https://doi.org/10.1016/j.ipm.2005.04.005Get rights and content

Abstract

Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.

Introduction

The ambiguity of a word sense can be defined as a natural phenomenon in which a lexical form has more than two senses. WSD is meant to solve this ambiguity of a word sense, that is to say, to assign the appropriate sense to the polysemous word according to its context. It is very difficult to disambiguate polysemous words, but it is very important in the application fields of natural language processing such as machine translation (MT) and information retrieval (IR); for example, we should choose the word of a target language corresponding to a polysemous word in a source context for MT, and the appropriate sense of polysemous words in a query in order to retrieve accurate information from a text for IR. Despite the importance of WSD, the WSD system is rarely used in application fields, for the system learns only a few target words and the performance of the system is very low. In addition to this, it takes long to implement the WSD system because of the need to create sense-labeled examples and construct a thesaurus: this is referred to as a knowledge acquisition bottleneck.

Therefore, we present a way to construct a WSD system, which can be easily implemented by learning all polysemous words at once, while covering all polysemous words which are listed in MRD, and escape from a knowledge acquisition bottleneck, while showing the appropriate accuracy.

Section snippets

Related work

Several techniques for WSD have been reported. It has been common to use two kinds of resources: a dictionary and corpora. The first resource, a dictionary, is chosen based on the premise that the headwords of a dictionary are closely associated with their corresponding sense definition. Lesk (1986) used the number of common words among the sense definition of a polysemous word and the sense definitions of its context words. Wilks et al. (1990) defined the related words as frequently

Learning step

The goal of the learning step is to gain two resources which are a similarity matrix from a corpus and vector representations of sense definitions from an MRD. Lee (1999a) constructed a useful knowledge base with an MRD, ‘

-WooriMalKeunSajeon’.

Fig. 1 shows the architecture of the learning step.

The first step is for acquiring co-occurrence probabilities from a POS tagged corpus. The POS tagged corpus is automatically generated by Sogang POS tagger. The Sogang POS tagger assigns the most

Disambiguation step

The disambiguation step is composed of two phases: constructing the AWD and assigning the most appropriate sense to a polysemous word. In the disambiguation step, the input sentences not used in the previous learning step are provided. We should choose k context words in an input sentence for efficiently disambiguating a polysemous word. By using k context words and the target word, the system processes an input sentence by building an acyclic weighted digraph over its words senses with each

Data set

In order to test the performance of our system, we built test data sets with the most semantically ambiguous Korean words; the target words are consisted of four nouns (

(Bae),
(Gogae),
(Jeongi),
(Sagi)), and two verbs (
(Bbajida),
(Tada)). They have 3.83 senses per target word on average. For the test data sets, we randomly collected POS error-free 869 sentences from Chosun-Illbo newspaper from 1996 to 1997; therefore, our data set might be biased to one sense which is commonly used in

Conclusion and future work

We proposed a way for WSD using the acyclic weighted digraph. We built the similarity matrix for word pairs using POS tagged corpus which was obtained from a raw text corpus by Sogang POS tagger, and represent all of sense definitions from MRD with its own vector using word vectors from similarity matrix. Based on two resources, the similarity matrix and vector representations of sense definitions, we structured the AWD and were able to assign the most appropriate sense to all polysemous words

References (14)

  • J. Allen

    Natural language understanding

    (1994)
  • Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In...
  • Brown, P., Della Pietra, S., Della Pietra V., & Mercer, R., (1991). Word-sense disambiguation using statistical...
  • Cho, J. M. (1998). Verb sense disambiguation using corpus and dictionary, Ph.D. thesis, Department of Computer Science...
  • H. Ellis et al.

    Fundamentals of data structures in C++

    (1995)
  • W. Gale et al.

    A method for disambiguation word sense in a large corpus

    (1993)
  • Lee, H. (1999a). Construction of korean lexical knowledge base using Korean machine readable dictionary. MS thesis,...
There are more references available in the full text version of this article.

Cited by (0)

View full text