Unsupervised word sense disambiguation for Korean through the acyclic weighted digraph using corpus and dictionary
Introduction
The ambiguity of a word sense can be defined as a natural phenomenon in which a lexical form has more than two senses. WSD is meant to solve this ambiguity of a word sense, that is to say, to assign the appropriate sense to the polysemous word according to its context. It is very difficult to disambiguate polysemous words, but it is very important in the application fields of natural language processing such as machine translation (MT) and information retrieval (IR); for example, we should choose the word of a target language corresponding to a polysemous word in a source context for MT, and the appropriate sense of polysemous words in a query in order to retrieve accurate information from a text for IR. Despite the importance of WSD, the WSD system is rarely used in application fields, for the system learns only a few target words and the performance of the system is very low. In addition to this, it takes long to implement the WSD system because of the need to create sense-labeled examples and construct a thesaurus: this is referred to as a knowledge acquisition bottleneck.
Therefore, we present a way to construct a WSD system, which can be easily implemented by learning all polysemous words at once, while covering all polysemous words which are listed in MRD, and escape from a knowledge acquisition bottleneck, while showing the appropriate accuracy.
Section snippets
Related work
Several techniques for WSD have been reported. It has been common to use two kinds of resources: a dictionary and corpora. The first resource, a dictionary, is chosen based on the premise that the headwords of a dictionary are closely associated with their corresponding sense definition. Lesk (1986) used the number of common words among the sense definition of a polysemous word and the sense definitions of its context words. Wilks et al. (1990) defined the related words as frequently
Learning step
The goal of the learning step is to gain two resources which are a similarity matrix from a corpus and vector representations of sense definitions from an MRD. Lee (1999a) constructed a useful knowledge base with an MRD, ‘-WooriMalKeunSajeon’.
Fig. 1 shows the architecture of the learning step.
The first step is for acquiring co-occurrence probabilities from a POS tagged corpus. The POS tagged corpus is automatically generated by Sogang POS tagger. The Sogang POS tagger assigns the most
Disambiguation step
The disambiguation step is composed of two phases: constructing the AWD and assigning the most appropriate sense to a polysemous word. In the disambiguation step, the input sentences not used in the previous learning step are provided. We should choose k context words in an input sentence for efficiently disambiguating a polysemous word. By using k context words and the target word, the system processes an input sentence by building an acyclic weighted digraph over its words senses with each
Data set
In order to test the performance of our system, we built test data sets with the most semantically ambiguous Korean words; the target words are consisted of four nouns ( (Bae), (Gogae), (Jeongi), (Sagi)), and two verbs ( (Bbajida), (Tada)). They have 3.83 senses per target word on average. For the test data sets, we randomly collected POS error-free 869 sentences from Chosun-Illbo newspaper from 1996 to 1997; therefore, our data set might be biased to one sense which is commonly used in
Conclusion and future work
We proposed a way for WSD using the acyclic weighted digraph. We built the similarity matrix for word pairs using POS tagged corpus which was obtained from a raw text corpus by Sogang POS tagger, and represent all of sense definitions from MRD with its own vector using word vectors from similarity matrix. Based on two resources, the similarity matrix and vector representations of sense definitions, we structured the AWD and were able to assign the most appropriate sense to all polysemous words
References (14)
Natural language understanding
(1994)- Banerjee, S., & Pedersen, T. (2002). An adapted lesk algorithm for word sense disambiguation using WordNet. In...
- Brown, P., Della Pietra, S., Della Pietra V., & Mercer, R., (1991). Word-sense disambiguation using statistical...
- Cho, J. M. (1998). Verb sense disambiguation using corpus and dictionary, Ph.D. thesis, Department of Computer Science...
- et al.
Fundamentals of data structures in C++
(1995) - et al.
A method for disambiguation word sense in a large corpus
(1993) - Lee, H. (1999a). Construction of korean lexical knowledge base using Korean machine readable dictionary. MS thesis,...