Introduction

Unlike English, Korean does not use word boundaries within an Eojeol, a Korean spacing unit as well as a Korean syntactic unit. Generally, an Eojeol is composed of simple or complex words followed by optional inflectional affixes. A complex word may consist of many simple words or derived words. For example, an Eojeola-phu-li-kha-ki-a-mwun-cey-lul’ consists of a compound nouna-phu-li-kha-ki-a-mwun-cey’ (African starvation problem) and an accusative case marker lul. a-phu-li-kha-ki-a-mwun-cey is further decomposed into three simple nouns a-phu-li-kha (Africa), ki-a (starvation), and mwun-cey (problem). Such decomposition is crucial to identify source language words for transfer dictionary lookup in machine translation, or to obtain index terms in information retrieval (IR). Thus, word segmentation (more precisely, Eojeol segmentation) is a key first step in Korean language processing.

In particular, decomposition of compound nouns is nontrivial in Korean (Kang, 1993, 1998; Choi, 1996; Jang and Myaeng, 1996; Park et al., 1996; Yun et al., 1997; Lee et al., 1999; Shim, 1999; Yoon, 2001; Min et al., 2003; Park et al., 2004), since there may be unknown component words and segmentation ambiguity in compound nouns. Today, due to globalization, unknown words are growing faster than ever before, especially in nominals. To attack the compound noun segmentation problem, practical and popular approaches rely on dictionaries, a set of segmented compound nouns, and heuristics such as longest-matching and/or least components preference. However, those supervised approaches typically suffer from the unknown word problem, and cannot distinguish domain-oriented or corpus-directed segmentation results from the others.

The problems inherent to supervised segmentation methods can be solved by unsupervised statistical approaches, since they assume neither pre-existing dictionaries nor pre-segmented corpora, and segment a compound noun using segmentation probabilities between two characters that are estimated from a raw corpus. For example, in order to segment a character sequence w=c 1c n , first, n−1 segmentation probabilities between two adjacent characters are calculated for each segmentation point. Next, w is segmented at a critical segmentation point that has globally or locally the maximum segmentation probability. Then, each of two segmented sequences is iteratively segmented at its own critical point.

However, many unsupervised approaches (Sproat and Shih, 1990; Lua and Gan, 1994; Chen et al., 1997; Maosong et al., 1998; Shim, 1999; Ando and Lee, 2003) do not consider all possible segmentation candidates,Footnote 1 and simply rely on a one-dimensional character-based segmentation context that is in most cases limited to only adjacent characters. So, they are prone to falling into a local solution. Moreover, since they do not utilize word-based context, certain threshold mechanisms are required to avoid under- or over-segmentation. Currently, as far as we know, no systematic method has been proposed to determine threshold values for segmentation decisions. Some other unsupervised methods (Ge et al., 1999; Huang et al., 2002; Peng et al., 2002) attempt to select the best one from all possible segmentation candidates based on the EM-algorithm. They use all-length n-gram statistics as segmentation clues. However, all-length n-grams may not be sufficient in providing evidence for word boundaries.

In order to deal with the problems of unsupervised statistical segmentation methods, this article proposes a collection-based compound noun segmentation model that searches the most likely segmentation candidate from all possible segmentation candidates using a word-based segmentation context. A legal segment (or component) in a compound noun is identified as an entry in a collection dictionary, which is composed of stems gathered from a target corpus. Stems are automatically obtained from Eojeols in a corpus by deleting inflectional suffixes. In our approach, ‘collection-based’ means that our model relies on only a given corpus, without using any other external resources. In other words, given a corpus, our segmentation model is created from only the corpus, and then the segmentation model is applied to segment the same corpus. This feature will be useful for domain-specific applications where domain-specific dictionaries may be difficult to obtain. To demonstrate the usefulness of our segmentation algorithm, this study evaluates the application of the proposed segmentation method to Korean information retrieval.

The remainder of this article is organized as follows. Section 2 summarizes related works. Section 3 describes a collection-based segmentation method as well as an automatic building of a collection dictionary with a segmentation example. Sections 4 and 5 report the results of segmentation evaluations and retrieval experiments. Finally, concluding remarks are given in Section 6. For representing Korean expressions, the Yale romanizationFootnote 2 is used.

Related works

Overview

While Korean lacks a spacing unit within an Eojeol, Chinese and Japanese use no word delimiters in written text. Thus, much research has also been conducted on word segmentation of the Chinese and Japanese languages. Previous studies can be classified into supervised and unsupervised methods, according to whether human supervision about word boundaries is required or not. However, since this paper handles a dictionary-less unsupervised approach, only unsupervised methods using neither dictionaries nor pre-segmented corpora are reviewed. For supervised methods, other literatures (Wu and Tseng, 1995; Nie and Ren, 1999; Ogawa and Matsuda, 1999; Ando and Lee, 2003) well summarize Chinese and Japanese segmentation.

Most unsupervised segmentation methods for Chinese and Japanese depend on n-gram statistics from raw corpora like bi-grams (Sproat and Shih, 1990; Lua and Gan, 1994; Chen et al., 1997; Maosong et al., 1998) or all-length n-grams (Ge et al., 1999; Huang et al., 2002; Peng et al., 2002; Ando and Lee, 2003). Bi-gram statistics are normally collected to compute association probability like mutual information (MI) between adjacent two characters in text. All-length n-gram statistics are used in the EM algorithm (Ge et al., 1999; Huang et al., 2002; Peng et al., 2002) or in the TANGO algorithm (Ando and Lee, 2003). To resolve segmentation ambiguity, all approaches except for the EM algorithm rely on empirical parameters such as thresholds. EM-based segmentation depends on all-length n-gram statistics which may be insufficient in providing evidence to identify word boundaries. Unlike the previous methods, our algorithm uses stems and their frequencies obtained from a corpus as segmentation clues.

Korean compound noun segmentation

Most researchers of Korean compound noun segmentation basically use a lexicon to identify segmentation candidates. To resolve segmentation ambiguity, they additionally utilize heuristic rules (Kang, 1993, 1998; Choi, 1996; Yun et al., 1997), and/or statistics from segmented compound nouns (Yun et al., 1997; Lee et al., 1999; Yoon, 2001). One example of heuristic rules is that a 4-character compound noun tends to be more likely segmented into 2+2 (two 2-character words) than 1+3 (1-character word and 3-character word) or 3+1. As statistical information used to resolve segmentation ambiguity, which is obtained from a set of manually segmented compound nouns, Yun et al., (1997) rely on probabilities that a simple word is used as a specifier, an intermediate, or a head in a compound noun, and Lee et al., (1999) estimate probabilities for semantic association between adjacent constituents in a compound noun, and Yoon (2001) use lexical probabilities to be used in his min-max disambiguation procedure. In summary, most segmentation approaches for Korean compound nouns can be classified into dictionary-based supervised approaches with statistical disambiguation devices. So, they severely suffer from the unknown word problem.

Some researchers proposed partially unsupervised segmentation methods for Korean (Jang and Myaeng, 1996; Park et al., 1996; Shim, 1999). Jang and Myaeng (1996) represent a list of nouns (including compound nouns) obtained by POS-tagging a corpus in the form of forward and backward tries, where each node corresponds to one character, and dummy nodes are inserted after common prefixes among nouns while matching each noun forward and backward. For example, three compound nouns tay-hak-sayng-sen-kyo-hoy (a missionary party for university students), tay-hak-sayng (university students), and kyo-hoy (a church) would generate forward and backward tries as follows.

  • Forward trie: tayhaksayngdummysenkyohoy,

  • Backward trie: tayhaksayngsendummykyohoy.

Then, a new compound noun is matched forward or backward over the two tries where dummy nodes indicate segmentation points. However, Jang and Myaeng (1996) use heuristic rules to resolve segmentation ambiguity.

Park et al. (1996) extract nouns from a corpus by a HMM model to build a noun dictionary, from which all segmentation candidates of compound noun w in document d are identified. Then, they select the best segmentation candidate corresponding to the Kullback-Leibler divergence, which is calculated between distributions of each segmentation candidate s and document d, where each distribution is defined from term distributions over a document collection. That is, they generate a segmentation candidate of which decomposed terms are similar to document terms in d in terms of term distribution over a document collection. So, their method assumes that for compound noun w in document d, its correctly decomposed terms more likely appear in d than incorrectly decomposed ones. Unfortunately, Park et al. (1996) did not evaluate segmentation accuracy, and they reported the effect of segmentation on retrieval effectiveness using a small Korean test collection.

For all adjacent character pairs (s i , s i +1) in a compound noun, Shim (1999) calculates its segmentation strength as a linear combination of four kinds of mutual information (MI), which can be informally rephrased as (1) a probability that s i is followed by s i +1 in a compound noun, (2) a probability that there is a word boundary between s i and s i +1, (3) a probability that a compound noun ends with a character s i , and (4) a probability that a compound noun starts with a character s i +1. These probabilities can be collected from a raw corpus. Using a threshold, Shim (1999) recursively segmented between characters. Unfortunately, he appealed to a dictionary to stop segmentation. For instance, if both segmented halves are in a dictionary, segmentation stops.

In summary, previous unsupervised segmentation methods for Korean partially rely on pre-existing dictionaries, heuristic rules, or threshold values. So, they require human labor for dictionary maintenance, rule adaptation, and threshold tuning.

Collection-based compound noun segmentation

Building a collection dictionary

In our dictionary-less unsupervised approach to compound noun segmentation, we use word-based context, compared to previous unsupervised methods that use character-based context. Thus, a list of words should be acquired from a raw corpus C. For doing that in Korean, one standard method is to use a morphological analyzer or a POS-tagger on C. Considering that a raw corpus has in general a number of unknown words such as proper nouns, domain jargons, technical terms, transliterated terms, acronyms, etc., morphological analysis is prone to generate many spurious words.

The other method employs a list of inflectional suffixes to simply delete word endings from an Eojeol to produce a stem. For example, the Eojeolse-wul-ey-se-pwu-the-nun’ consists of a stem se-wul (Seoul) and a sequence of three inflectional suffixes ey-se (in), pwu-the (from), and nun (a topic marker). From that Eojeol, a stem se-wul (Seoul) can be obtained by deleting the longest word ending ey-se-pwu-the-nun. Compared to the former morphological analysis or tagging-based method that is highly dependent on an open set of known words, this suffix-based method utilizes a closed set of suffixes. In addition, this suffix-stripping technique is simple and fast. We adopt this method to produce a collection dictionary using a list of our 7,434 Korean complex (inflectional) suffixes.Footnote 3

Segmentation algorithm

Let us assume that we have a corpus C, which is viewed as a document collection in information retrieval. Now, we want to segment an n-character compound noun w=c 1c i c n (c i is the i-th character), which is in C. As an alternative notation for w=c 1c n , we use c 1n . First, a collection dictionary D is created from C using the suffix-based method described in Section 3.1. Note that an entry e in D is a simple or complex word obtained from an Eojeol in C by deleting tailing function words.

In order to find the most likely segmentation candidate S * of w, we should calculate the following Formula (1), where k-th segmentation candidate is represented as S k =s 1s j s m (k) (s j is the j-th segment, m(k)≤n, and m(k) is the number of the last segment of S k ) of w. Note that a segment covers one or more contiguous characters in w. In Formula (1), we interpret P(S k  | C) as a probability that w is decomposed into s 1, s 2, …, and s m (k). In addition, occurrences of segments of a compound noun are assumed to be independent.

$$\displaylines{ S^* = {\mathop {{\rm arg}\;{\rm max}}\limits_{S_k = s_1 , \ldots ,s_{m(k)} } } \quad{P(S_k\,|\,C)} \cr \cr = {\mathop {{\rm arg}\;{\rm max}}\limits_{S_k = s_1 , \ldots ,s_{m(k)} } } \quad {\prod\limits_{i = 1}^{m(k)} {P(s_i\,|\,C)} } }$$
(1)

However, Formula (1) tends to produce the segmentation candidate that has the smaller number of segments. Generally, this can reflect well a heuristic of dictionary-based supervised segmentation methods. In other words, if the other conditions are the same, a segmentation candidate with the least number of segments is preferred. In unsupervised segmentation, however, Formula (1) would divide the input string T into a few large segments. This means that the naive application of Formula (1) may under-segment the input. We attempt to obviate this problem by pruning unhelpful segmentation candidates from the search space using the following two constraints.

  • Constraint 1: When the search space of a compound noun is enumerated in a top-down two-way decomposition manner, do not further decompose a segment of which length is smaller than K.

  • Constraint 2: When the search space of a compound noun is enumerated in a bottom-up two-way combination manner, if a segment (XY) can be obtained from the concatenation (X+Y) of two smaller segments (X and Y), do not generate the segment (XY) itself.

Figure 1 shows the pruned search space of a string ‘abcd’ for each of the two constraints. In the left side, bold-faced segmentation candidates survive constraint 1 with K = 4, since any of the segmentation candidates in the level 2 does not contain a segment of which length is equal to or larger than 4. In the right side, only a+b+cd survive constraint 2 when the collection dictionary contains a, b, ab, bc, cd, and abc. In cell (ii,1), a segment ab is ignored since the cell can be constructed by a+b though ab is in the collection dictionary. Cell (ii, 2), however, can hold bc since any combination of smaller segments cannot create bc, and bc is found in the collection dictionary. Thus, constraints 1 and 2 influence enumerating the feasible set of hypotheses by hindering over-segmentation and under-segmentation of a compound noun respectively.

Fig. 1
figure 1

Example of applying two constraints to segmenting ‘abcd’

In order to modify Formula (1) with the above constraints, we define a probability \({\boldmath \delta}\) using induction as in Formula (2), where the basis step deals with constraint 1 and checking the value of \({\boldmath \beta}\) in the induction step reflects constraint 2. \({\boldmath \delta}\)(c pq |C) means the probability of the most likely segmentation candidate of c pq , a substring covering contiguous characters of c 1n from the p-th character to the q-th character. For example, for c 15=kwuk-cey-wen-yu-ka (an international crude oil price), c 35 is wen-yu-ka (an crude oil price), and c 55 is ka (price). A parameter K indicates a minimum character length of the substring that is handled in the induction step. For example, K=4 means that, for a given input string, we do not try to segment any substrings of 2- or 3-character length into smaller constituents, but 4- or more character sub-strings are decomposed if that is possible.

$$\displaylines{{\boldmath Basis}\;{\boldmath Step} : (q - p + 1) < K \cr {\bf \delta} (c_{pq}\,|\,{\boldmath C}) = P(c_{pq}\,|\,\,{\boldmath C}) \cr {\boldmath \sigma} (c_{pq} ) = c_{pq} \cr {\boldmath Induction}\;{\boldmath Step}\!:\;(q - p + 1) \ge K \cr r^* = {\mathop {{\rm arg}\;{\rm max}}\limits_{p \le r \le q - 1} } \quad {{\boldmath \beta} (c_{pq} ,r,C)} \cr {\boldmath \delta} (c_{pq}\,|\, {\boldmath C}) = \left( {\begin{array}{*{20}c} {P(c_{pq}\,|\, {C}),} \\ {\boldmath \beta} ({\boldmath c_{pq} ,r^*,}C), \\ \end{array}} \right.\left. {\begin{array}{*{20}c}{{\rm if}\;{\boldmath \beta} (c_{pq} ,r^*,{C}) = 0} \\ {{\rm otherwise}} \\\end{array}} \right) \cr {\boldmath \sigma} (c_{pq} ) = \left( {\begin{array}{*{20}c}{c_{pq} ,} \\{{\boldmath \sigma} (c_{pr^*} ) + {\boldmath \sigma} (c_{(r^* + 1)q} ),} \\ \end{array}} \right.\left. {\begin{array}{*{20}c}{{\rm if}\;{\boldmath \beta} (c_{pq} ,r^*,C) = 0} \\{{\rm otherwise}} \\\end{array}} \right) \cr {\boldmath \beta} (c_{pq} ,r,C) = {\boldmath \delta} (c_{pr}\,|\, {C}){\boldmath \delta} (c_{(r + 1)q}\,|\, {C}) }$$
(2)

P(c pq | C), a probability that a character-string c pq is generated from C, is obtained from maximum likelihood estimation as in Formula (3), where freq(x) means the frequency of x in C, and D is a collection dictionary.

$$ P(c_{pq}\,|\,{C}) = \frac{{{\it freq}(c_{pq} )}}{{\sum_{c \in D} {{\it freq}(c)} }} $$
(3)

The probability function \({\boldmath \beta}\)(c pq , r, C) calculates a probability that a character-string c pq is decomposed into c pr and c (r+1)q in C. Using the recurrence relation of Formula (2), we can efficiently calculate the probability of the most likely segmentation candidate S * for a compound noun w=c 1n as in Formula (4). To obtain S * from Formula (4), it is necessary to store the best segmentation candidate at each step of Formula (2). For this, Formula (2) maintains σ(c pq ) to store a partial segmentation result.

$$P(S^*\,|\,{\boldmath C}) = {\boldmath \delta} (c_{1n} |C)$$
(4)

Example

To exemplify the proposed segmentation algorithm, suppose that we want to segment a compound noun kwuk-cey-wen-yu-ka (an international crude oil price), which occurs in corpus C. Also, we assume that a collection dictionary is already built from C. To obtain the most probable segmentation of c 15=kwuk-cey-wen-yu-ka, we calculate \({\boldmath \delta}\)(kwuk-cey-wen-yu-ka | C) using Formula (2). Note that the recurrence relation in Formula (2) is actually implemented using a dynamic programming technique. So, the segmentation process proceeds in a bottom-up manner. Figure 2 shows all intermediate data in order to compute \({\boldmath \delta}\)(kwuk-cey-wen-yu-ka | C), with K=3 in Formula (2).

Fig. 2
figure 2

Calculation of δ(kwuk-cey-wen-yu-ka | d) and σ(kwuk-cey-wen-yu-ka) with K=3

The bottom two rows of Fig. 2 are computed from the basis step of Formula (2), and the other rows from the induction step. For example, for a sub-string c 34=wen-yu (crude oil), \({\boldmath \delta}\)(wen-yu | C) is calculated as P(wen-yu | C) using the basis step, because the length of the sub-string c 34=wen-yu is 2(=4−3+1), which is less than the value of K(=3). For a sub-string c 35=wen-yu-ka, however, our induction step is fired. First, by the following calculations, r * is set to 4, as follows.

$$\displaylines{ r^* = \mathop {{\rm arg}\;{\rm max}}\limits_{3 \le r \le 5 - 1} \quad {\boldmath \beta} (c_{35} ,r,C)\cr {\boldmath \beta} (c_{35} ,3,C) = {\boldmath \delta} (c_{33}\,|\,C){\boldmath \delta} (c_{45}\,|\,C) = 3.59e^{ - 03} \times 5.00e^{ - 05} = 1.79e^{ - 07} \cr {\boldmath \beta} (c_{35} ,4,C) = {\boldmath \delta} (c_{34}\,|\,C){\boldmath \delta} (c_{55}\,|\,C) = 1.50e^{ - 04} \times 9.62e^{ - 03} = 1.44e^{ - 06} } $$

Next, \({\boldmath \delta}\)(wen-yu-ka | C) is set to \({\boldmath \beta}\)(c 35, 4, C). During the calculation of \({\boldmath \delta}\)(kwuk-cey-wen-yu-ka | C), σ(kwuk-cey-wen-yu-ka) is simultaneously accumulated as shown in Fig. 2. For our example compound noun, the best segmentation candidate is as follows.

$$\displaylines{{\boldmath \sigma} ({\it kwuk\hbox{-}cey\hbox{-}wen\hbox{-}yu\hbox{-}ka}) \cr \quad= \sigma ({\it kwuk\hbox{-}cey}) + {\boldmath \sigma} ({\it wen\hbox{-}yu\hbox{-}ka}) \cr \quad= {\boldmath \sigma} ({\it kwuk\hbox{-}cey}) + {\boldmath \sigma} ({\it wen\hbox{-}yu}) + {\boldmath \sigma} (ka) \cr \quad= {\it kwuk\hbox{-}cey} + {\it wen\hbox{-}yu} + ka }$$

In this case, the decomposition kwuk-cey (international) + wen-yu (crude oil) + ka (price) is correct.

Segmentation evaluation

Corpus, collection dictionary, and test set

In order to test our segmentation algorithm, we used the NTCIR-3 Korean corpus (Chen et al., 2002) as a test collection. The corpus is a collection of 66,146 documents which are newspaper articles published by the Korean Economic Daily in 1994. From this corpus, a set of stems was automatically generated from Eojeols using the dictionary creation method described in Section 3.1. As a result, a total of 184,456 stems constituted our collection dictionary as shown in Table 1, where the number of Eojeols and the number of stems were counted only for pure Korean Eojeols written in Hangul,Footnote 4 excluding English words, numbers, or any strings consisting of special characters. To compare word-based segmentation clues with character-based ones, we created an n-gram dictionary that is composed of all-length n-grams extracted from all Eojeols.

Table 1 A collection dictionary and an n-gram dictionary for segmentation evaluation

To create a test set, compound nouns were selected from the same corpus, since our task was to segment all compound nouns within a particular corpus. First, from the corpus, we randomly selected 886 documents, from which we extracted a total of 6,079 different compound nouns of more than 3 characters. Then, the extracted compound nouns were manually segmented to create our test set. For each test compound noun, we created two types of answer sets: rigid and relaxed. Basically, each answer set consists of segmentation answers that are semantically reasonable. A rigid answer set has a single segmentation answer, while a relaxed answer set may have multiple segmentation answers. Segmentation answers of rigid answer sets are composed of only simple nouns or affixes. That is, all derived nouns in rigid answer sets were further decomposed into simple nouns and derivational affixes. A relaxed answer set was obtained from its rigid answer set by creating another segmentation answers in which simple derived nouns were considered as legal components. A relaxed answer set thus includes its rigid answer set. By the simple derived noun, we mean a noun consisting of a simple noun and an affix.

For example, for a compound noun pwuk-han-oy-mwu-pwu-tay-pyen-in (the spokesperson for the Ministry of North Korea Foreign Affairs), two kinds of answer sets are as follows.

  • Rigid answer set: {

  • pwuk-han (North Korea, a simple noun) + oy-mwu (foreign affairs, a simple noun) + pwu (ministry, a suffix) + tay-pyen (speaking for, a simple noun) + in (person, a suffix)

  • }

  • Relaxed answer set: {

  • pwuk-han + oy-mwu + pwu + tay-pyen + in,

  • pwuk-han + oy-mwu- pwu (the Ministry of Foreign Affairs) + tay-pyen + in,

  • pwuk-han + oy-mwu + pwu + tay-pyen-in (spokesperson),

  • pwuk-han + oy-mwu-pwu + tay-pyen- in

  • }

In the above, suffixes pwu and in are underlined. Compared to the rigid answer set, its relaxed answer set includes three additional segmentation answers in which two simple derived nouns oy-mwu- pwu and tay-pyen- in are not divided. Note that the relaxed answer set does not include other segmentation answers composed of complex derived nouns such as pwuk-han-oy-mwu- pwu (the Ministry of North Korea Foreign Affairs) or oy-mwu - pwu -tay-pyen- in (the spokesperson for the Ministry of Foreign Affairs).

Generally, simple derived words are not further decomposed in compound noun segmentation. From the viewpoint of keyword extraction for information retrieval, however, it could be crucial to identify stems of simple derived words to alleviate word mismatch problem. Since the goal of this study is to develop a compound noun segmentation algorithm for Korean information retrieval, we will investigate which answer set is more appropriate for segmentation evaluation for IR.

Evaluation measures

To evaluate compound noun segmentation, we define two types of metrics: compound-noun-level precision and segment-level recall/precision. The compound-noun-level measure calculates how many compound nouns are correctly segmented, and it is defined as follows.

$$ c\hbox{\it Precision} = \frac{{\sum_{w \in \hbox{\it TestWords}} {|\hbox{\it Answer}(w) \cap \hbox{\it Output}(w)|} }}{{|\hbox{\it TestWords}|}} $$

In the above, TestWords is a set of test compound nouns. Answer(w) is a answer set for test word w. Output(w) is a set of segmentation results for test word w, which is generated by a compound noun segmentation program. In this paper, Output(w) has only the most likely segmentation result for test word w. We consider a compound noun to be correctly segmented, when all its segments are correct.

Segment-level measures are applied to segments to calculate how many segments are correctly produced, and they are defined as follows. In the following, Segment(X) is a set of segments extracted from X.

$$\displaylines{ s{\it Recall} = \frac{{\sum_{w \in {\it TestWords}} {|{\it Segment}({\it Answer}(w)) \cap {\it Segment}({\it Output}(w))|} }}{{\sum_{w \in {\it TestWords}} {|{\it Segment}({\it Answer}(w))|} }}\cr s{\it Precion} = \frac{{\sum_{w \in {\it TestWords}} {|{\it Segment}({\it Answer}(w)) \cap {\it Segment}({\it Output}(w))|} }}{{\sum_{w \in {\it TestWords}} {|{\it Segment}({\it Output}(w))|} }} } $$

Note that for each of cPrecision, sRecall, and sPrecision, its rigid and relaxed versions are defined according to whether Answer(w) in the above measures uses a rigid answer set or a relaxed answer set described in Section 4.1.

Evaluation results

As a baseline of compound noun segmentation, we used a commonly used left-to-right longest-matching algorithm which iteratively selects the leftmost longest dictionary word as a segment. For example, for a compound noun se-wul-tay-kong-wen which has two segmentation candidates (1) se-wul (Seoul)+tay-kong-wen (a large park), and (2) se-wul-tay (Seoul national university)+ kong-wen (a park), the left-to-right longest matching algorithm selects the latter. As a manual dictionary for the baseline system, we used a list of 94,482 words extracted from a dictionary of our laboratory's Korean morphological analyzer (Kwon et al., 1997).

Tables 2 and 3 show the results of compound-noun-level and segment-level evaluations. Base-colldic and base-mandic indicate longest-matching algorithm using a collection dictionary and a manual dictionary, respectively. Seg-c-n or Seg-w-n means segmentation based on Formula (2) with K=n respectively using an n-gram dictionary or a collection dictionary. The longest-matching algorithm showed much better performance when using a manual dictionary than a collection dictionary. It partly results from the fact that the collection dictionary has many long compound nouns un-segmented, since they were automatically gathered from the corpus. Actually, base-colldic creates the smaller numbers (12,520) of segments than the case (14,367) of base-mandic, producing many long segments that need to be further decomposed.

Table 2 Compound-noun-level evaluation (two figures of each cell correspond to cPrecision using rigid and relaxed answer sets for each of which the best is in bold. N is the value of K in Formula (2))
Table 3 Segment-level evaluation (For each column, the best is in bold)

Our unsupervised segmentation algorithm performed best when K was 3 or 4, outperforming the longest-matching algorithm. The parameter K of Formula (2) controls the minimum character length of character strings to be decomposed. So, K can stand for the average character length of the words that need to be segmented. Considering that the average character length of Korean simple words is two or three, the result of Table 2 is reasonable. cPrecisionRigid reached the best accuracy when K was 3, since in rigid answer sets all derived words (of mostly 3-character length) were further decomposed. cPrecisionRelaxed obtained the best accuracy when K was 4, since the proposed algorithm at K=4 does not decompose many 3-character words which are considered as legal segments in relaxed answer sets, but not in rigid answer sets. When K was 2, compound nouns were unnecessarily divided into many 1-character segments. When K was 5, compound nouns had long-character segments un-segmented that need to be decomposed. Table 4 shows the details of the evaluations for Seg-w-3 using rigid answer sets and Seg-w-4 using relaxed answer sets.

Table 4 Details of compound-noun-level evaluation for Seg-w-3 and Seg-w-4

In terms of selecting the value of parameter K for document indexing in information retrieval, the results of Tables 2 and 3 are not consistent. Rigid evaluation recommends K=3, while relaxed evaluation K=4. From the viewpoint of document representation in information retrieval, segment-level recall could stand for the completeness of document term space, while segment-level precision its correctness. Table 3 suggests K=3 for information retrieval in terms of completeness of term space, while K=4 in terms of correctness. Considering term mismatch problem, K=3 would be more favorable. Moreover, F-measure suggests K=3, though the difference is marginal. The real situation in Korean IR is presented in the next section.

The difference between Seg-c-3 and Seg-w-3 in Tables 2 and 3 implies that word-based segmentation clues of a collection dictionary are superior to character-based clues of an n-gram dictionary in terms of resolving segmentation ambiguity.

Application to Korean information retrieval

Experimental setup

We evaluate retrieval effectiveness of the proposed method using three large Korean test collections: NTCIR-3, NTCIR-4, and HANTEC. The NTCIR-3 Korean test set (Chen et al., 2002) is composed of 66,146 documents and 30 topic files. The NTCIR-4 Korean test set (Kishida et al., 2004) consists of 60 topic files and 254,438 documents which are newspaper articles published in 1998 and 1999. The HANTEC test collection (Myaeng et al., 1999) has 120,000 documents and 50 topic files. Each NTCIR topic has four fields: title, description, concept, and narrative. Each HANTEC topic has four fields: title, description, query, and narrative. Concept in NTCIR means a list of keywords, and is similar to query in HANTEC. We evaluated our IR systems using each of all topic fields in order to evaluate retrieval effectiveness over a diverse range of query lengths.

In NTCIR, human assessors categorize the relevance of each document in a pool of selected documents into four categories: “Highly Relevant”, “Relevant”, “Partially Relevant”, and “Irrelevant”. However, since the well-known IR scoring program, TREC_EVAL,Footnote 5 adopts binary relevance, NTCIR organizers divided the above four categories into two: rigid relevance, and relaxed relevance. Rigid relevance considers “Highly Relevant” and “Relevant” as relevant, and relaxed relevance regards “Highly Relevant”, “Relevant”, and “Partially Relevant” as relevant. This paper used relaxed relevance as relevance judgments of the NTCIR collections.

To create relevance judgments of HANTEC, two human assessors assign each document five-degree relevance score. Higher score means higher relevance. There are 10 types of relevance judgments: G1, G2, … , G5, L1, L2, … , and L5. G or L means using the lower or higher one of scores from two assessors as relevance score, and each of 1, 2, … , 5 indicates the cut-off score for creating binary relevance. For example, L2 regards 2 or all higher points as relevant using lower scores from two assessors. In this paper, L2 was used as relevance judgments of the HANTEC collection.

For the test document collection, a collection dictionary was constructed using the method described in Section 3.1. Next, for each entry in a collection dictionary, our segmentation algorithm from Section 3.2 was applied to produce its segmentation result. Then, each segment was used as an index term for document indexing. Query term extraction was basically the same. For each Eojeol in a query, its longest inflectional suffix was deleted to produce a query word w, which was segmented by the same segmentation algorithm using the collection dictionary. Each segment of w corresponds to a query term.

As retrieval models, we used the language model based on Jelinek-Mercer (JM) smoothing (Zhai and Lafferty, 2001) of which smoothing parameter λ was set to 0.75. For a statistical significance test, the Wilcoxon signed rank test was used. A symbol ‘*' or ‘**' is attached to the retrieval result that is statistically significant at a significance level of 0.05 or 0.01. In addition, all retrieval results are reported using non-interpolated mean average precision (MAP) which is computed by executing the TREC_EVAL program.

Retrieval results

To see the effectiveness of the proposed algorithm on the NTCIR-4 collection, we compared it with other index units as shown in Table 5. Eojeol-based, stem-based, and character-based mean retrieval systems using Eojeols, stems, and character unigrams as index terms, respectively. As described earlier, stems refer to entries in a collection dictionary. The character-based model is equivalent to using the segmentation system which divides an n-character stem into n characters, and it could show the lower-bound IR performance acquired from segmentation. The other retrieval models in Table 5 indicate segment-based retrieval systems using as their index terms segments produced by segmentation systems in Table 3. In Table 5, average term length, number of term types, and average term frequencies were calculated only for pure Korean Eojeols written in Hangul, excluding English words, numbers, or any strings consisting of special characters.

Table 5 Retrieval results for the NTCIR-4 collection (Avg. is the average over 4 different query types. % indicates the improvement over base-mandic)

In Table 5, the difference between Eojeol-based and stem-based retrieval confirms that the deletion of inflectional affixes from Eojeols substantially help to extract content terms from documents in agglutinative languages like Korean. The fact that segmentation of stems using a longest-matching algorithm outperforms stem-based retrieval implies that many stems in documents are used in the form of compound terms that need to be further decomposed. However, dividing stems into characters was not a proper strategy. Base-colldic slightly under-performed base-mandic. Unlike the result of segmentation evaluation in Section 4.3, however, base-colldic sliced more long segments than base-mandic, obtaining smaller average term length. As shown in Fig. 3, base-mandic could not segment more long stems of the NTCIR-4 collection than base-colldic. The reason for this is that our manual dictionary lacks many collection-oriented terms found in the NTCIR-4 collection.

Fig. 3
figure 3

Length distribution of term types of various segmentation techniques (base-mandic, base-colldic, and seg-w-3 mean index terms of NTCIR-4 collection segmented by each of them. Eojeol and stem mean Eojeols and stems of NTCIR-4 collection)

The proposed segmentation performed best with K=3. As shown in Figs. 3 and 4, Seg-w-3 divided most stems of more than 2-character length into nearly 1 or 2-character terms, while a longest-matching algorithm produced many 3-character terms as segments. This means that at least 3-character stems should be further decomposed into shorter terms for Korean IR effectiveness. Considering term space representation, segmentation producing many longer compound terms would create a more-specific dense less-populated term-space, while segmentation creating many shorter terms a more-general sparse more-populated term-space. Dense or sparse indicates how many the number of term types is. Less-populated or more-populated means on average how many term tokens occur for each term type. Compare the number of term types with the average term frequencies in Table 5. Thus, longer terms may severely suffer from word mismatch problem. This means that short terms such as 1 or 2-character terms are more suitable for document representation, since they could alleviate word mismatch problem of longer terms.

Fig. 4
figure 4

Length distribution of term tokens of various segmentation techniques (base-mandic, base-colldic, and seg-w-3 mean index terms of NTCIR-4 collection segmented by each of them. Eojeol and stem mean Eojeols and stems of NTCIR-4 collection)

Seg-w-3 showed an improvement of 9.6% on average over base-mandic, and the improvement was statistically significant except for the case of title queries. Unlike longest-matching algorithm, our method relies on a collection dictionary gathered from the whole document collection to identify component words, and depends on the probabilities of component words in a target document collection to find the most likely segmentation result. It implies that the proposed algorithm has the potential to slice stems into collection-oriented content terms which may be lacking in the manual dictionary. Compare stem (corresponds to NTCIR-4 collection dictionary) and manual dic. in Fig. 3. The collection dictionary has many unknown stems of 2 or 3-character length which are missing in the manual dictionary, even if we assume all stems of more than 3-character length as compound terms.

For different values of K, the result of retrieval experiments in Table 5 was roughly compatible with the results of rigid segmentation evaluation in Tables 2 and 3, but not with those of relaxed segmentation evaluation. It implies that relaxed segmentation evaluation, which supports K=4, is not a proper single indicator for selecting the value of K in our segmentation algorithm. In addition, it is remarkable that the difference among segmentation performance is not directly proportional to the difference among retrieval effectiveness. For example, Seg-w-2 and Seg-w-5 obtained 10.85 and 30.84% of Seg-w-3 in F-measure in the case of rigid segmentation evaluation, while they attained up to 80.91 and 82.15% of Seg-w-3 in mean average precision in the case of retrieval effectiveness. The reason for this is as follows. Segmentation performance is absolutely determined by how many legal component words or affixes are found. However, retrieval performance could be affected by other factors such as co-occurrence of query terms and term weighting schemes. For example, Seg-w-2 divides kwuk-cey-wen-yu-ka (an international crude oil price) into kwuk, cey, wen, yu, and ka which are all illegal segments, but the use of kwuk, cey, wen, yu, and ka as query terms could retrieve all documents containing any of legal components of kwuk-cey-wen-yu-ka.

The fact that Seg-c-3 based on character-based segmentation clues performed less than Seg-w-3 using word-based ones implies that segmentation based on word-based boundaries is superior to the one based on character-based segmentation clues as well in the case of IR effectiveness. Although Seg-c-3 did not employ on any resources such as a compiled list of Korean suffixes, it outperformed substantially base-mandic. This means the proposed algorithm itself is effective without using the suffix list.

To see the robustness of our method, we repeated the retrieval experiments of Table 5 for two other Korean test collections: HANTEC and NTCIR-3. The retrieval results were roughly consistent with that of Table 5. As shown in Tables 6 and 7, the proposed algorithm showed the best performance with K=3, and outperformed the dictionary-based segmentation (base-mandic) by 4.72 and 23.37% for HANTEC and NTCIR-3 collections.

Table 6 Retrieval results for the HANTEC collection (Avg. is the average over 4 different query types. % indicates the improvement over base-mandic)
Table 7 Retrieval results for the NTCIR-3 collection (Avg. is the average over 4 different query types. % indicates the improvement over base-mandic)

To see whether the retrieval system based on the proposed segmentation algorithm parallels or outperforms current best practices, we have compared the performance of our method with those of the best systems at recent NTCIR official evaluations. Table 8 shows such a comparison for title and description queries. At NTCIR-4 and NTCIR-5 evaluations, the submission of the retrieval results using title queries was mandatory for all participants, and the use of description queries was mandatory for each of NTCIR-3, NTCIR-4 and NTCIR-5 participation. For fair comparison, the retrieval results of seg-w-3 in Table 8 were obtained after pseudo relevance feedback was performed. For pseudo relevance feedback, we used model-based feedback with top 15 documents of initial retrieval (Na et al., 2005). Table 8 shows that our method parallels current NTCIR best systems. Actually, NTCIR-5 best system was ours. That was based on the proposed segmentation algorithm and improved the performance a little with an additional combination approach.

Table 8 Comparison of the system based on the proposed algorithm with recent NTCIR best systems

Conclusion

In this paper, we have proposed an unsupervised approach for Korean compound noun segmentation. Compared with most previous unsupervised methods, our approach searches the most likely segmentation candidate by considering all segmentation possibilities using word-based segmentation clues. To summarize our method, first, a collection dictionary is automatically built by generating stems from Eojeols in a corpus based on a set of complex suffixes. Then, given a compound noun, its most probable segmentation candidate is determined by calculating the likelihood of each segmentation candidate based on the probabilities of component words. Experiments showed that our segmentation algorithm is effective for Korean information retrieval.

The proposed algorithm segments Eojeols of the Korean language into words based on purely corpus statistics without the need for a pre-compiled dictionary. This feature would be especially useful for the processing of domain-specific corpora such as patent documents, genomic documents. We plan to apply our algorithm to patent retrieval or genomic data retrieval. In addition, we want to apply this algorithm to other languages. For example, segmenting Japanese Kanji sequences or Chinese noun-phrase chunks would be very similar to segmenting Korean compound nouns.