1 Introduction

Finding information about people in the World Wide Web is one of the most popular activities of the Internet users. However, the major problem with personal names is that they are not unique and sometimes even for one name many variations exist. Variations may be caused by permutations (e.g., Simon Perez and Perez Simon might refer to the same person), abbreviations (e.g., Jan Maria Rokita may become J. M. Rokita), spelling mistakes (e.g., George Bush versus George Buhs), usage of accents and foreign characters (e.g., Schaeffer, Schaffer and Schäffer), different transcriptions (e.g. Jakub, Jacob, Giacomo may refer to the same person), postfixes (e.g., names may end with a title like Jr. or a number—John Paul II versus John Paul), declension paradigm (e.g, Władimirze Putinie might be a locative form of Władimir Putin in Polish), and other factors. In a multilingual data repository like the Web the number of variants for a single person name may quickly rise to a couple of hundreds (Pouliquen and Steinberger 2009).

The task of person name matching is to find synonym and homonym personal names in a given dataset, in particular, in the Web. Various research communities, ranging from artificial intelligence to databases, have reported on a vast bulk of work on tackling this problem, under a variety of terms such as name disambiguation (On et al. 2005; Li et al. 2004; Cohen et al. 2003a), record linkage (Fellegi and Sunter 1969), duplicate detection (Elmagarmid et al. 2007; Bilenko and Mooney 2003), or merge/purge (Hernandez and Stolfo 1995). Up to now, the research in this area focused mainly on English texts (Agirre et al. 2007; Cucerzan 2007) and few other major languages. Nevertheless, even considering only English web pages, most commercial search engines frequently return (for a given person name search queries) either a blend of links to pages referring to different people, who share the same name (e.g., Michael Jordan), or just a small fraction of all pages referring to the person under interest. This is mainly due to the aforementioned types of potential name variations and the fact that significant number of person names in most of the languages is not unique. Some attempts of alleviating the web retrieval problem (including web person search) consists of applying linguistic analysis in the process of text normalization (Vilares et al. 2004; Ntoulas et al. 2001).

In this article, we explore knowledge-poor methods for supporting and tackling (full) person name matching task in Polish, a lesser studied language with particularly rich inflection and complex person name declension. In particular, the proposed methods utilize mainly well-established string distance metrics, some new variants thereof, and automatically acquired suffix-based lemmatization patterns. Furthermore, we also investigate whether better accuracy can be obtained by merging different techniques, e.g., combining string distance metrics with lemmatization patterns, etc. Finally, we deploy also techniques that utilize the local context, in which person names appear. The results of our numerous experiments carried out on a Polish person-name dataset extracted from a web news corpus are described. We believe that the results presented in this article could be of importance to solving the same problem for other highly inflectional languages, e.g., for most other Slavonic languages (over 400 million speakers) or languages exhibiting similar phenomena, e.g., Ugro-Finnish.

Our work was mainly inspired by the comprehensive studies on using string distance metrics for name matching tasks presented in Cohen et al. (2003a, b) and Christen (2006). The main motivation of carrying out this research is the fact that processing highly inflectional languages adds another complication to the person name matching task. The intuitive way of tackling the inflection problem in Polish and other languages with similar inflection paradigm, would be to lemmatize person names, and then to apply string-distance techniques, which turned out to work fine for inflection-poor languages like English. One could argue that the set of inflectional suffixes of the names in Polish is finite and the description of combinatorial constraints between such suffixes and corresponding stems is not out of reach. Unfortunately, the person name declension paradigm in Polish is extremely complex and knowledge-intensive. Accuracy figures of the reported knowledge-based lemmatization systems do not exceed 76%.

As reported by other authors, the inflection in name-matching tasks has been dealt with in two ways. The first approach is based on converting names into some kind of canonical form by stripping off inflectional suffixes (Coates-Steohens 1992) or truncating all the letters after the first k letters of a name (Klementiev and Roth 2006) (in most languages the inflections are affixed to the end of a word stem with some possible minor alternation of the stem at the junction) and normalizing language-specific diacritics (e.g., converting ‘ä’ into ‘a’ in German). In the second step, the canonical forms can be used for matching names by using conventional techniques, e.g., string distance metrics (Cohen et al. 2003a, b). The second approach, reported for instance in Steinberger and Pouliquen (2007), is based on generating all possible inflected morphological variants of a given person name in order to capture all the potential named mentions of the same person. Although over-generation of inflection forms does not pose a problem (non-existing names would not be matched), such an approach requires that the base forms are known, which in general might not be the case.

The aim of the work described in this article was not to provide a fully fledged solution to person name matching and lemmatization for Polish and related languages, but to explore whether the application of knowledge-poor and approximate methods based on string-distance metrics might be useful in the whole process of name matching. In particular, it can be seen in a way as being complementary to the previous work mentioned earlier in this article, i.e., Steinberger and Pouliquen (2007), Cohen et al. (2003a), Coates-Steohens (1992) and Klementiev and Roth (2006). However, it is important to note that we did not consider the full name matching including person name disambiguation, but considered only matching given full person names (possibly inflected) against a set of other names, which might be seen as the first step of name matching in textual collections, i.e., collecting documents which might refer to persons with the same name.

The organization of the article is as follows. First, in Sect. 2, we describe the phenomena which complicate person name declension in Polish. In Sect. 3, we briefly report on accuracy figures achieved by some knowledge-based systems for person name lemmatization for Polish, which demonstrates the difficulty of the task and shows that there is much room for improvement. Next, in Sect. 4, an overview of string distance metrics and their modifications, which were used in our study, is given. The test data, evaluation methodology and the results of numerous experiments on using string-distance metrics, statistically learned inflection suffixes, combination of these two techniques, and utilization of contextual information are described in Sects. 5 and 6. The results are discussed in Sect. 7. We conclude and present some perspectives for future work in Sect. 8.

2 Person names in Polish

Polish is a West Slavonic language with rich nominal inflection: nouns and adjectives are inflected for case, number and gender. Footnote 1

Similarly to common nouns, Polish person names undergo declension but the inflectional paradigm is more complex. In general, both the first and the last name can be inflected, e.g., Marian Kowalski (nom.) versus Mariana Kowalskiego (gen./acc.). If the surname is also a regular word form, things get more complicated. Whether it can be inflected in such cases depends on several factors, e.g., on the gender of the first name, a category (part-of-speech) and gender of the (common) word used as a surname. For instance, if the surname is a masculine noun, it is inflected only if the first name is also masculine. The declension of the male name Stanisław Polak (Polak—‘Pole’ is a masculine male noun) and its variant with the female first name Stanisława given in Table 1 illustrates this phenomenon. If the surname is an adjective (e.g., Niski ‘short’—opposite to ‘tall’), it is inflected (according to the adjectival paradigm) and agrees in gender with the first name, i.e., male and female last name forms are different (e.g., Niski ‘Short’ (masc.) versus Niska ‘Short’ (fem.)).

Table 1 Declension of the male name Stanisław Polak and its variant with the female first name Stanisława (left)

The declension of foreign surnames may strongly depend on their origin, and, in particular, on the pronunciation. For example, the name Wilde is pronounced differently in English and German, which impacts its declension in Polish. If it is of English origin, a nominal declension is applied, i.e., Wilde’a (gen.), whereas if it comes from German, an adjective-like declension is adopted: Wildego (gen.). Clearly, inferring the origin of a name from the surface string alone can not be done accurately.

Declension of surnames which are also common nouns can be different from the declension of common nouns, Footnote 2 e.g., the genitive form of the common noun gołąb ‘dove’ is gołębia, whereas the genitive form of the surname Gołąb is Gołąba. The full declension of the common noun gołąb and the surname Gołąb is given in Table 1.

First names present problems, too. Foreign masculine first names, whose pronounced version ends in a consonant or whose written version ends in -a, -o, -y or -i generally get inflected (e.g., Jacques (nom.) versus Jacques’a (gen./acc.)), whereas names whose pronounced version ends in a vowel and are stressed on the last syllable (e.g., François) usually do not change their form. For female first names created from a male first name there is a frequent homonymy between the nominative form of the female name and the genitive/accusative form of the corresponding male form, e.g., Józefa is nominative of Józefa (fem.) and genitive/accusative of Józef (masc.).

To give a final example of the complexities, consider the person name Marka Belki. The first name Marka could be either interpreted as a genitive form of the male name Marek or Mark (foreign version of Marek), or as a nominative form of a foreign female name Marka. As for the last name Belki, it is a genitive form of the common Polish noun belka (meaning ‘beam’ in English), but due to the fact that inflection of proper names differs from that of common nouns, we cannot exclude the special proper name form Belki. Consequently, there are six potential base forms for Marka Belki, namely: Marek Belka (masc.), Marka Belka (fem.), Marek Belki (masc.), Marka Belki (fem.), Mark Belki (masc.), Mark Belka (masc.). Even considering the document-level context of the occurrence of the name Marka Belki might not be sufficient for resolving the base form ambiguity (Piskorski 2005). Intuitively, the number of ambiguities could be reduced through utilization of case information (if provided). Nevertheless, in the above example, only the combination of Marka (nom. fem.) with Belki (gen. masc.) could be possibly excluded in this way, still leaving all other five interpretations open.

A comprehensive overview of this rather intriguing declension paradigm of Polish names is given in Grzenia (1998). Our empirical observations based on corpus analysis revealed that circa 13% of the person names occurring in Polish news articles are morphological variants, whose surface representations are different from the corresponding base forms.

3 Person name lemmatization with knowledge-based systems

We have carried out some initial experiments on applying existing knowledge-based systems for lemmatization of person names.

In our first experiment, we have tested Stempelator (Weiss 2005), a full-form lexicon-based lemmatizer, which uses a bunch of heuristics for guessing base forms of words not found in the lexicon. It returns one or more potential base forms for a given word. We have applied Stempelator on each part of the name (first name, surname) separately. Although Stempelator performs relatively well for common words (around 70% in case of a subset of IPI PAN Corpus (Przepiórkowski 2005), 90–95% in case of manually created test data), the accuracy achieved with the datasets described later in this article in Sect. 5 were not better than 35% in case one considers only the first result returned for the first name and surname. Even if one considers all combinations of all the results returned by Stempelator for each separate token (first name and surname) the correct base form could be found for not more than 61% of the names. Clearly, this leaves a lot of space for improvement.

In the second experiment, we have tested a more complex and time-intensive system dedicated to person name recognition and person name lemmatization for Polish (Piskorski 2005), which exploits: (a) a dictionary of about 6000 most frequent Polish first names and their morphological variants, (b) a set of over 50,000 foreign first names, (c) a set of simple but effective patterns matching the most frequent surname suffixes to their corresponding base forms (e.g, \(skiego\rightarrow ski\)), and (d) a set of some more sophisticated rules relying on higher-level linguistic information, which encode most of the types of phenomena described in Sect. 2. However, we did not consider the origin of the names since it is hard to guess. In order to evaluate this system, a set of 30 articles (containing 856 person names) on various topics (politics, finance, sports, culture and science) has been randomly chosen from Rzeczpospolita (Weiss 2007), a major Polish newspaper. From the set of the recognized person names, only 75.6% have been lemmatized correctly (the correct base form was in the set of candidate base forms returned by the system). It is important to note that for 12.4% of the recognized person names more than one base form was returned. Interestingly, the accuracy of this knowledge-intensive approach could be potentially improved through integration of sub-categorization lexica (e.g., verbs and their arguments) for ‘guessing’ the case of recognized named mentions of person names, which would most likely facilitate identification/disambiguation of the base form. Further, the utilization of the document-level context could be considered since both the base forms and inflected forms of the same person names might appear in the same document. The detailed description of the aforementioned experiment is presented in Piskorski et al. (2007).

The observations learned from the two aforementioned experiments constituted our main motivation for studying whether utilization of string distance metrics, other knowledge-poor techniques, and amalgamation of such methods with the systems like the first one mentioned in this section would yield comparable or better accuracy of lemmatization and person name variant matching for Polish.

4 String distance metrics

In our experiments on using string distance metrics for the name matching task and lemmatization we used mainly the metrics applied by the database community for record linkage. The point of departure constitutes the well-known Levenshtein edit distance metric given by the minimum number of character-level operations (insertion, deletion, or substitution) needed to transform one string into another (Levenshtein 1965). Furthermore, we used an extension of Levenshtein, namely Smith–Waterman (SW) metric (Smith and Waterman 1981), which additionally allows for variable cost adjustment to the cost of a gap and variable cost of substitutions (mapping each pair of symbols from alphabet to some cost). We tested two settings for this metric. Namely, one which normalizes the Smith–Waterman score with the length of the shorter string and one which uses for the same purpose the Dice coefficient, i.e., the average length of strings compared (SW-D). Further variants of the Smith–Waterman metric and other edit distance metrics, e.g., Needleman–Wunsch, were not taken into consideration since in our prior experiments (Piskorski et al. 2007) they did not perform better than the Smith–Waterman metrics. In general, most of the edit-distance metrics can be computed in O(|s| · |t|), where s and t are the two strings being compared.

Good results for name-matching tasks (Cohen et al. 2003a) have been reported using variants of the Jaro metric (Jaro 1989; Winkler 1999), which is not based on the edit-distance model. It considers the number and the order of the common characters between the two strings being compared. More precisely, given two strings s = a 1a K and t = b 1b L , we say that a i in s is common with t if there is a b j  = a i in t such that \(i-R {\leq}j {\leq}i+R\), where \(R = \left\lfloor \max(|s|,|t|)/2 \right\rfloor - 1\). Furthermore, let \(s^{\prime} = a_1^{\prime} \ldots a_{K^{\prime}}^{\prime}\) be the characters in s which are common with t (with preserved order of appearance in s) and let \(t^{\prime} = b_1^{\prime} \ldots b_{L^{\prime}}^{\prime}\) be defined analogously. A transposition for s′ and t′ is defined as any position i such that \(a_i^{\prime} \neq b_i^{\prime}\). Let us denote the number of transpositions for s′ and t′ as \(T_{s^{\prime},t^{\prime}}\). The Jaro similarity is then calculated as:

$$ J(s, t) = \frac{1}{3} \cdot \left(\frac{|s^{\prime}|}{|s|} + \frac{|t^{\prime}|}{|t|} + \frac{|s^{\prime}| - \left\lfloor T_{s^{\prime},t^{\prime}}/2 \right\rfloor}{|s^{\prime}|}\right) $$
(1)

A Winkler variant of Jaro metric boosts this similarity for strings with agreeing initial characters and is calculated as:

$$ JW(s, t) = J(s, t) + \delta \cdot boost_{p}(s, t) \cdot (1-J(s, t)) $$
(2)

where δ denotes the common prefix adjustment factor (default value is 0.1) and boost p (s,t) = min(|lcp(s,t)|, p). Here lcp(s,t) denotes the longest common prefix between s and t. Further, p stands for the upper bound of |lcp(s,t)|, i.e., up from a certain length of lcp(s,t) the ‘boost value’ remains the same. For multi-token strings we extended boost p to boost * p . Let s = s 1s K and t = t 1t L , where s i (t i ) represents the i-th token of s (t, respectively), and let L ≤ K (without loss of generality). boost * p is calculated as:

$$ boost_{p}^{*}(s, t) = \frac{1}{L} \cdot \left( \left( \sum_{i=1}^{L-1} boost_{p}(s_i,t_i) \right) + boost_{p}(s_L,t_L \ldots t_K) \right) $$
(3)

We denote the metric which uses boost * p as JWM. The time complexity of ‘Jaro’ metrics is O(|s| · |t|).

The q-gram metric (Ukkonen 1992) is based on the intuition that two strings are similar if they share a large number of character-level q-grams. We used a variant thereof, namely so called skip-gram metric (Keskustalo et al. 2003). It is based on the idea that in addition to forming bigrams of adjacent characters, bigrams that skip characters are considered. Gram classes are defined that specify what kind of skip-grams are created, e.g., {0,1} class means that normal bigrams are formed, and bigrams that skip one character. This metric can be computed in O(max(|s|,|t|)). Our previous experiments showed that it outperforms the classic q-gram metric and other metrics based on character-level q-grams, e.g., (positional q-grams), which takes into account only common q-grams that occur within a maximum distance to each other (Gravano et al. 2001).

Considering the declension paradigm of Polish we also considered a basic and time efficient metric based on the longest common prefix, which would intuitively perform well in the case of single-token names.Footnote 3 It is calculated as: \(CP_{\delta}(s, t) = {(|lcp(s, t)| + \delta(s, t))^2}/{|s| \cdot |t|}\). The parameter δ(s,t) in \(CP_{\delta}(s, t)\) favors certain suffix pairs in s (t). We have experimented with two variants, \(CP_{\delta_{1}}\) and \(CP_{\delta_{2}}\). In \(CP_{\delta_{1}}\) the value of δ(s,t) is set to 0 for all s and t values. In \(CP_{\delta_{2}}\), as a result of empirical study of the data and the declension paradigm δ(s,t) has been set to 1 if s ends with: ‘o’,‘y’,‘ą’,‘ę’, and t ends with ‘a’. Otherwise δ(s,t) is set to 0. For coping with multi-token strings we introduced a new similar metric called weighted longest common sub-strings distance (WLCS)—a variant of the better-known longest common sub-strings distance metric, which recursively finds and removes the longest common sub-string in the two strings compared. Let lcs(s,t) denote the first longest common sub-string for s and t and let \(s_{-p}\) denote a string obtained by removing from s the first occurrence of p in s. The LCS metric is calculated as:

$$ LCS(s, t) = \left\{\begin{array}{ll}0 &\text{if \ } |lcs(s, t)| \leq \phi\\ |lcs(s, t)| + LCS(s_{-lcs(s, t)},t_{-lcs(s, t)}) &\text{otherwise} \end{array}\right. $$
(4)

The value of ϕ is usually set to 2 or 3. The time complexity of LCS is O(|s| · |t|). In the extended version, i.e., WLCS, an additional weighting to the |lcs(s,t)| is introduced. The main idea is to penalize longest common sub-strings which do not match the beginning of a token in at least one of the compared strings. Let α be the maximum number of non-white-space characters, which precede the first occurrence of lcs(s,t) in s or t. Then, lcs(s,t) is assigned the weight \((|lcs(s, t)|+\alpha-\hbox{max}(\alpha,p))/(|lcs(s, t)| + \alpha)\), where p has been experimentally set to 4.

Finally, we tested the recursive schema known as Monge–Elkan (ME) distance (Monge and Elkan 1996) Let us assume that the strings s and t are broken into sub-strings (tokens), i.e., \(s = s_1 \ldots s_K\) and \(t = t_1 \ldots t_L\). The intuition behind Monge–Elkan measure is the assumption that s i in s corresponds to a t j with which it has the highest similarity. The similarity between s and t equals the mean of these maximum scores. Formally, the Monge–Elkan metric is defined as follows, where sim denotes some secondary similarity function.

$$ ME(s, t) = \frac{1}{K} \cdot \sum_{i=1}^{K} \max_{j=1 \ldots L } sim(s_i,t_j) $$
(5)

Inspired by the multi-token variants of the JW metric presented in Christen (2006) we applied two additional metrics, which are similar in spirit to the Monge–Elkan metric. The first one, Sorted-Tokens (ST) is computed in two steps. Firstly, the tokens constituting the full strings are sorted alphabetically. Next, an arbitrary metric is applied to compute the similarity of the ‘sorted’ strings. The second metric, Permuted-Tokens (PT) compares all possible permutations of tokens constituting the full strings and returns the maximum calculated similarity value.

5 Test data and evaluation

This section describes the test data and evaluation methodology used in our experiments on using different techniques for the name matching (and lemmatization) task.

We define the problem as follows. Let A, B and C be three sets of strings over some alphabet Σ, with \(B \subseteq C\). Furthermore, let \(f\hbox{: }A \rightarrow B\) be a function representing a mapping of inflected forms into their corresponding base forms. Given, A and C (the latter representing the search space), the task is to construct an approximation of f, namely \(\widehat{f}\hbox{: }A \rightarrow C\). If \(\widehat{f}(a) = f(a)\) for a ∈ A, we say that \(\widehat{f}\) returns a correct answer for a, otherwise, \(\widehat{f}\) is said to return an incorrect answer. We say that \(\widehat{f}\) returns a quasi-correct answer for a if \(\widehat{f}(a) = f(a)\) or \(f(\widehat{f}(a))=f(a)\) (the answer is the base form or another variant thereof).

Secondly, we define an additional task consisting of constructing another approximation of f, namely function \(f^{\ast}\hbox{: }A \rightarrow 2^C\), where f * is said to return a quasi-correct answer for a ∈ A if \(\forall a'\in f^{\ast}(a)\hbox{: } f(a)=a' \vee f(a) = f(a')\), i.e., f *(a) contains only strings which are either the base form of a or a variant of a, e.g., morphological variant.

5.1 Test data

For the experiments we have used two datasets: (a) a mapping of full person names (first name + surname) to their base forms (PFN-1) consisting of 1548 pairs,Footnote 4 and (b) another variant of (a) with some hard-to-tackle cases and consisting of 1538 entries (PFN-2). The latter one was obtained by taking PFN-1 set and inverting order of first name and surname in 1/3 of the cases chosen at random. Second modification was a replacement of some names that we expected to be easily distinguished from the others (due to significantly different name prefix) with more ambiguous forms (sharing a long common prefix with other names in the set).

The PFN-1 resource was created semi-automatically as follows. We have automatically extracted a list of circa 22,952 full person-name candidates from a corpus of 15,724 on-line news articles from the Rzeczpospolita corpus (Weiss 2007), by using first name lexicon consisting of over 6000 most popular Polish first names (including their morphological variants) and an additional list of 58,038 noninflected foreign first names. Subsequently, we have selected an excerpt of about 1900 entries (inflected forms) from this list. 1/3 of this excerpt are the most frequent names appearing in the corpus, 1/3 are the most rare names, and finally 1/3 of the entries were chosen randomly. Finally, this list was cleaned and duplicates were removed. The full set of the person name candidates was extended in order to include all base forms (22,064 entries) and was used as the search space in all the experiments.

5.2 Accuracy metrics

We measured the accuracy of various techniques in four ways. First, let s denote the number of strings, for which a single result was returned. Analogously, m is the number of strings for which multiple results were returned. Next, let s c (s qc ) denote the number of correct (quasi-correct) single-result answers returned. Furthermore, let m qc denote the number of quasi-correct multi-result answers. The four accuracy metrics are: all-answer accuracy (AA), all-answer relaxed accuracy (AAR), single-result accuracy (SR) and relaxed accuracy (RA). They are computed as: \(AA = s_{c}/(s+m),\;AAR =s_{qc}/(s+m), SR = s_{c}/s\) and \(RA = (s_{qc}+m_{qc})/(s+m)\), respectively.

In case of all-answer accuracy we made an assumption that a multi-result answer is incorrect and as a consequence of this multi-result answers are penalized. The second measure, all-answer relaxed accuracy, is a relaxed variant of the AA accuracy, where quasi-correct answers are counted as true positives (the answer is either the base form or another variant of the name, e.g., inflectional variant of the base form). The single-result accuracy measures solely the accuracy of single-result answers, i.e., multiple-result answers are disregarded. Finally, the most relaxed measure called relaxed accuracy is an extension of AAR. It treats a multi-result answer as true positive if all of the returned results are quasi correct (see the definition of f * in the beginning of Sect. 5), i.e., the result set contains solely strings which are base forms or other variants of the given name.

The SR and AA accuracy measures were basically defined for evaluating the usefulness of the explored string distance metrics and other techniques for performing lemmatization, whereas the intuition behind AAR and RA accuracy metrics was to measure the usability for the more general name matching task.

5.3 Statistical significance

This paper concerns, among others, comparison of performance of various algorithms using four evaluation metrics on two different datasets. Since the problem studied in this paper is generally difficult, the differences in figures are often small and statistical significance tests are needed to support the conclusions. To achieve this, we applied the following approach.

For each pair of compared settings (algorithm and its variant and particular evaluation metric) on a given dataset we created a set of N different random sub-samples of the dataset, each of size n (without repetitions), where N and n are parameters. After some experimentation the values of N = 50 and n = 500 were chosen for the tests. Subsequently, for each set of N sub-samples both the compared algorithms were run on it N times (each algorithm once per a sub-sample). Thus, two samples of N observations (each) of the two compared performance metrics were recorded.

For each observation sample in a pair we computed its mean. Let’s denote the higher mean with Θ1 and the lower one with Θ2 (there were no ties). We tested the null hypothesis: “Θ1 = Θ2” against the alternative hypothesis “Θ1 > Θ2” with the use of one-tailed Welch’s t test, since the sample variances were slightly different. The only necessary assumption was that the distribution in each sample of observations was Gaussian (with unknown variance)—which turned out to be a reasonable one after examining a couple of histograms.

Due to the low values of variance in observation samples, some differences in performance figures (reported in the next section) are statistically significant despite being quite small.

6 Experiments

6.1 Simple string distance metrics

In our first experiment we tested the basic non-recursive metrics described in Sect. 4. The results are given in Table 2. Smith–Waterman turned out to achieve the best scores in the AA accuracy for both datasets, whereas WLCS was the best metric w.r.t. SR accuracy for PFN-1, followed by Smith–Waterman metrics. In case of PFN-2 Smith–Waterman family of metrics achieved the best results in SR accuracy, although the figures around 60% are not impressive.

Table 2 The accuracy results for simple string distance metrics

Smith–Waterman metrics and JWM achieve the best results in AAR accuracy for PFN-1, whereas WLCS performs best for PFN-2 since it can cope best with the inverted order of first name and surname in PFN-2 dataset. Finally, WLCS significantly outperforms all other metrics in RA category for both datasets.

As an alternative to simple string distance metrics we experimented with a simple technique, which for a given name s = s 1 s 2, where s 1 and s 2 are tokens representing the first name and the surname, respectively, returns as an answer all names s′ in the search space, for which the total length of common prefixes with s is above a certain threshold. Intuitively, with such a method one could achieve fairly good results if the order of first name and surname is known. Finally, our experiment on PFN-1 dataset revealed that the top scoring string distance metrics clearly outperformed the aforementioned technique.

6.2 Fine-tuning Smith–Waterman metrics

Encouraged by the observation that Smith–Waterman metrics turned out to be among the best for the both PFN-1 and PFN-2 datasets additional experiments were conducted (reported also in Piskorski et al. 2008) in order to optimize their accuracy performance as follows.

Smith–Waterman metric depends on numerous parameters including MinCost, MaxCost and GapCost (default values are: −2.0, 1.0, 0.5, respectively). We applied random search through this three-dimensional parameter space, repeating the experiment 500 times. Checking only the small fraction of the possible parameter settings, resulted in an accuracy improvement for PFN-1 when compared to the default setting. The top accuracy results achieved with MinCost = −0.55391, MaxCost = 0.29161 and GapCost = 0.11144 are presented in Table 3. In the remaining part of this article we will refer to the ‘optimized’ versions of these Smith–Waterman metrics as SW 2 and SW-D 2, respectively.

Table 3 The top results for optimized Smith–Waterman metrics

As for the substitution cost matrix, we also experimented with various search heuristics including random search, grid-search, hill-climbing and simulated annealing for searching the parameter space of around 1000 dimensions. Random search method allowed to improve AAR measure by 1.7% with respect to the values achieved for the default setting. To be more precise, the top score achieved for the random search through the substitution matrix space of the Smith–Waterman with Dice Coefficient metric was: 77.3% (AA), 78.6% (SR), 91.4% (AAR) and 92.6% (RA). Regular grid-search around the best setting did not improve the results significantly. Furthermore, application of simulated annealing for the default setting yielded some insignificant improvement over the default setting. Therefore, we omit the details of the aforementioned experiments.

6.3 Recursive string distance metrics

In some settings, recursive metrics performed significantly better than others. In particular, the Monge–Elkan scheme performed best with \(CP_{\delta_{2}}\) as the internal metric and somewhat worse results were obtained with JWM and \(CP_{\delta_{2}}\) as the internal metrics. The top 10 results in all accuracy categories are summarized in Table 4. As for PFN-1 dataset, an improvement of about 4–5% could be achieved for AA, AAR and SR when compared to the top results for the basic metrics. On the PFN-1 the \(CP_{\delta_{2}}\) algorithm is the best for all four metrics and it is statistically significant at the 0.1% significance level. In case of PFN-2, the dataset containing more complex entries, the top result in AA and SR accuracy are only slightly better (1.3% and 0.1%, respectively). However, the top AAR accuracy is by about 10% higher. Here also, the \(CP_{\delta_{2}}\) algorithm is statistically better than others for the AA and SR metrics and \(CP_{\delta_{1}}\) for the AAR metric on 0.1% significance level but for RA the supremacy of the latter one over the former one is not statistically significant even at the 5% level (the p-value here is 0.949).

Table 4 The accuracy results for recursive string distance metrics

Summing up, the CP δ group is the best here and the difference is statistically significant.

6.4 Combining metrics

The first and obvious way of merging distance metrics is to combine the ‘best’ metrics in SR accuracy with the ‘best’ metrics in the AA category. Let us assume, that two metrics m 1 (good in SR) and m 2 (good in AA accuracy) are too be merged. The idea is to first use m 1 and if it returns a single answer, return it, otherwise return the result of application of m 2. The pseudo-code of the corresponding algorithm CombinedMostSimilar is given in Fig. 1.

Fig. 1
figure 1

The algorithm CombinedMostSimilar. s denotes the input string and Space denotes the search space. The function MostSimilar(m 1sSpace) returns for the metric m 1 and the string s the most similar string(s) in the search space Space. Please note that there is potentially more than one string in the search space, whose distance from s is the smallest

Application of the algorithm CombinedMostSimilar to PFN-1 revealed that the best results in AA accuracy (around 87.0–87.4%) could be achieved (unsurprisingly) with Monge–Elkan & \(CP_{\delta_{2}}\) as m 1 and Jaro metrics as m 2. In particular, the best result was achieved with JW (87.4%) and JWM (87.34%). Compared to the recursive metrics an improvement of 2.8% can be observed. Clearly, the top scores for SR were similar as those for recursive metrics, i.e., around 88%. The top result was achieved with Monge–Elkan & \(CP_{\delta_{2}}\) (m 1) and Monge–Elkan & \(CP_{\delta_{1}}\) (m 2) (88,3%). The AAR accuracy could be improved by about 3.4%. The best scores (96.7%) in this category were obtained with WLCS (m 1) and Monge–Elkan & \(CP_{\delta_{2}}\) (m 2). Finally, in the RA category, the best results were achieved by combining WLCS (m 1) and Monge–Elkan & \(CP_{\delta_{2}}\) as m 2 (97.93%).

Similarly, the AA and AAR scores for PFN-2 could be improved (by 2.45% and 2.71%, respectively). Again, for AA the best results (61.25%) were achieved with Monge–Elkan & \(CP_{\delta_{2}}\) (m 1) and JW or JWM (m 2). As for AAR, many combinations of m 1 being either Monge–Elkan & \(CP_{\delta_{2}}\) or Sorted-Tokens & WLCS or Permuted-Tokens & WLCS and m 2 being either JWM or JW or WLCS or Smith–Waterman yields a AAR score between 96.1% and 96.81%. In particular, the top score (96.81%) was achieved with Monge–Elkan & \(CP_{\delta_{2}}\) (m 1) and LCS (m 2). The best SR accuracy for PFN-2 (62.3%) was achieved with Monge–Elkan & \(CP_{\delta_{2}}\) (m 1) and Levenshtein (m 2). Finally, the best RA score was obtained with Sorted-Tokens & WLCS (m 1) combined with \({SW\hbox{-}D_{2}}\) as m 2 (98.8%).

Another variant of the algorithm CombinedMostSimilar computes first all strings, whose distance from s is among the first k closest in the search space. These strings constitute then the search space for the metric m 2 in the second step. The corresponding pseudo-code (CombinedMostSimilar-2) is presented in Fig. 2.

Fig. 2
figure 2

Algorithm CombinedMostSimilar-2. The method GetKthDistanceValue(m 1,s,Space,k) returns the k-th ‘least’ distance value for the string s in the search space Space

Surprisingly, the application of this variant on PFN-1 did not result in statistically significantly different accuracy figures from those obtained with CombinedMostSimilar even on the 10% significance level. The top-ranking settings in each category involved Jaro–Winkler, Smith–Waterman and WLCS as m 1, and Monge–Elkan & \(CP_{\delta_{2}}\) as m 2 metric. In particular, the best score in each category was achieved with WLCS (m 1) and Monge–Elkan & \(CP_{\delta_{2}}\) (m 2). See Table 5 for details. Contrary to PFN-1, significant improvement could be obtained with the algorithm CombinedMostSimilar-2 on PFN-2. In particular, the top scores for AA, SR and AAR were improved against the recursive metrics by 6.5%, 9.1%, and 1.2% respectively. The top metric combinations are given in Table 6.

Table 5 Top results for CombinedMostSimilar-2, PFN-1
Table 6 Top AA, SR, AAR, and RA results for CombinedMostSimilar-2 on PFN-2 with k = 3

Finally, we experimented with ‘merging’ the results of various distance metrics by computing a global rank, which is a linear combination of the corresponding distance values. Since the top score achieved in this way with Monge–Elkan & \(CP_{\delta_{2}}\), Monge–Elkan & \(CP_{\delta_{1}}\), WLCS, SW-D, and JWM did not result in an improvement of the accuracy (AA = 82.9%, SR = 85.0%, AAR = 94.4%, and RA = 96.5%) we dropped this line of explorations.

6.5 Pattern-based method

In our next experiment, we have explored whether utilization of a simplistic lemmatization model based on automatically acquired suffix-based patterns can improve the accuracy. We have automatically acquired from a large set of the training data a set of triples (TrainedTriples) of the form (s infl ,s base ,r), where s infl is a suffix of an inflected word form, s base is a corresponding suffix in the base form for s infl , and r is the rank, which is calculated as the frequency of the pair (s infl ,s base ) in the training data raised to the power of |s infl | (i.e., longer suffixes are promoted). We considered all pairs of suffixes of length up to five characters. The training data consisted of 1,093,149 noun entries extracted from the morphologically tagged dictionary taken from Morfologik project (Miłkowski 2007). These suffix-based patterns were then used to select the base form in case of multi-result answers by the given string distance metric. The algorithm (PatternBased) is given in Fig. 3.

Fig. 3
figure 3

The algorithm PatternBased. In line 5, a call to SelectUsingPatterns method returns for the string s the preferred base form from the list of candidates in the search space Cand, by ranking of suffix-based lemmatization patterns, which match s. The pseudo-code is given in Fig. 4

The top results for PFN-1 and PFN-2 tested with simple and recursive metrics are given in Table 7. These results were obtained with k = 1, i.e., considering only the smallest distance value. Unfortunately, increasing the value of k did not improve the accuracy figures. As can be observed, the suffix-based algorithm turned out to perform significantly better for both PFN-1 and PFN-2 in all categories when compared to the best results obtained for simple and recursive metrics.

Table 7 Top accuracy results for PatternBased algorithm

Furthermore, in case of PFN-1 the results are statistically significantly better than those obtained with both CombinedMostSimilar algorithms, for AA, AAR at the 0.1% significance level. For PFN-2 the top AA and SR scores obtained with the suffix-based algorithm are worse by 3.9% and 8.3%, respectively, whereas the AAR score is better by 3.2% compared to CombinedMostSimilar-2 and these differences are statistically significant at the 0.1% level.

Next, we have explored whether deployment of CombinedMostSimilar algorithms as the m metric in PatternBased algorithm yields any improvement. We call this variant PatternBased-2. The top scores for both datasets are given in Table 8. Improvement is not sure here, since the observed differences are statistically significant only around 5% significance level.

Table 8 Top results for PatternBased-2 algorithm (CMS and CMS-2 stand for CombinedMostSimilar and CombinedMostSimilar-2, respectively)

6.6 Pattern-based method with candidate pre-selection

Subsequently, we have experimented with the suffix-based patterns in another way, i.e., by replacing the string-distance metric used in the PatternBased algorithm with a candidate pre-selection heuristic, which for a given name s = s 1s k (where s i ’s denote tokens not characters) accepts only such names s′ = s1s k in the search space, for which the length of the common prefix of each corresponding token in s and s′ is at least 50% of the length of the token in s. The tokens constituting the names are sorted alphabetically before the aforesaid heuristic is applied. In this manner, the ‘candidate’ sets were significantly larger than in the case of applying methods introduced in previous sections. We refer to this algorithm as PatternBased-WithPreselection. The pseudo-code is given in Fig. 5.

Fig. 4
figure 4

Algorithm SelectUsingPatterns for selecting base forms. Initially (line 3–4) lemmatization patterns for the first name and surname respectively are created. Subsequently, for each candidate c (line 6), we select from the lemmatization pattern sets the ones which are compatible with c, i.e., the corresponding ‘stem’ part of the pattern matches with c, and which have the highest rank (call to BestPattern in lines 9–10). Subsequently candidate c is assigned a rank (line 11), which is a linear combination of the rank for the best first-name pattern and the rank of the best surname pattern (in our experiments α and β are set to 0.5). Finally, the candidate with the best rank (or more if there are more with the same rank) is returned (line 13)

Fig. 5
figure 5

Algorithm PatternBased-WithPreselection

All accuracy results for PFN-1 were significantly worse than the best overall scores obtained so far. Clearly, one could not expect to gain anything w.r.t. AAR and RA due to larger candidate sets. Surprisingly, the results for PFN-2 in AA and SR category could be improved. All figures are given in Table 9.

Table 9 The results of applying PatternBased-WithPreselection algorithm

Another technique, which we experimented with is learning the suffix-based lemmatization patterns for first names and the corresponding surnames in parallel. In particular, we carried out an experiment, in which we learned from PFN-1 patterns of the form {(f infl , f base ),(s infl , s base )}, where f infl (s infl ), and f base (s base ) stand for the corresponding suffixes in the inflected first name (surname) and the base form, respectively. They were then used as follows. For an input name in PFN-1, all ‘compatible’ patterns were applied for producing candidate base forms by performing appropriate suffix transitions. In case a candidate base form was in the search space, it was added to the result. The obtained accuracy figures were significantly lower than the top scores achieved so far. Nevertheless, once larger training data is provided, this method should be studied more thoroughly.

6.7 Utilization of contextual information

A known technique for disambiguating person names in web documents is to utilize the local context in which a person name mention appears (Bagga and Baldwin 1998; Mann and Yarowsky 2003; Fleischman and Hovy 2004; Pedersen et al. 2005; Bollegalla et al. 2008). In our final experiment, we carried out some initial experiments in order to explore whether any accuracy improvement could be gained by utilization of such contextual information. In particular, we first used the PatternBased algorithm described in Sect. 6.5 for making a pre-selection of candidates and subsequently, in case of multiple answers, we used a score for comparing the similarity of the context in which the current name and the candidates appear in, in order to select the correct candidate(s). The pseudo-code of this method (ContextBased-MostSimilar) is given in Fig. 6.

Fig. 6
figure 6

The algorithm ContextBased-MostSimilar. Cand denotes the candidate set returned by one of the algorithms described earlier (here PatternBased). The function Context-Similarity(s,c) returns the score for the similarity of the contexts s and c appear in. Several techniques have been used for implementing the aforementioned function

The score for computing context similarity Context-Similarity(s,c) has been computed in various ways. In the first variant, we compute for a given name s and a candidate c a ‘local context’ consisting of a bag of unique words extracted from all paragraphs in which s and c occurred in the Rzeczpospolita corpus. We denote this sets as LC(s) and LC(c) respectively. Next, we used the Jaccard coefficient Footnote 5 for computing the similarity between LC(s) and LC(c). We will refer to this variant with Jaccard.

In the second variant, we made an assumption that an inflected form and the corresponding base form co-occur in same documents. In particular, we used point-wise mutual information (PMI)Footnote 6 in the following way. For a current name s and a candidate c we compute the ratio hits(s AND c)/hits(c) (we dropped log 2 since we are looking for maximum values), i.e., the ratio of documents in the corpus, in which both s and c appear and the number of documents in the corpus, in which c appears. This ratio is a measure of the degree of statistical dependence between s and c. We used not only Rzeczpospolita corpus in this context, but also the Web via submitting queries to the MSN Live search engine (http://www.live.com) in a manner similar to the one described in Bollegalla et al. (2007). We will refer to these two sub-variants with PMI-RZ and PMI-MSN, respectively.

The evaluation revealed that such a context-based approach achieves the results which are among the top ones obtained so far, however, the differences (for the settings tested up to now) with pattern-based methods are not all statistically significant. It seems that some gain can be obtained for AAR and RA accuracy. To be more precise, the best results for PFN-2 reported in Table 7 for AAR and RA (ST & WLCS setting) could be further improved by 0.2% and 0.1%, respectively (the first difference is statistically significant at the 1% significance level) through using (ContextBased-MostSimilar) algorithm with Jaccard and PMI-RZ variant of the Context-Similarity function. We also modified the algorithm ContextBased-MostSimilar through replacing PatternBased with PatternBased-2. Analogously, no improvement against the results reported in Table 8 could be gained except AAR, which was improved for PFN-1 by 0.2% and for PFN-2 by 0.1%, respectively. Thus, the best performing algorithms could not be significantly boosted by using local context information in the way described above. We believe that this is mainly due to the fact that the ‘correct’ candidates are filtered out in the pre-selection phase (line 1 in the ContextBased-MostSimilar algorithm) and context information can not be fully exploited. However, some initial experiments with various values for k parameter (different size of the candidate set) revealed that no improvement could be gained. In Sect. 7 we give a detailed error analysis, where the ideas for improving the pre-selection phase are addressed. It is also important to note that the PMI-RZ and Jaccard performed on the average better than PMI-MSN.

We also experimented with using contextual information for scoring candidates in case of multiple-result answers returned by other algorithms described in previous sections. For instance, in case of PatternBased-WithPreselection algorithm the changes varied from a decrease in the SR measure by around 3% to an increase of around 5% for the AAR metric. Although, contextual information happened to improve the accuracy in many settings for the particular algorithms, the overall top scores obtained so far could not be further improved.

7 Discussion

In Sect. 6, we have presented some selected results of our numerous experiments on measuring lemmatization and name matching accuracy for several knowledge-poor methods. In particular, we have measured AA accuracy, which says how often a single-result answer constituting the base form could be returned (multiple-result answers are counted as false positives, i.e., they are penalized). Furthermore, SR accuracy measures the precision of single-result answers w.r.t. returning a base form. Next, AAR accuracy measure gives the precision of returning the base form or some other variant of the same name, where multiple-result answers are penalized again. Finally, the RA metric, the most ‘relaxed’ one, gives the percentage of results, which are either single-result answer or multiple-result answer, where all returned strings in the answer are either the base form or other variant of the same name.

In order to get a better picture of all the results achieved with various techniques, an overview of the best AA, SR, AAR and RA accuracy figures is given in Figs. 7 and 8, respectively. To be more precise, Fig. 7 refers to the top results achieved for the lemmatization task (AA and SR accuracy), whereas Fig. 8 reports the best results for the name matching task (AAR and RA accuracy). The symbols S, R, CMS, CMS-2, PMS, PMS-2, PWP, and CXT correspond to simple metrics, recursive metrics, CombinedMostSimilar algorithms, PatternBased algorithms, PatternBased WithPreselection, and ContextBased-MostSimilar method, respectively.

Fig. 7
figure 7

Summary of the AA (left) and SR (right) accuracy

Fig. 8
figure 8

Summary of the AAR (left) and RA (right) accuracy

To summarize, the pattern-based and context-based algorithms perform the best in our settings, and in most cases their supremacy is statistically significant, as reported in detail in previous section.

More precisely, as can be observed, in AA accuracy, one can gain by combining string distance metrics and further improve the accuracy figures by integrating automatically acquired suffix-based patterns for ‘best’ candidate selection. Going beyond the 90% mark seems to be hard (see Fig. 7). In case of PFN-2 dataset, which contains hard-to-tackle cases (e.g, inversions, etc.) the AA accuracy figures are not very impressive (not displayed in the diagrams), but this is due to the fact that in many cases the inverted base forms are being returned as the result (which is penalized). Most of the errors encountered in the AA category were due to matching a variant of the same name, but not the base form itself. Additionally, some part of the errors were caused by homonymy of male and female variants of the same first name and complex transliteration rules for inflected names in Polish. A detailed error analysis is given later in this section.

As for AAR accuracy, similarly to AA, one could obtain the best results for both datasets by combining string distance metrics and further significantly improve the accuracy by integrating automatically acquired suffix-based patterns. Due to the specification of AAR, PFN-1 and PFN-2 results were not much different except the simple metrics. Remarkably, almost the optimal score could be achieved (98.8%). Consequently, the best methods in the AAR category presented here are sufficient for performing person name matching tasks in Polish. Interestingly, the utilization of local context in which the names occur in the news articles did not turn out to boost the accuracy significantly. As a matter of fact, the AAR accuracy was the only one, which could be improved, namely by 0.2%. Most likely, deployment of more sophisticated linguistic techniques would not be highly beneficial either.

The situation with SR is a bit different. The performance of almost all techniques, which go beyond the simple metrics is around 88–89% (see Fig. 8). Analogously to AA, the figures for PFN-2 are not very impressive, but we could at least improve the SR figures by amalgamating various string distance metrics and other lightweight techniques.

As for RA figures, most of the top accuracy figures achieved with various methods for both datasets were oscillating between 97% (simple metrics) and 99% (ContextBased-MostSimilar). Noteworthy, the relatively good results for AAR and RA accuracy should not be overestimated since matching ‘some’ variants correctly is not equivalent to merging all variants of the same name into one cluster, etc.

In order to get a better insight into the most prevalent errors we have calculated some statistics for all four accuracy metrics. In particular, we investigated the ‘top’ scoring algorithms for each accuracy metric. The results of this analysis for PFN-1 and PFN-2 are given in Tables 10 and 11, respectively. The third column gives the total number of errors, whereas the remaining columns give the fraction of errors of different types. We differentiate between five types of errors:

  • Type A: The returned answer is a morphological variant of the name considered, but not the base form itself. This results from the fact that for many metrics distance between inflected variants is frequently smaller than between an inflected form and the corresponding base form, e.g., dist(‘Ramazzottiemu’, ‘Ramazzottiego’) is less than the value of dist(‘Ramazzottiemu’, ‘Ramazzotti’) for some metric dist, where Ramazzotti is the base form. This error type applies only to AA and SR.

  • Type B: The returned answer contains more than one name, where each such name is either the base form or a morphological variant of the name being considered. This error type applies only to AA and AAR.

  • Type C: The returned answer contains one or more names, which refer to another person with a similar name. There are three subtypes for this category: (a) identical first name and similar but different surname, (b) identical surname and similar but different first name, and (c) both first name and surname are similar but different. Similarly to type-A errors, this type of errors are caused by certain limitations of string distance metrics. Let us consider as an example the name Marka Kubiaka—a genitive form of the masculine name Marek Kubiak. One might find person names in the search space with similar surname to Kubiak, e.g., Kubisz for which dist(‘Marka Kubiaka’, ‘Marka Kubisza’) is less than dist(‘Marka Kubiaka’, ‘Marek Kubiak’) for some metric dist (mainly due to alterations in the stem).

  • Type D: The error is caused by homonymy of male and female variants of the same first name, i.e., the returned answer refers to a person with the same surname and first name, but the gender of the first name is different. For instance, Stanisława Polaka (masc. gen.) could be mapped to Stanisława Polak (fem. nom.) instead of being mapped to Stanisław Polak (masc. nom.). See Table 1 in Sect. 2 for full declension of these names. Interestingly, most of the Polish first names that end in sław have their female variant (e.g., Bronisław (masc.) versus Bronisława (fem.), Czesław (masc.) versus Czesława (fem.), Wiesław (masc.) versus Wiesława (fem.)). In the whole Rzeczpospolita corpus (Weiss 2007) (see Sect. 3) we found circa 50 distinct first names ending with sław, which have their female variant. There exist also other first-name suffixes, which exhibit same phenomenon, e.g., mir, but they are less occur less frequently than the ones ending in sław.

  • Type E: The name being considered is a foreign name and the name(s) returned as an answer is either the base form or an inflected form of the same name, but the spelling of the returned name(s) is incorrect due to the declension rules for foreign names and transliteration issues (see Sect. 2). It is important to mention in this context that in Polish a base form of a person name preserves original spelling while inflected versions use Polish transliteration, e.g., Julii Tymoszenko (gen. fem.) versus Julia Timoszenko (nom. fem.). The substitution of i by y in the middle of the aforementioned name is penalized by some metrics/techniques. The main idea of errors of type E is to penalize the accuracy if we match current name with its orthographically incorrect variants found in the corpus. As a matter of fact the information on errors in this category reflects how frequently humans incorrectly inflect foreign names in Polish.

Table 10 The error analysis for PFN-1 and best settings for each accuracy metric
Table 11 The error analysis for PFN-2 for AAR and RA accuracy metrics

As can be observed, the errors of type A account for the majority of errors in AA and SR accuracy. Consequently, it appears very unlikely that the accuracy results in these categories could be improved due to the limitations of the string distance metrics mentioned earlier, e.g., distance between inflected variants of a given person name is often smaller than between an inflected form and the corresponding base form, etc. At a first glance, the utilization of local context might not be very useful since different variants of the same name might occur in the same or similar context, i.e., in sentences or paragraphs in text that are related to the same event. However, if one makes an assumption that the nominative form of a given name appears most frequently among all it’s variants, then we could modify the CXT algorithm by: (a) relaxing the pre-selection of candidates so that larger number of potential candidates is returned, and (b) selecting the one, which appears most frequently in similar local context as the current name, for which base form is being searched. To illustrate the idea more precisely consider the name Jana Kowalskiego (masc. gen.). By relaxing the candidate pre-selection phase one would expect the following names to appear in the preliminary answer set (provided that they appear in the corpus): Janem Kowalskim (masc. ins.), Janowi Kowalskiemu (masc. dat.), Jan Kowalski (masc. nom.), Janie Kowalskim (masc. loc.). Subsequently, a corpus of text documents (e.g. results returned by a Web Search engine) could be used for identifying the context, in which the name Jana Kowalskiego appears most frequently. Next, for each of the names in the candidate set one could compute the number of occurrence in the text collection, provided that the context in which the candidate appears is similar to the ’most’ frequent context of Jana Kowalskiego computed before, i.e., candidates appearing in a different context would be discarded. Finally, one would return the candidate with the highest frequency and expect the nominative form Jan Kowalski to score highest. Such an approach might intuitively cause gain in AA accuracy and could be explored in the future.

Since the number of errors of type B is not significant we do not discuss them here. The errors of type C constitute the second largest group of errors. These numbers indicate the importance of using local context information for improving the accuracy in the same way as described in the previous paragraph. A thorough analysis of the errors revealed that in most cases the correct base form or other variants were filtered out in the pre-selection phase. Thus the next logical step would be to experiment with relaxation of the pre-selection phase of the CXT algorithm.

As for AAR and RA accuracy the errors of type D pose a significant problem. In particular, in the context of this type of errors contextual information might come in handy for disambiguation purposes. However, the error numbers in case of the CXT method show that no significant improvements could be gained to reduce this type of errors. Once again, we believe that fine-tuning the pre-selection phase might alleviate the problem. Contrary to the above, in the case of SR accuracy the total number of errors of type D could be reduced by deploying the CXT algorithm.

Finally, we can notice that errors of type E constitute a relatively minor problem in the case of PFN-1 and PFN-2 corpora. As a matter of fact they are treated here as ‘errors’ only because the PFN-1 and PFN-2 data sets do not include orthographically incorrect name variants. Nevertheless, our intention was to highlight this interesting problem humans have when inflecting some foreign names in Polish due to the complex transliteration rules. Interestingly, when we consider only foreign person names in the test corpora the corresponding error rate is significantly higher, in particular in the case of PFN-2 (see the column labeled with ‘type E*’ in Table 11). The fraction of foreign names account for 46% (829) and 45% (694) of all names in PFN-1 and PFN-2, respectively.

To sum up the error analysis, we believe that a larger test data set should be used in order to get a better insight. Additionally, we pinpointed some ways of how the method using local context information for selecting the correct answer could be improved in proximate step.

8 Summary and outlook

In this article, we have studied the usability of several knowledge-poor methods for supporting and tackling the task of matching Polish person names and their lemmatization. The presented techniques utilize string distance metrics, combinations thereof and automatically acquired suffix-based lemmatization patterns. Furthermore, we also utilized local context in which the names appear in news articles. The major aim of our work was to explore how good results can be obtained with such lightweight techniques without linguistic sophistication. For solving some of the tasks they seem to be sufficient, whereas for other tasks, e.g., lemmatization, deployment of more elaborated techniques might result in better accuracy.

We hope that the results presented in this article constitute useful guidelines for developing a fully fledged solution to person name matching for Polish and similar highly inflectional languages. To our knowledge, this is one of the first efforts on tackling the person name matching and lemmatization task in Polish using linguistically poor methods.

In future work, we plan to experiment with some machine learning techniques for tackling the task. For instance, in Lindén (2008) a new probabilistic model for determining base forms for previously unseen words by analogy with a set word and base form pairs has been introduced. This new language-independent method for automatically learning a base form guesser, achieves a recall of 89–99% and precision of 76–94%, without any a priori knowledge of the declension paradigm. It would be interesting to try utilizing this approach in the context of lemmatizing Polish person names and other tasks related to name matching. However, the aforementioned technique requires large amount of training data.

Also, the context-based approach seems to be promising since its performance is among the best and still seems to leave some room for improvement, due to the fact that the context and similarity can be computed in numerous ways which have not been explored yet. In particular, the techniques described in Bagga and Baldwin (1998), Mann and Yarowsky (2003), Fleischman and Hovy (2004), Pedersen et al. (2005), Bollegalla et al. (2008), and Fernandez et al. (2007) could be fully explored in the future.

Finally, we intend to apply the methods presented in this article in a framework for clustering large web page collection in Polish according to persons mentioned in these pages.

To sum up, lemmatization of proper names and name matching in highly inflectional languages poses an interesting and challenging problem. We strongly believe that work in this area is of paramount importance in the context of improving web search quality since the number of non-English pages in the Web increases. Apart from that, most (if not all) commercial search engines do not seem to be capable of dealing with person name ‘normalization’ in Polish and other similar highly inflectional languages with complex person name declension paradigm. Consequently, a significant number of web pages containing query-relevant information can never be found, unless one searches for inflected name forms.

The significance of the results presented in this article would further benefit if the experiments were carried out on a larger data set. Therefore, we are continuously extending the data set and envisage to make these data publicly available for the research community.