A Memory-Based Learning Approach for Named Entity Recognition in Hindi

Kamal Sarkar; Sudhir Kumar Shaw

doi:10.1515/jisys-2015-0010

Open Access Published by De Gruyter March 24, 2016

A Memory-Based Learning Approach for Named Entity Recognition in Hindi

Kamal Sarkar and Sudhir Kumar Shaw

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2015-0010

Abstract

Named entity (NE) recognition (NER) is a process to identify and classify atomic elements such as person name, organization name, place/location name, quantities, temporal expressions, and monetary expressions in running text. In this paper, the Hindi NER task has been mapped into a multiclass learning problem, where the classes are NE tags. This paper presents a solution to this Hindi NER problem using a memory-based learning method. A set of simple and composite features, which includes binary, nominal, and string features, has been defined and incorporated into the proposed model. A relatively small Hindi Gazetteer list has also been employed to enhance the system performance. A comparative study on the experimental results obtained by the memory-based NER system proposed in this paper and a hidden Markov model (HMM)-based NER system shows that the performance of the proposed memory-based NER system is comparable to the HMM-based NER system.

Keywords: Named entity recognition (NER); natural language processing; memory-based learning; HMM

1 Introduction

Named entity (NE) recognition (NER) is an important preprocessing component of almost all natural language processing (NLP) applications, such as information extraction, machine translation [1], question answering [16], text summarization, information retrieval, ontology development (e.g. clinical vocabularies may be developed extracting directly from clinical reports [15]).

The objective of NER is to identify NEs in a running text and map them to predefined classes such as location names (cities, countries, places, etc.), person names (names of people), organization names (companies, government organizations, committees, etc.), temporal expression (date, time, period), miscellaneous names (monetary expression, materials, artifacts, quantity, measurement expressions, etc.), and “none of the above.” NER in an Indian language like Hindi is a challenging task because, unlike the English language, Indian languages lack the capitalization information that plays a very important role to identify NEs.

The term NER task was first coined in the 6th and 7th Message Understanding Conference. This conference was called for discussing the problem of recognition of names, temporal expression, and monetary expression in documents. It was later defined as the shared task of the CoNLL 2003 conference, where the task was to tag noun phrases with four classes: person (PER), organization (ORG), location (LOC), and miscellaneous (MISC).

In this paper, we present a memory-based learner for NER in Hindi. Hindi is the third most spoken language in the world, and it is the national language of India. The work reported in this paper is different from existing NER systems in terms of the following points:

In many previous works on NER for Hindi [8, 17, 19], the researchers only considered four broad NE categories, namely person names, location names, organization names, and miscellaneous, whereas we have considered a more fine-grained set of 22 NE categories that can be grouped into three broad types: (i) NE categories for name expressions – person, organization, location, facilities, locomotive, artifact, entertainment, materials, living things, plants, disease; (ii) NE categories for number expressions – distance, money, quantity, count; (iii) NE categories for time expressions – time, year, month, date, day, period, Sday (special day such as Christmas day).
We have used a set of composite features along with a set of individual features for NER.
We have employed a memory-based learning method for the Hindi NER task.

1.1 Issues Related to NER in Hindi Language

The task of building an NE recognizer for the Hindi language presents several issues related to their linguistic characteristics. There are some issues faced by the Hindi language:

Lack of capitalization information: Unlike English and many European languages, Indian languages (like Hindi) lack the capitalization information in the NEs, which plays a very important role to identify NEs in those languages. For example, English names always start with capital letters, while Hindi names do not have capitalization cues.
Ambiguous names: Hindi names are relatively ambiguous in a sense that many person names have other specific meanings. For example, Puja can be a name of a person or it can be a common noun with meaning “worship.” This issue makes the Hindi NER a very difficult task.
Scarcity of resources and tools: Hindi is also a resource-poor language. Name dictionaries, good morphological analyzers, annotated corpora, part-of-speech (POS) taggers, etc., are not yet available in the required quality and in sufficient accuracy.
Variation in spelling: Another important issue is the variation in the spellings of proper names. Due to this problem, the number of tokens to be learnt by the machine is increased and this requires a higher-level task like co-reference resolution.
Inflection: Hindi is an inflected language, and it provides a rich and challenging set of linguistic and statistical features that result in long and complex word forms.

2 Related Work

The NER task has received more attention from NLP researchers since the last decades due to the importance of NER in many NLP tasks: (i) information extraction that seeks to locate and classify atomic elements in the text; (ii) question answering, which also requires extracting important information from questions and generating appropriate answers; (iii) machine translation; and (iv) other applications. Three commonly used approaches to NER are the linguistic approach, machine learning (ML)-based approach, and hybrid approach.

The linguistic approach [4] mainly uses handcrafted rules that are created by linguists. Clearly, the skills and experience of a linguist are more important to reach the desired performance by the overall system. The main advantage in the rule-based approach is extraction of complex entities can be fine tuned, and it does not require a large amount of annotated data or corpus. Handcrafted rules are language dependent, and so they may not be applicable for other languages.

The ML-based technique for NER tasks requires large amounts of NE-annotated training data to acquire a higher level of language knowledge. Sekine [23] used decision tree learning for Japanese NE recognition. The most commonly used ML methods for the NER task are the hidden Markov model (HMM), conditional random fields (CRFs), and support vector machine (SVM). Each of these ML approaches has its own advantages and disadvantages. HMM [2] was found to be very effective in sequential labeling problems. CRF [9] is a probabilistic approach, flexible to capture many closely connected features, including overlapping and non-independent features. SVM [24, 25] separates the classes by drawing a decision boundary between them, and a query instance is predicted to belong to a category based on which side of the gap it falls on.

Saha et al. [18] presented a hybrid approach that combines rule-based and ML-based methods, and created new methods that take the strongest points from each method. They also employed the Gazetteer Lists in NER tasks. They experimented with Hindi NER using a training set of >5 lakh words and a test set of 38,704 words, and achieved F-measures of 66.08, 67.50, and 65.13, respectively, for maximal, nested, and lexical-level evaluation.

Cucerzan and Yarowsky [6] studied the NER task for Hindi as a part of their language-independent NER work that used morphological and contextual evidences [6]. They experimented with five languages – Romanian, English, Greek, Turkish, and Hindi. Among these, the accuracy for Hindi was the worst. For the Hindi language, they achieved a 41.70% F-measure with a very low recall of 27.84% and about 85% precision. Li and McCallum [14] presented a more successful Hindi NER system that uses CRFs with feature induction. They were able to achieve a 71.50% F-value using a training set of size 340k words. For Hindi, better accuracy was achieved by Kumar and Bhattacharyya [13]. Their maximum entropy Markov model-based system gives a 79.7% F-value.

The k-nearest neighbor (KNN) method that we have used for the NER task is a memory-based learning method [20, 22, 27]. Compared to the previous ML-based approach, the main advantage of a memory-based NER system is that it is less affected by the sparse data problem, as the KNN approach provides a solution to the sparse data problem via an implicit similarity-based smoothing scheme. Our choice of the KNN approach was motivated by its simplicity, flexibility to incorporate different data types, adaptability to irregular feature spaces, and capability to directly handle string features that facilitate defining the context of words.

3 Data Set Description

The data set that we have used in our experiment has been taken from NLP tool contest on NER for Indian Languages, conducted in association with ICON 2013 (http://ltrc.iiit.ac.in/icon/2013/nlptools/). The data sets released for the tool contest was POS tagged and chunked. The goal of this contest was to perform NE recognition on a variety of types such as artifact, entertainment, facilities, location, locomotive, materials, organisms, organization, person, plants, count, distance, money, quantity, date, day, period, time, and year.

The data set is available in SSF (Shakti Standard Format) (http://ltrc.iiit.ac.in/nlptools2010/files/documents/SSF.pdf). Sentence-level SSF is used to store the analysis of a sentence. The analysis of the sentence gives POS and chunk information for the tokens in the sentence. Each line represents a token or a group information (except for lines with “)),” which only indicates the end of a group). For each group, the symbol used to indicate the start of a group is “((.” Each token or group information has three parts: the first part stores the tree address of each token or group (this is for human readability only); the second part stores the token or group information; and the third part stores the POS tag or group/phrase category (chunk information).

To enhance the performance of our proposed model, we have considered chunk boundaries as the tokens in our model. The chunk boundaries are indicated by double opening brackets “((” and double closing brackets “)).” This is because, most of the time, these chunk boundaries play a role as a separator for the NEs, and this can be helpful to find the NE boundaries. For example, consider the sample sentence in Figure 1 containing NEs. As we can see in Figure 1, chunk boundaries (opening and closing double brackets) play a separator role for NE recognition. Here, from the chunk boundaries, we can have partial knowledge of NE boundaries within the sentence.

Figure 1:

A Sample Training Sentence in SSF Format.

In the ICON 2013 NLP tool contest on NER for Indian Languages, three labeled data sets were initially given to contestants to evaluate their systems: training set, development set, and test set. The detailed description on our used experimental data sets is given in Table 1.

Table 1:

Number of Words (Including Punctuations) and NE Tags Available in Training, Development, and Test Data.

Data set	Number of words	Number of NEs present
Training set	68,608	4646
Development set	10,678	1058
Test set	8944	652

3.1 Preprocessing of Data

We have considered the phrase chunk boundary information as tokens. However, in the SSF format, the closing phrase boundary indicated by “))” has no tag. Thus, to maintain uniformity, we have defined for the token “))” one POS tag “END” and one chunk tag “XXX-E,” where XXX is the name of the respective group (chunk). For example, if a “))” is the closing bracket for an NP chunk, XXX gets the value of NP and the chunk tag becomes NP-E, which is assigned for the token “)).”

In the SSF format, the token “((” indicating opening phrase boundary has a chunk tag but has no POS tag. Here, also an additional POS tag is required to maintain uniformity. Thus, we assign our defined POS tag to this kind of token. For example, when the chunk type for the token “((” is “YYY,” we assign a POS tag YYY-OPEN for the token “((”; that is, if the chunk type is NP, the POS tag for the token “((” is NP-OPEN and if the chunk type is JJP, the POS tag for the token “((” is JJP-OPEN.

After doing the above-mentioned changes in the input data, the data set in SSF format is further processed to convert it to the IOB (short name for Inside, Outside, Beginning) format (http://www.cnts.ua.ac.be/conll2000/chunking/). In the IOB format, the entities are encoded with IOB tags. We have used the XXX-B tag to indicate the first word of an entity type XXX, and XXX-I is used for subsequent words of an entity and the tag “O” is used to indicate the word that is outside of an NE (i.e. not a part of an NE).

Table 2 contains the POS, chunk, and NE tag information in the new format for the sentence shown in Figure 1.

Table 2:

An Example of Preprocessed Data.

Token	POS tag	Chunk	NE tag
((	NP-OPEN	NP-B	O
Sirmor	NNP	NP-I	LOC-B
Mein	PSP	NP-I	O
))	END	NP-E	O
((	NP-OPEN	NP-B	P
Bahoot	QF	NP-I	O
Se	PSP	NP-I	O
))	END	NP-E	O
((	NP-OPEN	NP-B	O
Aakarshan	NN	NP-I	O
))	END	NP-E	O
((	JJP-OPEN	JJP-B	O
Uplabdh	JJ	JJP-I	O
))	END	JJP-E	O
((	VGF-OPEN	VGF-B	O
Hain	VAUX	VGF-I	O
.	SYM	VGF-I	O
))	END	VGF-E	O

4 Memory-Based Learning Approach

We have used the memory-based learning method for Hindi NER, although a limited number of works on NER has used memory-based learning for NER. The memory-based learning algorithm that we have used for the NER task is the KNN algorithm [10], which works as follows:

Stores the training data into memory.
When a query instance is encountered, k numbers of nearest neighbors of the query are retrieved from the memory by computing the similarities between the training instances and query instance.
Finally, based on class distribution of these k-retrieved instances, the query instance is classified. The KNN approximates the class label of query instance by assigning the most frequent class label occurring among the k-most similar patterns retrieved from the memory.

The steps of the KNN algorithm are elaborated in Figure 2.

Figure 2:

KNN Algorithm.

4.1 Resolving Tie Situations

A simple KNN model may suffer from a tie situation when we find that the frequency of the most frequently encountered class from among KNNs is n_max, and at least two class labels obtain n_max number of votes contributed by the nearest neighbors.

To break a tie, we depend on the similarities between the query and its nearest neighbors. If NE classes c₁, c₂, …, c_n (n>1) are in a tie, for each class label c_i in the set of n class labels that are in a tie, we separately compute the sum of the similarities between the query and its nearest neighbors that contribute votes for the class label c_i, and finally choose the NE class c for which the sum is the maximum. The following equation presents this concept in more concise manner:

C = argmaxi ∈ NEclasses∑j=1ksim(q, δ(xj, i)),where δ(xj, i) = xj if i = f(xj) = ℧ otherwise.

Here, ℧ is the null vector, f(x_j) is the class label of the j-th neighbor and sim(q, ℧)=0.

The function δ(x_j, i) returns the instance x_j if the class label of x_j and the class label under consideration is the same; otherwise, it returns a null vector. The function sim(x,y) computes the similarity between two vectors x and y using our proposed similarity measure described in Section 4.2.

For example, if a tie situation occurs between two NE classes PER-B and LOC-B, to break the tie, we sum up the similarity values computed between the query and its nearest neighbors contributing votes for NE class PER-B and also do the same for its nearest neighbors contributing votes for NE class LOC-B, and finally returns the NE class for which the total similarity value is the highest.

4.2 Similarity Measure

Similarity measure is very crucial in memory-based learning, because the classification of query instance is based on the generated k-most similar patterns from training data. The strength of such a system lies in the capability to compute the similarity between the test instances and the training instances. There are different similarity or distance measures that are widely used in the KNN model to compute the similarity or distances between training and test instances: Euclidean distance, Minkowski distance, Mahalanobis distance, and cosine similarity. As our feature vector is a combination of nominal (e.g. POS tags), string (e.g. word itself), and binary (e.g. digits features) attribute values, that is, the instances are characterized by attributes of mixed types, the above-stated distance or similarity measures are not useful in our case. The simplest similarity measure is the overlap measure [7], which compares the corresponding attribute values and adds 1 to the similarity value when they are same. Thus, we have used a variant of overlap measure as the similarity measure. According to this measure, when two instances (pattern vectors) are compared, for each of the categorical attribute, we follow the following rule:

For two given instances, if the value of the attribute in instance 1 is the same as that in the instance 2, then we increment their similarity value by 1.

The similarity sim(x_i, x_j) between two instances, containing p attributes of mixed types, is defined as

sim(xi, xj) = ∑n = 1psim(n)(xi, xj)p,

where p is the total number of attributes.

simⁿ(x_i, x_j) is the contribution of attribute n to the similarity between the two instances x_i and x_j. The value of simⁿ(x_i, x_j) is computed according to the type of the attribute as stated below:

Binary attribute: If the attribute is binary, simⁿ(x_i, x_j)=1 if x_in=x_jn. Otherwise, simⁿ(x_i, x_j)=0. Here, x_in is the value of the n-th attribute in the instance x_i.

Categorical (nominal) attribute: If the attribute is categorical (nominal) and x_in≠x_jn, simⁿ(x_i, x_j)=0. If the attribute is categorical (nominal) and x_in=x_jn, simⁿ(x_i, x_j)=w₁, where the value of w₁ can be set to 1 for most of the cases; however, only for a few cases can the value of w₁ be determined based on the current token (token under consideration for tagging) and the nature of the feature. For example, for the Gazetteer feature, the value of this feature is determined by looking into the Gazetteer list. If the current token is Gandhi, but the NE with which it matches in the person name Gazetteer list is Mahatma Gandhi, then w₁ is set to “0.50” [=1 / (number of words in a matched NE in the list)]. The respective similarity rules applied for other categorical features are presented in the “feature set” section.

String attribute: If the attribute is a string and the attribute values match,

simn(xi, xj) = w2.

The value of w₂ is set to 1 if x_in=x_jn. However, if exact matching is not found and the suffix or the prefix matching is found, w₂ is set to some value <1. The detailed discussion on similarity computation with respect to this type of feature is presented later in Section 5. If no match is found, simⁿ(x_i, x_j)=0.

5 Feature Set

Features play a crucial role in identifying and classifying the NEs. While assigning NE tags to the tokens in a sentence, it is scanned from the left to the right and the KNN classifier is used to assign the tags to the tokens one by one. Here, a token is represented as a feature vector that is a vector of values of the features describing the token. In our settings, a training instance is a feature vector representing a token labeled with a particular NE tag and a test instance is an unlabeled feature vector representing a token to be labeled. For this task, we have considered a feature set that includes a variety of simple features and their combinations. We have divided our entire features into two sets: single or simple features and composite features.

5.1 Simple Features

We have used a set of simple features that are widely used for NER tasks and that we found useful in our case also. These features are detailed below.

5.1.1 Current Token

It is a string feature. It is an important feature in NER task. Here, the current token can be a word itself or the chunk boundaries (because we have considered them as a token). When two instances are compared on this feature component for similarity computation, the feature values are initially exactly matched (without striping some characters). If an exact match is found, the value of the similarity between two instances under consideration is incremented by 1, and if the exact match is not found, then the suffix or prefix is considered. If the previous two cases fail, no increment is made. We consider suffixes or prefixes because these help to describe the role of the word [12].

If the values of the current token feature in two different instances do not fully match, then we do matching after striping three characters from the beginning; that is, we consider suffix-level matching. If suffix-level matching is not found, prefix-level matching is done. In prefix-level matching, three characters from the end of the words are stripped off. If the length of the prefix (or suffix) is greater than four characters and prefix (or suffix) matching is found, then we increment the similarity value for two instances by a value (0.8) that is <1. Here, the lower value is considered to discriminate between an exact match and a partial match.

5.1.2 Next Token

It is a string feature. While comparing two instances on this feature component to compute the value of the similarity between them, a decision of increment/no increment in similarity value is taken as per the same rule used for the current token feature mentioned above.

5.1.3 Chunk Information of the Current Token

It is a nominal feature. The chunk information of a word separates a word group from other groups of words; for example, it helps separate noun group words from adjective or verb group words. This is also very crucial feature because NEs appear in the form of noun phrases. If two instances to be compared have the same value for this feature component, the similarity value with respect to this feature component is set to 1; otherwise, it is set to 0.

5.1.4 POS Tag of the Current Token

This feature is actually a nominal feature. The POS tag of the current token plays an important role in finding NEs because it gives the important information about the POS type of the token. For example, the POS tag “QC” for a token specifies that it is related to a quantity or number. The similarity computation for this feature component is the same as the other nominal features mentioned above.

5.1.5 POS Tag of the Previous Token

The POS tag of the previous token can also be defined in the same way as we defined the POS tag of the current token. It is also a nominal feature. The similarity computation for this feature component is the same as other the nominal features mentioned above.

5.1.6 Infrequent Word

It is a binary feature. As some entities are infrequent in the documents, we check whether the current token is infrequent in the training corpus. If the current token is infrequent in the training corpus, then the value of this feature is set to 1; otherwise, it is set to 0. A list of infrequent words is created from the training corpus. A word is considered to be infrequent if the frequency of the word in the training corpus is ≤3.

5.1.7 First Word

It is a binary feature. The value of this feature is set to 1 if the current token is the first word of a sentence; otherwise, it is set to 0.

5.1.8 Word Length

It is a binary feature. If the length of the current token is greater than three characters, the value of this feature is set to 1; otherwise, it is set to 0.

5.1.9 Next Token’s POS Tag

It is nominal feature. Its value is the POS tag of the next token.

5.1.10 Chunk Information of the Next Token

It is also a nominal feature. The chunk tag of the next token is used as the feature value.

5.1.11 Gazetteer List

We have employed a small Gazetteer list in our proposed NER method. The description of the Gazetteer list is shown in Table 3. The Gazetteer list is incorporated in our NER task by considering a feature “Gazetteer” as a nominal feature that can take a value depending on which of the Gazetteer list contains the current word/token. That is, if nothing is specified and the current token is included in the person name list, the feature value is set to p-name. Similarly, for a location list, the feature value becomes l-name and so on. When two instances are compared on this feature only, if matching is found, we set the similarity value to 1*w, where the value of w is determined by how much of the current word/token matches with an NE in the corresponding list; that is, w is set to the value that is equal to m/n, where m=number of word matches and n=total number of words in a matched NE in the Gazetteer list. For example, if the current token is Gandhi but the NE with which it matches in the person name Gazetteer list is Mahatma Gandhi, then w is set to “0.50” [=1 / (number of words in a matched NE in the list)].

Table 3:

Hindi Gazetteer List Used in Our Work.

Gazetteer	Description	Number of entries	Source
Person	Contains titles of persons, some famous names	36	Manually prepared
Week days	Contains days of week in Hindi and English calendar	14	Manually created
Entertainment	Contains names of famous games	41	Manually created
Location	Contains names of famous locations	101	Manually created
Materials	Contains names of some materials used in daily life	49	Manually created
Different measurement expression for distance, period	Contains some measurement expressions	26	Manually created
Month names	Contains names of months in Hindi and in the English calendar	54	Manually created
Organization names	Contains some famous organization names	40	Manually created
Facility names	Contains some facility names, e.g. ambulance, airline, railway, etc.	28	Manually created
Living things	Contains names of living organisms	297	Manually created
Counting expression	Contains some counting expressions like thousand, lakh	4	Manually created
Plant names	Plant names	30	Manually created
Quantity measurement expressions	Contains expressions used for quantity	4	Manually created
Artifact	Contains names of religious books, tools, or materials, e.g. marble, knife	35	Manually created

For the following exceptional cases, the decision of determining the value of the Gazetteer feature is taken as follows:

If the current token contains digits and the next token is a month name (say October), then the value of Gazetteer feature for the current token value is set to d-name (Date name). For example, if the current token is “26” and it is found that its next token is a month name, say “October,” that means the NE class of the current token is closer to the “Date” NE.
If the current token is a month name and the previous token contains a digit, then the value of Gazetteer feature for the current token will be set to “Date.”
If the current token contains digits and the next token is a distance expression like “kilometer” or “meter,” then the value of Gazetteer feature is set to “Distance.”
Similarly, we developed the rule for NE tags: Quantity, Period, Money, Count. If the current token contains digits and the next token is a Quantity/Period/Money expression, then the value of the Gazetteer feature for the current token is set to the corresponding NE name.

5.1.12 Digit Features

We have used 11 different digital features presented in Refs. [8, 9] to find different digital patterns. These are the binary features. The summary of digit features used in our NER task has been presented in Table 4.

Table 4:

A List of Digit Features.

Feature	Description	Type
CntDgt	If the token W_i contains a digit, then it is set to 1, else 0	Binary
TwoDgt	If W_i contains two digits, then it is set to 1, else 0	Binary
FourDgt	If W_i contains four digits, then it is set to 1, else 0	Binary
CntDgtCmma	If W_i contains digit and comma, then it is set to 1, else 0	Binary
CntDgtPrd	If W_i contains digit and period, then it is set to 1, else 0	Binary
CntDgtSlsh	If W_i contains digit and slash, then it is set to 1, else 0	Binary
CntDgtHph	If W_i contains digit and hyphen, then it is set to 1, else 0	Binary
CntDgtPrctg	If W_i contains digit and percentage, then it is set to 1, else 0	Binary
DgtOnly	If W_i contains digit only, then it is set to 1, else 0	Binary
CmaDotAftrDgt	If W_i is either comma, hyphen, or dot and W_{i – 1} is containing a digit, then it is set to 1, else 0	Binary
CmaDotBfrDgt	If W_i is either comma, hyphen, or dot and W_i+1 is containing a digit, then it is set to 1, else 0	Binary

5.1.13 Previous NE Tag

It is a dynamic feature because the value of this feature is obtained while assigning NE tags to the words in a sentence scanned from the left to the right. This feature carries important information and plays an important role in classifying the current token because, many times, the same NE class continues itself. It is also a nominal feature.

5.2 Composite Features

We have used several combinations of features to act as a single feature, i.e. combining the effect of several features as a single feature. As all NEs are noun phrases, and it is quite difficult to classify them into different entity classes, we have observed that these noun entities are classified with the help of several composite features that are nothing but the combination of several single features. The main objective of using these features is to correctly classify tokens by using their associated surrounding information. For example, consider the following two sentences.

England won the World Cup (England ne World Cup jeeta).
The World Cup took place in England (World Cup England mein khela gaya).

Here, in sentence 1, the word England is an Organization name and, in sentence 2, it is a Location name. For resolving this ambiguity, we can associate several contextual features with the current token (England) to generate a single feature that is expected to have the ability to resolve ambiguity to some extent.

As the objective of having the composite features is to utilize contextual information in resolving tag ambiguities, we have considered short context and relatively long context both while designing composite features. This leads to a number of composite features that are finally optimized by using the back elimination method. The different composite features that we have initially considered are discussed below. To illustrate the values of the composite features, we take an example of a tagged sentence shown in Figure 3.

Figure 3:

A Sample POS-Tagged and Chunked Training Sentence (Chosen from the Training Data).

5.2.1 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token

This feature is a combination of the current token, POS tag of the next token, and NE tag of the previous token. If we consider the above sequence as a composite feature, then the attribute values for different single features will be taken together. For example, for the first word in the sentence shown in Figure 3, this composite feature has the following parts:

October.
END [the tag “END” is not part of the IOB format; it is inserted by us to assign a tag to the token “))”].
“O” (outside) (because the current token is the first word of the sentence).

If the current token is an entity, then this feature may help determine whether the current token is a single-word NE or if it is a part of a multiword NE. For example, if the current token is a single-word entity, it should be tagged as XXX-B; if it is middle part of an entity, it should be tagged as XXX-I, where XXX is an entity name like Person, Location, etc.

When comparing two instances on this feature, the exact match (all parts match) is considered. The similarity value is set to 1 for an exact match, 0 otherwise. In other words, the value of the similarity between two instances under comparison is incremented by 1 when an exact match between the values corresponding to this feature component is found; otherwise, no increment is made.

If nothing is specified, for the remaining composite features discussed below, the same rule is also applied while computing the similarity between any two instances.

5.2.2 Current Token and Previous Token’s Chunk Information and NE Tag of the Previous Token

This feature is a combination of the current token, the previous token’s chunk information, and the NE tag of the previous token. For example, for the first word in the sentence shown in Figure 3, this composite feature has the following parts:

October.
NP-B [previous token “((” has the chunk tag NP-B as per the IOB format].
O.

When comparing two instances on this feature component, the exact match (all parts match) is considered. The similarity value is set to 1 for an exact match, otherwise 0.

This feature is designed to target at detecting whether the current token is the beginning of an NE or not.

5.2.3 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token and Chunk Information of the Next Token

This feature is a combination of the current token, POS tag of the next token, NE tag of the previous token, and chunk information of the next token. For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
END.
O.
NP-E [this is also assigned by us for the token “))”].

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

This feature is designed to target at detecting whether the current token is at the end of an NE or not.

5.2.4 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token and Chunk Information of the Next Token and POS Tag of the Next of Next Token

This feature is a combination of several single features that include the current token, POS tag of the next token, NE tag of the previous token, chunk information of the next token, and POS tag of the next of next token. For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
END.
O.
NP-E.
CCP-Open.

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

This feature is assumed to utilize the left and right context of the current token to distinguish among the different types of NEs.

5.2.5 Current Token and NE Tag of the Previous Token and Punctuation Symbols

This feature includes only three single features called current token, NE tag of the previous token, and punctuation symbols. For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
O.
no (If the current token is a dot, comma, or hyphen, then its value is set to “yes,” and “no” otherwise.).

The main purpose of using this feature is to find the continuation of NE class where punctuation symbols (dot, comma, and hyphen) appeared within NE classes. For example:

<Location> Uttar – Pradesh </Location>
<Entertainment> River – Rafting </Entertainment>

In the above example, we can see that hyphen is present within NE class (Location and Entertainment). Similarly, it helps to find the continuation of NE within some digit patterns where digits are separated by some punctuation symbols; for example, the following sequence of tokens contains hyphenated numbers and a number containing a comma inside.

<Period> 1479 – 1531 </Period>
<Count> 10, 000 </Count>

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

5.2.6 Current Token and Previous Token

This feature includes only two single features: current token and previous token to create a composite feature.

For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
((.

The value of this composite feature is the combination of the values of the current token and the previous token.

When we compare two instances on this feature component, we set the component similarity value to 1 if both have the same value for this feature component. However, if the values match partially (i.e. the value of the current token portion in the composite feature value matches, but only the suffix or the prefix of the previous token gets matched), then we set the component similarity value to a value (0.8), which is <1, to discriminate between a full match and a partial match. We set the component similarity to 0 if the current token is not matched at all.

We assume that this feature may distinguish between the types of NEs. For example, in the sentence segments “England won” and “in England,” the first occurrence of “England” is the organization name and the second occurrence is the name of a place.

5.2.7 Current Token and Next Token

This feature includes only two single features, current token and next token, to create a composite feature.

For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
)).

When we compare two instances on this feature component, the component similarity value is computed in the same way as it is done for “current token and previous token.”

We assume that this feature may also distinguish between the types of NEs. For example, in the sentence segments “England is a place for” and “England won,” the first occurrence of “England” is a place name and the second occurrence of “England” is an organization name.

5.2.8 Current Token, Previous POS, and Previous Chunk

This feature includes only three single features, current token, previous POS, and previous chunk, to create a composite feature.

For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
NP-OPEN.
NP-B [previous token “((” has the chunk tag NP-B as per the IOB format].

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

Using this feature, we associate the current token with the POS tag of the previous token and chunk information of the previous token. The objective here is to utilize the left context of the current token in determining its NE tag.

5.2.9 Current Token, Next POS, and Next Chunk

This feature includes only three single features: current token, next POS, and next chunk.For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
END.
NP-E.

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

With this feature, we associate the current token with its right context to utilize it in determining the NE tag of the current token.

5.2.10 Current Token, Previous POS, Previous Chunk, Next POS, and Next Chunk

This feature includes only five single features: current token, previous POS, previous chunk, next POS, and next chunk.

For example, for the first word in the sentence shown in Figure 3, the composite feature has the following parts:

October.
NP-OPEN.
NP-B.
END.
NP-E.

When comparing two instances on this feature component, the exact match is considered. The component similarity value is set to 1 for a match, otherwise 0.

With this feature, we utilize both the left context and right context of the current token in determining the NE tag to be assigned to it.

6 Postprocessing Rules

Several postprocessing techniques have been adopted in order to increase the performance of a classifier by resolving the erroneous tags. An analysis was made on the most frequent errors made by our model, and rules were crafted to correct errors:

If a token is tagged as NE-I (where NE can be any NE tag such as PER, LOC, ORG, etc.) and the previous token is tagged as O (outside tag), then we replace the NE-I tag by NE-B; that is, any isolated NE-I is replaced by NE-B.
A second case of erroneous tagging is encountered when more than one NE tags are appeared within a tag sequence, e.g. “O PER-B PER-I LOC-I O.” One possible solution for this type of error is to assign the NE whose tags are more frequent in the sequence. However, this may not work when there are at least two NEs whose tags are occurring in equal number in the tag sequence; for example, consider the following situation:

“O PER-B LOC-I LOC-I PER-I O,” where the tags for the NEs PER (person) and LOC (location) have the same frequency, i.e. 2.

To handle this situation, we have used the confidence value of the classifier in assigning an NE tag. The total sum of the confidence values over the tags for each possible NE is computed, and the sequence is assigned the NE category for which the highest total confidence value is obtained. The confidence value of the classifier in assigning an NE tag to a token is calculated by taking the ratio of the frequency of the particular NE tag in the group of the k labels of the k nearest neighbors and the value of k.

According to the above-mentioned method, to break the tie situation for the “O PER-B LOC-I LOC-I PER-I O,” we have to sum up the confidence values for the tags PER-B and PER-I and also sum up the confidence values for the tags LOC-I and LOC-I. Finally, the two values are compared to decide whether the token sequence corresponding to the tag sequence should be classified as PERSON NE or LOCATION NE. If the token sequence is finally classified as PERSON NE, the corresponding tag sequence is converted in the IOB format before writing the token sequence and the associated tag sequence to the output file; that is, “O PER-B LOC-I LOC-I PER-I O” is converted to “O PER-B PER-I PER-I PER-I O.”

7 Backward Elimination Method

One of the problems with the KNN algorithm is that it suffers from the curse of dimensionality problem [26]. For this problem, the neighborhood of a given point becomes very sparse in a high-dimensional space, resulting in high variance. One of the solutions of this curse of dimensionality problem is to shrink the unimportant dimensions of the feature space, bringing more relevant neighbors close to the target point. One of the approaches to overcome this curse of dimensionality problem is the backward elimination approach [5]. In our work, for removing the relatively irrelevant attributes, we have used the backward elimination method, which works as follows:

Starts initially with the full set of features and greedily removes the one that most improves performance.

Using the backward elimination method, we have identified a set of features whose removal improves the system performance. The detail of the identified unimportant features is discussed in Section 10.

8 HMM-Based NER Model

To prove the effectiveness of our proposed method, we have compared the performance of our proposed NER system with an HMM-based NER system that was submitted and evaluated in the ICON 2013 NLP tool contest (http://ltrc.iiit.ac.in/icon/2013/nlptools/). An HMM-based NE tagging approach presented in Ref. [11] considers NE tagging as a sequence labeling task. In general, like POS tagging [21], an HMM-based NE tagging task commonly uses words in a sentence as an observation sequence. However, as the data released for the ICON 2013 NLP tool contest was POS tagged and chunked, the system presented in Ref. [11] uses this additional information: POS tag and chunk tag for the NER task. POS tag and chunk tag are incorporated into the observation symbol, which then becomes a triplet <word, POS-tag, chunk-tag> and hence the observation sequence for a sentence with the word sequence <word₁, word₂, …, word_n> is considered as (<word₁, POS-tag₁, chunk-tag₁>, <word₂, POS-tag₂, chunk-tag₂>, <word₃, POS-tag₃, chunk-tag₃>, …, <word_n, POS-tag_n, chunk-tag_n>).

For an HMM-based NER model, the important issues are the data sparseness problem, decoding to find the best hidden state sequence given an input HMM, and a sequence of observations and handling of unknown observation tokens. When we implement the HMM-based NER model, we have handled these issues in the following ways.

To handle the data sparseness problem, the deleted interpolation-based smoothing technique [3] has been used. For decoding purposes, the viterbi decoding algorithm has been used. The task of a decoder is to find the best hidden state sequence given an input HMM and a sequence of observations. To handle the unknown triplets in the observation sequence, the observation probability of an unknown one is estimated by analyzing the POS tag, chunk tag, and the suffix of the word that constitute the triplet. The observation probabilities of the unknown triplet <word, POS-tag, chunk-tag> corresponding to a word in the input sentence are decided according to the suffix of a pseudo-word formed by adding the POS tag and chunk tag to the end of the word. We find the observation probabilities of such unknown pseudo-words using suffix analysis. To handle unknown observation through suffix analysis, when a list of rare words is created, the threshold on the frequency of words in the training corpus is tuned by considering the different possible threshold values. The different possible suffix lengths are also considered while calculating observation probabilities of an unknown observation based on its suffix information. The HMM system obtains the best results on the development set when we consider that the words whose frequency in the training corpus is ≤2 are the rare words and the maximum suffix length is 9. The detail of this HMM-based NE tagger can be found in Ref. [11].

The performance of this HMM-based NER system is also tested on the test data set on which our proposed memory-based NER system is tested. As we have used the Gazetteer feature in the KNN-based NER system, we should incorporate the same in the HMM-based system to make them comparable. To incorporate this additional feature in the HMM-based NER system described in Ref. [11], we deploy the Gazetteer list in the training data when the HMM-based system is trained. This actually affects the observation probability of the words that are found in the Gazetteer list.

9 Performance Measures

For system evaluation, we use “exact match evaluation” and accordingly we calculate precision, recall, and F-measure.

Precision is defined as the percentage of NEs found by the system that are correct, and recall is defined as the percentage of NEs present in the solution (answer file) that are found by the system. As our system produces output in the IOB format, “exact match” means the system-assigned labels for the parts of an NE match in order with those for the corresponding NE in the solution.

For example, if there are five true NEs in the answer file, there are five system guesses and only one guess that exactly matches the solution. The precision is therefore 20% and the recall is 20%.

The F-measure is a combination of precision and recall and is calculated as follows:

F-measure = (β2 + 1) ∗ P ∗ R(P + R),

where P and R are the precision and recall, respectively. β is a weighting between precision and recall. For system evaluation, we set the value of β to 1.

10 Results

10.1 Results on Development Set

A number of experiments are conducted to optimize the value of k (nearest neighbors) for our proposed KNN-based NER model.

We have tuned the value of k by testing the system on the development data set. Before tuning the value of k, the system is developed with all the features discussed in Section 5. The performance of our developed system on the development set with the different values of k is shown in Table 5. The bold values shown in Table 5 indicate the optimal value of k and the results obtained on the development set by our proposed memory-based NER system when k is set to the optimal value.

Table 5:

Effect on Performance of Our Proposed Memory-Based NER System (with the Complete Feature Set) on the Development Set When the Value of k is Varied.

Value of k	Precision	Recall	F-measure
1	63.52	65.03	64.26
2	63.63	64.83	64.22
3	71.45	66.72	69
4	71.93	66.35	69.03
5	74.61	68.05	71.18
6	74.43	67.96	71.05
7	73.32	65.97	69.45
8	73.62	67.01	70.16
9	72.85	67.20	69.91

The overall performance of our proposed model on a development set with the optimized value of k (=5) is given in Table 6.

Table 6:

Best Performances Obtained on the Development Set by Our Proposed NER Systems and Our Developed HMM-Based NER System.

Model	Precision (%)	Recall (%)	F-measure (%)
Memory-based NER system + backward elimination	75.38	70.60	72.91
Memory-based NER system (without backward elimination)	74.61	68.05	71.18
HMM-based NER model	68.10	64.65	66.31

Bold values shown in the table indicate the best results obtained on the development set.

Table 6 also shows the performance comparisons of our proposed memory-based NER systems presented in this paper and our developed version of the HMM-based NER model presented in Ref. [11]. Our developed version of the HMM-based NER model presented in Ref. [11] uses the training data in which we deploy the Gazetteer list in the appropriate format.

After the development of the initial system, parameters are tuned. Then, the value of the parameter k is set to the optimum value of 5 obtained through the tuning process, and the backward elimination method is applied for the removal of unimportant features. We found that 11 features are unimportant. The features that we found unimportant are (i) simple features – next token, infrequent word, first word, word length, next token’s POS tag, and chunk information of next token; (ii) composite features – <current token and previous token>, <current token and next token>, <current token, previous POS, and previous chunk>, <current token, next POS, and next chunk>, and <current token, previous POS, previous chunk, next POS, and next chunk>.

The performance of our developed NER systems on the development set with the optimized number of features is also shown in Table 6. Table 6 shows that removal of unimportant features improves the performance of our proposed memory-based NER system.

10.2 Results on the Test Set

We test our system on the test set by setting the parameters to the values for which we obtain the best results on the development set. The performance comparison of our developed systems on the test set is given in Table 7.

Table 7:

Comparisons of System Performances on the Test Data Set.

Model	Precision (%)	Recall (%)	F-measure (%)
Memory-based NER system + backward elimination	80.13	76.67	78.37
Memory-based NER system (without backward elimination)	78.69	73.62	76.07
Our implemented HMM-based NER system	75.55	73.47	74.50

Bold values shown in the table indicate the best results obtained on the test data set.

Table 7 shows that our proposed memory-based NER approach performs better than the HMM-based NER system in terms of precision, recall, and F-measure. Our proposed memory-based NER approach gives better results for many possible reasons:

Our proposed system is less affected by the sparse data problem as the KNN approach provides a solution to the sparse data problem via an implicit similarity-based smoothing scheme [22].
The KNN approach can directly handle string features that facilitate defining the context of the word.
We incorporate a variety of composite features in our model.
We also incorporate a Gazetteer list (though it is small in size) in our model.

The results shown in Table 7 also indicate that it is difficult for a Hindi NER task to achieve >80% accuracy when a relatively large number of NE categories (our used data set considers 22 NE categories) is considered. We observe that the main reasons behind this are as follows: (i) the capitalization information like English NER is not useful in Hindi NER task; (ii) the more the number of NE categories increases, the more it makes a classifier confused in discriminating among the classes.

11 Evaluation Results of the 10-Fold Cross-Validation Test

To compare the system performance, a 10-fold cross-validation test is performed for each system. For this purpose, we use the entire labeled data (combining three labeled data sets initially given to contestants: training set, development set, and test set) released for the NLP tool contest held in association with ICON 2013. In this evaluation method, the entire data is divided into equal 10 parts (each part contains the equal number of sentences) and one part is held out for testing and the remaining 9 parts are combined to use for training the system. Thus, for each system, 10 test results are collected for the 10 different folds.

A statistical analysis is carried out on the results of the 10-fold cross-validation test obtained by the HMM-based system and the best version of our proposed memory-based NER system. We tested the statistical significance of the difference between precisions of the two systems, as well as their recalls, using a paired t-test. Results are reported in Table 8.

Table 8:

Results of the 10-Fold Cross-Validation Test.

Average precision±SD		Significant test on precision difference (p-value)	Average recall±SD		Significant test on recall difference (p-value)
Memory-based NER system+backward elimination	HMM-Based NER System	Significant test on precision difference (p-value)	Memory-Based NER System+Backward Elimination	HMM-based NER System	Significant test on recall difference (p-value)
72.62±4.76	69.29±5.00	<0.05	68.10±5.32	68.53±4.67	>0.05

As we can see from Table 8, the average precision of the memory-based NER system is better than the HMM-based NER system, and the difference between the precisions of the two systems is statistically significant (p<0.05), whereas the difference between the recalls of two systems is not statistically significant. In Table 8, p-value shown in italic font indicates statistical significance of the recall difference.

Compared to the HMM system, the proposed memory-based NER system shows an improvement in precision but no significant changes in recall is found. The main drawback of the HMM-based system is that it depends on the suffix of the unknown words (words that are not present in training corpus) for predicting its NE tag, and the similar suffix may also present in the words of different NE categories. The data sparseness problem is also another issue in HMM-based systems. Though we have used a smoothing technique to handle this situation, no smoothing technique is error free. On the other hand, memory-based learning methods like KNN are less affected by problems related to unknown word handling. We can easily incorporate many contextual features and word-level features in KNN-based NER systems. In our work, we have incorporated a number of contextual and word-level features in KNN. We have also incorporated the Gazetteer feature in the KNN-based system in a way that is different from that used in HMM-based systems. This helps in improving the precision of KNN.

However, we did not find any significant difference in the recall values of the two systems compared. The reason is the presence of annotation errors in the training data as well as test data. We observed the following types of error:

A word sequence has been annotated as NE type X, but it should be annotated as NE type Y.
A word sequence is not an NE, but it has been annotated as NE.
A word sequence is an NE, but it is not annotated as NE.

We have observed the frequent occurrence of the error type 1 in the training data. This type of annotation error occurs for the cases where NE types are very close in meaning (e.g. COUNT and QUANTITY, PERIOD and TIME, etc.). In Figure 4, we have shown some examples of annotation errors. The word sequences shown in bold font have been incorrectly annotated in the data set. We have not changed it manually because this data set was not created by us. The data set that we have used in our experiment has been taken from the NLP tool contest on NER for Indian languages, conducted in association with ICON 2013. Figure 4 shows that assignment of “QUANTITY” and “COUNT” NE tags to word sequences is not consistent.

Figure 4:

Examples of Annotation Errors.

The annotation errors affect precision and recall both. For example, assume that a sentence <w₁ w₂ w₃ … w_n> with several single-word NEs is annotated as <w₁/NE-X w₂/O w₃/NE-Y w₄/O w₅/O w₆/NE-X w₇/O W₈/NE-Z> and the annotation errors have occurred at w₁/NE-X and w₆/NE-X. Actually, the correct annotation is w₁/NE-Y and w₆/NE-Y. If the system could tag without any error, it would tag the sentence ideally as w₁/NE-Y w₂/O w₃/NE-Y w₄/O w₅/O w₆/NE-Y w₇/O W₈/NE-Z. In this case, comparing the system-provided tags and the human-provided tags (where annotation errors are not corrected) for the NEs in the sentence <w₁ w₂ w₃ … w_n>, we have precision=2/4 and recall=2/4. However, if the system behaves differently and assigns tags to the sentence as “w₁/O w₂/O w₃/NE-Y w₄/O w₅/O w₆/O w₇/O W₈/NE-Z,” then precision is 2/2 and recall is 2/4. In the second case, precision improves, but recall remains the same as before. Most likely, this is why our proposed memory-based learning approach has obtained an improved precision score without any significant changes in the recall score. Comparing these two situations, we can find that the second case is better than the first one. We should also prefer the second case because recognizing a word sequence as “not an entity” is better than recognizing it as an incorrect entity. We suggest that managing noise in the data set is an important issue related to improvement in recall as well.

12 Conclusion and Future Work

A memory-based NE recognizer for Hindi has been presented in this paper. The performance of the memory-based Hindi NE recognizer has been compared to a trigram HMM-based Hindi NE recognizer. A comprehensive set of features has been used for this task. The experimental results show that the performance of the proposed memory-based NER system is comparable to the HMM-based NER system. The NER task we have considered in our study is more challenging in a sense that our used data set considers 22 NE categories, whereas many previous studies on NER have considered mainly the three or four NE categories such as person name, location name, organization name, and miscellaneous.

One of the major drawbacks of our proposed memory-based NE recognizer is that it is slow in nature. We have planned to apply an efficient KNN search algorithm [28] for speeding up the proposed NE recognizer.

Corresponding author: Kamal Sarkar, Department of Computer Science and Engineering, Jadavpur University, Kolkata 700032, India, e-mail: jukamal2001@yahoo.com

Bibliography

[1] B. Babych and A. Hartley, Improving machine translation quality with automatic named entity recognition, in: Proceedings of the 7th International EAMT Workshop on MT and Other Language Technology Tools, Improving MT through Other Language Technology Tools: Resources and Tools for Building MT, Association for Computational Linguistics, pp. 1–8, 2003.10.3115/1609822.1609823Search in Google Scholar

[2] D. M. Bikel, R. Schwartz and R. M. Weischedel, An algorithm that learns what’s in a name, Mach. Learn.34 (1999), 211–231.10.1023/A:1007558221122Search in Google Scholar

[3] T. Brants, TnT – a statistical part-of-speech tagger, in: Proceedings of the 6th Applied NLP Conference, pp. 224–231, 2000.Search in Google Scholar

[4] I. Budi and S. Bressan, Association rules mining for name entity recognition, in: Proceedings of the 4th International Conference on Web Information Systems Engineering (WISE’03), 2003.Search in Google Scholar

[5] P. Cunningham and S. J. Delany, k-Nearest neighbour classifiers, Technical Report UCD-CSI-2007-4, March 27, 2007.Search in Google Scholar

[6] S. Cucerzan and D. Yarowsky, Language independent named entity recognition combining morphological and contextual evidence, in: Proceedings of the 1999 Joint SIGDAT Conference on EMNLP and VLC, pp. 90–99, 1999.Search in Google Scholar

[7] W. Daelemans, J. Zavrel, K. Van der Sloot and A. Van den Bosch, Timbl: Tilburg memory-based learner, Reference Guide, ILK Technical report-ILK 10-01 (2010), 1–60.Search in Google Scholar

[8] A. Ekbal and S. Bandyopadhyay, Named entity recognition using support vector machine: a language independent approach, Int. J. Elect. Electron. Eng.4 (2010), 155–170.Search in Google Scholar

[9] A. Ekbal and S. Bandyopadhyay, A conditional random field approach for named entity recognition in Bengali and Hindi, Linguistic Issues in Language Technology2 (2009), 1–44.10.33011/lilt.v2i.1203Search in Google Scholar

[10] E. Fix and J. L. Hodges, Jr., Discriminatory analysis, nonparametric discrimination: consistency properties, in: USAF School of Aviation 4. Medicine, Randolph Field, TX, Project No. 21-49-004, Rep. No. 4, Contract No. AF41 (128)-31, 1951.Search in Google Scholar

[11] V. Gayen and K. Sarkar, An HMM based named entity recognition system for Indian languages: the JU System at ICON 2013, arXiv preprint arXiv:1405.7397, 2014.Search in Google Scholar

[12] L. Kovács, Classification method for learning morpheme analysis, Journal of Information Technology Research (JITR)5 (2012), 85–98.10.4018/jitr.2012100106Search in Google Scholar

[13] N. Kumar and P. Bhattacharyya, Named entity recognition in Hindi using MEMM, Technical report, IIT Bombay, India, 2006.Search in Google Scholar

[14] W. Li and A. McCallum, Rapid development of Hindi named entity recognition using conditional random fields and feature induction, ACM Transactions on Asian Language Information Processing (TALIP)2 (2003), 290–294.10.1145/979872.979879Search in Google Scholar

[15] D. Mendes, I. P. Rodrigues, C. Rodriguez-Solano and C. F. Baeta, Enrichment/population of customized CPR (computer-based patient record) ontology from free-text reports for CSI (computer semantic interoperability), Journal of Information Technology Research (JITR)7 (2014), 1–11.10.4018/jitr.2014010101Search in Google Scholar

[16] D. I. Moldovan, S. M. Harabagiu, R. Girju, P. Morarescu, V. F. Lacatusu, A. Novischi and O. Bolohan, LCC tools for question answering, in: Proceedings of the TREC, Maryland, 2002, November, pp. 1–10.10.3115/1072228.1072395Search in Google Scholar

[17] A. Nayan, B. R. K. Rao, P. Singh, S. Sanyal and R. Sanyal, Named entity recognition for Indian languages, IJCNLP (2008), 97–104.Search in Google Scholar

[18] S. K. Saha, S. Chatterji, S. Dandapat, S. Sarkar and P. Mitra, A hybrid approach for named entity recognition in Indian languages, NER for South and South East Asian Languages17 (2008), 17–24.Search in Google Scholar

[19] S. Saha and A. Ekbal, Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition, Data. Knowl. Eng.85 (2013), 15–39.10.1016/j.datak.2012.06.003Search in Google Scholar

[20] E. T. K. Sang, Memory based named entity recognition, Proceedings of the 6th Conference of Natural Language Learning20 (2002), 1–4.Search in Google Scholar

[21] K. Sarkar and V. Gayen, A practical part-of-speech tagger for Bengali, in: Proceedings ofThird International Conference on Emerging Applications of Information Technology (EAIT), IEEE, pp. 36–40, 2012.10.1109/EAIT.2012.6407856Search in Google Scholar

[22] K. Sarkar and A. R. Ghosh, A memory based POS tagger for Bengali, in: Proceedings of the 1st Indian Workshop on Machine Learning, IIT Kanpur, India, 2013.Search in Google Scholar

[23] S. Sekine, NYU: description of the Japanese NE system used for MET-2, in Proc. Message Understanding Conference, Fairfax, Virginia, May, 1998.Search in Google Scholar

[24] K. Takeuchi and N. Collier, Use of support vector machine in extended named entity recognition, in: Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), pp. 119–125, 2002.10.3115/1118853.1118882Search in Google Scholar

[25] H. Yamada, T. Kudo and Y. Matsumoto, Japanese named entity extraction using support vector machine, Transactions of IPSJ43 (2001), 44–53.Search in Google Scholar

[26] Z. Yao and W. L. Ruzzo, A regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data, BMC Bioinformatics7 (2006), S11.10.1186/1471-2105-7-S1-S11Search in Google Scholar PubMed PubMed Central

[27] J. Zavrel and W. Daelemans, Recent advances in memory-based part-of-speech tagging, in: VI Simposio Internacional de Comunicacion Social, pp. 590–597, 1999.Search in Google Scholar

[28] B. Zhang and S. N. Srihari, Fast k-nearest neighbor classification using cluster-based trees, IEEE Trans. Pattern Anal. Mach. Intell.26 (2004), 525–528.10.1109/TPAMI.2004.1265868Search in Google Scholar PubMed

Received: 2015-2-14

Published Online: 2016-3-24

Published in Print: 2017-4-1

This article is distributed under the terms of the Creative Commons Attribution Non-Commercial License, which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A Memory-Based Learning Approach for Named Entity Recognition in Hindi

Abstract

1 Introduction

1.1 Issues Related to NER in Hindi Language

2 Related Work

3 Data Set Description

3.1 Preprocessing of Data

4 Memory-Based Learning Approach

4.1 Resolving Tie Situations

4.2 Similarity Measure

5 Feature Set

5.1 Simple Features

5.1.1 Current Token

5.1.2 Next Token

5.1.3 Chunk Information of the Current Token

5.1.4 POS Tag of the Current Token

5.1.5 POS Tag of the Previous Token

5.1.6 Infrequent Word

5.1.7 First Word

5.1.8 Word Length

5.1.9 Next Token’s POS Tag

5.1.10 Chunk Information of the Next Token

5.1.11 Gazetteer List

5.1.12 Digit Features

5.1.13 Previous NE Tag

5.2 Composite Features

5.2.1 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token

5.2.2 Current Token and Previous Token’s Chunk Information and NE Tag of the Previous Token

5.2.3 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token and Chunk Information of the Next Token

5.2.4 Current Token and POS Tag of the Next Token and NE Tag of the Previous Token and Chunk Information of the Next Token and POS Tag of the Next of Next Token

5.2.5 Current Token and NE Tag of the Previous Token and Punctuation Symbols

5.2.6 Current Token and Previous Token

5.2.7 Current Token and Next Token

5.2.8 Current Token, Previous POS, and Previous Chunk

5.2.9 Current Token, Next POS, and Next Chunk

5.2.10 Current Token, Previous POS, Previous Chunk, Next POS, and Next Chunk

6 Postprocessing Rules

7 Backward Elimination Method

8 HMM-Based NER Model

9 Performance Measures

10 Results

10.1 Results on Development Set

10.2 Results on the Test Set

11 Evaluation Results of the 10-Fold Cross-Validation Test

12 Conclusion and Future Work

Bibliography

Journal and Issue

Articles in the same Issue