1 Introduction

The Named Entity Identification (NEI) is a task of identifying whether the term is a Named Entity (NE), i.e., the name of a person, location, and organization or not. Machine Translation (MT) system is a long standing research area and a lot of research has been done in MT for foreign languages, but the challenges are still not resolved for Indian languages. An issue of improper term translation or transliteration, i.e., whether a term needs to be translated or transliterated, is addressed in this paper. Most of the previous MT systems not address this issue and suffer from poor quality translations.

The proposed NEI translation disambiguation model is evaluated with Cross-Lingual Information Retrieval (CLIR). Dictionary and parallel/comparable corpus-based approaches are the traditional CLIR approaches. In this paper, a recently proposed parallel/comparable corpus-based Term Frequency Model (TFM) is used for evaluation [10]. Our contribution in this work for the Hindi language is to: (i) Collect and prepare the named entity annotated data and gazetteer list; Develop an NEI model with some linguistic patterns. (ii) Analysis and evaluation of an NEI model with TFM; Is NEI translation disambiguation model suitable for resolving improper term translation or transliteration issue? The paper structure is like; Sect. 2 represents literature review. Proposed approach is discussed in Sect. 3. Experiment results and discussions are presented in Sect. 4. Conclusion and future work is discussed in Sect. 5.

2 Literature Review

The NEI techniques are broadly categorized into (i) Rule-Based (RB) approaches, and (ii) Machine Learning (ML) approaches. RB approaches contain a set of rules which are based on grammar, gazetteer list and lists of trigger words. A lot of grammatical knowledge and experience about a particular language is required to write such rules which is the main deficiency of the RB approaches. The phonetic matching technique is based on the similar sounding property [1, 6]. The Maximum Entropy Model (Max-Ent) combined with language specific rules and gazetteer list [2] are used to identify NE. The ML approaches need a lot of NE annotated data which is not available and very cumbersome to construct manually. Wikipedia’s links are transformed into NE annotations [3]. The Conditional Random Fields (CRF) and Support Vector Machine (SVM) are the ML approaches and CRF is superior to SVM [4]. Wikipedia inter-wiki links among English and other languages are used in a language independent way to identify NE [5]. The RB and ML approaches are discussed and showed that the CRF is better than the RB and Max-Ent ML approach [7].

The direct translation, i.e., dictionary based, corpora based, MT, and indirect translation, i.e., Cross-Lingual Latent Semantic Indexing (CL-LSI), Cross-Lingual Latent Dirichlet Allocation (CL-LDA), Cross-Lingual Explicit Semantic Analysis (CL-ESA) are the Cross-Lingual Information Retrieval (CLIR) approaches [8]. A dictionary is used for translation. A transliteration mining algorithm is used to handle the Out Of Vocabulary (OOV) words [9]. The Term Frequency Model (TFM) includes the concept of a set of comparable sentences and cosine similarity [10]. The dual semantic space based translation models CL-LSI, CL-LDA are effective but not efficient [11]. A Statistical Machine Translation (SMT) system is trained on aligned comparable sentences [12]. The transliteration generation or mining techniques are used to handle the OOV words [13]. The CRF model is used to generate the OOV words transliterations [14, 15].

3 Proposed Approach

User queries contain three types of terms which are stop words, terms which need translation, and terms which need transliteration. The proposed approach is represented in Fig. 1. Stop words are removed in the preprocessing step and the remaining terms are tested against the NEI module and TFM module.

Fig. 1.
figure 1

NEI translation disambiguation based proposed approach

Table 1. Web sources of named entities

3.1 Named Entity Identification (NEI)

The CRF algorithm is better than other ML algorithms [4, 7]. The CRF based Stanford Named Entity RecognizerFootnote 1 (SNER) is used to train the NEI system. SNER needs a lot of NE annotated training data which is not available for the Hindi language. So the NE annotated dataset and gazetteer lists need to be prepared to train the SNER.

An available NE tagged datasetFootnote 2 contains around 17000 sentences. This dataset is parsed by Shallow parserFootnote 3 developed by IIIT Hyderabad to obtain the Part Of Speech (POS) tags. Further NE tags and POS tags are merged, and an annotated dataset is prepared for training the SNER system. Any standard gazetteer list for NEI is not available. Various Indian named entity terms are collected from the Web to prepare a gazetteer list. The named entity terms and their sources are listed in Table 1. A testing word is classified into four categories, i.e., Person Name (NEP), Location (NEL), Organization (NEO) and non-NE terms (NOP). Various stop-word phrases are analyzed, and six phrases are identified as patterns. These patterns are like Word1 Stop-word Word2, and if any word in the identified patterns is an NE then another word is also an NE with the same NE tag. The proposed patterns are presented in Table 2.

Table 2. Stop-word phrases

3.2 Term Frequency Model (TFM)

A brief discussion on TFM module is presented in Fig. 2. A term frequency matrix is constructed from a set of comparable sentences which are selected based on the source language query terms. Cosine Similarity Score (CSS) is used to select the top-n target language translations. CSS is computed between two term’s vectors \(A={a_1,a_2,...,a_N}\) and \(B={b_1,b_2,...,b_N}\) as.

$$\begin{aligned} CSS=\frac{\sum _{i=1}^{N}A_iB_i}{\sqrt{\sum _{i=1}^{N}A_{i}^{2}}\sqrt{\sum _{i=1}^{N}B_{i}^{2}}} \end{aligned}$$
(1)
Fig. 2.
figure 2

NEI translation disambiguation based proposed approach

figure a

3.3 Disambiguation

Disambiguation module collects NE tag from NER module and top-n translations from TFM module. A named entity word’s transliteration is also present in top-n translations if word’s transliteration is available in a comparable corpus, but that word’s transliteration has very low translation CSS. So a disambiguation algorithm is proposed in Algorithm 1 to select the proper translation or transliteration. Longest Common Subsequence (LCS) score between two strings \(S_1\) and \(S_2\) is computed by Eq. 2.

$$\begin{aligned} LCS(S_1,S_2)=\frac{Longest\_common\_string(S_1,S_2)}{Maximum(length(S_1),length(S_2))} \end{aligned}$$
(2)

4 Experiment Results and Discussions

The proposed approach is evaluated with FIRE 2010 and 2011 datasets which contain a topic set of 50 Hindi language queries and a set of target English language documents. Topic set includes \(\left\langle title \right\rangle \), \(\left\langle desc \right\rangle \), and \(\left\langle narr \right\rangle \) tag field in each query. We are experimenting with only \(\left\langle title \right\rangle \) tag field. A preprocessed source language query is passed through NEI module and TFM module separately. The outcome of NEI module, i.e., an NE tagged query and the outcome of TFM module, i.e., top-5 translations are passed through the disambiguation module. Target language queries are the resultant outcome of the proposed approach. Vector Space Model (VSM) is used to retrieve query relevant target language documents. NEI disambiguation technique with CLIR system is evaluated by using Recall and Mean Average Precision (MAP). The recall is the fraction of relevant documents that are retrieved. MAP for a set of queries is the mean of the average precision score of each query. Precision is the fraction of retrieved documents that are relevant to the query. The experiment results are presented in Table 3.

The inclusion of NEI disambiguation module degrades the performance of CLIR system because at many instances the translation versions are more popular than the transliteration, so the proposed approach achieves low MAP than the TFM only in both the cases of Fire 2010 and 2011. The significant differences between the popularity of the term’s translation and transliteration are presented in Table 4. NEI alone is not sufficient to select the proper translation or transliteration because term’s popularity decides whether it needs either translation or transliteration.

Table 3. Comparative result analysis
Table 4. Effectiveness of NEI technique

5 Conclusion and Future Work

NEI technique is analyzed to resolve the improper translation or transliteration issue. Indian languages suffer from a lack of availability of NE annotated data and Gazetteer list. The NE annotated data is prepared with the help of IIIT Hyderabad’s NE corpus and shallow parser. Gazetteer lists are prepared from different web sources. Stanford NER is trained on NE annotated data and gazetteer list. The proposed linguistic patterns are used to improve the NEI system. TFM module is used to select the top-n translations against a query word. Disambiguation module selects the proper translation and transliteration based on the outcome of NEI and TFM module. The proposed approach achieves low MAP than the TFM only. NEI alone is not sufficient to select the proper translations or transliterations because term’s popularity decides the translation or transliteration more effectively. In future, term’s popularity will be used to identify that whether a term needs to be translated or transliterated.