Named Entity Identification Based Translation Disambiguation Model

Sharma, Vijay Kumar; Mittal, Namita

doi:10.1007/978-3-319-69900-4_46

Named Entity Identification Based Translation Disambiguation Model

Conference paper
First Online: 01 November 2017

2625 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10597))

Abstract

Machine Translation (MT) systems are in growing state for Indian languages, where either a translation or transliteration mechanism is used for a word or phrase. Identifying whether a word needs translation or transliteration mechanism, is still a challenge. Since the Named Entity (NE) terms have a property of similar pronunciation across the languages. So the Named Entity Identification (NEI) will be very useful for disambiguating the word in favor of either translation or transliteration. Term Frequency Model (TFM), i.e., a Cross-Lingual Information Retrieval (CLIR) model is used to evaluate the NEI based translation disambiguation model.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

The Named Entity Identification (NEI) is a task of identifying whether the term is a Named Entity (NE), i.e., the name of a person, location, and organization or not. Machine Translation (MT) system is a long standing research area and a lot of research has been done in MT for foreign languages, but the challenges are still not resolved for Indian languages. An issue of improper term translation or transliteration, i.e., whether a term needs to be translated or transliterated, is addressed in this paper. Most of the previous MT systems not address this issue and suffer from poor quality translations.

The proposed NEI translation disambiguation model is evaluated with Cross-Lingual Information Retrieval (CLIR). Dictionary and parallel/comparable corpus-based approaches are the traditional CLIR approaches. In this paper, a recently proposed parallel/comparable corpus-based Term Frequency Model (TFM) is used for evaluation [10]. Our contribution in this work for the Hindi language is to: (i) Collect and prepare the named entity annotated data and gazetteer list; Develop an NEI model with some linguistic patterns. (ii) Analysis and evaluation of an NEI model with TFM; Is NEI translation disambiguation model suitable for resolving improper term translation or transliteration issue? The paper structure is like; Sect. 2 represents literature review. Proposed approach is discussed in Sect. 3. Experiment results and discussions are presented in Sect. 4. Conclusion and future work is discussed in Sect. 5.

2 Literature Review

The NEI techniques are broadly categorized into (i) Rule-Based (RB) approaches, and (ii) Machine Learning (ML) approaches. RB approaches contain a set of rules which are based on grammar, gazetteer list and lists of trigger words. A lot of grammatical knowledge and experience about a particular language is required to write such rules which is the main deficiency of the RB approaches. The phonetic matching technique is based on the similar sounding property [1, 6]. The Maximum Entropy Model (Max-Ent) combined with language specific rules and gazetteer list [2] are used to identify NE. The ML approaches need a lot of NE annotated data which is not available and very cumbersome to construct manually. Wikipedia’s links are transformed into NE annotations [3]. The Conditional Random Fields (CRF) and Support Vector Machine (SVM) are the ML approaches and CRF is superior to SVM [4]. Wikipedia inter-wiki links among English and other languages are used in a language independent way to identify NE [5]. The RB and ML approaches are discussed and showed that the CRF is better than the RB and Max-Ent ML approach [7].

The direct translation, i.e., dictionary based, corpora based, MT, and indirect translation, i.e., Cross-Lingual Latent Semantic Indexing (CL-LSI), Cross-Lingual Latent Dirichlet Allocation (CL-LDA), Cross-Lingual Explicit Semantic Analysis (CL-ESA) are the Cross-Lingual Information Retrieval (CLIR) approaches [8]. A dictionary is used for translation. A transliteration mining algorithm is used to handle the Out Of Vocabulary (OOV) words [9]. The Term Frequency Model (TFM) includes the concept of a set of comparable sentences and cosine similarity [10]. The dual semantic space based translation models CL-LSI, CL-LDA are effective but not efficient [11]. A Statistical Machine Translation (SMT) system is trained on aligned comparable sentences [12]. The transliteration generation or mining techniques are used to handle the OOV words [13]. The CRF model is used to generate the OOV words transliterations [14, 15].

3 Proposed Approach

User queries contain three types of terms which are stop words, terms which need translation, and terms which need transliteration. The proposed approach is represented in Fig. 1. Stop words are removed in the preprocessing step and the remaining terms are tested against the NEI module and TFM module.

Table 1. Web sources of named entities

Full size table

3.1 Named Entity Identification (NEI)

The CRF algorithm is better than other ML algorithms [4, 7]. The CRF based Stanford Named Entity Recognizer^{Footnote 1} (SNER) is used to train the NEI system. SNER needs a lot of NE annotated training data which is not available for the Hindi language. So the NE annotated dataset and gazetteer lists need to be prepared to train the SNER.

An available NE tagged dataset^{Footnote 2} contains around 17000 sentences. This dataset is parsed by Shallow parser^{Footnote 3} developed by IIIT Hyderabad to obtain the Part Of Speech (POS) tags. Further NE tags and POS tags are merged, and an annotated dataset is prepared for training the SNER system. Any standard gazetteer list for NEI is not available. Various Indian named entity terms are collected from the Web to prepare a gazetteer list. The named entity terms and their sources are listed in Table 1. A testing word is classified into four categories, i.e., Person Name (NEP), Location (NEL), Organization (NEO) and non-NE terms (NOP). Various stop-word phrases are analyzed, and six phrases are identified as patterns. These patterns are like Word1 Stop-word Word2, and if any word in the identified patterns is an NE then another word is also an NE with the same NE tag. The proposed patterns are presented in Table 2.

Table 2. Stop-word phrases

Full size table

3.2 Term Frequency Model (TFM)

A brief discussion on TFM module is presented in Fig. 2. A term frequency matrix is constructed from a set of comparable sentences which are selected based on the source language query terms. Cosine Similarity Score (CSS) is used to select the top-n target language translations. CSS is computed between two term’s vectors $A={a_1,a_2,...,a_N}$ and $B={b_1,b_2,...,b_N}$ as.

$$\begin{aligned} CSS=\frac{\sum _{i=1}^{N}A_iB_i}{\sqrt{\sum _{i=1}^{N}A_{i}^{2}}\sqrt{\sum _{i=1}^{N}B_{i}^{2}}} \end{aligned}$$

(1)

3.3 Disambiguation

Disambiguation module collects NE tag from NER module and top-n translations from TFM module. A named entity word’s transliteration is also present in top-n translations if word’s transliteration is available in a comparable corpus, but that word’s transliteration has very low translation CSS. So a disambiguation algorithm is proposed in Algorithm 1 to select the proper translation or transliteration. Longest Common Subsequence (LCS) score between two strings $S_1$ and $S_2$ is computed by Eq. 2.

$$\begin{aligned} LCS(S_1,S_2)=\frac{Longest\_common\_string(S_1,S_2)}{Maximum(length(S_1),length(S_2))} \end{aligned}$$

(2)

4 Experiment Results and Discussions

The proposed approach is evaluated with FIRE 2010 and 2011 datasets which contain a topic set of 50 Hindi language queries and a set of target English language documents. Topic set includes $\left\langle title \right\rangle $, $\left\langle desc \right\rangle $, and $\left\langle narr \right\rangle $ tag field in each query. We are experimenting with only $\left\langle title \right\rangle $ tag field. A preprocessed source language query is passed through NEI module and TFM module separately. The outcome of NEI module, i.e., an NE tagged query and the outcome of TFM module, i.e., top-5 translations are passed through the disambiguation module. Target language queries are the resultant outcome of the proposed approach. Vector Space Model (VSM) is used to retrieve query relevant target language documents. NEI disambiguation technique with CLIR system is evaluated by using Recall and Mean Average Precision (MAP). The recall is the fraction of relevant documents that are retrieved. MAP for a set of queries is the mean of the average precision score of each query. Precision is the fraction of retrieved documents that are relevant to the query. The experiment results are presented in Table 3.

The inclusion of NEI disambiguation module degrades the performance of CLIR system because at many instances the translation versions are more popular than the transliteration, so the proposed approach achieves low MAP than the TFM only in both the cases of Fire 2010 and 2011. The significant differences between the popularity of the term’s translation and transliteration are presented in Table 4. NEI alone is not sufficient to select the proper translation or transliteration because term’s popularity decides whether it needs either translation or transliteration.

Table 3. Comparative result analysis

Full size table

Table 4. Effectiveness of NEI technique

Full size table

5 Conclusion and Future Work

NEI technique is analyzed to resolve the improper translation or transliteration issue. Indian languages suffer from a lack of availability of NE annotated data and Gazetteer list. The NE annotated data is prepared with the help of IIIT Hyderabad’s NE corpus and shallow parser. Gazetteer lists are prepared from different web sources. Stanford NER is trained on NE annotated data and gazetteer list. The proposed linguistic patterns are used to improve the NEI system. TFM module is used to select the top-n translations against a query word. Disambiguation module selects the proper translation and transliteration based on the outcome of NEI and TFM module. The proposed approach achieves low MAP than the TFM only. NEI alone is not sufficient to select the proper translations or transliterations because term’s popularity decides the translation or transliteration more effectively. In future, term’s popularity will be used to identify that whether a term needs to be translated or transliterated.

Notes

References

Nayan, A., Rao, B.R.K., Singh, P., Sanyal, S., Sanyal, R.: Named entity recognition for Indian languages. In: IJCNLP, pp. 97–104 (2008)
Google Scholar
Saha, S.K., Chatterji, S., Dandapat, S., Sarkar, S., Mitra, P.: A hybrid approach for named entity recognition in Indian languages. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian languages, pp. 17–24 (2008)
Google Scholar
Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the Australian Language Technology Workshop, pp. 124–132 (2008)
Google Scholar
Maxwell, C.J., Krishnarao, A.A., Gahlot, H., Srinet, A., Kushwaha, D.S.: A comparative study of named entity recognition for Hindi using sequential learning algorithms. In: Advance Computing Conference, 2009, IACC 2009, IEEE International, pp. 1164–1169. IEEE (2009)
Google Scholar
Bhagavatula, M., GSK, S., Varma, V.: Language-independent named entity identification using Wikipedia. In: Proceedings of the First Workshop on Multilingual Modeling, Association for Computational Linguistics, pp. 11–17 (2012)
Google Scholar
Mathur, S., Saxena, V.P.: Hybrid approach to English-Hindi name entity transliteration. In: IEEE Students’ conference on Electrical, Electronis and Computer Science (2014)
Google Scholar
Prasad, G., Fousiya, K.K.: Named entity recognition approaches: a study applied to English and Hindi language. In: International Conference on Circuit, Power and Computing Technologies (ICCPCT), 2015, pp. 1–4. IEEE (2015)
Google Scholar
Sharma, V.K., Mittal, N.: Cross Lingual Information Retrieval (CLIR): Review of tools, challenges and translation approaches. In: Information System Design and Intelligent Application, pp. 699–708 (2016)
Google Scholar
Sharma, V.K, Mittal, N.: Cross lingual information retrieval: a dictionary based query translation approach? In: Advances in Intelligent Systems and Computing (2016)
Google Scholar
Sharma, V.K., Mittal, N.: Exploiting parallel sentences and cosine similarity for identifying target language translation. J. Procedia Comput. Sci. 89, 428–433 (2016)
Article Google Scholar
Vulic, I., de Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retrieval 16(3), 331–368 (2013)
Article Google Scholar
Jagarlamudi, J., Kumaran, A.: Cross-Lingual information retrieval system for Indian languages. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 80–87. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85760-0_10
Chapter Google Scholar
Saravanan, K., Udupa, R., Kumaran, A.: Crosslingual information retrieval system enhanced with transliteration generation and mining. In: Forum for Information Retrieval Evaluation (FIRE-2010) Workshop (2010)
Google Scholar
Surya, G., Harsha, S., Pingali, P., Verma, V.: Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd Workshop on Cross Lingual Information Access (2008)
Google Scholar
Shishtla, P., Surya, G., Sethuramalingam, S., Varma, V.: A language-independent transliteration schema using character aligned models at NEWS 2009. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Association for Computational Linguistics, pp. 40–43 (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

Malaviya National Institute of Technology Jaipur, Jaipur, India
Vijay Kumar Sharma & Namita Mittal

Authors

Vijay Kumar Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Namita Mittal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vijay Kumar Sharma .

Editor information

Editors and Affiliations

Indian Statistical Institute, Kolkata, India
B. Uma Shankar
Indian Statistical Institute, Kolkata, India
Kuntal Ghosh
Indian Statistical Institute, Kolkata, India
Deba Prasad Mandal
Indian Statistical Institute, Kolkata, India
Shubhra Sankar Ray
The Hong Kong Polytechnic University, Hong Kong, China
David Zhang
Indian Statistical Institute, Kolkata, India
Sankar K. Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sharma, V.K., Mittal, N. (2017). Named Entity Identification Based Translation Disambiguation Model. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_46

Download citation

DOI: https://doi.org/10.1007/978-3-319-69900-4_46
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Abstract

1 Introduction

2 Literature Review

3 Proposed Approach

3.1 Named Entity Identification (NEI)

3.2 Term Frequency Model (TFM)

3.3 Disambiguation

4 Experiment Results and Discussions

5 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation