Abstract
Machine Translation (MT) systems are in growing state for Indian languages, where either a translation or transliteration mechanism is used for a word or phrase. Identifying whether a word needs translation or transliteration mechanism, is still a challenge. Since the Named Entity (NE) terms have a property of similar pronunciation across the languages. So the Named Entity Identification (NEI) will be very useful for disambiguating the word in favor of either translation or transliteration. Term Frequency Model (TFM), i.e., a Cross-Lingual Information Retrieval (CLIR) model is used to evaluate the NEI based translation disambiguation model.
You have full access to this open access chapter, Download conference paper PDF
1 Introduction
The Named Entity Identification (NEI) is a task of identifying whether the term is a Named Entity (NE), i.e., the name of a person, location, and organization or not. Machine Translation (MT) system is a long standing research area and a lot of research has been done in MT for foreign languages, but the challenges are still not resolved for Indian languages. An issue of improper term translation or transliteration, i.e., whether a term needs to be translated or transliterated, is addressed in this paper. Most of the previous MT systems not address this issue and suffer from poor quality translations.
The proposed NEI translation disambiguation model is evaluated with Cross-Lingual Information Retrieval (CLIR). Dictionary and parallel/comparable corpus-based approaches are the traditional CLIR approaches. In this paper, a recently proposed parallel/comparable corpus-based Term Frequency Model (TFM) is used for evaluation [10]. Our contribution in this work for the Hindi language is to: (i) Collect and prepare the named entity annotated data and gazetteer list; Develop an NEI model with some linguistic patterns. (ii) Analysis and evaluation of an NEI model with TFM; Is NEI translation disambiguation model suitable for resolving improper term translation or transliteration issue? The paper structure is like; Sect. 2 represents literature review. Proposed approach is discussed in Sect. 3. Experiment results and discussions are presented in Sect. 4. Conclusion and future work is discussed in Sect. 5.
2 Literature Review
The NEI techniques are broadly categorized into (i) Rule-Based (RB) approaches, and (ii) Machine Learning (ML) approaches. RB approaches contain a set of rules which are based on grammar, gazetteer list and lists of trigger words. A lot of grammatical knowledge and experience about a particular language is required to write such rules which is the main deficiency of the RB approaches. The phonetic matching technique is based on the similar sounding property [1, 6]. The Maximum Entropy Model (Max-Ent) combined with language specific rules and gazetteer list [2] are used to identify NE. The ML approaches need a lot of NE annotated data which is not available and very cumbersome to construct manually. Wikipedia’s links are transformed into NE annotations [3]. The Conditional Random Fields (CRF) and Support Vector Machine (SVM) are the ML approaches and CRF is superior to SVM [4]. Wikipedia inter-wiki links among English and other languages are used in a language independent way to identify NE [5]. The RB and ML approaches are discussed and showed that the CRF is better than the RB and Max-Ent ML approach [7].
The direct translation, i.e., dictionary based, corpora based, MT, and indirect translation, i.e., Cross-Lingual Latent Semantic Indexing (CL-LSI), Cross-Lingual Latent Dirichlet Allocation (CL-LDA), Cross-Lingual Explicit Semantic Analysis (CL-ESA) are the Cross-Lingual Information Retrieval (CLIR) approaches [8]. A dictionary is used for translation. A transliteration mining algorithm is used to handle the Out Of Vocabulary (OOV) words [9]. The Term Frequency Model (TFM) includes the concept of a set of comparable sentences and cosine similarity [10]. The dual semantic space based translation models CL-LSI, CL-LDA are effective but not efficient [11]. A Statistical Machine Translation (SMT) system is trained on aligned comparable sentences [12]. The transliteration generation or mining techniques are used to handle the OOV words [13]. The CRF model is used to generate the OOV words transliterations [14, 15].
3 Proposed Approach
User queries contain three types of terms which are stop words, terms which need translation, and terms which need transliteration. The proposed approach is represented in Fig. 1. Stop words are removed in the preprocessing step and the remaining terms are tested against the NEI module and TFM module.
3.1 Named Entity Identification (NEI)
The CRF algorithm is better than other ML algorithms [4, 7]. The CRF based Stanford Named Entity RecognizerFootnote 1 (SNER) is used to train the NEI system. SNER needs a lot of NE annotated training data which is not available for the Hindi language. So the NE annotated dataset and gazetteer lists need to be prepared to train the SNER.
An available NE tagged datasetFootnote 2 contains around 17000 sentences. This dataset is parsed by Shallow parserFootnote 3 developed by IIIT Hyderabad to obtain the Part Of Speech (POS) tags. Further NE tags and POS tags are merged, and an annotated dataset is prepared for training the SNER system. Any standard gazetteer list for NEI is not available. Various Indian named entity terms are collected from the Web to prepare a gazetteer list. The named entity terms and their sources are listed in Table 1. A testing word is classified into four categories, i.e., Person Name (NEP), Location (NEL), Organization (NEO) and non-NE terms (NOP). Various stop-word phrases are analyzed, and six phrases are identified as patterns. These patterns are like Word1 Stop-word Word2, and if any word in the identified patterns is an NE then another word is also an NE with the same NE tag. The proposed patterns are presented in Table 2.
3.2 Term Frequency Model (TFM)
A brief discussion on TFM module is presented in Fig. 2. A term frequency matrix is constructed from a set of comparable sentences which are selected based on the source language query terms. Cosine Similarity Score (CSS) is used to select the top-n target language translations. CSS is computed between two term’s vectors \(A={a_1,a_2,...,a_N}\) and \(B={b_1,b_2,...,b_N}\) as.
3.3 Disambiguation
Disambiguation module collects NE tag from NER module and top-n translations from TFM module. A named entity word’s transliteration is also present in top-n translations if word’s transliteration is available in a comparable corpus, but that word’s transliteration has very low translation CSS. So a disambiguation algorithm is proposed in Algorithm 1 to select the proper translation or transliteration. Longest Common Subsequence (LCS) score between two strings \(S_1\) and \(S_2\) is computed by Eq. 2.
4 Experiment Results and Discussions
The proposed approach is evaluated with FIRE 2010 and 2011 datasets which contain a topic set of 50 Hindi language queries and a set of target English language documents. Topic set includes \(\left\langle title \right\rangle \), \(\left\langle desc \right\rangle \), and \(\left\langle narr \right\rangle \) tag field in each query. We are experimenting with only \(\left\langle title \right\rangle \) tag field. A preprocessed source language query is passed through NEI module and TFM module separately. The outcome of NEI module, i.e., an NE tagged query and the outcome of TFM module, i.e., top-5 translations are passed through the disambiguation module. Target language queries are the resultant outcome of the proposed approach. Vector Space Model (VSM) is used to retrieve query relevant target language documents. NEI disambiguation technique with CLIR system is evaluated by using Recall and Mean Average Precision (MAP). The recall is the fraction of relevant documents that are retrieved. MAP for a set of queries is the mean of the average precision score of each query. Precision is the fraction of retrieved documents that are relevant to the query. The experiment results are presented in Table 3.
The inclusion of NEI disambiguation module degrades the performance of CLIR system because at many instances the translation versions are more popular than the transliteration, so the proposed approach achieves low MAP than the TFM only in both the cases of Fire 2010 and 2011. The significant differences between the popularity of the term’s translation and transliteration are presented in Table 4. NEI alone is not sufficient to select the proper translation or transliteration because term’s popularity decides whether it needs either translation or transliteration.
5 Conclusion and Future Work
NEI technique is analyzed to resolve the improper translation or transliteration issue. Indian languages suffer from a lack of availability of NE annotated data and Gazetteer list. The NE annotated data is prepared with the help of IIIT Hyderabad’s NE corpus and shallow parser. Gazetteer lists are prepared from different web sources. Stanford NER is trained on NE annotated data and gazetteer list. The proposed linguistic patterns are used to improve the NEI system. TFM module is used to select the top-n translations against a query word. Disambiguation module selects the proper translation and transliteration based on the outcome of NEI and TFM module. The proposed approach achieves low MAP than the TFM only. NEI alone is not sufficient to select the proper translations or transliterations because term’s popularity decides the translation or transliteration more effectively. In future, term’s popularity will be used to identify that whether a term needs to be translated or transliterated.
References
Nayan, A., Rao, B.R.K., Singh, P., Sanyal, S., Sanyal, R.: Named entity recognition for Indian languages. In: IJCNLP, pp. 97–104 (2008)
Saha, S.K., Chatterji, S., Dandapat, S., Sarkar, S., Mitra, P.: A hybrid approach for named entity recognition in Indian languages. In: Proceedings of the IJCNLP-08 Workshop on NER for South and South East Asian languages, pp. 17–24 (2008)
Nothman, J., Curran, J.R., Murphy, T.: Transforming Wikipedia into named entity training data. In: Proceedings of the Australian Language Technology Workshop, pp. 124–132 (2008)
Maxwell, C.J., Krishnarao, A.A., Gahlot, H., Srinet, A., Kushwaha, D.S.: A comparative study of named entity recognition for Hindi using sequential learning algorithms. In: Advance Computing Conference, 2009, IACC 2009, IEEE International, pp. 1164–1169. IEEE (2009)
Bhagavatula, M., GSK, S., Varma, V.: Language-independent named entity identification using Wikipedia. In: Proceedings of the First Workshop on Multilingual Modeling, Association for Computational Linguistics, pp. 11–17 (2012)
Mathur, S., Saxena, V.P.: Hybrid approach to English-Hindi name entity transliteration. In: IEEE Students’ conference on Electrical, Electronis and Computer Science (2014)
Prasad, G., Fousiya, K.K.: Named entity recognition approaches: a study applied to English and Hindi language. In: International Conference on Circuit, Power and Computing Technologies (ICCPCT), 2015, pp. 1–4. IEEE (2015)
Sharma, V.K., Mittal, N.: Cross Lingual Information Retrieval (CLIR): Review of tools, challenges and translation approaches. In: Information System Design and Intelligent Application, pp. 699–708 (2016)
Sharma, V.K, Mittal, N.: Cross lingual information retrieval: a dictionary based query translation approach? In: Advances in Intelligent Systems and Computing (2016)
Sharma, V.K., Mittal, N.: Exploiting parallel sentences and cosine similarity for identifying target language translation. J. Procedia Comput. Sci. 89, 428–433 (2016)
Vulic, I., de Smet, W., Moens, M.-F.: Cross-language information retrieval models based on latent topic models trained with document-aligned comparable corpora. Inf. Retrieval 16(3), 331–368 (2013)
Jagarlamudi, J., Kumaran, A.: Cross-Lingual information retrieval system for Indian languages. In: Peters, C., Jijkoun, V., Mandl, T., Müller, H., Oard, D.W., Peñas, A., Petras, V., Santos, D. (eds.) CLEF 2007. LNCS, vol. 5152, pp. 80–87. Springer, Heidelberg (2008). doi:10.1007/978-3-540-85760-0_10
Saravanan, K., Udupa, R., Kumaran, A.: Crosslingual information retrieval system enhanced with transliteration generation and mining. In: Forum for Information Retrieval Evaluation (FIRE-2010) Workshop (2010)
Surya, G., Harsha, S., Pingali, P., Verma, V.: Statistical transliteration for cross language information retrieval using HMM alignment model and CRF. In: Proceedings of the 2nd Workshop on Cross Lingual Information Access (2008)
Shishtla, P., Surya, G., Sethuramalingam, S., Varma, V.: A language-independent transliteration schema using character aligned models at NEWS 2009. In: Proceedings of the 2009 Named Entities Workshop: Shared Task on Transliteration, Association for Computational Linguistics, pp. 40–43 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Sharma, V.K., Mittal, N. (2017). Named Entity Identification Based Translation Disambiguation Model. In: Shankar, B., Ghosh, K., Mandal, D., Ray, S., Zhang, D., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2017. Lecture Notes in Computer Science(), vol 10597. Springer, Cham. https://doi.org/10.1007/978-3-319-69900-4_46
Download citation
DOI: https://doi.org/10.1007/978-3-319-69900-4_46
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69899-1
Online ISBN: 978-3-319-69900-4
eBook Packages: Computer ScienceComputer Science (R0)