Elsevier

Knowledge-Based Systems

Volume 234, 25 December 2021, 107601
Knowledge-Based Systems

A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings

https://doi.org/10.1016/j.knosys.2021.107601Get rights and content

Highlights

  • Development of enhanced word embeddings for bilingual NER system is a novel attempt.

  • Proposed work is the first attempt to develop a bilingual Hindi-Punjabi NER system.

  • Our study reveals effectiveness of different embeddings for Hindi and Punjabi text.

  • Blending of Bi-GRU and CNN model using EWE improves the performance of NER system.

  • We determine that the proposed approach can be used for any resource-scarce language.

Abstract

The increasing availability of information on the web makes the task of named entity recognition (NER) more challenging. Named entity recognition is an important pre-processor tool that is concerned with the extraction of entities of our interest such as person, location, organization, gene, protein, number, measurement, etc. The success of earlier named entity recognition systems is highly dependent on rule-based techniques or traditional machine learning algorithms exploiting several linguistic and non-linguistic features. In this article, we propose a novel named entity recognition (NER) system that involves the use of deep learning strategies as well as an enhanced version of word embeddings. We develop a Bidirectional Gated Recurrent Unit (Bi-GRU) and Convolutional Neural Networks (CNN) based bilingual named entity recognition system which is built upon enhanced word embeddings (EWE). Enhanced word embeddings (EWE) are generated by concatenation of FastText word embeddings along with minimal feature embeddings, namely part of speech embeddings, word prefix embeddings, word suffix embeddings, and word length embeddings which improve the computational power of deep learning methods. We perform several experiments using corpora in two different languages. One is IJCNLP-08 NERSSEAL shared task corpora containing annotated dataset in Hindi language and the other is manually annotated dataset in Punjabi language. We also make several experiments on bilingual Hindi and Punjabi dataset. The results of the experiments performed in this work reveal that the Bidirectional GRU and CNN based model along with enhanced word embeddings (EWE) has excelled with Precision, Recall, and F-score value of 92.60%, 90.70%, 91.64% respectively for Hindi, 93.87%, 93.33%, 93.60% respectively for Punjabi and 93.78%, 92.66%, 93.22% respectively for bilingual Hindi and Punjabi named entity recognition. Enhanced word embeddings accelerate the performance of a Bi-GRU and CNN based named entity recognition system without using a large set of features and any sort of gazetteers.

Introduction

Named entity recognition is an emerging field of research in NLP and information retrieval. It is the task of identifying proper nouns like person name, location name, organization name, etc. Earlier the entities with enamex, numex, and timex tags [1] were well considered for the extraction task but now the researchers are focusing on recognizing entities of their interests such as biomedical entities, product names, disease names, etc. Named entity recognition acts as a vital pre-processor tool in several NLP applications, namely machine translation systems [2], [3], question answering systems [4], text summarization systems [5], etc.

Earlier the focus of researchers was on traditional machine learning algorithms which include Naive Bayes, Support Vector Machine, and Conditional Random Field. These machine learning techniques require large annotated datasets and refined features leading to the success of research work. Several named entity recognition systems have been introduced by different researchers for different languages such as English, Chinese, Japanese, German, Dutch, Portuguese [6], etc. The capitalization feature is the main clue to identify named entities in these languages. These languages obtain the advantage of sufficient availability of annotated datasets as well as language resources like morphological analyzer, chunker, POS tagger, etc. On the other hand, developing the named entity recognition system for Indian languages is a quite difficult task. Hindi, Punjabi, and some other Indian languages provide various instinctive difficulties in many natural language-related tasks. These languages contain many structural complexities like there is no provision of uppercase and lowercase letters in Hindi and Punjabi. Hindi and Punjabi are free word order languages. For example, a sentence “

(aaj mohan dilli ja raha hai)” can also be written as “
(mohan aaj dilli ja raha hai)”, “
(mohan dilli aaj ja raha hai)” in Hindi. But English strictly follows the “Subject–verb–object” structure. So named entities in English can be found either at the beginning or end of the sentence. We do not find this indication in Hindi and Punjabi language. Besides it, the non-availability of large datasets and language resources becomes the major cause of insufficient results in Indian languages. The proposed deep learning-based technique overcomes the limitations of resource-scarce languages like Hindi and Punjabi. It removes the need for deep feature engineering like capitalization feature, initial and last word features, etc. The main architecture of our model includes the use of Bidirectional GRU and CNN along with FastText embeddings [7]. To improve the performance of the system, we embed different relevant language dependent and independent features (part of speech, word prefix, word suffix, word length, first word, last word, infrequent word, digit features) and concatenate them with FastText embeddings. But out of all these embeddings, we find the best results using only four feature embeddings, namely part of speech, word prefix, word suffix, and word length. So, we call it enhanced word embeddings (EWE). Enhanced word embeddings along with character features consider the contextual information and out-of-vocabulary (OOV) words well.

The evaluation of the model is done on publicly available Hindi dataset: IJCNLP-08 [8] corpus and manually annotated Punjabi dataset as well as on the combination of both i.e. bilingual Hindi and Punjabi dataset. We measure the performance of different models with different embeddings out of which we obtain state-of-the-art results by applying the Bi-GRU and CNN model on top of enhanced word embeddings.

The major achievements of this paper are as follows:

  • As per our vast literature study, we find ourselves as the first to develop enhanced word embeddings (EWE) for deep learning-based Hindi and Punjabi NER system. Enhanced word embeddings are the amalgamation of Gensim’s FastText word embeddings, POS2Vec embeddings, WordPrefix2Vec embeddings, WordSuffix2Vec embeddings, WordLength2Vec embeddings.

  • The proposed work is the first attempt to develop a bilingual Hindi and Punjabi NER system that can extract named entities from Hindi, Punjabi, and the combined Hindi and Punjabi dataset.

  • Our study reveals the effectiveness of different embeddings namely random embeddings, Facebook’s FastText embeddings, Gensim’s FastText embeddings, and enhanced word embeddings using different models to extract named entities from two Indian languages, namely Hindi and Punjabi.

  • Our findings confirm that the blending of the Bi-GRU and CNN model using enhanced word embeddings improves the performance of the bilingual NER system as compared to other deep learning models such as LSTM, Bi-LSTM, GRU with the combination of CNN features and enhanced word embeddings.

  • To evaluate the effectiveness of our system, we have applied the proposed approach to a small set of Hindi and Punjabi untrained text for the extraction of named entities and found good results on unseen data.

  • Due to the non-availability of the Punjabi NER dataset, we run several experiments using traditional machine learning algorithms on our manually created Punjabi NER dataset. We have shown a comparison between the results of traditional learning algorithms and our proposed method.

  • We determine that the proposed approach has been presented without using a large set of features and any sort of gazetteer. Hence, it can be used for any resource-scarce language.

  • The proposed work is found efficient in comparison with several already existing approaches developed for the NER task. So, the efficiency of existing NER systems can be enhanced using our proposed method.

The remaining part of the paper is arranged as follows:

Section 2 discusses the related works of the named entity recognition system for Hindi and other languages using traditional machine learning algorithms and the latest deep learning techniques. We present our proposed approach used for the NER task in Section 3. Discussion about the dataset used and its annotation framework, training and hyperparameters used for the named entity recognition system has been presented in Section 4. We highlight the experimental results and discussion of the main findings of this research in Section 5. We show the comparison of our work with previous works and different techniques in Section 6 and the conclusion of this research and some plans for future work have been presented in Section 7.

Section snippets

Related works

This section presents the research work done in the field of named entity recognition using traditional approaches and advanced approaches for the extraction of entities of interest. Traditional approaches include rule-based methods, traditional machine learning algorithms like SVM, CRF, HMM, etc., and hybrid methods. Advanced approaches include deep learning-based algorithms like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and a combination of both.

Proposed approach

In this section, we present a brief overview of the layers used in our proposed deep neural network model. Initially, we start with the introduction of the Bidirectional GRU network. Further, we define the character features extracted using Convolutional Neural Networks (CNN) and enhanced word embeddings (EWE).

Datasets used

For Hindi named entity recognition system, we use the standard dataset defined as part of the IJCNLP-08 [8] NER Shared Task for South and South East Asian Languages (SSEAL). A corpus of 23,614 sentences consisting of 502,974 tokens of Hindi is tagged with a tag set of 12 different NE classes. Originally, Shakti Standard Format (SSF) [44] is used for representing the annotated Hindi corpus which is shown in Fig. 5. The annotation is performed manually by IIIT Hyderabad. The dataset is converted

Experimental results and discussions

We find the experimental results using intrinsic performance measures, namely Precision (P), Recall (R), and F-score (F) for different deep learning-based NER models. Initially, we apply all the models on top of randomly generated word vectors which provides good results. Then, randomly generated word vectors are replaced with Facebook’s pre-trained word vectors which affect the accuracy of NER systems badly because of the non-availability of many words in FastText vectors. Later on, all the

Comparisons

In this section, a comparison of results of previous works of Hindi NER on the IJCNLP-08 dataset and other datasets have been quoted. Besides, a comparison of the performance of Punjabi NER system using different traditional machine learning-based and deep learning-based approaches as well as other works in Punjabi NER have been highlighted.

Conclusion and future works

Most of the research studies for named entity recognition in Hindi and Punjabi have been done using rule-based methods as well as traditional machine learning techniques. This study focuses on deep learning aspects for the NER task. We have used different variants of RNN such as LSTM, Bi-LSTM, GRU, Bi-GRU for recognizing named entities out of Hindi, Punjabi, and bilingual Hindi and Punjabi text out of which Bi-GRU is found most effective. A novel enhanced word embedding (EWE) have been

CRediT authorship contribution statement

Archana Goyal: Conceptualization, Methodology, Software, Data curation, Investigation, Writing – original draft. Vishal Gupta: Methodology, Software, Writing – review & editing, Validation, Supervision. Manish Kumar: Visualization, Validation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (52)

  • BojanowskiP. et al.

    Enriching word vectors with subword information

    Trans. Assoc. Comput. Linguist.

    (2017)
  • Hindi dataset is available online at:...
  • GuptaV. et al.

    Named entity recognition for punjabi language text summarization

    Int. J. Comput. Appl.

    (2011)
  • GodenyB.

    Rule based product name recognition and disambiguation

  • AlfredR. et al.

    Malay named entity recognition based on rule-based approach

    Int. J. Mach. Learn. Comput.

    (2014)
  • FreireN. et al.

    An approach for named entity recognition in poorly structured data

  • BamS.B. et al.

    Named entity recognition for nepali text using support vector machines

    Intelligent Information Management

    (2014)
  • YadavV. et al.

    A survey on recent advances in named entity recognition from deep learning models

    (2019)
  • LampleG. et al.

    Neural architectures for named entity recognition

  • SinghS.P. et al.

    Machine translation using deep learning: An overview

  • MikolovT. et al.

    Extensions of recurrent neural network language model

  • HintonG. et al.

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

    IEEE Signal Process. Mag.

    (2012)
  • GoyalA. et al.

    Analysis of different supervised techniques for named entity recognition

  • HeK. et al.

    Deep residual learning for image recognition

  • EpelbaumT.

    Deep learning: Technical introduction

    (2017)
  • BodenM.

    A guide to recurrent neural networks and backpropagation

  • Cited by (13)

    • A new approach of integrating industry prior knowledge for HAZOP interaction

      2023, Journal of Loss Prevention in the Process Industries
    View all citing articles on Scopus
    View full text