A deep learning-based bilingual Hindi and Punjabi named entity recognition system using enhanced word embeddings

doi:10.1016/j.knosys.2021.107601

Knowledge-Based Systems

Volume 234, 25 December 2021, 107601

https://doi.org/10.1016/j.knosys.2021.107601 Get rights and content

Highlights

•
Development of enhanced word embeddings for bilingual NER system is a novel attempt.
•
Proposed work is the first attempt to develop a bilingual Hindi-Punjabi NER system.
•
Our study reveals effectiveness of different embeddings for Hindi and Punjabi text.
•
Blending of Bi-GRU and CNN model using EWE improves the performance of NER system.
•
We determine that the proposed approach can be used for any resource-scarce language.

Abstract

The increasing availability of information on the web makes the task of named entity recognition (NER) more challenging. Named entity recognition is an important pre-processor tool that is concerned with the extraction of entities of our interest such as person, location, organization, gene, protein, number, measurement, etc. The success of earlier named entity recognition systems is highly dependent on rule-based techniques or traditional machine learning algorithms exploiting several linguistic and non-linguistic features. In this article, we propose a novel named entity recognition (NER) system that involves the use of deep learning strategies as well as an enhanced version of word embeddings. We develop a Bidirectional Gated Recurrent Unit (Bi-GRU) and Convolutional Neural Networks (CNN) based bilingual named entity recognition system which is built upon enhanced word embeddings (EWE). Enhanced word embeddings (EWE) are generated by concatenation of FastText word embeddings along with minimal feature embeddings, namely part of speech embeddings, word prefix embeddings, word suffix embeddings, and word length embeddings which improve the computational power of deep learning methods. We perform several experiments using corpora in two different languages. One is IJCNLP-08 NERSSEAL shared task corpora containing annotated dataset in Hindi language and the other is manually annotated dataset in Punjabi language. We also make several experiments on bilingual Hindi and Punjabi dataset. The results of the experiments performed in this work reveal that the Bidirectional GRU and CNN based model along with enhanced word embeddings (EWE) has excelled with Precision, Recall, and F-score value of 92.60%, 90.70%, 91.64% respectively for Hindi, 93.87%, 93.33%, 93.60% respectively for Punjabi and 93.78%, 92.66%, 93.22% respectively for bilingual Hindi and Punjabi named entity recognition. Enhanced word embeddings accelerate the performance of a Bi-GRU and CNN based named entity recognition system without using a large set of features and any sort of gazetteers.

Introduction

Named entity recognition is an emerging field of research in NLP and information retrieval. It is the task of identifying proper nouns like person name, location name, organization name, etc. Earlier the entities with enamex, numex, and timex tags [1] were well considered for the extraction task but now the researchers are focusing on recognizing entities of their interests such as biomedical entities, product names, disease names, etc. Named entity recognition acts as a vital pre-processor tool in several NLP applications, namely machine translation systems [2], [3], question answering systems [4], text summarization systems [5], etc.

Earlier the focus of researchers was on traditional machine learning algorithms which include Naive Bayes, Support Vector Machine, and Conditional Random Field. These machine learning techniques require large annotated datasets and refined features leading to the success of research work. Several named entity recognition systems have been introduced by different researchers for different languages such as English, Chinese, Japanese, German, Dutch, Portuguese [6], etc. The capitalization feature is the main clue to identify named entities in these languages. These languages obtain the advantage of sufficient availability of annotated datasets as well as language resources like morphological analyzer, chunker, POS tagger, etc. On the other hand, developing the named entity recognition system for Indian languages is a quite difficult task. Hindi, Punjabi, and some other Indian languages provide various instinctive difficulties in many natural language-related tasks. These languages contain many structural complexities like there is no provision of uppercase and lowercase letters in Hindi and Punjabi. Hindi and Punjabi are free word order languages. For example, a sentence “

(aaj mohan dilli ja raha hai)” can also be written as “

(mohan aaj dilli ja raha hai)”, “

(mohan dilli aaj ja raha hai)” in Hindi. But English strictly follows the “Subject–verb–object” structure. So named entities in English can be found either at the beginning or end of the sentence. We do not find this indication in Hindi and Punjabi language. Besides it, the non-availability of large datasets and language resources becomes the major cause of insufficient results in Indian languages. The proposed deep learning-based technique overcomes the limitations of resource-scarce languages like Hindi and Punjabi. It removes the need for deep feature engineering like capitalization feature, initial and last word features, etc. The main architecture of our model includes the use of Bidirectional GRU and CNN along with FastText embeddings [7]. To improve the performance of the system, we embed different relevant language dependent and independent features (part of speech, word prefix, word suffix, word length, first word, last word, infrequent word, digit features) and concatenate them with FastText embeddings. But out of all these embeddings, we find the best results using only four feature embeddings, namely part of speech, word prefix, word suffix, and word length. So, we call it enhanced word embeddings (EWE). Enhanced word embeddings along with character features consider the contextual information and out-of-vocabulary (OOV) words well.

The evaluation of the model is done on publicly available Hindi dataset: IJCNLP-08 [8] corpus and manually annotated Punjabi dataset as well as on the combination of both i.e. bilingual Hindi and Punjabi dataset. We measure the performance of different models with different embeddings out of which we obtain state-of-the-art results by applying the Bi-GRU and CNN model on top of enhanced word embeddings.

The major achievements of this paper are as follows:

•
As per our vast literature study, we find ourselves as the first to develop enhanced word embeddings (EWE) for deep learning-based Hindi and Punjabi NER system. Enhanced word embeddings are the amalgamation of Gensim’s FastText word embeddings, POS2Vec embeddings, WordPrefix2Vec embeddings, WordSuffix2Vec embeddings, WordLength2Vec embeddings.
•
The proposed work is the first attempt to develop a bilingual Hindi and Punjabi NER system that can extract named entities from Hindi, Punjabi, and the combined Hindi and Punjabi dataset.
•
Our study reveals the effectiveness of different embeddings namely random embeddings, Facebook’s FastText embeddings, Gensim’s FastText embeddings, and enhanced word embeddings using different models to extract named entities from two Indian languages, namely Hindi and Punjabi.
•
Our findings confirm that the blending of the Bi-GRU and CNN model using enhanced word embeddings improves the performance of the bilingual NER system as compared to other deep learning models such as LSTM, Bi-LSTM, GRU with the combination of CNN features and enhanced word embeddings.
•
To evaluate the effectiveness of our system, we have applied the proposed approach to a small set of Hindi and Punjabi untrained text for the extraction of named entities and found good results on unseen data.
•
Due to the non-availability of the Punjabi NER dataset, we run several experiments using traditional machine learning algorithms on our manually created Punjabi NER dataset. We have shown a comparison between the results of traditional learning algorithms and our proposed method.
•
We determine that the proposed approach has been presented without using a large set of features and any sort of gazetteer. Hence, it can be used for any resource-scarce language.
•
The proposed work is found efficient in comparison with several already existing approaches developed for the NER task. So, the efficiency of existing NER systems can be enhanced using our proposed method.

The remaining part of the paper is arranged as follows:

Section 2 discusses the related works of the named entity recognition system for Hindi and other languages using traditional machine learning algorithms and the latest deep learning techniques. We present our proposed approach used for the NER task in Section 3. Discussion about the dataset used and its annotation framework, training and hyperparameters used for the named entity recognition system has been presented in Section 4. We highlight the experimental results and discussion of the main findings of this research in Section 5. We show the comparison of our work with previous works and different techniques in Section 6 and the conclusion of this research and some plans for future work have been presented in Section 7.

Section snippets

Related works

This section presents the research work done in the field of named entity recognition using traditional approaches and advanced approaches for the extraction of entities of interest. Traditional approaches include rule-based methods, traditional machine learning algorithms like SVM, CRF, HMM, etc., and hybrid methods. Advanced approaches include deep learning-based algorithms like Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN), and a combination of both.

Proposed approach

In this section, we present a brief overview of the layers used in our proposed deep neural network model. Initially, we start with the introduction of the Bidirectional GRU network. Further, we define the character features extracted using Convolutional Neural Networks (CNN) and enhanced word embeddings (EWE).

Datasets used

For Hindi named entity recognition system, we use the standard dataset defined as part of the IJCNLP-08 [8] NER Shared Task for South and South East Asian Languages (SSEAL). A corpus of 23,614 sentences consisting of 502,974 tokens of Hindi is tagged with a tag set of 12 different NE classes. Originally, Shakti Standard Format (SSF) [44] is used for representing the annotated Hindi corpus which is shown in Fig. 5. The annotation is performed manually by IIIT Hyderabad. The dataset is converted

Experimental results and discussions

We find the experimental results using intrinsic performance measures, namely Precision (P), Recall (R), and F-score (F) for different deep learning-based NER models. Initially, we apply all the models on top of randomly generated word vectors which provides good results. Then, randomly generated word vectors are replaced with Facebook’s pre-trained word vectors which affect the accuracy of NER systems badly because of the non-availability of many words in FastText vectors. Later on, all the

Comparisons

In this section, a comparison of results of previous works of Hindi NER on the IJCNLP-08 dataset and other datasets have been quoted. Besides, a comparison of the performance of Punjabi NER system using different traditional machine learning-based and deep learning-based approaches as well as other works in Punjabi NER have been highlighted.

Conclusion and future works

Most of the research studies for named entity recognition in Hindi and Punjabi have been done using rule-based methods as well as traditional machine learning techniques. This study focuses on deep learning aspects for the NER task. We have used different variants of RNN such as LSTM, Bi-LSTM, GRU, Bi-GRU for recognizing named entities out of Hindi, Punjabi, and bilingual Hindi and Punjabi text out of which Bi-GRU is found most effective. A novel enhanced word embedding (EWE) have been

CRediT authorship contribution statement

Archana Goyal: Conceptualization, Methodology, Software, Data curation, Investigation, Writing – original draft. Vishal Gupta: Methodology, Software, Writing – review & editing, Validation, Supervision. Manish Kumar: Visualization, Validation, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (52)

SahaS.K. et al.
A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition
Knowl.-Based Syst.
(2012)
KorkontzelosI. et al.
Boosting drug named entity recognition using an aggregate classifier
Artif. Intell. Med.
(2015)
KaurA. et al.
Evaluation of named entity features for Punjabi language
Procedia Comput. Sci.
(2015)
RezaeiniaS.M. et al.
Sentiment analysis based on improved pre-trained word embeddings
Expert Syst. Appl.
(2019)
JainA. et al.
Research trends for named entity recognition in hindi language
A. Ugawa, A. Tamura, T. Ninomiya, H. Takamura, M. Okumura, Neural machine translation incorporating named entity, in:...
DandapatS. et al.
Improved named entity recognition using machine translation-based cross-lingual information
Computacion Y Sistemaś
(2016)
PrzybyłaP.
Boosting question answering by deep entity recognition
(2016)
HasselM.
Exploitation of named entities in automatic text summarization for swedish
SantosD. et al.
Harem: An advanced NER evaluation contest for portuguese

BojanowskiP. et al.

Enriching word vectors with subword information

Trans. Assoc. Comput. Linguist.

(2017)

Hindi dataset is available online at:...

GuptaV. et al.

Named entity recognition for punjabi language text summarization

Int. J. Comput. Appl.

(2011)

GodenyB.

Rule based product name recognition and disambiguation

AlfredR. et al.

Malay named entity recognition based on rule-based approach

Int. J. Mach. Learn. Comput.

(2014)

FreireN. et al.

An approach for named entity recognition in poorly structured data

BamS.B. et al.

Named entity recognition for nepali text using support vector machines

Intelligent Information Management

(2014)

YadavV. et al.

A survey on recent advances in named entity recognition from deep learning models

(2019)

LampleG. et al.

Neural architectures for named entity recognition

SinghS.P. et al.

Machine translation using deep learning: An overview

MikolovT. et al.

Extensions of recurrent neural network language model

HintonG. et al.

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

IEEE Signal Process. Mag.

(2012)

GoyalA. et al.

Analysis of different supervised techniques for named entity recognition

HeK. et al.

Deep residual learning for image recognition

EpelbaumT.

Deep learning: Technical introduction

(2017)

BodenM.

A guide to recurrent neural networks and backpropagation

Cited by (13)

A new approach of integrating industry prior knowledge for HAZOP interaction
2023, Journal of Loss Prevention in the Process Industries
Accidents often occur in the petrochemical industry, which have a negative impact on society and the environment. Learning Process Safety Knowledge (PSK) from accident cases is essential to prevent accidents and improve safety level. Hazard and Operability Analysis (HAZOP) is a popular hazard risk analysis method. Its report contains large-scale PSK, which can provide safety analysis and decision support for the industry. Subject to the characteristics of PSK, existing researches mine them in the form of sequence labeling. However, there are two intractable problems that cause the PSK mined by the model to be inaccurate. (1) PSK in HAZOP is domain specific, which is rare or even absent in general-domain texts. (2) The entity boundaries are ambiguous. Most domain-specific entities for HAZOP lack boundary characters. Inaccurate security knowledge is not acceptable from the perspective of process safety engineering. To solve the problems, we present a PSK mining architecture with External Lexicon Prior knowledge called EDPMA, EDPMA is prior knowledge-based multi-task HAZOP knowledge mining model. Specifically, EDPMA consists of prior knowledge constructor and sequence labeling model. The prior knowledge constructor expresses prior knowledge in the form of word embedding by three steps. For the sequence annotation model, we improve its embedding and decoding layers. The former incorporated the word vectors generated by the prior knowledge constructor, and the latter added the task of entity boundary prediction. We conduct multiple evaluation experiments on HAZOP datasets. The experimental results show that the accuracy, recall and F1-score of the EDPMA model are 92.92%, 91.85% and 92.38% respectively, which is better than the existing research. Our study represents a meaningful attempt to introduce prior knowledge in HAZOP knowledge mining and makes an important contribution to intelligence the field of process safety.
Why KDAC? A general activation function for knowledge discovery
2022, Neurocomputing
Deep learning oriented named entity recognition (DNER) has gradually become the paradigm of knowledge discovery, which greatly promotes domain intelligence. However, the activation function of DNER fails to treat gradient vanishing, no negative output or non-differentiable existence, which may impede the exploration of knowledge due to the omission and incomplete representation of the latent semantic. To break through the dilemma, we present a novel activation function termed KDAC. Detailly, KDAC is an aggregation function with multiple conversion modes. The backbone is the interaction between exponent and linearity, and the both ends are extended through adaptive linear divergence, which can surmount the gradient vanishing and no negative output. Crucially, the non-differentiable points can be alerted and eliminated by an approximate smoothing algorithm. KDAC has a series of brilliant properties, such as nonlinear, stable near-linear transformation and derivative, as well as dynamic style, etc. We perform experiments based on BERT-BiLSTM-CNN-CRF model on six benchmark datasets containing different domain knowledge, such as Weibo, Clinical, E-commerce, Resume, HAZOP and People's daily. The evaluation results show that KDAC is advanced and effective, and can provide more generalized activation to stimulate the performance of DNER. We hope that KDAC can be exploited as a promising activation function to devote itself to the construction of knowledge.
A Construction of Knowledge Graph for Semiconductor Industry Chain Based on Lattice-LSTM and PCNN Models
2024, Journal of Internet Technology
A deep neural framework for named entity recognition with boosted word embeddings
2024, Multimedia Tools and Applications
DeepSpacy-NER: an efficient deep learning model for named entity recognition for Punjabi language
2023, Evolving Systems
A Chinese BERT-Based Dual-Channel Named Entity Recognition Method for Solid Rocket Engines
2023, Electronics (Switzerland)

View all citing articles on Scopus

View full text