Combining Specialized Word Embeddings and Subword Semantic Features for Lexical Entailment Recognition

doi:10.1016/j.datak.2022.102077

Data & Knowledge Engineering

Volume 141, September 2022, 102077

https://doi.org/10.1016/j.datak.2022.102077 Get rights and content

Abstract

The challenge of Lexical Entailment Recognition (LER) aims to identify the is-a relation between words. This problem has recently received attention from researchers in the field of natural language processing because of its application to varied downstream tasks. However, almost all prior studies have only focused on datasets that include single words; thus, how to handle compound words effectively is still a challenge. In this study, we propose a novel method called LERC (Lexical Entailment Recognition Combination) to solve this problem by combining embedding representations and subword semantic features. For this aim, firstly a specialized word embedding model for the LER tasks is trained. Secondly, subword semantic information of word pairs is exploited to compute another feature vector. This feature vector is combined with embedding vectors for supervised classification. We considered three LER tasks, including Lexical Entailment Detection, Lexical Entailment Directionality, and Lexical Entailment Determination. Experimental results conducted on several benchmark datasets in English and Vietnamese languages demonstrated that the subword semantic feature is useful for these tasks. Moreover, LERC outperformed several methods published recently.

Introduction

Lexical entailment (LE) is an asymmetric semantic relation between a generic word (hypernym) and its specific instance (hyponym). For example, vehicle is a hypernym of car, fruit is a hypernym of mango. This semantic relation has recently been studied extensively from different perspectives to develop the mental lexicon [1]. Additionally, LE is also referred to using other terms including taxonomic [2], is-a [3] or hypernymy [4]. In this study, we prefer the term lexical entailment to others because this term can demonstrate the nature of this relation clearly. Prior studies have introduced a number of LE definitions [5], [6]. Since the definition given by Geffet and Dagan [5] is based on the substitutability of words holding the LE relation, this definition is more likely regarding the distributional hypothesis, which is the basis of word embedding methods, which are important to solve LER tasks. Geffet and Dagan [5] define lexical entailment as follows: u entails v, if two conditions are fulfilled.

1. Word meaning entailment: the meaning of a possible sense of u implies a possible sense of v;

2. Substitutability: u can substitute for v in some naturally occurring sentence, such that the meaning of the modified sentence would entail the meaning of the original one.

LE is a fundamental relation in many structured knowledge databases including WordNet [7] and BabelNet [8]. Thus, LE has been used effectively in many natural language processing (NLP) systems consisting of recognizing textual entailment [9], text generation [10], and metaphor detection [11]. Among many others, a good example is presented in [6] about recognizing entailment between sentences by identifying the lexical entailment relation between words. For example, since bitten is a hyponym of attacked, and dog is a hyponym of animal, George was bitten by a dog and George was attacked by an animal have an entailment relation. The lexical entailment recognition problems (LER) are important NLP tasks that aim to identify the is-a relation between words.

In recent years, word embeddings have established themselves as an integral part of NLP models, with their usefulness demonstrated across application areas such as taxonomy induction [12], machine translation [13], natural language inference [14]. Recently, methods based on word embeddings to solve LER tasks outperformed other approaches [15], [16]. However, word embedding models rely on the distributional hypothesis leading to the first drawback $^{*}$ that concentrates on different relations between words, such as synonymy and topical relatedness into a single vector space. Several studies proposed specialized word embedding methods, strengthening a specific relation of embedded vectors for concrete tasks. These methods either integrate lexical contrast into objective functions when training word embedding [4] or inject lexical information extracted from external knowledge resources into distributional vector spaces [17]. Aiming to gain word representation vectors suitable for LER tasks, Luu et al. [2] introduced a dynamic weighting neural network model to learn a specialized word embedding model, using not only LE pairs but also its contextual information.

In the lexicon of languages, there are often both single words and compound words.¹ In the Vietnamese language, compound words take a large proportion of vocabulary. Table 1 shows several word length statistics from a popular Vietnamese dictionary, which is conducted by Nguyen et al. [18]. In technical domains, the ratio of compound words is even higher. Although the proportion of English compound words is less than in Vietnamese, they are more common in technical domains such as medication, and bioinformatics (Table 8).

According to Zipf’s law, the frequency of any word is inversely proportional to its rank in the frequency table, which is to say most words occurring very infrequently [19]. Furthermore, compound words generally have lower frequencies than single words. For example, only $135$ single words are needed to account for half the Brown Corpus, a standard corpus includes $500$ samples of English-language text, which total roughly one million words. Moreover, word embedding models learn the representation of words according to the distribution of words in a corpus. The quality of embedded vectors depends on the frequency of words. To solve the LER tasks, the second drawback $^{* *}$ that is embedding models may not be good enough for compound words. This is trustworthy when compound words appear only in a specific domain, such as terminologies (e.g. long-winged_web-footed_aquatic_bird, aquatic_bird, partially_observable_markov_decision_problem, markov_decision_problem). Furthermore, acquiring technical domain corpora such as medication, and bioinformatics is very costly.

In pairs containing compound words, the semantic relation between components of two words of a pair, such as lexical entailment, synonymy, and identical, often manifests the lexical entailment relation of this word pair. For example, hoa_hồng¡rose¿ and hoa_hồng_bạch¡white rose¿,² these words share a common component as hoa_hồng , while both icosahedral_polyploid _dsRNA_virus and infectious_bursal_disease_virus have a common component as virus. Intuitively, we can recognize their LE relation. The information of the semantic relations between components that belong to two words of a pair, which is named the subword semantic feature. However, prior studies have not exploited this feature to solve the LER task between compound words.

For downstream tasks, determining relations of rare compound words is more useful than recognizing relations of popular single words that inherently have been defined in the lexical knowledge resource. Therefore, This study aims to propose a method that overcomes two disadvantages ( $^{*},^{* *}$ ) of word embedding models to solve LER tasks effectively for compound words. For this purpose, we introduce a specialized word embedding model to handle the first drawback. Rich subword semantic features of the compound word are complemented to surmount the second drawback and the proposed method is compared to the most recent studies. Experimental results on both English and Vietnamese benchmark datasets demonstrate that our method can achieve better results than others on several tasks.

Thus, this paper offers three main contributions:

•
Firstly, we provide reliable Vietnamese datasets for the LER tasks, which are elaborately built with the participation of Vietnamese language experts.
•
Secondly, we propose an improvement for the dynamic weighting neural network model [2]. The improved embedding model provides higher quality embedded vectors suitable for the LER tasks. Further, we introduce a supervised classification model that combines specialized word embedding vectors and the subword semantic features to solve these tasks.
•
Last but not least, we exploit structural characteristics of compound words, which are named subword semantic features, for the LER tasks. According to this, we propose a schema to extract feature vectors representing the semantic similarity between components of words and their position. By exploiting the feature, our method’s performance can be increased significantly due to better handling of compound words.

The rest of this paper is structured as follows. Section 2 presents related works. Section 3 describes our method. Section 4 presents the construction of three Vietnamese datasets. Section 5 offers experimental results and analyses. Lastly, the final section provides several conclusions and future works.

Section snippets

Related works

Prior studies on this problem can be classified into pattern-based and distributional approaches. Pattern-based methods and early studies on LER exploit lexical-syntactic patterns (e.g. x such as y) to predict an LE relation of a word pair (x, y), if $x$ and $y$ appear in the same sentence and match these patterns. The most influential work following this approach was introduced by Hearst [20]. His study introduced several handcrafted lexical patterns as a relation strainer on a corpus to harvest

LERC model

The LERC model exploits information from both lexical knowledge and corpus to solve the lexical entailment recognition problem. Fig. 1 describes the architecture of our model.

Three new Vietnamese datasets

Datasets play an important role in semantic relation studies, especially for a low-resource language such as Vietnamese. Building a reliable dataset is a challenge [58], [59]. In this study, we aim to construct three Vietnamese benchmark datasets for the LER tasks. The datasets were annotated by a language expert at the institute of linguistics, Vietnam academy of social sciences.

Experiments

We conducted experiments to evaluate our method of solving three LER tasks in both Vietnamese and English.

Conclusions

This paper offers three significant contributions. Firstly, the paper proposes EDWN, which is a specialized word embedding model. By adding to the original DWN model an attention layer, the EDWN model can provide higher quality embedding vectors for the LER tasks. Secondly, the paper introduces the subword feature that is useful to recognize the LE relation between compound words and proposes a method to calculate this feature. Further, we propose the LERC model for solving three LER tasks by

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under Grant number 102.05-2020.26.

References (61)

Phuong Thai Nguyen, Van Lam Pham, Hoang Anh Nguyen, Huy Hien Vu, Ngoc Anh Tran, Truong Thi Thu Ha, A Two-Phase Approach...
LuuAnh Tuan et al.
Learning term embeddings for taxonomic relation identification using dynamic weighting neural network
SeitnerJulian et al.
A large DataBase of hypernymy relations extracted from the web
NguyenKim Anh et al.
Hierarchical embeddings for hypernymy detection and directionality
GeffetMaayan et al.
Bootstrapping distributional feature vector quality
Comput. Linguist.
(2009)
TurneyPeter D. et al.
Experiments with three approaches to recognizing lexical entailment
Nat. Lang. Eng.
(2015)
NavigliRoberto et al.
Babelnet: Building a very large multilingual semantic network
DaganIdo et al.
Recognizing textual entailment: Models and applications
BiranOr et al.
Classifying taxonomic relations between pairs of wikipedia articles

MohlerMichael et al.

Semantic signatures for example-based linguistic metaphor detection

(2013)

GuptaAmit et al.

Taxonomy induction using hypernym subsequences

ZouWill Y. et al.

Bilingual word embeddings for phrase-based machine translation

BowmanSamuel R. et al.

A large annotated corpus for learning natural language inference

NayakN.

Learning hypernymy over word embeddings

(2015)

Zheng Yu, Haixun Wang, Xuemin Lin, Min Wang, Learning Term Embeddings for Hypernymy Identification, in: Proceedings of...

VulićIvan et al.

Specialising word vectors for lexical entailment

NguyenPhuong Thai et al.

Vietnamese treebank construction and entropy-based error detection

Lang. Resour. Eval.

(2015)

HanksPatrick

The impact of corpora on dictionaries

Contemp. Corpus Linguist.

(2009)

Marti A. Hearst, Automatic Acquisition of Hyponyms from Large Text Corpora, in: 14th International Conference on...

SnowR. et al.

Learning syntactic patterns for automatic hypernym discovery

Adv. Neural Inf. Process. Syst.

(2005)

Anh Tuan Luu, Jung-jae Kim, See-Kiong Ng, Incorporating Trustiness and Collective Synonym/Contrastive Evidence into...

ShwartzVered et al.

Hypernyms under siege: Linguistically-motivated artillery for hypernymy detection

WuWentao et al.

Probase: A probabilistic taxonomy for text understanding

DongXin et al.

Knowledge vault: A web-scale approach to probabilistic knowledge fusion

BannourNesrine et al.

Patch-based identification of lexical semantic relations

LinDekang

Automatic retrieval and clustering of similar words.

Maayan Geffet, Ido Dagan, The Distributional Inclusion Hypotheses and Lexical Entailment, in: ACL 2005, 43rd Annual...

Julie Weeds, David J. Weir, Diana McCarthy, Characterising Measures of Lexical Distributional Similarity, in: COLING,...

ClarkeDaoud

Context-theoretic semantics for natural language: An overview

Cited by (1)

Flexible margins and multiple samples learning to enhance lexical semantic similarity
2024, Engineering Applications of Artificial Intelligence
The advancement of deep learning and neural networks has led to the widespread adoption of neural word embeddings as a prominent lexical representation method in natural language processing. With the help of the neural language model trained by the contextual information of large scale text, the neural word embedding obtained by the neural language model captures more semantic correlation in the semantic space, while ignoring the semantic similarity. It will incur high computational cost and time costs during the training process of the model. To better inject semantic similarity into the distribution space and reduce time cost, we perform post processing learning of neural word embeddings using deep metric learning. This paper proposes a lexical enhancement method based on flexible margins and multiple samples learning. In this method, we embed the lexical entailment constraint relations into neural word embeddings. By categorizing the set of lexical constraints and penalizing the negative samples to different degrees according to the gap between categories, and allowing the positive and negative samples to learn from each other in the distributed space. The method we propose significantly improves neural word embeddings. By evaluating neural word embedded vocabulary similarity, the benchmark accuracy is improved to 75%. The method shows great competitiveness in text similarity tasks and text categorization tasks. These findings summarize research results and provide strong support for further applications.

Van-Tan Bui is a lecturer at the Department of Information Technology, University of Economics - Technology for Industries, Vietnam. He received a BSc degree in Information Technology from the Ho Chi Minh City University of Technology and Education, Vietnam in 2007, and an MSc degree from the Le Quy Don Technical University, Vietnam in 2011. He is currently a PhD student in computer science at the VNU University of Engineering and Technology (UET), Vietnam. He is conducting a doctoral thesis on the topic ”Determining Semantic Relations based on Statistical Machine Learning” under the supervision of Associate Professor Phuong-Thai Nguyen (UET). His research is currently focusing on natural language processing (NLP). He is working with some NLP problems based on deep learning such as word representation, word similarity, and taxonomy learning from text corpora.

Phuong-Thai Nguyen is the head of Natural Language Processing Laboratory at the Institute of Artificial Intelligence, VNU University of Engineering and Technology (UET), Vietnam. He received a PhD degree in Computer Science from the Japan Advanced Institute of Science and Technology, Japan in 2008. He was appointed as an associate professor at UET in 2015. He is the author or co-author of more than 60 scientific articles, book chapters, and in charge of many scientific research projects. His research interests include natural language processing, linguistic annotation, and machine learning.

Van-Lam Pham is a PhD in Linguistics. He is the head of the Department of Phonology – Lexicology – Grammar, Institute of Linguistics, Vietnam Academy of Social Sciences. He received his PhD from the Vietnam National University, Hanoi in 2017. His research interests currently focus on lexical-semantic relations, linguistic annotation, machine translation, language education, language acquisition, and theoretical linguistics.

View full text