A novel word embedding learning model using the dissociation between nouns and verbs

doi:10.1016/j.neucom.2015.07.046

Neurocomputing

Volume 171, 1 January 2016, Pages 1108-1117

https://doi.org/10.1016/j.neucom.2015.07.046 Get rights and content

Abstract

In recent years, there have been researches on using semantic knowledge and global statistical features to guide the learning of word embeddings. Though the syntax knowledge also plays an very important role in natural language understanding, its effectiveness on the word embedding learning is still far from well investigated. Inspired by the principle of the dissociation between nouns and verbs (DNV) in language acquisition observed in neuropsychology, we propose a novel model for word embeddings learning using DNV (named Continuous Dissociation between Nouns and Verbs Model, CDNV). CDNV uses a three-layer feed forward neural network to integrate DNV generated by auto-tagged noun/verb information into the word embeddings learning process, which can still preserve the word order of local context. The advantage of the CDNV lies in that it is able to learn high-quality word embeddings with relatively low time complexity. Experimental results show that: (1) CDNV takes about 1.5 h to learn word embeddings on a corpus of billions of words, which is comparable with CBOW and Skip-gram and more efficient than other models; (2) the nearest neighbors of some representative words derived from the word embeddings learnt by CDNV are more reasonable than other word embeddings; (3) the performance improvement on F1 measure from CDNV word embeddings is greater than other word embeddings on NER and Chunking.

Introduction

In most natural language processing (NLP) models, words are usually mapped to symbolic IDs and then are transformed to discrete 0/1 binary vectors using a one-hot representation. The feature vectors are hence of the same length as the size of the vocabulary. This representation often suffers from the curse of dimensionality [1] and data sparsity. To overcome these problems, researchers have proposed the distributed representation for words, which essentially represent words using low dimensional, continuous and dense vectors, called word embeddings, and have showed stable improvements on a variety of tasks such as paraphrase detection [2], sentiment analysis [3] and machine translation [4]. Word embeddings are usually learnt from a large-scale unlabeled corpus by a multi-layer neural network. Studies on word embeddings learning mainly focus on one of the following two aspects: (1) reducing the time complexity of the learning process, and (2) improving the quality of word embeddings. Two types of approaches were used to reduce the time complexity, i.e., decomposing output nodes and simplifying the structure of the neural network. To improve the quality of word embeddings, human knowledge or global statistical information have been introduced into the process of learning, unfortunately most of which have a relatively high time complexity. The goal of this paper is therefore to seek an efficient model to learn word embeddings of high quality.

More specifically, we want to answer the following question: can grammatical information (e.g. part-of-speech, POS), a basic component of a language [5], be directly used in models to learn word embeddings of high quality? Experiments from cognitive linguistics have shown that English-speaking children take the POS of a new word as a clue to understand it [6]. In particular, nouns and verbs play an important role in language understanding. For example, the word “love” can be either a noun or a verb. Children learn the noun “love” (“love/NN”) in a different way from the verb “love” (“love/VB”), since the meanings and usages of “love/NN” and “love/VB” are different. Experiments from neuropsychology have shown that different parts of the brain are activated when a person learns a noun and a verb from analyzing brain images taken by functional magnetic resonance [7], [8]. This phenomenon is known as the dissociation between nouns and verbs (DNV).

In this paper, we attempt to design an efficient learning framework of word embeddings to take full advantage of the DNV characteristic of language acquisition for improving the quality of word embeddings. To ensure a low time complexity, we adopt a three-layer feed forward neural network like the continuous bag-of-words model (CBOW) and the continuous skip-gram model (Skip-gram) [9] proposed by Mikolov et al. recently. For high-quality word embeddings, we preserve the word order of local context and decompose the output nodes of the neural network according to DNV. The proposed models for learning word embeddings with DNV are named Continuous Dissociation between Nouns and Verbs Models (CDNV). For evaluation, we use it to learn word embeddings on a corpus of billions of words and compare them with other word embeddings in the following two aspects: (1) qualitative analysis: check whether the nearest neighbors of some representative words derived from the word embeddings are reasonable, (2) quantitative analysis: compare the improvement gains from different word embeddings on two traditional NLP tasks, i.e. named entity recognition (NER) and Chunking. The word embeddings used for comparison include those learnt on the same corpus by CBOW and Skip-gram, and other public word embeddings learnt on the similar corpora by other models. Experimental results show that: (1) CDNV models is very efficient (taking about 1.5 h), (2) the nearest neighbors of some representative words derived from the word embeddings learnt by CDNV are more reasonable than other word embeddings, and (3) on both NER and Chunking tasks, the improvement from the word embeddings learnt by CDNV is significantly higher than other word embeddings.

Section snippets

Related work

According to the learning strategy, word embeddings learning model can be classified into two categories: entropy criterion-based models and pairwise ranking-based models [10]. One of the most popular entropy criterion-based models was proposed by Bengio at al. [11], which is a four-layer feed-forward neural network including an input layer, a linear projection layer, a non-linear hidden layer and a softmax output layer. The main idea of this model is to predict a word in a sentence using

Word embedding learning using DNV

As CDNV uses a neural network similar to CBOW, we firstly give a brief review of CBOW, and then introduce CDNV in detail. Based on CBOW, two changes are made to construct CDNV as follows:

(1)
The word order of local context is used at the input layer by replacing the summing up operation of CBOW with a concatenating operation.
(2)
DNV is used as a guide to construct a binary tree to decompose nodes at the output layer. According to DNV, a word may correspond to at most three groups of output nodes and

Experiments

Wikipedia (August 2013 snapshot) and Reuters RCV1 [22] are used to train our word embeddings. We first normalize these two corpus by removing short sentences (length less than five) and abnormal sentences (the proportion of lowercase characters a–z less than 90%), converting all uppercases to lowercases, mapping all digits to “D” and words of low frequency (less than 30) to “UNKNOWN”. And then combine them into a larger corpus with about 1200 million words, in which the size of vocabulary is

Discussion

There are still another two questions: (1) whether the any dissociation of POSs can improve word embeddings? (2) whether the DNV can improve word embeddings when ignoring the word order in the local context?

As there are too many possible dissociations of POSs, we can not investigate all of them. In our study, we only consider the dissociation of all POSs shown in Fig. 4, where all words with their POSs are free mixed together to generate codes of words. Table 10 presents results of CBOW and

Conclusion

This paper presents a novel model to improve word embeddings. We introduce POS information into neural network language model and dissociate nouns and verbs in the process of learning word embeddings. Guided by the DNV principle, the existing POS classes are divided into three groups (i.e., NN, VB and OT). Each word is then encoded by up to three codes, therefore maintaining a good balance between accuracy and efficiency. By replacing the summing up operation at the projection layer of CBOW

Acknowledgments

This paper is supported in part by grants: NSFCs (National Natural Science Foundation of China) (61173075, 61473101 and 61272383), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20140508161040764 and JCYJ20140417172417105).

Baotian Hu received the M.S. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2012. He is currently pursuing the Ph.D. degree in computer science and technology at Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. His current research interests include deep learning and natural language processing.

References (36)

A. Mestres-Missé et al.
Neural differences in the mapping of verb and noun concepts onto novel words
NeuroImage
(2010)
G. Denes et al.
precursor of cognitive neuropsychology? the first reported case of noun-verb dissociation following brain damage
Brain Lang.
(1998)
B. Richard
Ernest, Dynamic Programming
(2003)
R. Socher, E.H. Huang, A. Ng, Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, in:...
S. Richard, P. Alex, W. Jean, C. Jason, M. Chris, N. Andrew, C. Potts, Recursive deep models for semantic...
J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul, Fast and robust neural network joint models for...
F. Charles Carpenter
The Structure of EnglishAn Introduction to the Construction of English Sentences
(1952)
B. Roger W, Linguistic determinism and the part of speech, J. Abnorm. Soc. Psychol. 55 (1973)...
C. Davide, I. Chiara, V. Ruggero, C. Antonella, S. Carlo, L. Claudio, On nouns, verbs, lexemes and lemmas: Evidence...
T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: Proceedings...

R. Collobert et al.

Natural language processing (almost) from scratch

J. Mach. Learn. Res.

(2011)

B. Yoshua et al.

A neural probabilistic language model

J. Mach. Learn. Res.

(2003)

F. Morin, Y. Bengio, Hierarchical probabilistic neural network language model, in: Proceedings of the 10th...

G.A. Miller

Wordneta lexical database for english

Commun. ACM

(1995)

A. Mnih, G.E. Hinton, A scalable hierarchical distributed language model, in: Advances in Neural Information Processing...

E.H. Huang, R. Socher, C.D. Manning, A.Y. Ng, Improving word representations via global context and multiple word...

F. Lev, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, R. Eytan, Placing search in context: the concept...

C. Sylvia et al.

The dissociation between nouns and verbs in broca׳s and wernicke׳s aphasiaFindings from chinese

Aphasiology

(1998)

Cited by (28)

MRC-Sum: An MRC framework for extractive summarization of academic articles in natural sciences and medicine
2023, Information Processing and Management
Extractive summarization for academic articles in natural sciences and medicine has attracted attention for a long time. However, most existing extractive summarization models often process academic articles with sentence classification models, which are hard to produce comprehensive summaries. To address this issue, we explore a new view to solve the extractive summarization of academic articles in natural sciences and medicine by taking it as a question-answering process. We propose a novel framework, MRC-Sum, where the extractive summarization for academic articles in natural sciences and medicine is cast as an MRC (Machine Reading Comprehension) task. To instantiate MRC-Sum, article-summary pairs in the summarization datasets are firstly reconstructed into (Question, Answer, Context) triples in the MRC task. Several questions are designed to cover the main aspects (e.g. Background, Method, Result, Conclusion) of the articles in natural sciences and medicine. A novel strategy is proposed to solve the problem of the non-existence of the ground truth answer spans. Then MRC-Sum is trained on the reconstructed datasets and large-scale pre-trained models. During the inference stage, four answer spans of the predefined questions are given by MRC-Sum and concatenated to form the final summary for each article. Experiments on three publicly available benchmarks, i.e., the Covid, PubMed, and arXiv datasets, demonstrate the effectiveness of MRC-Sum. Specifically, MRC-Sum outperforms advanced extractive summarization baselines on the Covid dataset and achieves competitive results on the PubMed and arXiv datasets. We also propose a novel metric, COMPREHS, to automatically evaluate the comprehensiveness of the system summaries for academic articles in natural sciences and medicine. Abundant experiments are conducted and verified the reliability of the proposed metric. And the results of the COMPREHS metric show that MRC-Sum is able to generate more comprehensive summaries than the baseline models.
Adaptive cross-contextual word embedding for word polysemy with unsupervised topic modeling
2021, Knowledge-Based Systems
Because of its efficiency, word embedding has been widely used in many natural language processing and text modeling tasks. It aims to represent each word by a vector so such that the geometry between these vectors can capture the semantic correlations between words. An ambiguous word can often have diverse meanings in different contexts, a quality which is called polysemy. The bulk of studies aimed to generate only one single embedding for each word, whereas a few studies have made a small number of embeddings to present different meanings of each word. However, it is hard to determine the exact number of senses for each word, as meanings depend on contexts. To address this problem, this paper proposes a novel adaptive cross-contextual word embedding (ACWE) method for capturing the word polysemy in different contexts based on topic modeling, in which the word polysemy is defined over a latent interpretable semantic space. The proposed ACWE consists of two main parts, in the first of which an unsupervised cross-contextual probabilistic word embedding model is designed to obtain the global word embeddings, and each word is represented by an embedding in the unified latent semantic space. Based on the global word embeddings, an adaptive cross-contextual word embedding process is then devised in the second part to learn the local embeddings for each polysemous word in different contexts. In fact, a word embedding is adaptively adjusted and updated with respect to different contexts to generate different word embeddings tailored to the corresponding contexts. The proposed ACWE is validated on two datasets collected from Wikipedia and IMDb on different tasks including word similarity, polysemy induction, semantic interpretability, and text classification. Experimental results indicate that ACWE does not only outperform the established word embedding methods, which consider word polysemy on six popular benchmark datasets, but it also yields competitive performance compared with state-of-the-art deep learning-based approaches without considering polysemy. Moreover, the proposed ACWE significantly improves the performances of text classification both in precision and F1, and the visualizations of the semantics of words demonstrate the feasibility and advantage of the proposed ACWE model on polysemy.
Sentiment aware word embeddings using refinement and senti-contextualized learning approach
2020, Neurocomputing
Citation Excerpt :
In another approach, the POS tag information of words was considered in the word embedding models. Hu et al. [20] proposed an embedding learning algorithm similar to CBOW that integrates the noun/verb information of words into the learning process. More recently, the neural machine translation (NMT) encoder and bidirectional language model have been used to learn deep contextualized representations for words [21–23].
Most pre-trained word embeddings are achieved from context-based learning algorithms trained over a large text corpus. This leads to learning similar vectors for words that share most of their contexts, while expressing different meanings. Therefore, the complex characteristics of words cannot be fully learned by using such models. One of the natural language processing applications that suffers from this problem is sentiment analysis. In this task, two words with opposite sentiments are not distinguished well by using common pre-trained word embeddings. This paper addresses this problem and proposes two simple, but empirically effective, approaches to learn word embeddings for sentiment analysis. The both approaches exploit sentiment lexicons and take into account the polarity of words in learning word embeddings. While the first approach encodes the sentiment information of words into existing pre-trained word embeddings, the second one builds synthetic sentimental contexts for embedding models along with other semantic contexts. The word embeddings obtained from the both approaches are evaluated on several sentiment classification tasks using Skip-gram and GloVe models. Results show that both approaches improve state-of-the-art results using basic deep learning models over sentiment analysis benchmarks.
Automatic detection and interpretation of nominal metaphor based on the theory of meaning
2017, Neurocomputing
Citation Excerpt :
To capture more context, we use word embedding to obtain vector representations of the concepts. The word embedding approach has been applied to many tasks [39,40]; we apply it to metaphor tasks in this paper. Important semantic information of concepts is implied in word representations, such as the relations and properties of the concepts.
Automatic processing of metaphors can be explicitly divided into two subtasks: recognition and interpretation. This paper presents an approach to recognize nominal metaphorical references and to interpret metaphors by exploiting distributional semantics word embedding techniques and calculating semantic relatedness. In terms of detection, our idea is that nominal metaphors consist of source and target domains and that domains present in metaphors will be less related than domains present in non-metaphors. We represent the meaning of the concept as a vector in high-dimensional conceptual space derived from the corpus and compute the relatedness between the vectors to complete the task of detection. Relatedness here is based on the semantics of concepts. Thus, the model we present deals with metaphors where target and source have the same direct ancestors, such as “A surgeon is a butcher”.
Then, using the relatedness between target and source domain, based on the properties of source domain and dynamic transfer of properties, we present an approach to interpret metaphors with dynamic transfer. Based on the view that metaphor interpretation is the cooperation of source and target domains, we divide metaphor interpretation into two subtasks: properties extraction and properties transfer. Creatively, we use annotations to express a non-binary evaluation, and we take the degree of the annotators' acceptability to evaluate our interpretation of metaphors.
Meta heuristic approaches for sentiment analysis
2024, Expert Systems
Chinese Language and Literature Education and Humanistic Literacy Enhancement in Colleges and Universities under the Background of Education Informatization
2024, Applied Mathematics and Nonlinear Sciences

View all citing articles on Scopus

Buzhou Tang received the PH.D. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2011. From November 2011 to July 2013, he worked for Vanderbilt University and UTHealth as a postdoctoral research fellow. Since April 2015, he has been an assistant research fellow at Harbin Institute of Technology Shenzhen Graduate School. His research interests cover machine learning, natural language processing and medical informatics.

Qingcai Chen received the Ph.D. degree in computer science from the Computer Science and Engineering Department, Harbin Institute of Technology. From September 2003 to August 2004, he worked for Intel (China) Ltd. as a senior software engineer. Now, he is with the Computer Science and Technology Department of Harbin Institute of Technology Shenzhen Graduate School as an professor. His research interests include machine learning, pattern recognition, speech signal processing, and natural language processing.

Longbiao Kang received the M.S. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2014. He is currently the Research & Development Engineer at Baidu, Inc. His current research interests include recommendation system for web search and algorithmic mechanism design for ranking.

View full text

A novel word embedding learning model using the dissociation between nouns and verbs

Abstract

Introduction

Section snippets

Related work

Word embedding learning using DNV

Experiments

Discussion

Conclusion

Acknowledgments

NeuroImage

Brain Lang.

Ernest, Dynamic Programming

The Structure of EnglishAn Introduction to the Construction of English Sentences

Natural language processing (almost) from scratch

J. Mach. Learn. Res.

A neural probabilistic language model

J. Mach. Learn. Res.

Wordneta lexical database for english

Commun. ACM

The dissociation between nouns and verbs in broca׳s and wernicke׳s aphasiaFindings from chinese

Aphasiology