Elsevier

Neurocomputing

Volume 171, 1 January 2016, Pages 1108-1117
Neurocomputing

A novel word embedding learning model using the dissociation between nouns and verbs

https://doi.org/10.1016/j.neucom.2015.07.046Get rights and content

Abstract

In recent years, there have been researches on using semantic knowledge and global statistical features to guide the learning of word embeddings. Though the syntax knowledge also plays an very important role in natural language understanding, its effectiveness on the word embedding learning is still far from well investigated. Inspired by the principle of the dissociation between nouns and verbs (DNV) in language acquisition observed in neuropsychology, we propose a novel model for word embeddings learning using DNV (named Continuous Dissociation between Nouns and Verbs Model, CDNV). CDNV uses a three-layer feed forward neural network to integrate DNV generated by auto-tagged noun/verb information into the word embeddings learning process, which can still preserve the word order of local context. The advantage of the CDNV lies in that it is able to learn high-quality word embeddings with relatively low time complexity. Experimental results show that: (1) CDNV takes about 1.5 h to learn word embeddings on a corpus of billions of words, which is comparable with CBOW and Skip-gram and more efficient than other models; (2) the nearest neighbors of some representative words derived from the word embeddings learnt by CDNV are more reasonable than other word embeddings; (3) the performance improvement on F1 measure from CDNV word embeddings is greater than other word embeddings on NER and Chunking.

Introduction

In most natural language processing (NLP) models, words are usually mapped to symbolic IDs and then are transformed to discrete 0/1 binary vectors using a one-hot representation. The feature vectors are hence of the same length as the size of the vocabulary. This representation often suffers from the curse of dimensionality [1] and data sparsity. To overcome these problems, researchers have proposed the distributed representation for words, which essentially represent words using low dimensional, continuous and dense vectors, called word embeddings, and have showed stable improvements on a variety of tasks such as paraphrase detection [2], sentiment analysis [3] and machine translation [4]. Word embeddings are usually learnt from a large-scale unlabeled corpus by a multi-layer neural network. Studies on word embeddings learning mainly focus on one of the following two aspects: (1) reducing the time complexity of the learning process, and (2) improving the quality of word embeddings. Two types of approaches were used to reduce the time complexity, i.e., decomposing output nodes and simplifying the structure of the neural network. To improve the quality of word embeddings, human knowledge or global statistical information have been introduced into the process of learning, unfortunately most of which have a relatively high time complexity. The goal of this paper is therefore to seek an efficient model to learn word embeddings of high quality.

More specifically, we want to answer the following question: can grammatical information (e.g. part-of-speech, POS), a basic component of a language [5], be directly used in models to learn word embeddings of high quality? Experiments from cognitive linguistics have shown that English-speaking children take the POS of a new word as a clue to understand it [6]. In particular, nouns and verbs play an important role in language understanding. For example, the word “love” can be either a noun or a verb. Children learn the noun “love” (“love/NN”) in a different way from the verb “love” (“love/VB”), since the meanings and usages of “love/NN” and “love/VB” are different. Experiments from neuropsychology have shown that different parts of the brain are activated when a person learns a noun and a verb from analyzing brain images taken by functional magnetic resonance [7], [8]. This phenomenon is known as the dissociation between nouns and verbs (DNV).

In this paper, we attempt to design an efficient learning framework of word embeddings to take full advantage of the DNV characteristic of language acquisition for improving the quality of word embeddings. To ensure a low time complexity, we adopt a three-layer feed forward neural network like the continuous bag-of-words model (CBOW) and the continuous skip-gram model (Skip-gram) [9] proposed by Mikolov et al. recently. For high-quality word embeddings, we preserve the word order of local context and decompose the output nodes of the neural network according to DNV. The proposed models for learning word embeddings with DNV are named Continuous Dissociation between Nouns and Verbs Models (CDNV). For evaluation, we use it to learn word embeddings on a corpus of billions of words and compare them with other word embeddings in the following two aspects: (1) qualitative analysis: check whether the nearest neighbors of some representative words derived from the word embeddings are reasonable, (2) quantitative analysis: compare the improvement gains from different word embeddings on two traditional NLP tasks, i.e. named entity recognition (NER) and Chunking. The word embeddings used for comparison include those learnt on the same corpus by CBOW and Skip-gram, and other public word embeddings learnt on the similar corpora by other models. Experimental results show that: (1) CDNV models is very efficient (taking about 1.5 h), (2) the nearest neighbors of some representative words derived from the word embeddings learnt by CDNV are more reasonable than other word embeddings, and (3) on both NER and Chunking tasks, the improvement from the word embeddings learnt by CDNV is significantly higher than other word embeddings.

Section snippets

Related work

According to the learning strategy, word embeddings learning model can be classified into two categories: entropy criterion-based models and pairwise ranking-based models [10]. One of the most popular entropy criterion-based models was proposed by Bengio at al. [11], which is a four-layer feed-forward neural network including an input layer, a linear projection layer, a non-linear hidden layer and a softmax output layer. The main idea of this model is to predict a word in a sentence using

Word embedding learning using DNV

As CDNV uses a neural network similar to CBOW, we firstly give a brief review of CBOW, and then introduce CDNV in detail. Based on CBOW, two changes are made to construct CDNV as follows:

  • (1)

    The word order of local context is used at the input layer by replacing the summing up operation of CBOW with a concatenating operation.

  • (2)

    DNV is used as a guide to construct a binary tree to decompose nodes at the output layer. According to DNV, a word may correspond to at most three groups of output nodes and

Experiments

Wikipedia (August 2013 snapshot) and Reuters RCV1 [22] are used to train our word embeddings. We first normalize these two corpus by removing short sentences (length less than five) and abnormal sentences (the proportion of lowercase characters a–z less than 90%), converting all uppercases to lowercases, mapping all digits to “D” and words of low frequency (less than 30) to “UNKNOWN”. And then combine them into a larger corpus with about 1200 million words, in which the size of vocabulary is

Discussion

There are still another two questions: (1) whether the any dissociation of POSs can improve word embeddings? (2) whether the DNV can improve word embeddings when ignoring the word order in the local context?

As there are too many possible dissociations of POSs, we can not investigate all of them. In our study, we only consider the dissociation of all POSs shown in Fig. 4, where all words with their POSs are free mixed together to generate codes of words. Table 10 presents results of CBOW and

Conclusion

This paper presents a novel model to improve word embeddings. We introduce POS information into neural network language model and dissociate nouns and verbs in the process of learning word embeddings. Guided by the DNV principle, the existing POS classes are divided into three groups (i.e., NN, VB and OT). Each word is then encoded by up to three codes, therefore maintaining a good balance between accuracy and efficiency. By replacing the summing up operation at the projection layer of CBOW

Acknowledgments

This paper is supported in part by grants: NSFCs (National Natural Science Foundation of China) (61173075, 61473101 and 61272383), Strategic Emerging Industry Development Special Funds of Shenzhen (JCYJ20140508161040764 and JCYJ20140417172417105).

Baotian Hu received the M.S. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2012. He is currently pursuing the Ph.D. degree in computer science and technology at Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. His current research interests include deep learning and natural language processing.

References (36)

  • A. Mestres-Missé et al.

    Neural differences in the mapping of verb and noun concepts onto novel words

    NeuroImage

    (2010)
  • G. Denes et al.

    precursor of cognitive neuropsychology? the first reported case of noun-verb dissociation following brain damage

    Brain Lang.

    (1998)
  • B. Richard

    Ernest, Dynamic Programming

    (2003)
  • R. Socher, E.H. Huang, A. Ng, Dynamic pooling and unfolding recursive autoencoders for paraphrase detection, in:...
  • S. Richard, P. Alex, W. Jean, C. Jason, M. Chris, N. Andrew, C. Potts, Recursive deep models for semantic...
  • J. Devlin, R. Zbib, Z. Huang, T. Lamar, R. Schwartz, J. Makhoul, Fast and robust neural network joint models for...
  • F. Charles Carpenter

    The Structure of EnglishAn Introduction to the Construction of English Sentences

    (1952)
  • B. Roger W, Linguistic determinism and the part of speech, J. Abnorm. Soc. Psychol. 55 (1973)...
  • C. Davide, I. Chiara, V. Ruggero, C. Antonella, S. Carlo, L. Claudio, On nouns, verbs, lexemes and lemmas: Evidence...
  • T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: Proceedings...
  • R. Collobert et al.

    Natural language processing (almost) from scratch

    J. Mach. Learn. Res.

    (2011)
  • B. Yoshua et al.

    A neural probabilistic language model

    J. Mach. Learn. Res.

    (2003)
  • F. Morin, Y. Bengio, Hierarchical probabilistic neural network language model, in: Proceedings of the 10th...
  • G.A. Miller

    Wordneta lexical database for english

    Commun. ACM

    (1995)
  • A. Mnih, G.E. Hinton, A scalable hierarchical distributed language model, in: Advances in Neural Information Processing...
  • E.H. Huang, R. Socher, C.D. Manning, A.Y. Ng, Improving word representations via global context and multiple word...
  • F. Lev, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, R. Eytan, Placing search in context: the concept...
  • C. Sylvia et al.

    The dissociation between nouns and verbs in broca׳s and wernicke׳s aphasiaFindings from chinese

    Aphasiology

    (1998)
  • Cited by (28)

    • Sentiment aware word embeddings using refinement and senti-contextualized learning approach

      2020, Neurocomputing
      Citation Excerpt :

      In another approach, the POS tag information of words was considered in the word embedding models. Hu et al. [20] proposed an embedding learning algorithm similar to CBOW that integrates the noun/verb information of words into the learning process. More recently, the neural machine translation (NMT) encoder and bidirectional language model have been used to learn deep contextualized representations for words [21–23].

    • Automatic detection and interpretation of nominal metaphor based on the theory of meaning

      2017, Neurocomputing
      Citation Excerpt :

      To capture more context, we use word embedding to obtain vector representations of the concepts. The word embedding approach has been applied to many tasks [39,40]; we apply it to metaphor tasks in this paper. Important semantic information of concepts is implied in word representations, such as the relations and properties of the concepts.

    View all citing articles on Scopus

    Baotian Hu received the M.S. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2012. He is currently pursuing the Ph.D. degree in computer science and technology at Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China. His current research interests include deep learning and natural language processing.

    Buzhou Tang received the PH.D. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2011. From November 2011 to July 2013, he worked for Vanderbilt University and UTHealth as a postdoctoral research fellow. Since April 2015, he has been an assistant research fellow at Harbin Institute of Technology Shenzhen Graduate School. His research interests cover machine learning, natural language processing and medical informatics.

    Qingcai Chen received the Ph.D. degree in computer science from the Computer Science and Engineering Department, Harbin Institute of Technology. From September 2003 to August 2004, he worked for Intel (China) Ltd. as a senior software engineer. Now, he is with the Computer Science and Technology Department of Harbin Institute of Technology Shenzhen Graduate School as an professor. His research interests include machine learning, pattern recognition, speech signal processing, and natural language processing.

    Longbiao Kang received the M.S. degree in computer science from Harbin Institute of Technology Shenzhen Graduate School, China, in 2014. He is currently the Research & Development Engineer at Baidu, Inc. His current research interests include recommendation system for web search and algorithmic mechanism design for ranking.

    View full text