Improving effectiveness of mutual information for substantival multiword expression extraction

doi:10.1016/j.eswa.2009.02.026

Expert Systems with Applications

Volume 36, Issue 8, October 2009, Pages 10919-10930

https://doi.org/10.1016/j.eswa.2009.02.026 Get rights and content

Abstract

One of the deficiencies of mutual information is its poor capacity to measure association of words with unsymmetrical co-occurrence, which has large amounts for multi-word expression in texts. Moreover, threshold setting, which is decisive for success of practical implementation of mutual information for multi-word extraction, brings about many parameters to be predefined manually in the process of extracting multiword expressions with different number of individual words. In this paper, we propose a new method as EMICO (Enhanced Mutual Information and Collocation Optimization) to extract substantival multiword expression from text. Specifically, enhanced mutual information is proposed to measure the association of words and collocation optimization is proposed to automatically determine the number of individual words contained in a multiword expression when the multiword expression occurs in a candidate set. Our experiments showed that EMICO significantly improves the performance of substantival multiword expression extraction in comparison with a classic extraction method based on mutual information.

Introduction

A word is characterized by the company it keeps (Firth, 1957) and the closer a set of terms, the more likely they are to indicate relevance (Hawking & Thistlewaite, 1996). That means not only the individual word but also the contextual information of the individual word is useful for further information processing. This simple and direct idea motivates researches on multiword expression (MWE), which expects to capture semantic concepts expressed by multi-words in text. In state of art, there is no satisfactory formal definition of MWE but some generally grammatical, syntactical or lexical characteristics to describe multiword expression. In this paper, the substantival multiword expression we refer to includes merely terminology and named entity. Although it is the simplest and most frequently used MWE in text, unfortunately, we also cannot give a precise definition for the substantival multiword expression but to use some explicit properties to characterize it.

•
It has been used as noun phrase in text to describe a concrete concept in context such as “federal reserve board” and “Shearson Lehman Brothers Inc.”.
•
From grammatical parsing view, it can be parsed as an entity in a sentence and usually, it has a more stable syntactical pattern than other MWEs in text.
•
In lexical composition, it often uses contiguous words composed as a word block in sentence, i.e., there is no other word inserted into a substantival multiword expression. This is not the case in most prepositional and conjunctional collocation such as “too…to…” and “so…that…”
•
Like terminology, substantival multiword expression also has a length as 2–6 individual words.

The motivation for us to carry out the research on MWE extraction is that we intend to use MWE for text mining purpose and examine its performance in comparison with traditional indexing method as individual words combined with vector space model (Zhang et al., 2008a, Salton et al., 1975). We conjecture that for text representation, MWE may have superiority in both statistical and semantical quality over individual word. With this intention, we started out our research on MWE extraction (Zhang et al., 2009, Zhang et al., 2008b). Especially, the focus of this paper is on using statistical method to extract MWE from text. We also follow the regulation in this area to propose an association measure to score candidates firstly and then propose a method to differentiate the substantival MWEs from all candidates automatically.

In statistical method for MWE extraction, the most frequently used association measure is mutual information (MI). Although there is also some other measures such as z-score and mutual expectation, their basic ideas are very similar with MI: joint probability inverse products of independent probabilities and the assumption concerning the two words in a word pair are the same: the two words may have as many occurrences as each other, that is, their occurrence possibilities in text are almost equal. Hence, these methods can be regarded as variants of mutual information. However, we will show later that MI is not appropriate for association measure when it goes to unsymmetrical co-occurrence of these two words. Moreover, how to select candidates after association measure is another problem. Usually, a predefined parameter was set to retain a proportion of candidates with top association scores (values) as final extracted MWEs. Although LocalMaxs (Silva, Dias, Guillore, & Lpoes, 1999) was proposed to determine the number of individual words included in a MWE automatically, it is not appropriate for extracting substantival MWE because it often has a fixed composition and sometimes the word sequences at the association maxima is not an exact substantival MWE.

In this paper, EMICO was proposed to extract multi-words from documents. Specifically, we proposed the enhanced mutual information (EMI) to cope with the problem as unsymmetrical co-occurrence. And we developed collocation optimization (CO) to determine the number of individual words contained in a substantival MWE automatically.

The key idea of EMI is to measure word pair’s dependency as the ratio of its probability of being a multi-word to its probability of not being a multi-word. By revising the individual words’ occurrences as their occurrences subtracting their co-occurrence, respectively, EMI has considered individual word’s occurrences and its proportion contributed in its co-occurrence with other words synthetically so that association score will vary dramatically with the proportion of one word’s occurrence contributed to its co-occurrences with other words. In addition, by separating the association contributed by each word in a word sequence with more than two single words, rather than requiring MWE candidates to be formatted into two components as in practical implementation of mutual information, EMI can reduce the negative effect of rare occurrences to some extent.

Collocation optimization (CO) was proposed to determine the exact number of individual words in a substantival MWE automatically. We use the traditional N-gram method to produce word sequences with a same head noun and pack these sequences into a candidate set (clarified in Section 5.1). In each candidate set, we only retain one of its candidates as a substantival MWE because we conjecture that there must be at most only one correct substantival MWE for the head noun to compose a most appropriate MWE with other words. The key idea of CO is similar with LocalMaxs, that is, when an individual word is added to a MWE candidate (old MWE), the cohesiveness of the new MWE will increase if this individual word is exactly a part of this MWE candidate, otherwise, the association score of the new MWE will decline compared with that of the old MWE.

The remainder of this paper is organized as follows. Section 2 provides a literature review of the MWE extraction. Section 3 introduces mutual information, particularly with its practical application for MWE extraction. Section 4 proposes EMI. We will give its definition, its theoretical analysis and numerical simulation in comparison with mutual information. Section 5 proposes CO. Its mechanism will be specified together with a comparison with LocalMaxs. Section 6 specifies the details of EMICO for substantival MWE extraction together with practical performance evaluation on real corpus. Section 7 concludes this paper and indicates our future research.

Section snippets

Literature review

Generally speaking, there are four types of methods developed for MWE extraction: statistical method, linguistic method, hybrid method and machine learning. These methods are introduced as follows.

Mutual information

Mutual information (MI) is defined as the reduction in uncertainty of one random variable due to knowing about another, or in other words, the amount of information one random variable contains about another. In multi-word detection, MI can be defined as the amount of information provided by the occurrence of the word represented by Y about the occurrence of the word represented by X. Church and Hanks (1990) proposed the association ratio for measuring word association based on the information

Motivation

The reason for unsymmetrical co-occurrence is from the unequal proportions of the words’ occurrences contributing to their common co-occurrence in a word pair as is shown in Section 3. We would like to make it clear using the following example as shown in Fig. 1. Given the two cases of words’ co-occurrence, our problem is which one should be regarded as having greater word association than the other?

Obviously, the left co-occurrence is more balanced than the right one because both X₁ and X₂

Related concepts

In order to proceed, some related concepts with substantival MWE should be clarified. As we have pointed out in Section 1, the topic of this paper is to extract terminologies and named entities using statistical methods and usually, they are noun phrases. For this reason, substantival MWE candidates are produced by traditional N-gram method. For instance, if we have a sentence after morphological analysis as “A B C D E F G H.” and H is a noun, then the candidates will be generated as “G H”, “F

Corpora and standard set

Based on our previous work on text mining (Zhang et al., 2006, Zhang et al., 2007), 184 documents from Xiangshan Science Conference Website (http://www.xssc.ac.cn) are downloaded and used for the Chinese text collection to conduct multi-word extraction. The topics of these documents mainly focus on basic research in academic fields such as nano science and life science, so there are plenty of noun multi-words (terms, noun phrases, etc.) in these documents. These documents contain totally 16,281

Concluding remarks and future work

In this paper, a new approach, EMICO (enhanced mutual information and collocation optimization) is proposed for substantival MWE extraction from texts. Specifically, EMI is proposed to measure association of word pair and collocation optimization is proposed to determine the optimal length of a MWE. With EMI, association of a word pair is measured by the ratio of the probability of the individual words’ being a MWE to the probability of them not being a MWE. The benefits of EMI include the

Acknowledgement

This work is supported by Ministry of Education, Culture, Sports, Science and Technology of Japan under the “Kanazawa Region, Ishikawa High-Tech Sensing Cluster of Knowledge-Based Cluster Creation Project”, and partially supported by the National Natural Science Foundation of China under Grants Nos. 70571078 and 70221001.

References (27)

T.F. Smith et al.
Identification of common molecular subsequences
Journal of Molecular Biology
(1981)
W. Zhang et al.
Text classification based on multi-word with support vector machine
Knowledge-based Systems
(2008)
Argamon, S., Dagan, I., & Krymolowski, Y. (1998). A memory-based approach to learning shallow natural language...
Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings...
Chen, K. H., & Chen, H. H. (1994). Extracting noun phrases from large-scale texts: A hybrid approach and its automatic...
B.X. Chen et al.
Preparatory work on automatic extraction of collocations from Corpora
Computational Linguistics and Chinese Language Processing
(1992)
Church, K. W., & William, A. G. (1991). Concordances for parallel text. In Proceedings of the seventh annual conference...
K.W. Church et al.
Word association norms, mutual information, and lexicography
Computational Linguistics
(1990)
Dias, G. (2003). Multiword unit hybrid extraction. In Proceedings of the workshop on multiword expressions of the 41st...
Duan, J. Y., Lu, R. Z., Wu, W. L., Hu, Y., & Tian, Y., (2005). A bio-inspired approach for multi-word expression...

Firth, J. R. (1957). A synopsis of linguistic theory 1930–1955. Studies in linguistic analysis. Philological Society....

D. Hawking et al.

Proximity operators – so near and yet so far

Proceedings of TREC-4

(1996)

F. Jelinek

Self-organized language modeling for speech recognition

(1990)

Cited by (0)

View full text