Using Wikipedia concepts and frequency in language to extract key terms from support documents

doi:10.1016/j.eswa.2012.07.011

Expert Systems with Applications

Volume 39, Issue 18, 15 December 2012, Pages 13480-13491

https://doi.org/10.1016/j.eswa.2012.07.011 Get rights and content

Abstract

In this paper, we present a new key term extraction system able to handle with the particularities of “support documents”. Our system takes advantages of frequency-based and thesaurus-based approaches to recognize two different classes of key terms. On the one hand, it identifies multi-domain key terms of the collection using Wikipedia as knowledge resource. On the other hand, the system extracts specific key terms highly related with the context of a support document. We use the frequency in language as a criterion to detect and rank such terms. To prove the validity of our system we have designed a set of experiment using a Frequently Asked Questions (FAQ) collection of documents. Since our approach is generic, minor modifications should be undertaken to adapt the system to other kind of support documents. The empirical results evidence the validity of our approach.

Highlights

► Able to deal with the specific characteristics of the support documents. ► Hybrid system based on frequency-based and thesaurus-based approaches. ► Frequency in language of terms as a criterion to detect support domain dependant key terms. ► Dictionary of concepts drawn from Wikipedia designed to detect multidomain key terms.

Introduction

Nowadays, the hectic lifestyle force us to a continuous and fast learning not only in the working environment, but also in our free time. Fortunately, there are brief documents that help non-experts users learn the main concepts of any topics. These documents, known as “support documents” (SD), include the so-called how-to guides, tutorials, FAQ lists or walkthroughs. Due to the large quantity of documents published on the Internet, it would be desirable that we had the use of proper tools or techniques to support classification, search, or other maintenance activities. In this sense, an Automatic Keyword Extraction (AKE) method help users to quickly skim over documents would be unarguably a valuable contribution. Furthermore, it could also be applied for text summarization, text clustering, and text classification (Wenchao, Lianchen, & Ting, 2009).

In classical AKE studies, the documents of the collection are usually dependent on general-domains (containing world knowledge from public sources covering high diffusion topics). In addition, each document presents extensive and self-describing information about its main topic. This is not the case for SD. Their topics are frequently referred to specific-domains (containing knowledge from specific issues, usually of limited or private diffusion). Moreover, a SD does not contain every detail of its topic, but only specific information referred to significant aspects of the topic.

From a more deeper analysis, we found two classes of important terms¹: (i) “Multidomain key terms” (MKT), meaningful in a multidomain context (e.g. ‘virtual assistant’ in Example 1 in Table 1); and (ii) “Specific domain key terms” (SKT), proper nouns or technical terms closely related to specific aspects of the document topic (e.g. ‘Interactive Dialog’ in Example 1 in Table 1).

The above mentioned characteristic motivated us to design an hybrid automatic system to extract key terms from SD. In this regard, a frequency-based approach seems not appropriate due to the terms appearing rarely could be actually relevant to the document. Similarly, as there is not a significant co-occurrence distribution between terms in SD, a word association-based approach would also be inappropriate. For these reason, we decided to combine two strategies. (a) A thesaurus-based approach is employed to detect MKT. Following the actual tendency in the literature, we develop a controlled dictionary of ‘concepts’ drawn from Wikipedia that includes acronyms, translations, and misspellings. (b) In contrast, SKT are hardly contained in any controlled dictionary due to its technical nature. However, since they are uncommon terms in their language, we take advantage of frequency in language analysis.² In addition, candidate key terms are previously filtered out by means of a modular filter. Since our system is focused on FAQ documents, we believe that it could be an useful tool for monitoring human FAQ administrators in the task of configuring FAQ retrieval systems (Tao, Liu, & Lin, 2011).

The rest of this paper is organized as follows. Section 2, offers an overview of previous work on key term extraction. A brief summary of Wikipedia and its use to obtain a controlled vocabulary of concepts is presented in Section 3. Next, we describe the frequency in language feature of terms and its applications. Section 5 describes the structure and functions of our system. The method of analysis and the experimental validation of our method are outlined in Section 6. Finally, Section 7 concludes with a discussion of results and future research.

Section snippets

Related works

In this section, we depict the characteristics of the main approaches in Keyword Extraction. Additionally, we analyse how each proposal could be applied to extract the two types of key terms in SD.

First works on this field were based on Machine Learning algorithms. Supervised learning methods treat this task as a classification problem using lexical, syntactic or statistical features (or a mixture of them) of the training labelled data to extract keywords (Csomai & Mihalcea, 2008,

Wikipedia-based knowledge resource

In this section, we define the dictionary of concepts extracted from Wikipedia articles that is involved in subsequent processes of our system (Section 5.3).

Term frequency in language

This section depicts a term frequency in language dictionary based on the English language, used in Section 5.2. For terms not contained in this dictionary, we design an algorithm to calculate an approximate simulated frequency based on Google’s search engine.

System overview

In this section, the multi-stage architecture of our high performance Automatic Support-Domain Key Term Extraction (ASKEx) system is explained. We have considered two classes of key terms from each SD in a collection: (i) multidomain key terms (MKT), and (ii) specific domain key terms (SKT).

We thus propose a system composed by the following modules (Fig. 2). In an initial stage, the preprocessing module prepares the content of the collection of documents. Then, the algorithm performs at

Empirical evaluation

This section reports the empirical results obtained in the evaluation of our system. The validity of our system has also been contrasted to other state-of-the-art methods.

Conclusions and future work

In this paper, a new key term extraction system able to handle with the particularities of the support document context has been proposed. The system has obtained promising results in experimental validation, being compared with some of the most important algorithms in Keyword Extraction.

Our method is a hybrid system that takes advantage of the strengths of the frequency-based and thesaurus-based criteria, introducing some novelties. In this sense, the system is able to recognize two different

Acknowledgements

The authors thank the Junta de Andalucía that supported this article with its project: TIC2009–5011.

References (42)

P.-I. Chen et al.
Word adhoc network: Using google core distance to extract the most relevant information
Knowledge-Based Systems
(2011)
L.H. Lee et al.
High relevance keyword extraction facility for bayesian text classification on different domains of varying characteristic
Expert System with Applications
(2012)
C.L. Liu et al.
Intelligent computer assisted blog writing system
Expert System with Applications
(2012)
C.L. Liu et al.
Computer assisted writing system
Expert System with Applications
(2011)
G. Salton et al.
Term weighting approaches in automatic text retrieval
Information Processing and Management
(1988)
Y.H. Tao et al.
Summary of faqs from a topical forum based on the native composition structure
Expert System with Applications
(2011)
M. Abramowitz et al.
Handbook of mathematical functions
(1965)
E. Agirre et al.
Using the multilingual central repository for graph-based word sense disambiguation
A.F. Al-Eroud et al.
Evaluating google queries based on language preferences
Journal of Information Science
(2011)
Bracewell, D. B., Ren, F., & Kuriowa, S. (2005). Multilingual single document keyword extraction for information...

Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the seventh...

Bunescu, R., & Pasça, M. (2006). Using encyclopedic knowledge for named entity disambiguation. In Proceedings of the...

J. Cohen

Statistical power analysis for the behavioral sciences

(1988)

J. Cohen et al.

Applied multiple regression/correlation analysis for the behavioral sciences

(2003)

Coursey, K. H., Mihalcea, R., & Moen, W. E. (2008). Automatic keyword extraction for learning object repositories. In...

A. Csomai et al.

Linguistically motivated features for enhanced back-of-the-book indexing

M. Davies

The corpus of contemporary american english as the first reliable monitor corpus of english

Literary and Linguistic Computing

(2010)

Davies, M. (2011). Word frequency data from the Corpus of Contemporary American English (COCA). Available from...

E. Frank et al.

Domain-specific keyphrase extraction

Gabrilovich, E., & Markovitch, S. (2006). Overcoming the brittleness bottleneck using Wikipedia: Enhancing text...

Gazendam, L., Wartena, C., & Brussee, R. (2010). Thesaurus based term ranking for keyword extraction. In Workshop on...

Cited by (9)

A cloud of FAQ: A highly-precise FAQ retrieval system for the Web 2.0
2013, Knowledge-Based Systems
Citation Excerpt :
If that is the case, the words of these synsets are preprocessed and stored along with the query word. If no WordNet synsets are found, the same process is carried out employing a Wikipedia-based dictionary of concepts [43]. If no synonyms are found in any of the two dictionaries, the query word is preprocessed and stored alone.
FAQ (Frequency Asked Questions) lists have attracted increasing attention for companies and organizations. There is thus a need for high-precision and fast methods able to manage large FAQ collections. In this context, we present a FAQ retrieval system as part of a FAQ exploiting project. Following the growing trend towards Web 2.0, we aim to provide users with mechanisms to navigate through the domain of knowledge and to facilitate both learning and searching, beyond classic FAQ retrieval algorithms. To this purpose, our system involves two different modules: an efficient and precise FAQ retrieval module and, a tag cloud generation module designed to help users to complete the comprehension of the retrieved information. Empirical results evidence the validity of our approach with respect to a number of state-of-the-art algorithms in terms of the most popular metrics in the field.
Term extraction from sparse, ungrammatical domain-specific documents
2013, Expert Systems with Applications
Citation Excerpt :
As a start, we plan to use the Wikipedia collection in a language L, e.g. Dutch, to identify relevant terms and filter out irrelevant ones from a domain-specific corpus, which also expressed in L. This is similar to our Relevant Term Selection phase (Section 3.3) and to the work of Romero et al. (2012). Designing a linguistic filter and developing a multi-word term extraction algorithm are more involved as they require some knowledge of the linguistic patterns adopted by terms in the language L.
Existing term extraction systems have predominantly targeted large and well-written document collections, which provide reliable statistical and linguistic evidence to support term extraction. In this article, we address the term extraction challenges posed by sparse, ungrammatical texts with domain-specific contents, such as customer complaint emails and engineers’ repair notes. To this aim, we present ExtTerm, a novel term extraction system. Specifically, as our core innovations, we accurately detect rare (low frequency) terms, overcoming the issue of data sparsity. These rare terms may denote critical events, but they are often missed by extant TE systems. ExtTerm also precisely detects multi-word terms of arbitrarily lengths, e.g. with more than 2 words. This is achieved by exploiting fundamental theoretical notions underlying term formation, and by developing a technique to compute the collocation strength between any number of words. Thus, we address the limitation of existing TE systems, which are primarily designed to identify terms with 2 words. Furthermore, we show that open-domain (general) resources, such as Wikipedia, can be exploited to support domain-specific term extraction. Thus, they can be used to compensate for the unavailability of domain-specific knowledge resources. Our experimental evaluations reveal that ExtTerm outperforms a state-of-the-art baseline in extracting terms from a domain-specific, sparse and ungrammatical real-life text collection.
Towards automatic tweet generation: A comparative study from the text summarization perspective in the journalism genre
2013, Expert Systems with Applications
Citation Excerpt :
Recent research has focused on applying the headline generation task to produce titles (Lopez, Prince, & Roche, 2012; Tseng, 2010), image captions (Woodsend & Lapata, 2010), or even story highlights (Woodsend, Feng, & Lapata, 2010). On the other hand, the techniques employed in automatic keyword extraction (Romero, Moreo, Castro, & Zurita, 2012, 2013) could be useful for tweet generation, specifically for identifying the set of relevant keywords that could be transformed into hashtags for producing the tweet, or could be combined for generating a new sentence. This would be another manner of presenting a tweet, which is out of the scope of this research.
In recent years, Twitter has become one of the most important microblogging services of the Web 2.0. Among the possible uses it allows, it can be employed for communicating and broadcasting information in real time. The goal of this research is to analyze the task of automatic tweet generation from a text summarization perspective in the context of the journalism genre. To achieve this, different state-of-the-art summarizers are selected and employed for producing multi-lingual tweets in two languages (English and Spanish). A wide experimental framework is proposed, comprising the creation of a new corpus, the generation of the automatic tweets, and their assessment through a quantitative and a qualitative evaluation, where informativeness, indicativeness and interest are key criteria that should be ensured in the proposed context.
From the results obtained, it was observed that although the original tweets were considered as model tweets with respect to their informativeness, they were not among the most interesting ones from a human viewpoint. Therefore, relying only on these tweets may not be the ideal way to communicate news through Twitter, especially if a more personalized and catchy way of reporting news wants to be performed. In contrast, we showed that recent text summarization techniques may be more appropriate, reflecting a balance between indicativeness and interest, even if their content was different from the tweets delivered by the news providers.
Machine Learning Technique to Detect and Classify Mental Illness on Social Media Using Lexicon-Based Recommender System
2022, Computational Intelligence and Neuroscience
A text mining approach agent-based DSS for it infrastructure maintenance
2021, International Journal of Decision Support System Technology
Developing an effective scheme for translation and expansion of Persian user queries
2020, Digital Scholarship in the Humanities

View all citing articles on Scopus

View full text

Using Wikipedia concepts and frequency in language to extract key terms from support documents

Abstract

Highlights

Introduction

Section snippets

Related works

Wikipedia-based knowledge resource

Term frequency in language

System overview

Empirical evaluation

Conclusions and future work

Acknowledgements

Knowledge-Based Systems

Expert System with Applications

Expert System with Applications

Expert System with Applications

Information Processing and Management

Expert System with Applications

Handbook of mathematical functions

Using the multilingual central repository for graph-based word sense disambiguation

Evaluating google queries based on language preferences

Journal of Information Science

Statistical power analysis for the behavioral sciences

Applied multiple regression/correlation analysis for the behavioral sciences

Linguistically motivated features for enhanced back-of-the-book indexing

The corpus of contemporary american english as the first reliable monitor corpus of english

Literary and Linguistic Computing

Domain-specific keyphrase extraction