Using Wikipedia concepts and frequency in language to extract key terms from support documents
Highlights
► Able to deal with the specific characteristics of the support documents. ► Hybrid system based on frequency-based and thesaurus-based approaches. ► Frequency in language of terms as a criterion to detect support domain dependant key terms. ► Dictionary of concepts drawn from Wikipedia designed to detect multidomain key terms.
Introduction
Nowadays, the hectic lifestyle force us to a continuous and fast learning not only in the working environment, but also in our free time. Fortunately, there are brief documents that help non-experts users learn the main concepts of any topics. These documents, known as “support documents” (SD), include the so-called how-to guides, tutorials, FAQ lists or walkthroughs. Due to the large quantity of documents published on the Internet, it would be desirable that we had the use of proper tools or techniques to support classification, search, or other maintenance activities. In this sense, an Automatic Keyword Extraction (AKE) method help users to quickly skim over documents would be unarguably a valuable contribution. Furthermore, it could also be applied for text summarization, text clustering, and text classification (Wenchao, Lianchen, & Ting, 2009).
In classical AKE studies, the documents of the collection are usually dependent on general-domains (containing world knowledge from public sources covering high diffusion topics). In addition, each document presents extensive and self-describing information about its main topic. This is not the case for SD. Their topics are frequently referred to specific-domains (containing knowledge from specific issues, usually of limited or private diffusion). Moreover, a SD does not contain every detail of its topic, but only specific information referred to significant aspects of the topic.
From a more deeper analysis, we found two classes of important terms1: (i) “Multidomain key terms” (MKT), meaningful in a multidomain context (e.g. ‘virtual assistant’ in Example 1 in Table 1); and (ii) “Specific domain key terms” (SKT), proper nouns or technical terms closely related to specific aspects of the document topic (e.g. ‘Interactive Dialog’ in Example 1 in Table 1).
The above mentioned characteristic motivated us to design an hybrid automatic system to extract key terms from SD. In this regard, a frequency-based approach seems not appropriate due to the terms appearing rarely could be actually relevant to the document. Similarly, as there is not a significant co-occurrence distribution between terms in SD, a word association-based approach would also be inappropriate. For these reason, we decided to combine two strategies. (a) A thesaurus-based approach is employed to detect MKT. Following the actual tendency in the literature, we develop a controlled dictionary of ‘concepts’ drawn from Wikipedia that includes acronyms, translations, and misspellings. (b) In contrast, SKT are hardly contained in any controlled dictionary due to its technical nature. However, since they are uncommon terms in their language, we take advantage of frequency in language analysis.2 In addition, candidate key terms are previously filtered out by means of a modular filter. Since our system is focused on FAQ documents, we believe that it could be an useful tool for monitoring human FAQ administrators in the task of configuring FAQ retrieval systems (Tao, Liu, & Lin, 2011).
The rest of this paper is organized as follows. Section 2, offers an overview of previous work on key term extraction. A brief summary of Wikipedia and its use to obtain a controlled vocabulary of concepts is presented in Section 3. Next, we describe the frequency in language feature of terms and its applications. Section 5 describes the structure and functions of our system. The method of analysis and the experimental validation of our method are outlined in Section 6. Finally, Section 7 concludes with a discussion of results and future research.
Section snippets
Related works
In this section, we depict the characteristics of the main approaches in Keyword Extraction. Additionally, we analyse how each proposal could be applied to extract the two types of key terms in SD.
First works on this field were based on Machine Learning algorithms. Supervised learning methods treat this task as a classification problem using lexical, syntactic or statistical features (or a mixture of them) of the training labelled data to extract keywords (Csomai & Mihalcea, 2008,
Wikipedia-based knowledge resource
In this section, we define the dictionary of concepts extracted from Wikipedia articles that is involved in subsequent processes of our system (Section 5.3).
Term frequency in language
This section depicts a term frequency in language dictionary based on the English language, used in Section 5.2. For terms not contained in this dictionary, we design an algorithm to calculate an approximate simulated frequency based on Google’s search engine.
System overview
In this section, the multi-stage architecture of our high performance Automatic Support-Domain Key Term Extraction (ASKEx) system is explained. We have considered two classes of key terms from each SD in a collection: (i) multidomain key terms (MKT), and (ii) specific domain key terms (SKT).
We thus propose a system composed by the following modules (Fig. 2). In an initial stage, the preprocessing module prepares the content of the collection of documents. Then, the algorithm performs at
Empirical evaluation
This section reports the empirical results obtained in the evaluation of our system. The validity of our system has also been contrasted to other state-of-the-art methods.
Conclusions and future work
In this paper, a new key term extraction system able to handle with the particularities of the support document context has been proposed. The system has obtained promising results in experimental validation, being compared with some of the most important algorithms in Keyword Extraction.
Our method is a hybrid system that takes advantage of the strengths of the frequency-based and thesaurus-based criteria, introducing some novelties. In this sense, the system is able to recognize two different
Acknowledgements
The authors thank the Junta de Andalucía that supported this article with its project: TIC2009–5011.
References (42)
- et al.
Word adhoc network: Using google core distance to extract the most relevant information
Knowledge-Based Systems
(2011) - et al.
High relevance keyword extraction facility for bayesian text classification on different domains of varying characteristic
Expert System with Applications
(2012) - et al.
Intelligent computer assisted blog writing system
Expert System with Applications
(2012) - et al.
Computer assisted writing system
Expert System with Applications
(2011) - et al.
Term weighting approaches in automatic text retrieval
Information Processing and Management
(1988) - et al.
Summary of faqs from a topical forum based on the native composition structure
Expert System with Applications
(2011) - et al.
Handbook of mathematical functions
(1965) - et al.
Using the multilingual central repository for graph-based word sense disambiguation
- et al.
Evaluating google queries based on language preferences
Journal of Information Science
(2011) - Bracewell, D. B., Ren, F., & Kuriowa, S. (2005). Multilingual single document keyword extraction for information...
Statistical power analysis for the behavioral sciences
Applied multiple regression/correlation analysis for the behavioral sciences
Linguistically motivated features for enhanced back-of-the-book indexing
The corpus of contemporary american english as the first reliable monitor corpus of english
Literary and Linguistic Computing
Domain-specific keyphrase extraction
Cited by (9)
A cloud of FAQ: A highly-precise FAQ retrieval system for the Web 2.0
2013, Knowledge-Based SystemsCitation Excerpt :If that is the case, the words of these synsets are preprocessed and stored along with the query word. If no WordNet synsets are found, the same process is carried out employing a Wikipedia-based dictionary of concepts [43]. If no synonyms are found in any of the two dictionaries, the query word is preprocessed and stored alone.
Term extraction from sparse, ungrammatical domain-specific documents
2013, Expert Systems with ApplicationsCitation Excerpt :As a start, we plan to use the Wikipedia collection in a language L, e.g. Dutch, to identify relevant terms and filter out irrelevant ones from a domain-specific corpus, which also expressed in L. This is similar to our Relevant Term Selection phase (Section 3.3) and to the work of Romero et al. (2012). Designing a linguistic filter and developing a multi-word term extraction algorithm are more involved as they require some knowledge of the linguistic patterns adopted by terms in the language L.
Towards automatic tweet generation: A comparative study from the text summarization perspective in the journalism genre
2013, Expert Systems with ApplicationsCitation Excerpt :Recent research has focused on applying the headline generation task to produce titles (Lopez, Prince, & Roche, 2012; Tseng, 2010), image captions (Woodsend & Lapata, 2010), or even story highlights (Woodsend, Feng, & Lapata, 2010). On the other hand, the techniques employed in automatic keyword extraction (Romero, Moreo, Castro, & Zurita, 2012, 2013) could be useful for tweet generation, specifically for identifying the set of relevant keywords that could be transformed into hashtags for producing the tweet, or could be combined for generating a new sentence. This would be another manner of presenting a tweet, which is out of the scope of this research.
Machine Learning Technique to Detect and Classify Mental Illness on Social Media Using Lexicon-Based Recommender System
2022, Computational Intelligence and NeuroscienceA text mining approach agent-based DSS for it infrastructure maintenance
2021, International Journal of Decision Support System TechnologyDeveloping an effective scheme for translation and expansion of Persian user queries
2020, Digital Scholarship in the Humanities