WNavis: Navigating Wikipedia semantically with an SNA-based summarization technique

doi:10.1016/j.dss.2012.04.002

Decision Support Systems

Volume 54, Issue 1, December 2012, Pages 46-62

https://doi.org/10.1016/j.dss.2012.04.002 Get rights and content

Abstract

Link-based applications like Wikipedia are becoming increasingly popular because they provide users with an efficient way to find needed knowledge, such as searching for definitions and information about a particular topic, and exploring articles on related topics. This work introduces a semantics-based navigation application called WNavi^s, to facilitate information-seeking activities in internal link-based websites in Wikipedia. WNavi^s is based on the theories and techniques of link mining, semantic relatedness analysis and text summarization. Our goal is to develop an application that helps users find related articles for a seed query (topic) easily and then quickly check the content of articles to explore a new concept or topic in Wikipedia. Technically, we construct a preliminary topic network by analyzing the internal links of Wikipedia and applying the normalized Google distance algorithm to quantify the strength of the semantic relationships between articles via key terms. Because not all the content of articles in Wikipedia is relevant to users' information needs, it is desirable to locate specific information for users and enable them to quickly explore and read topic-related articles. Accordingly, we propose an SNA-based single and multiple-document summarization technique that can extract meaningful sentences from articles. We applied a number of intrinsic and extrinsic evaluation methods to demonstrate the efficacy of the summarization techniques in terms of precision, and recall. The results suggest that the proposed summarization technique is effective. Our findings have implications for the design of a navigation tool that can help users explore related articles in Wikipedia quickly.

Highlights

► We introduce a semantics-based navigation application, called WNavi^s, in Wikipedia. ► We propose a novel SNA-based article summarization technique to help users explore topics. ► We apply intrinsic and extrinsic evaluation methods to evaluate the techniques. ► The techniques and application can be generalized to the internal link-based websites.

Introduction

With the ubiquity of Web 2.0 technologies, the World Wide Web (WWW) has become the main source of information and knowledge for countless people in the modern era. According to Alexa Traffic Rank, the top ten sites on the Web are Google, Facebook, YouTube, Yahoo!, Blogger, Baidu, Wikipedia, Windows Live, Twitter, and QQ.COM (July, 2011). Wikipedia, the most popular web-based, free-content encyclopedia, is one of the best examples of crowdsourcing systems, which gather like-minded users into groups to collaborate and create long-lasting artifacts that benefit the whole community [2], [13]. The statistical data for Wikipedia shows that, in 2008, the site welcomed 684 million visitors; and over 91,000 contributors worked on more than 16 million articles in 270 languages. Because of the popularity of collaborative peer production systems like Wikipedia, the number of articles is constantly expanding. Hence, an increasing number of people regard Wikipedia as an efficient way to find needed knowledge, such as searching for definitions and information about a particular topic, and exploring articles on related topics. Basically, users browse Wikipedia content in the traditional manner (i.e., by following hyperlinks) when searching for information. However, users may unconsciously change their search goals or get lost when exploring or retrieving information in Wikipedia. To make searching more efficient for the vast number of Wikipedia users, more effective search and navigation tools must be developed.

Generally, users invest a great deal of time in browsing, i.e., following links or searching for specific information. Because of the rapid growth in the volume of information on the WWW, web mining and information retrieval are regarded as key techniques for finding desired information. Web mining tries to extract potentially useful implicit information, link structures and patterns from information units or activities on the WWW. There are three types of web mining techniques: web content mining, web structure mining, and web usage mining. The main difference between web pages and static text documents is that the former contain content as well as link information, and metadata [1], [8], [28]. Web content mining exploits information retrieval (IR) and artificial intelligence (AI) techniques to mine and analyze information from web pages. Generally, web content mining strategies can be divided into those that mine information or knowledge implicitly from documents, and those designed to improve the search results, i.e., the information retrieved by search engines. IR technology relies primarily on content analysis techniques, but Web pages are usually noisy and contain various types of content, such as text, images, and multimedia. To resolve this problem, some researchers have exploited the hyperlink structure, which provides hyperlink information for a collection of web pages, and proposed ranking algorithms to rank search results. Analyzing the hyperlink structure between WWW pages to support user search activities has attracted a great deal of attention in recent years.

Since the link structure encodes a considerable number of latent human judgments, link mining and analysis techniques are employed by commercial search engines, e.g., the PageRank algorithm [6], [34] used by the Google search engine is one of the most well-known link-based algorithms. Currently, PageRank is the dominant link analysis model for web searches, partly because it does not depend on search queries. It is a query-independent measure of the static ranking of web pages, and is based on the measure of prestige used in social networks. Kleinberg [26] proposed the Hypertext Induced Topic Search (HITS) algorithm, which analyzes the link topology to find “hub” and “authority” pages. A hub is a page with several out-links, while an authoritative page contains several in-links. The algorithm analyzes both in-links and out-links to obtain two ranking scores for pages based on the user's query result. It analyzes the relationships between web pages and then ranks the search results accordingly. Almpanidis and Kotropoulos [1] proposed a topical information resource discovery algorithm that applies a focused or topic-driven crawler by combining text and link analysis techniques. Their results show that, in the initial stage, the content-based and link-based algorithm does not need a lot of data, and it outperforms comparable methods. In this work, we show that it is efficient to analyze the articles related to a topic based on the link relationships between them without employing tedious content analysis techniques.

In Wikipedia, a topic may be linked to many articles, so it is sometimes difficult for users to locate articles relevant to their particular interest simply by following the given hyperlinks. To address this problem, we propose a semantics-based navigation system that is based on the theories and techniques of link mining, semantic relatedness analysis and social network analysis (SNA). Wu and Wu [45] propose a link strength (LS) measure that establishes a network by analyzing the internal links between articles in Wikipedia; however, some irrelevant articles are included in the network. Accordingly, we utilize the normalized Google distance algorithm [10] to quantify the strength of the semantic relationships between articles via key terms, and filter out articles that do not have strong relationships. We also propose a hybrid measure, i.e., an internal link-based semantic topic network analysis measure, to construct a topic network with stronger semantic relationships. Our preliminary evaluation results demonstrate the effectiveness of applying semantic analysis in an internal link-based network. To help users search for information, we apply centrality-based and cohesive measures in SNA to summarize single and multiple articles. The measures are degree centrality and k-clique, which identify, respectively, the hub article and the sub-topics of the seed query for further summarization of multiple articles. When the user clicks on a topic node that he/she wants to explore, an SNA-based summary is presented on the interface. Then, intrinsic and extrinsic methods are used to evaluate the quality of the summarization results. The intrinsic method, measures the quality of a system's text summarization; while the extrinsic method gives classification tasks to users so that they can evaluate the quality of the text summarization results based on the task's performance. To visualize the semantics-based topic network more efficiently, we use the software libraries provided by the Java Universal Network/Graph (JUNG) Framework to create a June-based application called Semantics-based WNavi^s. Finally, an interface is generated to help users navigate Wikipedia effectively. The contributions of this work are as follows.

1.
We designed a navigation support application, the semantics-based WNavi^s, for Wikipedia, which is an internal link-based website. In addition, we developed associated tools, such as a topic network and topic summaries, to help users explore topics of interest.
2.
We apply a series of intrinsic and extrinsic evaluation methods to confirm the effectiveness of the proposed SNA-based summarization technique. Furthermore, we simulate search tasks to evaluate the quality of multi-article summarization.
3.
The techniques proposed in this work can be generalized to applications of navigation support tools in internal link-based knowledge intensive websites, e.g., user-generated encyclopedias and articles in technical forums, to help users gain an overview of topics and explore articles efficiently. Moreover, the visualization navigation support application may help users obtain topic knowledge.

The remainder of this paper is organized as follows. The next section reviews some basic concepts and text summarization techniques. In Section 3, we describe the system framework. In Section 4, we incorporate semantic analysis techniques into an internal link-based network; and in Section 5, we discuss SNA-based summaries in Wikipedia. 6 Evaluation methods, tasks and metrics, 7 Evaluation results and discussions focus on the evaluation metrics and results; and Section 8 contains some concluding remarks.

Section snippets

Semantic relatedness analysis: normalized Google distance algorithm

People acquire the meaning of a word and its relations to other words based on their background knowledge. By contrast, it is difficult for computers to make judgments about the semantic relationships between keywords. As a result, enabling computers to extract the meanings of words automatically has motivated a great deal of research in the fields of natural language processing and artificial intelligence [10], [32].

Basically, three measures are used to estimate the semantic relatedness of

The system framework

Internet search engines like Google and Yahoo! provide one of the most popular ways to access information on the WWW. Furthermore, with the emergence of Web 2.0 technologies, social web sites (i.e., social networking websites and micro-blogging services) provide unprecedented opportunities for sharing user-generated content. Wikipedia, one of the most famous collaborative projects on the Web, has become an extremely popular reference database for people seeking information or knowledge.

Internal link analysis with the LS measure

We use the term “article” to denote an entry in Wikipedia rather than a page on the WWW, and the term “node” to denote a word or phrase in an article with a hyperlink to another article. The link strength (LS), which indicates the degree of closeness between two articles, is determined by considering the type and frequency of the links between the articles. Our goal is to find the specific topic or related subtopics for a seed query. An article may have three types of links: in-links,

Process for generating summaries

Summaries can be divided into three types based on their purpose: indicative summaries, informative summaries and critical summaries [36]. An indicative summary provides enough information to let the user determine whether the actual document would be helpful, and whether reading the document would be worthwhile. An informative summary condenses the important content of the actual document, and the user could even utilize it instead of the document. A critical summary comments on a text by

Evaluation methods

For single document summaries, we evaluate the performance of five methods, namely, the CP&FP, WF, Hybrid(0.1), Hybrid(0.5), and Hybrid(0.9) methods. The CP&FP method selects keywords from the concept phrases and the first paragraph of the target article, as shown in Eq. (6). The parameter λ is set at 0 in the equation. The WF method selects keywords from the weighted first paragraph, as shown in Eq. (5). In the three Hybrid(λ) methods, we adjust the parameter λ in Eq. (6) to 0.1, 0.5, and 0.9

Intrinsic evaluation of summaries of single article

To select important articles for generating summaries on the interface, we select the seed article and the central article of the topic network, i.e., measure by degree centralities in SNA, as our candidate articles. For example, 17 articles for the “Knowledge Management” topic are selected as candidate articles for extracting summaries. The articles with the associated compression rates are listed in Table 4. Following Rush et al. [36], we use an average compression rate of 70%. As mentioned

Conclusion and future work

Wikipedia, the largest multi-lingual online encyclopedia, allows users to contribute their knowledge as members of a Wiki community; thus, the number of articles in Wikipedia is constantly expanding. In this study, we propose an SNA-based summarization technique and develop a navigation interface called WNavi^s to help Wikipedia users find and organize needed information or topics. First, we employ the NGD algorithm in the proposed LS measure to quantify the strength of the semantic

Acknowledgments

We thank the editor-in-chief, and the anonymous reviewers for their constructive comments. We also thank Prof. Pertti Vakkari at the University of Tampere for helpful comments.

This research was supported by the National Science Council of Taiwan under Grant No. 99-2410-H-030-047-MY3.

I-Chin Wu received a Ph.D. in Information Management from National Chiao Tung University, Taiwan in January 2006. Since 2006 she has been with the Department of Information Management, Fu-Jen Catholic University, Taipei, Taiwan, where she is currently an Associate Professor. She has been a visiting scholar in the School of Information Science, University of Tampere, Finland in 2011. Her research interests are mainly focused on Information Search and Retrieval, Knowledge Management and Web

References (46)

G. Almpanidis et al.
Combining text and link analysis for focused crawling—an application for vertical search engines
Information Systems
(2007)
S. Brin et al.
The anatomy of a large-scale hypertextual web search engine
Computer Network
(1998)
M.H. Chehreghani et al.
Density link-based methods for clustering web pages
Decision Support Systems
(2009)
L.C. Freeman
Centrality in social networks: conceptual clarification
Social Networks
(1979)
D. Ganley et al.
The ties that bind: social network principles in online communities
Decision Support Systems
(2009)
D.R. Radev et al.
Centroid-based summarization of multiple documents
Journal of Information Processing and Management
(2004)
C.C. Yang et al.
An information delivery system with automatic summarization for mobile commerce
Decision Support Systems
(2007)
O. Alonso et al.
Design and implementation of relevance assessments using crowdsourcing
A. Bavelas
A mathematical model of group structures
Human Organization
(1948)
P. Borlund
The IIR evaluation model: a framework for evaluation of interactive information retrieval systems
Information Research
(2003)

P. Borlund et al.

Reconsideration of the simulated work task situation: a context instrument for evaluation of information retrieval interaction

J. Callan et al.

Meeting of the MINDS: an information retrieval research agenda

H. Chen et al.

MetaSpider: meta-searching and categorization on the web

Journal of the American Society for Information Science

(2001)

R.L. Cilibrasi et al.

Automatic extraction of meaning from the Web

R.L. Cilibrasi et al.

The Google similarity distance

IEEE Transactions on Knowledge and Data Engineering

(2007)

W.B. Croft

What do people want from information retrieval? D-Lib Magazine

A. Doan et al.

Crowdsourcing systems on the World-Wide Web

Communications of the ACM

(2011)

A.J. Evangelista et al.

Google Distance between Words. Frontiers in Undergraduate Research

(2006)

L. Finkelstein et al.

Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)

(2002)

R. Forsyth et al.

L.C. Freeman

A set of measures of centrality based on betweenness

Sociometry

(1977)

T. Fukusima et al.

Text summarization challenge: text summarization evaluation at NTCIR Workshop2

J. Goldstein et al.

Creating and evaluating multi-document sentence extract summaries

Cited by (0)

Yi-Sheng Lin received her BBA and MS in Department of Information Management from Fu-Jen Catholic University, Taipei, Taiwan in 2009 and 2011 respectively. His research interests are mainly focused on Information Retrieval, and Web Mining.

View full text

WNavis: Navigating Wikipedia semantically with an SNA-based summarization technique

Abstract

Highlights

Introduction

Section snippets

Semantic relatedness analysis: normalized Google distance algorithm

The system framework

Internal link analysis with the LS measure

Process for generating summaries

Evaluation methods

Intrinsic evaluation of summaries of single article

Conclusion and future work

Acknowledgments

Information Systems

Computer Network

Decision Support Systems

Social Networks

Decision Support Systems

Journal of Information Processing and Management

Decision Support Systems

Design and implementation of relevance assessments using crowdsourcing

A mathematical model of group structures

Human Organization

The IIR evaluation model: a framework for evaluation of interactive information retrieval systems

Information Research

Reconsideration of the simulated work task situation: a context instrument for evaluation of information retrieval interaction

Meeting of the MINDS: an information retrieval research agenda

MetaSpider: meta-searching and categorization on the web

Journal of the American Society for Information Science

Automatic extraction of meaning from the Web

The Google similarity distance

IEEE Transactions on Knowledge and Data Engineering

What do people want from information retrieval? D-Lib Magazine

Crowdsourcing systems on the World-Wide Web

Communications of the ACM

Google Distance between Words. Frontiers in Undergraduate Research

Placing search in context: the concept revisited

ACM Transactions on Information Systems (TOIS)

A set of measures of centrality based on betweenness

Sociometry

Text summarization challenge: text summarization evaluation at NTCIR Workshop2

Creating and evaluating multi-document sentence extract summaries

WNavi^s: Navigating Wikipedia semantically with an SNA-based summarization technique