Generating overview timelines for major events in an RSS corpus

doi:10.1016/j.joi.2006.10.002

Journal of Informetrics

Volume 1, Issue 2, April 2007, Pages 131-144

https://doi.org/10.1016/j.joi.2006.10.002 Get rights and content

Abstract

Really simple syndication (RSS) is becoming a ubiquitous technology for notifying users of new content in frequently updated web sites, such as blogs and news portals. This paper describes a feature-based, local clustering approach for generating overview timelines for major events, such as the tsunami tragedy, from a general-purpose corpus of RSS feeds. In order to identify significant events, we automatically (1) selected a set of significant terms for each day; (2) built a set of (term–co-term) pairs and (3) clustered the pairs in an attempt to group contextually related terms. The clusters were assessed by 10 people, finding that the average percentage apparently representing significant events was 68.6%. Using these clusters, we generated overview timelines for three major events: the tsunami tragedy, the US election and bird flu. The results indicate that our approach is effective in identifying predominantly genuine events, but can only produce partial timelines.

Introduction

The task of identifying significant events from real time news feed data is a standard one in data mining and event detection and tracking (Allan, Papka, & Lavrenko, 1998b; Yang, Pierce, & Carbonell, 1998). The Internet now hosts a range of readily accessible information formats that are new candidates for event detection, and these may come to replace or supplement traditional types, or may give rise to new event detection applications. Really simple syndication (RSS) is one such technology and has already become a widely used standard: it allows blogs and news sources to post-timely information to subscribers, for example, hourly or daily summaries of the most recent updates. RSS feeds have great potential to be used for public-opinion gathering (Glance, Hurst, & Tomokiyo, 2004; Gruhl, Guha, Liben-Nowell, & Tomkins, 2004), mainly because of the large numbers of blog authors maintaining sites with RSS feeds, although bloggers are not typical citizens (Adar, Zhang, Adamic, & Lukose, 2004; Lin & Halavais, 2004) and have a wide variety of motives (Herring, Scheidt, Bonus, & Wright, 2004). In addition, the concise RSS formats allow relatively low-bandwidth data gathering, even for a large number of different sources. When a major (or world) event, such as the Asian tsunami (26/12/2004), occurs, RSS feeds could therefore be used to generate an overview timeline of the event.

Our contributions are to develop an automatic method to achieve the following using RSS data.

(1)
Find daily sets of significant terms (either nouns or noun phrases) which maybe associated with important events, i.e. the most discussed happenings (Section 3).
(2)
Use the significant terms to build a set of (term–co-term) pairs and cluster the pairs. The clusters are our candidates for the day's significant events (Section 4).
(3)
Generate overview timelines for major events by sorting the clusters by date (Section 5).

In this paper, we are primarily interested in the precision of the clusters in (2). More specifically, we assess the extent to which human judges agree that the automatically generated clusters genuinely describe a single event. A human-based evaluation was important to discover whether the results could be understood by potential end users, i.e. human interpreters.

To illustrate the timeline generation, three major events, ‘tsunami tragedy’, ‘US election’ and ‘bird flu spreading’, were selected as our case studies. Table 5, Table 6, Table 7 show the generated timelines for each major event. Each timeline refers to one particular major event along with many related, subsequent events.

Section snippets

Related work

This section reviews existing work in the area of (1) term selection, (2) topic and event detection and tracking (TDT) and (3) timeline generation.

Significant term selection

The selection of significant terms was conducted in three stages.

(1)
RSS items were collected and the text found within the items was processed;
(2)
χ² and Information Gain (I) values were computed;
(3)
A set of significant terms was selected.

Term clustering

This section discusses the way we automatically clustered together related terms and manually evaluated the clusters.

Clustering results

We use a qualitative approach to investigate how clusters relate to major events. The objective is to gain insights into the types of information indicated by the clusters and how this may vary by major event type. We automatically selected a portion of clusters which were related to three important events, ‘tsunami’, ‘bird flu’ and ‘US presidential election’, which happened in 2004. Each major event was termed ‘an initial event’, as it was the beginning of a series of ‘subsequent events’. In

Conclusions and future work

Our method automatically produced clusters of terms from RSS feeds, which were assessed by human evaluators to see whether they appeared to signify a single news event. The low level of agreement among the 10 assessors (κ_average = 0.36) indicated the difficulty of the human task of reliably identifying an event from a small set of terms rather than problems with the clustering algorithm itself. The evaluation of 100 real clusters carried out by the assessors indicated that the average percentage

Acknowledgements

The work was supported by a European Union grant for activity code NEST-2003-Path-1. It is part of the Critical Events in Evolving Networks project (CREEN, contract 012684). We thank the reviewers for their helpful comments.

References (28)

R. Prabowo et al.
A comparison of feature selection methods for an evolving RSS feed corpus
IPM
(2006)
E. Adar et al.
Implicit structure and the dynamic of blogspace
M. Alexandrov et al.
An approach to clustering abstracts
J. Allan et al.
Topic detection and tracking pilot study: Final report
J. Allan et al.
On-line new event detection and tracking
R. Baeza-Yates et al.
Modern information retrieval
(1999)
R.K. Belew
Finding out about—a cognitive perspective on search engine technology and the WWW
(2000)
E. Brill
A simple rule-based part-of-speech tagger
J. Cohen
A coefficient of agreement for nominal scales
Educational and Psychological Measurement
(1960)
B.D. Eugenio et al.
The kappa statistics: A second look
Computational Linguistics
(2004)

N.S. Glance et al.

BlogPulse: Automated trend discovery for weblogs

D. Gruhl et al.

Information diffusion through blogspace

S.C. Herring et al.

Bridging the gap: A genre analysis of weblogs

J. Lin et al.

Mapping the blogosphere in America

Cited by (7)

State of the library and information science blogosphere after social networks boom: A metric approach
2011, Library and Information Science Research
Citation Excerpt :
All the feeds generated by Libworm through eight RSS channels (academic libraries, government libraries, law libraries, medical libraries, personal, professional associations, public libraries and school libraries) were analyzed. This method of relying on RSS feeds to gather large amounts of information has been successfully used in previous studies (Prabowo, Thelwall, & Alexandrov, 2007; Thelwall, Prabowo, & Fairclough, 2006). A macro was designed in Visual Basic to extract daily post production and data were processed individually for each category using Excel 2007.
A metric analysis of blogs on library and information science (LIS) between November 2006 and June 2009 indexed on the Libworm search engine characterizes the community's behavior quantitatively. An analysis of 1108 personal and corporate blogs with a total of 275,103 posts is used to calculate survival rate, production (number of posts published), and visibility via such indicators as links received, Technorati authority, and Google's PagePank. Over the study period, there was a 52% decrease in the number of active blogs. Despite the drop in production over this period, the average number of posts per blog remained constant (14 per month). The most representative blogs in the discipline are identified. The emergence of such platforms as Facebook and Twitter seems to have meant that both personal and corporate blogs have lost some of their prominence.
Using RSS to support mobile learning based on media richness theory
2010, Computers and Education
Citation Excerpt :
Because RSS uses XML to glean relevant information related to user’s needs, RSS may well become the universal method to mine information from the Internet (Cold, 2006). As the concise RSS formats allow relatively low-bandwidth data gathering, even for several different sources, RSS is becoming a mobile technique for notifying the users of new content, particularly, in frequently updated web sites, such as blogs and news portals (Prabowo, Thelwall, & Alexandrov, 2007). In addition, there is another benefit to using RSS for supporting learning activities.
With the rapid development of mobile technologies, mobile learning has become a new trend in education. A better understanding of how to effectively use communication technologies to improve mobile learning is important. The purpose of this paper is to evaluate the media richness of various message delivery methods in the proposed m-learning environment based on media richness theory. Regarding the implications of the media richness theory, this study has identified four factors to evaluate a content in respect to the media richness among SMS, Email, and RSS: timeliness, richness, accuracy and adaptability. By the repeated-measures one-way ANOVA analysis, the results show that: (1) SMS has better performance than Email and RSS on content timeliness; thus SMS may be appropriate for immediate information delivery such as notifying or reminding of some time-sensitive matters; (2) Email has better performance than SMS and RSS on content richness and so may be applied in exhaustive information delivery; (3) RSS has better performance than SMS and Email on content accuracy and adaptability; thus RSS is more appropriate for supporting various front-end mobile devices to access and present the content in a mobile learning environment. According to the results, this study suggests developer and designer of an m-learning environment could adopt suitable information delivery medium to support the corresponding learning activities in a mobile learning environment; moreover, current general e-learning systems, particularly those intending to provide a mobile learning environment, can take advantage of RSS techniques to support mobile access and achieve the goal of mobile learning anytime and anywhere.
Towards the acceptance of RSS to support learning: An empirical study to validate the technology acceptance model in Lebanon
2015, Electronic Journal of e-Learning
Reasoning about fuzzy temporal information from the web: Towards retrieval of historical events
2010, Soft Computing
Visualization of text streams: A survey
2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Reasoning about fuzzy temporal and spatial information from the web
2010, Reasoning about Fuzzy Temporal and Spatial Information from the Web

View all citing articles on Scopus

View full text

Generating overview timelines for major events in an RSS corpus

Abstract

Introduction

Section snippets

Related work

Significant term selection

Term clustering

Clustering results

Conclusions and future work

Acknowledgements

IPM

Implicit structure and the dynamic of blogspace

An approach to clustering abstracts

Topic detection and tracking pilot study: Final report

On-line new event detection and tracking

Modern information retrieval

Finding out about—a cognitive perspective on search engine technology and the WWW

A simple rule-based part-of-speech tagger

A coefficient of agreement for nominal scales

Educational and Psychological Measurement

The kappa statistics: A second look

Computational Linguistics

BlogPulse: Automated trend discovery for weblogs

Information diffusion through blogspace

Bridging the gap: A genre analysis of weblogs

Mapping the blogosphere in America