Elsevier

Journal of Informetrics

Volume 1, Issue 2, April 2007, Pages 131-144
Journal of Informetrics

Generating overview timelines for major events in an RSS corpus

https://doi.org/10.1016/j.joi.2006.10.002Get rights and content

Abstract

Really simple syndication (RSS) is becoming a ubiquitous technology for notifying users of new content in frequently updated web sites, such as blogs and news portals. This paper describes a feature-based, local clustering approach for generating overview timelines for major events, such as the tsunami tragedy, from a general-purpose corpus of RSS feeds. In order to identify significant events, we automatically (1) selected a set of significant terms for each day; (2) built a set of (term–co-term) pairs and (3) clustered the pairs in an attempt to group contextually related terms. The clusters were assessed by 10 people, finding that the average percentage apparently representing significant events was 68.6%. Using these clusters, we generated overview timelines for three major events: the tsunami tragedy, the US election and bird flu. The results indicate that our approach is effective in identifying predominantly genuine events, but can only produce partial timelines.

Introduction

The task of identifying significant events from real time news feed data is a standard one in data mining and event detection and tracking (Allan, Papka, & Lavrenko, 1998b; Yang, Pierce, & Carbonell, 1998). The Internet now hosts a range of readily accessible information formats that are new candidates for event detection, and these may come to replace or supplement traditional types, or may give rise to new event detection applications. Really simple syndication (RSS) is one such technology and has already become a widely used standard: it allows blogs and news sources to post-timely information to subscribers, for example, hourly or daily summaries of the most recent updates. RSS feeds have great potential to be used for public-opinion gathering (Glance, Hurst, & Tomokiyo, 2004; Gruhl, Guha, Liben-Nowell, & Tomkins, 2004), mainly because of the large numbers of blog authors maintaining sites with RSS feeds, although bloggers are not typical citizens (Adar, Zhang, Adamic, & Lukose, 2004; Lin & Halavais, 2004) and have a wide variety of motives (Herring, Scheidt, Bonus, & Wright, 2004). In addition, the concise RSS formats allow relatively low-bandwidth data gathering, even for a large number of different sources. When a major (or world) event, such as the Asian tsunami (26/12/2004), occurs, RSS feeds could therefore be used to generate an overview timeline of the event.

Our contributions are to develop an automatic method to achieve the following using RSS data.

  • (1)

    Find daily sets of significant terms (either nouns or noun phrases) which maybe associated with important events, i.e. the most discussed happenings (Section 3).

  • (2)

    Use the significant terms to build a set of (term–co-term) pairs and cluster the pairs. The clusters are our candidates for the day's significant events (Section 4).

  • (3)

    Generate overview timelines for major events by sorting the clusters by date (Section 5).

In this paper, we are primarily interested in the precision of the clusters in (2). More specifically, we assess the extent to which human judges agree that the automatically generated clusters genuinely describe a single event. A human-based evaluation was important to discover whether the results could be understood by potential end users, i.e. human interpreters.

To illustrate the timeline generation, three major events, ‘tsunami tragedy’, ‘US election’ and ‘bird flu spreading’, were selected as our case studies. Table 5, Table 6, Table 7 show the generated timelines for each major event. Each timeline refers to one particular major event along with many related, subsequent events.

Section snippets

Related work

This section reviews existing work in the area of (1) term selection, (2) topic and event detection and tracking (TDT) and (3) timeline generation.

Significant term selection

The selection of significant terms was conducted in three stages.

  • (1)

    RSS items were collected and the text found within the items was processed;

  • (2)

    χ2 and Information Gain (I) values were computed;

  • (3)

    A set of significant terms was selected.

Term clustering

This section discusses the way we automatically clustered together related terms and manually evaluated the clusters.

Clustering results

We use a qualitative approach to investigate how clusters relate to major events. The objective is to gain insights into the types of information indicated by the clusters and how this may vary by major event type. We automatically selected a portion of clusters which were related to three important events, ‘tsunami’, ‘bird flu’ and ‘US presidential election’, which happened in 2004. Each major event was termed ‘an initial event’, as it was the beginning of a series of ‘subsequent events’. In

Conclusions and future work

Our method automatically produced clusters of terms from RSS feeds, which were assessed by human evaluators to see whether they appeared to signify a single news event. The low level of agreement among the 10 assessors (κaverage = 0.36) indicated the difficulty of the human task of reliably identifying an event from a small set of terms rather than problems with the clustering algorithm itself. The evaluation of 100 real clusters carried out by the assessors indicated that the average percentage

Acknowledgements

The work was supported by a European Union grant for activity code NEST-2003-Path-1. It is part of the Critical Events in Evolving Networks project (CREEN, contract 012684). We thank the reviewers for their helpful comments.

References (28)

  • R. Prabowo et al.

    A comparison of feature selection methods for an evolving RSS feed corpus

    IPM

    (2006)
  • E. Adar et al.

    Implicit structure and the dynamic of blogspace

  • M. Alexandrov et al.

    An approach to clustering abstracts

  • J. Allan et al.

    Topic detection and tracking pilot study: Final report

  • J. Allan et al.

    On-line new event detection and tracking

  • R. Baeza-Yates et al.

    Modern information retrieval

    (1999)
  • R.K. Belew

    Finding out about—a cognitive perspective on search engine technology and the WWW

    (2000)
  • E. Brill

    A simple rule-based part-of-speech tagger

  • J. Cohen

    A coefficient of agreement for nominal scales

    Educational and Psychological Measurement

    (1960)
  • B.D. Eugenio et al.

    The kappa statistics: A second look

    Computational Linguistics

    (2004)
  • N.S. Glance et al.

    BlogPulse: Automated trend discovery for weblogs

  • D. Gruhl et al.

    Information diffusion through blogspace

  • S.C. Herring et al.

    Bridging the gap: A genre analysis of weblogs

  • J. Lin et al.

    Mapping the blogosphere in America

  • Cited by (7)

    • State of the library and information science blogosphere after social networks boom: A metric approach

      2011, Library and Information Science Research
      Citation Excerpt :

      All the feeds generated by Libworm through eight RSS channels (academic libraries, government libraries, law libraries, medical libraries, personal, professional associations, public libraries and school libraries) were analyzed. This method of relying on RSS feeds to gather large amounts of information has been successfully used in previous studies (Prabowo, Thelwall, & Alexandrov, 2007; Thelwall, Prabowo, & Fairclough, 2006). A macro was designed in Visual Basic to extract daily post production and data were processed individually for each category using Excel 2007.

    • Using RSS to support mobile learning based on media richness theory

      2010, Computers and Education
      Citation Excerpt :

      Because RSS uses XML to glean relevant information related to user’s needs, RSS may well become the universal method to mine information from the Internet (Cold, 2006). As the concise RSS formats allow relatively low-bandwidth data gathering, even for several different sources, RSS is becoming a mobile technique for notifying the users of new content, particularly, in frequently updated web sites, such as blogs and news portals (Prabowo, Thelwall, & Alexandrov, 2007). In addition, there is another benefit to using RSS for supporting learning activities.

    • Visualization of text streams: A survey

      2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Reasoning about fuzzy temporal and spatial information from the web

      2010, Reasoning about Fuzzy Temporal and Spatial Information from the Web
    View all citing articles on Scopus
    View full text