skip to main content
10.1145/2095536.2095546acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiiwasConference Proceedingsconference-collections
research-article

Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds

Published: 05 December 2011 Publication History

Abstract

Web feeds allow users to retrieve new content from pages on the World Wide Web. Feeds are offered by a multitude of web pages, ranging from conventional news sites to pages with user generated content such as wikis, forums or personal blogs. They notify interested readers of new content and are therefore interesting for information retrieval tasks. Unfortunately, there is no comprehensive dataset of feeds publicly available, making it difficult for researchers to work with this kind of data and, more importantly, to compare their research results by using a common dataset.
In this work we present an extensive real-world dataset of 200,000 diversified feeds, as well as an analysis thereof. The dataset has been collected for a time span of four weeks, yielding over 54 million entries and 100 GB of compressed data. One important outcome of the analysis is, that feeds show different activity patterns that should be considered by aggregators, such as feed reader software, to improve polling strategies. The dataset has been made publicly available for use by research communities around the world.

References

[1]
The BLOGS06 test collection. http://ir.dcs.gla.ac.uk/test_collections/blogs06info.html, 2006.
[2]
The BLOGS08 test collection. http://ir.dcs.gla.ac.uk/test_collections/blogs08info.html, 2008.
[3]
G. Adam, C. Bouras, and V. Poulopoulos. Efficient extraction of news articles based on RSS crawling. In International Conference on Machine and Web Intelligence (ICMWI), Algiers, Algeria, 2010.
[4]
J. Allan. Introduction to topic detection and tracking. In J. Allan, editor, Topic detection and tracking, chapter 1, pages 1--16. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[5]
J. Barr. Syndic8.com. http://www.syndic8.com/, 2010.
[6]
L. Bright, A. Gal, and L. Raschid. Adaptive pull-based policies for wide area data delivery. ACM Trans. Database Syst., 31(2):631--671, 2006.
[7]
J. Callan, M. Hoy, C. Yoo, and L. Zhao. The ClueWeb09 Dataset. http://boston.lti.cs.cmu.edu/Data/clueweb09/, 2009.
[8]
J. Cho and H. Garcia-Molina. Effective Page Refresh Policies for Web Crawlers. ACM Trans. Database Syst., 28:390--426, December 2003.
[9]
C. Cieri, S. Strassel, D. Graff, N. Martey, K. Rennert, and M. Liberman. Corpora for topic detection and tracking, pages 33--66. Kluwer Academic Publishers, Norwell, MA, USA, 2002.
[10]
Y. G. Han, S. H. Lee, J. H. Kim, and Y. Kim. A new aggregation policy for RSS services. In CSSSIA '08: Proc. of the 2008 international workshop on Context enabled source and service selection, integration and adaptation, pages 1--7, New York, USA, 2008. ACM.
[11]
P. Katz. Causal Relation Detection for Activities from Heterogeneous Sources. In In Proceedings of the 11th International Conference on Web Engineering (ICWE 2011), Paphos, Cyprus, 6 2011.
[12]
A. King. Average Web Page, May 2008. http://www.optimizationweek.com/reviews/average-web-page/.
[13]
K. Lang. 20 Newsgroups Data Set. http://people.csail.mit.edu/jrennie/20Newsgroups/, 2008.
[14]
H. Liu, V. Ramasubramanian, and E. G. Sirer. Client behavior and feed characteristics of RSS, a publish-subscribe system for web micronews. In Proceedings of he 5th ACM SIGCOMM conference on Internet Measurement, pages 29--34. USENIX Association, 2005.
[15]
M. Nottingham and R. Sayre. The Atom Syndication Format, 2005. http://www.ietf.org/rfc/rfc4287.txt.
[16]
V. Ramasubramanian, R. Peterson, and E. G. Sirer. Corona: A High Performance Publish-Subscribe System for the World Wide Web. In 3rd USENIX Symposium on Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, 2006.
[17]
I. Rose, R. Murty, P. Pietzuch, J. Ledlie, M. Roussopoulos, and M. Welsh. Cobra: Content-based Filtering and Aggregation of Blogs and RSS Feeds. In Proceedings of the Symposium on Networked Systems Design and Implementation, Boston, MA, 2007.
[18]
T. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume 1 -- from Yesterday's News to Tomorrow's Language Resources. In Language Resources and Evaluation, 2002.
[19]
K. C. Sia, J. Cho, and H.-K. Cho. Efficient Monitoring Algorithm for Fast News Alerts. IEEE Trans. on Knowl. and Data Eng., 19:950--961, July 2007.
[20]
D. Urbansky, M. Feldmann, J. A. Thom, and A. Schill. Entity Extraction from the Web with WebKnox. In Proceedings of the Sixth Atlantic Web Intelligence Conference, 2009.
[21]
D. Urbansky, S. Reichert, K. Muthmann, D. Schuster, and A. Schill. An Optimized Web Feed Aggregation Approach for Generic Feed Types. In Proceedings of the Fifth International AAAI Conference on Weblogs and Social Media (ICWSM-11). American Association for Artificial Intelligence, 2011.
[22]
D. Winer. RSS 2.0 Specification, 07 2003. http://cyber.law.harvard.edu/rss/rss.html.

Cited By

View all
  • (2024)Intelligent algorithm selection for efficient update predictions in social media feedsSocial Network Analysis and Mining10.1007/s13278-024-01315-914:1Online publication date: 20-Aug-2024
  • (2021)When she posts next? A comparison of refresh strategies for Online Social NetworksThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487778(123-129)Online publication date: 29-Nov-2021
  • (2018)Social Networks Serving Web FeedsProceedings of the 2018 10th International Conference on Information Management and Engineering10.1145/3285957.3285966(115-121)Online publication date: 22-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
iiWAS '11: Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services
December 2011
572 pages
ISBN:9781450307840
DOI:10.1145/2095536
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. RSS
  2. analysis
  3. atom
  4. classification
  5. dataset
  6. feed

Qualifiers

  • Research-article

Conference

MoMM '11

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Intelligent algorithm selection for efficient update predictions in social media feedsSocial Network Analysis and Mining10.1007/s13278-024-01315-914:1Online publication date: 20-Aug-2024
  • (2021)When she posts next? A comparison of refresh strategies for Online Social NetworksThe 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487778(123-129)Online publication date: 29-Nov-2021
  • (2018)Social Networks Serving Web FeedsProceedings of the 2018 10th International Conference on Information Management and Engineering10.1145/3285957.3285966(115-121)Online publication date: 22-Sep-2018
  • (2015)History based heuristic feed querying scheme in web feed aggregator2015 IEEE 4th Global Conference on Consumer Electronics (GCCE)10.1109/GCCE.2015.7398633(446-450)Online publication date: Oct-2015
  • (2015)Online refresh strategies for content based feed aggregationWorld Wide Web10.1007/s11280-014-0288-y18:4(913-947)Online publication date: 1-Jul-2015

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media