Online refresh strategies for content based feed aggregation

Horincar, Roxana; Amann, Bernd; Artières, Thierry

doi:10.1007/s11280-014-0288-y

Online refresh strategies for content based feed aggregation

Published: 02 May 2014

Volume 18, pages 913–947, (2015)
Cite this article

World Wide Web Aims and scope Submit manuscript

Roxana Horincar¹,
Bernd Amann¹ &
Thierry Artières¹

244 Accesses
3 Citations
6 Altmetric
Explore all metrics

Abstract

With the rapid growth of data sources, services and devices connected to the Internet, online available web content is getting more and more diverse and dynamic. In order to facilitate the efficient dissemination of evolving and temporary information, many web applications publish their new information as RSS and Atom documents which are then collected and transformed by RSS aggregators like Feedly or Yahoo! News. This article addresses the particular issue of large scale aggregation of highly dynamic information sources by focusing on the design of optimal refresh strategies for large collections of RSS feed documents. First, we introduce two quality measures specific to RSS aggregation which reflect the information completeness and average freshness of the result feeds. Then, we propose a best effort feed refresh strategy that achieves maximum aggregation quality compared with all other existing policies with the same average number of refreshes. This strategy is based on specific online change estimation models developed after a deep analysis of the temporal publication characteristics of a representative collection of real-world RSS feeds. The presented methods have been implemented and tested against synthetic and real-world RSS feed data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW, pp. 280–290 (2003)
Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence (2010)
Brewington, B.E., Cybenko, G.: How dynamic is the Web? Comput. Netw. 33(1–6), 257–276 (2000)
Article Google Scholar
Bright, L., Gal, A., Raschid, L.: Adaptive pull-based policies for wide area data delivery. ACM Trans. Database Syst. 31(2), 631–671 (2006)
Article Google Scholar
Chatfield, C.: The Analysis of Time Series: An Introduction. CRC Press. (2004)
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) SIGMOD Conference, pp. 117128. ACM (2000)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G.,Whang, K.Y. (eds.) VLDB, 200209. Morgan Kaufmann (2000)
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 124–135. ACM, New York, NY, USA (2002). http://doi.acm.org/10.1145/511446.511464
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003). http://doi.acm.org/10.1145/958942.958945
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). http://doi.acm.org/10.1145/857166.857170
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. 30(1-7), 161–172 (1998)
Google Scholar
Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for Web search engines. J. Sched. 1(1), 15–29 (1998)
Article MATH MathSciNet Google Scholar
Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)
Gruhl, D., Guha, R.V., Liben-Nowell, D., Tomkins, A.: Information diffusion through blogspace. In: Feldman, S.I., Uretsky, M., Najork, M., Wills, C.E. (eds.) WWW, pp. 491501. ACM (2004)
Gwertzman, J., Seltzer, M.I.: World wide web cache consistency. In: USENIX annual technical conference, pp. 141–152 (1996)
Hmedeh, Z., Vouzoukidou, N., Travers, N., Christophides, V., du Mouza, C., Scholl, M.: Characterizing web syndication behavior and content. In: WISE’11, The 11th International Conference on Web Information System Engineering, LNCS, pp 29–42, Sidney (2011)
Horincar, R., Amann, B., Artières, T.: Best-effortr refresh strategies for content-based RSS feed aggregation. In: Chen, L., Triantafillou, P., Suel, T. (eds.) WISE, Lecture Notes in Computer Science, vol. 6488, pp. 262–270. Springer (2010)
Horincar, R., Amann, B., Artiėres, T.: Online refresh strategies for RSS feed crawlers. In: BDA’11, 27ėmes Journėes Bases de Donnėes Avancėes. Rabat, Maroc (2011)
Horincar, R., Amann, B., Artières, T.: Online Change Estimation Models for Dynamic Web Resources. In: ICWE’12, The 12th International Conference on Web Engineering (ICWE). Berlin (2012)
Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retr. 4(3), 175–246 (2010)
Article MATH Google Scholar
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Huai, J., Chen, R., Hon, H.W., Liu, Y., Ma, W.Y., Tomkins, A., Zhang, X. (eds.) WWW, pp. 437446. ACM (2008)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Conference, pp. 7384, ACM (2002)
O’Reilly, T.: What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software (2005). http://oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
Pandey, S., Dhamdhere, K., Olston, C.: WIC: A General-purpose algorithm for monitoring web information sources. In: Nascimento, M.A., O zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB, 360371. Morgan Kaufmann (2004)
Pandey, S., Olston, C.: User-centric web crawling. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 401411. ACM (2005)
Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW, pp. 659–668 (2003)
Peersim. http://peersim.sourceforge.net/
Reichert, S., Urbansky, D., Muthmann, K., Katz, P., Wauer, M., Schill, A.: Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds. In: Taniar, D., Pardede, E., Nguyen, H.Q., Rahayu, J.W., Khalil, I. (eds.) iiWAS, pp. 4451. ACM (2011)
Roitman, H., Carmel, D., Yom-Tov, E.: Maintaining dynamic channel profiles on the web. PVLDB 1(1), 151–162 (2008)
Google Scholar
RoSeS Project. http://www-bd.lip6.fr/roses/doku.php
RSS Board. http://www.rssboard.org/
Saporta, G.: Probabilités, analyse des données et statistique. Technip (2006)
Sia, K.C., Cho, J., Cho, H.K.: Efficient monitoring algorithm for fast news alerts. IEEE Trans. on Knowl. and Data Eng. 19(7), 950–961 (2007). doi:10.1109/TKDE.2007.1041
Article Google Scholar
Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: Proceedings of the International Conference on Weblogs and Social Media (Boulder Colorado, March 2007), pp. 161–168 (2007)
Stewart, J.: Calculus: Early Transcendentals. Brooks/Cole (1991)
The Atom Publishing Protocol. http://tools.ietf.org/html/rfc5023
Tomàs, J.C., Amann, B., Travers, N., Vodislav, D.: RoSeS: A continuous content-based query engine for RSS feeds. In: Hameurlain, A., Liddle, S.W., Schewe, K.D., Zhou, X. (eds.) DEXA (2), Lecture Notes in Computer Science, vol. 6861, pp. 203218. Springer (2011)
Urbansky, D., Reichert, S., Muthmann, K., Schuster, D., Schill, A.: An optimized web feed aggregation approach for generic feed types. In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM. The AAAI Press (2011)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)
Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.D., Thalheim, B., Wang, X.S. (eds.) WISE, Lecture Notes in Computer Science, vol. 5175, pp. 619. Springer (2008)
Zimmer, C., Tryfonopoulos, C., Berberich, K., Weikum, G., Koubarakis, M.: Node Behavior Prediction for Large-Scale Approximate Information Filtering 1st International Workshop on Large Scale Distributed Systems for Information Retrieval (LSDS-IR 2007) (2007)

Download references

Author information

Authors and Affiliations

LIP6, University Pierre et Marie Curie, 4 Place Jussieu, 75005, Paris, France
Roxana Horincar, Bernd Amann & Thierry Artières

Authors

Roxana Horincar
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Amann
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Artières
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Roxana Horincar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Horincar, R., Amann, B. & Artières, T. Online refresh strategies for content based feed aggregation. World Wide Web 18, 913–947 (2015). https://doi.org/10.1007/s11280-014-0288-y

Download citation

Received: 23 March 2013
Revised: 21 November 2013
Accepted: 02 March 2014
Published: 02 May 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11280-014-0288-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Online refresh strategies for content based feed aggregation

Abstract

Access this article

Similar content being viewed by others

Multi-Query Optimization on RSS Feeds

Quantifying retrieval bias in Web archive search

Large-Scale Real-Time News Recommendation Based on Semantic Data Analysis and Users’ Implicit and Explicit Behaviors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Online refresh strategies for content based feed aggregation

Abstract

Access this article

Similar content being viewed by others

Multi-Query Optimization on RSS Feeds

Quantifying retrieval bias in Web archive search

Large-Scale Real-Time News Recommendation Based on Semantic Data Analysis and Users’ Implicit and Explicit Behaviors

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation