Skip to main content
Log in

Online refresh strategies for content based feed aggregation

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

With the rapid growth of data sources, services and devices connected to the Internet, online available web content is getting more and more diverse and dynamic. In order to facilitate the efficient dissemination of evolving and temporary information, many web applications publish their new information as RSS and Atom documents which are then collected and transformed by RSS aggregators like Feedly or Yahoo! News. This article addresses the particular issue of large scale aggregation of highly dynamic information sources by focusing on the design of optimal refresh strategies for large collections of RSS feed documents. First, we introduce two quality measures specific to RSS aggregation which reflect the information completeness and average freshness of the result feeds. Then, we propose a best effort feed refresh strategy that achieves maximum aggregation quality compared with all other existing policies with the same average number of refreshes. This strategy is based on specific online change estimation models developed after a deep analysis of the temporal publication characteristics of a representative collection of real-world RSS feeds. The presented methods have been implemented and tested against synthetic and real-world RSS feed data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW, pp. 280–290 (2003)

  2. Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence (2010)

  3. Brewington, B.E., Cybenko, G.: How dynamic is the Web? Comput. Netw. 33(1–6), 257–276 (2000)

    Article  Google Scholar 

  4. Bright, L., Gal, A., Raschid, L.: Adaptive pull-based policies for wide area data delivery. ACM Trans. Database Syst. 31(2), 631–671 (2006)

    Article  Google Scholar 

  5. Chatfield, C.: The Analysis of Time Series: An Introduction. CRC Press. (2004)

  6. Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) SIGMOD Conference, pp. 117128. ACM (2000)

  7. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G.,Whang, K.Y. (eds.) VLDB, 200209. Morgan Kaufmann (2000)

  8. Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 124–135. ACM, New York, NY, USA (2002). http://doi.acm.org/10.1145/511446.511464

  9. Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003). http://doi.acm.org/10.1145/958942.958945

  10. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). http://doi.acm.org/10.1145/857166.857170

  11. Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. 30(1-7), 161–172 (1998)

    Google Scholar 

  12. Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for Web search engines. J. Sched. 1(1), 15–29 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  13. Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)

  14. Gruhl, D., Guha, R.V., Liben-Nowell, D., Tomkins, A.: Information diffusion through blogspace. In: Feldman, S.I., Uretsky, M., Najork, M., Wills, C.E. (eds.) WWW, pp. 491501. ACM (2004)

  15. Gwertzman, J., Seltzer, M.I.: World wide web cache consistency. In: USENIX annual technical conference, pp. 141–152 (1996)

  16. Hmedeh, Z., Vouzoukidou, N., Travers, N., Christophides, V., du Mouza, C., Scholl, M.: Characterizing web syndication behavior and content. In: WISE’11, The 11th International Conference on Web Information System Engineering, LNCS, pp 29–42, Sidney (2011)

  17. Horincar, R., Amann, B., Artières, T.: Best-effortr refresh strategies for content-based RSS feed aggregation. In: Chen, L., Triantafillou, P., Suel, T. (eds.) WISE, Lecture Notes in Computer Science, vol. 6488, pp. 262–270. Springer (2010)

  18. Horincar, R., Amann, B., Artiėres, T.: Online refresh strategies for RSS feed crawlers. In: BDA’11, 27ėmes Journėes Bases de Donnėes Avancėes. Rabat, Maroc (2011)

  19. Horincar, R., Amann, B., Artières, T.: Online Change Estimation Models for Dynamic Web Resources. In: ICWE’12, The 12th International Conference on Web Engineering (ICWE). Berlin (2012)

  20. Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retr. 4(3), 175–246 (2010)

    Article  MATH  Google Scholar 

  21. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Huai, J., Chen, R., Hon, H.W., Liu, Y., Ma, W.Y., Tomkins, A., Zhang, X. (eds.) WWW, pp. 437446. ACM (2008)

  22. Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Conference, pp. 7384, ACM (2002)

  23. O’Reilly, T.: What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software (2005). http://oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html

  24. Pandey, S., Dhamdhere, K., Olston, C.: WIC: A General-purpose algorithm for monitoring web information sources. In: Nascimento, M.A., O zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB, 360371. Morgan Kaufmann (2004)

  25. Pandey, S., Olston, C.: User-centric web crawling. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 401411. ACM (2005)

  26. Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW, pp. 659–668 (2003)

  27. Peersim. http://peersim.sourceforge.net/

  28. Reichert, S., Urbansky, D., Muthmann, K., Katz, P., Wauer, M., Schill, A.: Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds. In: Taniar, D., Pardede, E., Nguyen, H.Q., Rahayu, J.W., Khalil, I. (eds.) iiWAS, pp. 4451. ACM (2011)

  29. Roitman, H., Carmel, D., Yom-Tov, E.: Maintaining dynamic channel profiles on the web. PVLDB 1(1), 151–162 (2008)

    Google Scholar 

  30. RoSeS Project. http://www-bd.lip6.fr/roses/doku.php

  31. RSS Board. http://www.rssboard.org/

  32. Saporta, G.: Probabilités, analyse des données et statistique. Technip (2006)

  33. Sia, K.C., Cho, J., Cho, H.K.: Efficient monitoring algorithm for fast news alerts. IEEE Trans. on Knowl. and Data Eng. 19(7), 950–961 (2007). doi:10.1109/TKDE.2007.1041

    Article  Google Scholar 

  34. Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: Proceedings of the International Conference on Weblogs and Social Media (Boulder Colorado, March 2007), pp. 161–168 (2007)

  35. Stewart, J.: Calculus: Early Transcendentals. Brooks/Cole (1991)

  36. The Atom Publishing Protocol. http://tools.ietf.org/html/rfc5023

  37. Tomàs, J.C., Amann, B., Travers, N., Vodislav, D.: RoSeS: A continuous content-based query engine for RSS feeds. In: Hameurlain, A., Liddle, S.W., Schewe, K.D., Zhou, X. (eds.) DEXA (2), Lecture Notes in Computer Science, vol. 6861, pp. 203218. Springer (2011)

  38. Urbansky, D., Reichert, S., Muthmann, K., Schuster, D., Schill, A.: An optimized web feed aggregation approach for generic feed types. In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM. The AAAI Press (2011)

  39. Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)

  40. Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.D., Thalheim, B., Wang, X.S. (eds.) WISE, Lecture Notes in Computer Science, vol. 5175, pp. 619. Springer (2008)

  41. Zimmer, C., Tryfonopoulos, C., Berberich, K., Weikum, G., Koubarakis, M.: Node Behavior Prediction for Large-Scale Approximate Information Filtering 1st International Workshop on Large Scale Distributed Systems for Information Retrieval (LSDS-IR 2007) (2007)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Roxana Horincar.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Horincar, R., Amann, B. & Artières, T. Online refresh strategies for content based feed aggregation. World Wide Web 18, 913–947 (2015). https://doi.org/10.1007/s11280-014-0288-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-014-0288-y

Keywords

Navigation