Abstract
With the rapid growth of data sources, services and devices connected to the Internet, online available web content is getting more and more diverse and dynamic. In order to facilitate the efficient dissemination of evolving and temporary information, many web applications publish their new information as RSS and Atom documents which are then collected and transformed by RSS aggregators like Feedly or Yahoo! News. This article addresses the particular issue of large scale aggregation of highly dynamic information sources by focusing on the design of optimal refresh strategies for large collections of RSS feed documents. First, we introduce two quality measures specific to RSS aggregation which reflect the information completeness and average freshness of the result feeds. Then, we propose a best effort feed refresh strategy that achieves maximum aggregation quality compared with all other existing policies with the same average number of refreshes. This strategy is based on specific online change estimation models developed after a deep analysis of the temporal publication characteristics of a representative collection of real-world RSS feeds. The presented methods have been implemented and tested against synthetic and real-world RSS feed data sets.
Similar content being viewed by others
References
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: WWW, pp. 280–290 (2003)
Adam, G., Bouras, C., Poulopoulos, V.: Efficient extraction of news articles based on RSS crawling. In: International Conference on Machine and Web Intelligence (2010)
Brewington, B.E., Cybenko, G.: How dynamic is the Web? Comput. Netw. 33(1–6), 257–276 (2000)
Bright, L., Gal, A., Raschid, L.: Adaptive pull-based policies for wide area data delivery. ACM Trans. Database Syst. 31(2), 631–671 (2006)
Chatfield, C.: The Analysis of Time Series: An Introduction. CRC Press. (2004)
Cho, J., Garcia-Molina, H.: Synchronizing a database to improve freshness. In: Chen, W., Naughton, J.F., Bernstein, P.A. (eds.) SIGMOD Conference, pp. 117128. ACM (2000)
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: Abbadi, A.E., Brodie, M.L., Chakravarthy, S., Dayal, U., Kamel, N., Schlageter, G.,Whang, K.Y. (eds.) VLDB, 200209. Morgan Kaufmann (2000)
Cho, J., Garcia-Molina, H.: Parallel crawlers. In: Proceedings of the 11th International Conference on World Wide Web, WWW ’02, pp. 124–135. ACM, New York, NY, USA (2002). http://doi.acm.org/10.1145/511446.511464
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003). http://doi.acm.org/10.1145/958942.958945
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003). http://doi.acm.org/10.1145/857166.857170
Cho, J., Garcia-Molina, H., Page, L.: Efficient crawling through URL ordering. Comput. Netw. 30(1-7), 161–172 (1998)
Coffman, E.G., Liu, Z., Weber, R.R.: Optimal robot scheduling for Web search engines. J. Sched. 1(1), 15–29 (1998)
Edwards, J., McCurley, K.S., Tomlin, J.A.: An adaptive model for optimizing performance of an incremental web crawler. In: WWW, pp. 106–113 (2001)
Gruhl, D., Guha, R.V., Liben-Nowell, D., Tomkins, A.: Information diffusion through blogspace. In: Feldman, S.I., Uretsky, M., Najork, M., Wills, C.E. (eds.) WWW, pp. 491501. ACM (2004)
Gwertzman, J., Seltzer, M.I.: World wide web cache consistency. In: USENIX annual technical conference, pp. 141–152 (1996)
Hmedeh, Z., Vouzoukidou, N., Travers, N., Christophides, V., du Mouza, C., Scholl, M.: Characterizing web syndication behavior and content. In: WISE’11, The 11th International Conference on Web Information System Engineering, LNCS, pp 29–42, Sidney (2011)
Horincar, R., Amann, B., Artières, T.: Best-effortr refresh strategies for content-based RSS feed aggregation. In: Chen, L., Triantafillou, P., Suel, T. (eds.) WISE, Lecture Notes in Computer Science, vol. 6488, pp. 262–270. Springer (2010)
Horincar, R., Amann, B., Artiėres, T.: Online refresh strategies for RSS feed crawlers. In: BDA’11, 27ėmes Journėes Bases de Donnėes Avancėes. Rabat, Maroc (2011)
Horincar, R., Amann, B., Artières, T.: Online Change Estimation Models for Dynamic Web Resources. In: ICWE’12, The 12th International Conference on Web Engineering (ICWE). Berlin (2012)
Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retr. 4(3), 175–246 (2010)
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: Huai, J., Chen, R., Hon, H.W., Liu, Y., Ma, W.Y., Tomkins, A., Zhang, X. (eds.) WWW, pp. 437446. ACM (2008)
Olston, C., Widom, J.: Best-effort cache synchronization with source cooperation. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) SIGMOD Conference, pp. 7384, ACM (2002)
O’Reilly, T.: What Is Web 2.0? Design Patterns and Business Models for the Next Generation of Software (2005). http://oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html
Pandey, S., Dhamdhere, K., Olston, C.: WIC: A General-purpose algorithm for monitoring web information sources. In: Nascimento, M.A., O zsu, M.T., Kossmann, D., Miller, R.J., Blakeley, J.A., Schiefer, K.B. (eds.) VLDB, 360371. Morgan Kaufmann (2004)
Pandey, S., Olston, C.: User-centric web crawling. In: Ellis, A., Hagino, T. (eds.) WWW, pp. 401411. ACM (2005)
Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: WWW, pp. 659–668 (2003)
Peersim. http://peersim.sourceforge.net/
Reichert, S., Urbansky, D., Muthmann, K., Katz, P., Wauer, M., Schill, A.: Feeding the world: a comprehensive dataset and analysis of a real world snapshot of web feeds. In: Taniar, D., Pardede, E., Nguyen, H.Q., Rahayu, J.W., Khalil, I. (eds.) iiWAS, pp. 4451. ACM (2011)
Roitman, H., Carmel, D., Yom-Tov, E.: Maintaining dynamic channel profiles on the web. PVLDB 1(1), 151–162 (2008)
RoSeS Project. http://www-bd.lip6.fr/roses/doku.php
RSS Board. http://www.rssboard.org/
Saporta, G.: Probabilités, analyse des données et statistique. Technip (2006)
Sia, K.C., Cho, J., Cho, H.K.: Efficient monitoring algorithm for fast news alerts. IEEE Trans. on Knowl. and Data Eng. 19(7), 950–961 (2007). doi:10.1109/TKDE.2007.1041
Sia, K.C., Cho, J., Hino, K., Chi, Y., Zhu, S., Tseng, B.L.: Monitoring RSS feeds based on user browsing pattern. In: Proceedings of the International Conference on Weblogs and Social Media (Boulder Colorado, March 2007), pp. 161–168 (2007)
Stewart, J.: Calculus: Early Transcendentals. Brooks/Cole (1991)
The Atom Publishing Protocol. http://tools.ietf.org/html/rfc5023
Tomàs, J.C., Amann, B., Travers, N., Vodislav, D.: RoSeS: A continuous content-based query engine for RSS feeds. In: Hameurlain, A., Liddle, S.W., Schewe, K.D., Zhou, X. (eds.) DEXA (2), Lecture Notes in Computer Science, vol. 6861, pp. 203218. Springer (2011)
Urbansky, D., Reichert, S., Muthmann, K., Schuster, D., Schill, A.: An optimized web feed aggregation approach for generic feed types. In: Adamic, L.A., Baeza-Yates, R.A., Counts, S. (eds.) ICWSM. The AAAI Press (2011)
Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW, pp. 136–147 (2002)
Zimmer, C., Tryfonopoulos, C., Berberich, K., Koubarakis, M., Weikum, G.: Approximate information filtering in peer-to-peer networks. In: Bailey, J., Maier, D., Schewe, K.D., Thalheim, B., Wang, X.S. (eds.) WISE, Lecture Notes in Computer Science, vol. 5175, pp. 619. Springer (2008)
Zimmer, C., Tryfonopoulos, C., Berberich, K., Weikum, G., Koubarakis, M.: Node Behavior Prediction for Large-Scale Approximate Information Filtering 1st International Workshop on Large Scale Distributed Systems for Information Retrieval (LSDS-IR 2007) (2007)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Horincar, R., Amann, B. & Artières, T. Online refresh strategies for content based feed aggregation. World Wide Web 18, 913–947 (2015). https://doi.org/10.1007/s11280-014-0288-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-014-0288-y