Abstract
Content syndication has become a popular way for timely delivery of frequently updated information on the Web. Today, web syndication technologies such as RSS or Atom are used in a wide variety of applications spreading from large-scale news broadcasting to medium-scale information sharing in scientific and professional communities. However, they exhibit serious limitations for dealing with information overload in Web 2.0. There is a vital need for efficient real-time filtering methods across feeds, to allow users to effectively follow personally interesting information. We investigate in this paper three indexing techniques for users’ subscriptions based on inverted lists or on an ordered trie for exact and partial matching. We present analytical models for memory requirements and matching time and we conduct a thorough experimental evaluation to exhibit the impact of critical parameters of realistic web syndication workloads.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Hmedeh Z, Vouzoukidou N, Travers N, Christophides V, du Mouza C, Scholl M. Characterizing web syndication behavior and content. In Proc. the 12th WISE, Nov. 2011, pp.29-42.
Pereira J, Fabret F, Llirbat F, Preotiuc-Pietro R, Ross K A, Shasha D. Publish/subscribe on the web at extreme speed. In Proc. the 26th VLDB, Sept. 2000, pp.627-630.
Fabret F, Jacobsen H A, Llirbat F, Pereira J, Ross K A, Shasha D. Filtering algorithms and implementation for very fast publish/subscribe. In Proc. SIGMOD, May 2001, pp.115-126.
Aguilera M K, Strom R E, Sturman D C, Astley M, Chandra T D. Matching events in a content-based subscription system. In Proc. the 8th PODC, Apr. 29-May 6, 1999, pp.53-61.
Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Survey, 2006, 38(2): Article No. 6.
Knuth D E. The Art of Computer Programming, Volume III: Sorting and Searching (2nd edition). Addison Wesley Longman Publishing Co., Inc., Redwood City, CA, USA, 1998.
Yan T W, Garcia-Molina H. Index structures for selective dissemination of information under the Boolean model. ACM Transactions on Database Systems, 1994, 19(2): 332–364.
König A C, Church K W, Markov M. A data structure for sponsored search. In Proc. the 25th ICDE, Mar. 29-April 2, 2009, pp.90-101.
Bodon F. Surprising results of trie-based FIM algorithms. In Proc. IEEE CIDM Workshop on FIMI, Nov. 2004.
Malik H H, Kender J R. Optimizing frequency queries for data mining applications. In Proc. the 7th ICDM, Oct. 2007, pp.595-600.
Travers N, Hmedeh Z, Vouzoukidou N, du Mouza C, Christophides V, Scholl M. RSS feeds behavior analysis, structure and vocabulary. International Journal of Web Information Systems, 2014, 10(3): 291–320.
Yan T W, Garcia-Molina H. The SIFT information dissemination system. ACM Transactions on Database Systems, 1999, 24(4): 529–565.
Bodon F. A trie-based APRIORI implementation for mining frequent item sequences. In Proc. the 1st Int. Work. Open Source Data Mining (OSDM), Aug. 2005, pp.56-65.
Clément J, Flajolet P, Vallée B. Dynamical sources in information theory: A general analysis of trie structures. Algorithmica, 2001, 29(1): 307–369.
Baeza-Yates R A, Ribeiro-Neto B. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
Salton G, Wong A, Yang C S. A vector space model for automatic indexing. Communications of the ACM, 1975, 18(11): 613–620.
Bookstein A, Swanson D. Probabilistic models for automatic indexing. J. Am. Soc. Inf. Sci., 1974, 25(5): 312–316.
Bagwell P. Ideal hash trees. Technical Report LAMPREPORT-2001-001, Ecole Polytechnique Federal de Lausanne, Switzerland, 2001.
Walker A J. An efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software, 1977, 3(3): 253–256.
Beitzel S M, Jensen E C, Chowdhury A, Grossman D, Frieder O. Hourly analysis of a very large topically categorized web query log. In Proc. the 27th SIGIR, Jul. 2004, pp.321-328.
Carzaniga A, Wolf A. Forwarding in a content-based network. In Proc. the 17th SIGCOMM, Aug. 2003, pp.163-174.
Kale S, Hazan E, Cao F, Singh J P. Analysis and algorithms for content-based event matching. In Proc. the 25th Int. Conf. Distributed Computing Systems (ICDCS) Workshops, Jun. 2005, pp.363-369.
Wang B, Zhang W, Kitsuregawa M. UB-tree based efficient predicate index with dimension transform for pub/sub system. In Proc. the 9th DASFAA, Mar. 2004, pp.63-74.
Machanavajjhala A, Vee E, Garofalakis M N, Shanmugasundaram J. Scalable ranked publish/subscribe. PVLDB, 2008, 1(1): 451–462.
Sadoghi M, Jacobsen H A. BE-tree: An index structure to efficiently match Boolean expressions over high-dimensional discrete space. In Proc. the 30th SIGMOD, Jun. 2011, pp.637-648.
Whang S, Garcia-Molina H, Brower C, Shanmugasundaram J, Vassilvitskii S, Vee E, Yerneni R. Indexing Boolean expressions. PVLDB, 2009, 2(1): 37–48.
Sadoghi M, Jacobsen H A. Analysis and optimization for Boolean expression indexing. ACM Transactions on Database Systems, 2013, 38(2): Article No. 8.
Sadoghi M, Jacobsen H A. Relevance matters: Capitalizing on less (top-k matching in publish/subscribe). In Proc. the 28th ICDE, Apr. 2012, pp.786-797.
Petrovic M, Liu H, Jacobsen H A. G-ToPSS: Fast filtering of graph-based metadata. In Proc. the 14th WWW, May 2005, pp.539-547.
Liu H, Petrovic M, Jacobsen H. Efficient filtering of RSS documents on computer cluster. Technical Report, MSRG, University of Toronto, Nov. 2007.
Demers A J, Gehrke J, Hong M, Riedewald M, White W M. Towards expressive publish/subscribe systems. In Proc. the 10th EDBT, Mar. 2006, pp.627-644.
Irmak U, Mihaylov S, Suel T, Ganguly S, Izmailov R. Efficient query subscription processing for prospective search engines. In Proc. USENIX, Jun. 2006, pp.375-380.
Shraer A, Gurevich M, Fontoura M, Josifovski V. Top-k publish-subscribe for social annotation of news. PVLDB, 2013, 6(6): 385–396.
Hmedeh Z, du Mouza C, Travers N. TDV-based filter for novelty and diversity in a real-time pub/sub system. In Proc. the 19th IDEAS, Jul. 2015, pp.136-145.
Hmedeh Z, du Mouza C, Travers N. FiND: A real-time filtering by novelty and diversity for publish/subscribe systems. In Proc. the 27th SSDBM, June 29-July 1, 2015.
Author information
Authors and Affiliations
Corresponding author
Additional information
A preliminary version of the paper was published in the Proceedings of EDBT 2012.
Rights and permissions
About this article
Cite this article
Hmedeh, Z., Kourdounakis, H., Christophides, V. et al. Content-Based Publish/Subscribe System for Web Syndication. J. Comput. Sci. Technol. 31, 359–380 (2016). https://doi.org/10.1007/s11390-016-1632-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-016-1632-8