ABSTRACT
Systems such as social networks, search engines or trading platforms operate geographically distant sites that continuously generate streams of events at high-rate. Such events can be access logs to web servers, feeds of messages from participants of a social network, or financial data, among others. The ability to timely detect trends and popularity variations is of paramount importance in such systems. In particular, determining what are the most popular events across all sites allows to capture the most relevant information in near real-time and quickly adapt the system to the load. This paper presents TOPiCo, a protocol that computes the most popular events across geo-distributed sites in a low cost, bandwidth-efficient and timely manner. TOPiCo starts by building the set of most popular events locally at each site. Then, it disseminates only events that have a chance to be among the most popular ones across all sites, significantly reducing the required bandwidth. We give a correctness proof of our algorithm and evaluate TOPiCo using a real-world trace of more than 240 million events spread across 32 sites. Our empirical results shows that (i) TOPiCo is timely and cost-efficient for detecting popular events in a large-scale setting, (ii) it adapts dynamically to the distribution of the events, and (iii) our protocol is particularly efficient for skewed distributions.
- Arlitt, M., and Jin, T. A workload characterization study of the 1998 world cup web site. Network, IEEE 14, 3 (2000), 30--37. Google ScholarDigital Library
- Babcock, B., and Olston, C. Distributed top-k monitoring. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data (New York, NY, USA, 2003), SIGMOD '03, ACM, pp. 28--39. Google ScholarDigital Library
- Brenna, L., Gehrke, J., Hong, M., and Johansen, D. Distributed event stream processing with non-deterministic finite automata. In Proceedings of the Third ACM International Conference on Distributed Event-Based Systems (New York, NY, USA, 2009), DEBS '09, ACM, pp. 3:1--3:12. Google ScholarDigital Library
- Cao, P., and Wang, Z. Efficient top-k query calculation in distributed networks. In Proceedings of the Twenty-third Annual ACM Symposium on Principles of Distributed Computing (New York, NY, USA, 2004), PODC '04, ACM, pp. 206--215. Google ScholarDigital Library
- Cormode, G., and Muthukrishnan, S. An improved data stream summary: The count-min sketch and its applications. J. Algorithms 55, 1 (Apr. 2005), 58--75. Google ScholarDigital Library
- Culhane, W., Jayaram, K. R., and Eugster, P. Fast, expressive top-k matching. In Proceedings of the 15th International Middleware Conference (New York, NY, USA, 2014), Middleware '14, ACM, pp. 73--84. Google ScholarDigital Library
- Demaine, E. D., López-Ortiz, A., and Munro, J. I. Frequency estimation of internet packet streams with limited space. In Proceedings of the 10th Annual European Symposium on Algorithms (London, UK, UK, 2002), ESA '02, Springer-Verlag, pp. 348--360. Google ScholarDigital Library
- Fagin, R., Kumar, R., and Sivakumar, D. Comparing top k lists. In Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms (Philadelphia, PA, USA, 2003), SODA '03, Society for Industrial and Applied Mathematics, pp. 28--36. Google ScholarDigital Library
- Fagin, R., Lotem, A., and Naor, M. Optimal aggregation algorithms for middleware. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (New York, NY, USA, 2001), PODS '01, ACM, pp. 102--113. Google ScholarDigital Library
- Guerrieri, A., Montresor, A., and Velegrakis, Y. Top-k item identification on dynamic and distributed datasets. In Euro-Par 2014 Parallel Processing, F. Silva, I. Dutra, and V. Santos Costa, Eds., vol. 8632 of Lecture Notes in Computer Science. Springer International Publishing, 2014, pp. 270--281.Google ScholarCross Ref
- Guntzer, J., Balke, W.-T., and Kiessling, W. Towards efficient multi-feature queries in heterogeneous environments. In Proceedings of the International Conference on Information Technology: Coding and Computing (Washington, DC, USA, 2001), ITCC '01, IEEE Computer Society, pp. 622--. Google ScholarDigital Library
- Hirzel, M. Partition and compose: Parallel complex event processing. In Proceedings of the 6th ACM International Conference on Distributed Event-Based Systems (New York, NY, USA, 2012), DEBS '12, ACM, pp. 191--200. Google ScholarDigital Library
- Ilyas, I. F., Beskales, G., and Soliman, M. A. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (Oct. 2008), 11:1--11:58. Google ScholarDigital Library
- Lahiri, B., Chandrashekar, J., and Tirthapura, S. Space-efficient tracking of persistent items in a massive data stream. In Proceedings of the 5th ACM International Conference on Distributed Event-based System (New York, NY, USA, 2011), DEBS '11, ACM, pp. 255--266. Google ScholarDigital Library
- Lahiri, B., and Tirthapura, S. Identifying frequent items in a network using gossip. Journal of Parallel and Distributed Computing 70, 12 (2010), 1241--1253. Google ScholarDigital Library
- Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. Finding (recently) frequent items in distributed data streams. In Proceedings of the 21st International Conference on Data Engineering (Washington, DC, USA, 2005), ICDE '05, IEEE Computer Society, pp. 767--778. Google ScholarDigital Library
- Michel, S., Triantafillou, P., and Weikum, G. KLEE: A Framework for Distributed Top-k Query Algorithms. VLDB '05 - Proceedings of the 31st VLDB conference (2005), 637--648. Google ScholarDigital Library
- Misra, J., and Gries, D. Finding repeated elements. Sci. Comput. Program. 2, 2 (1982), 143--152.Google ScholarCross Ref
- Sacha, J., and Montresor, A. Identifying frequent items in distributed data sets. Computing 95, 4 (Apr. 2013), 289--307. Google ScholarDigital Library
- Singh, S., Estan, C., Varghese, G., and Savage, S. Automated worm fingerprinting. In Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation - Volume 6 (Berkeley, CA, USA, 2004), OSDI'04, USENIX Association, pp. 4--4. Google ScholarDigital Library
- Theobald, M., Weikum, G., and Schenkel, R. Top-k query evaluation with probabilistic guarantees. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30 (2004), VLDB '04, VLDB Endowment, pp. 648--659. Google ScholarDigital Library
- Tudoran, R., Nano, O., Santos, I., Costan, A., Soncu, H., Bougé, L., and Antoniu, G. Jetstream: Enabling high performance event streaming across cloud data-centers. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems (New York, NY, USA, 2014), DEBS '14, ACM, pp. 23--34. Google ScholarDigital Library
- Vitter, J. S. Random sampling with a reservoir. ACM Transactions on Mathematical Software 11, 1 (1985). Google ScholarDigital Library
- Wang, X., Candan, K. S., and Song, J. Complex pattern ranking (cpr): Evaluating top-k pattern queries over event streams. In Proceedings of the 5th ACM International Conference on Distributed Event-based System (New York, NY, USA, 2011), DEBS '11, ACM, pp. 279--290. Google ScholarDigital Library
- Weigert, S., Hiltunen, M. A., and Fetzer, C. Community-based analysis of netflow for early detection of security incidents. In Proceedings of the 25th International Conference on Large Installation System Administration (Berkeley, CA, USA, 2011), LISA'11, USENIX Association. Google ScholarDigital Library
- Wong, R. C.-W., and Fu, A. W.-C. Mining top-k frequent itemset from data streams. Journal of Data Mining and Knowledge Discovery 13, 2 (2006), 193--217. Google ScholarDigital Library
Index Terms
- TOPiCo: detecting most frequent items from multiple high-rate event streams
Recommendations
Clustering Events on Streams Using Complex Context Information
ICDMW '08: Proceedings of the 2008 IEEE International Conference on Data Mining WorkshopsMonitoring applications play an increasingly important role in many domains. They detect events in monitored systems and take actions such as invoke a program or notify an administrator. Often administrators must then manually investigate events to ...
The DEBS 2014 grand challenge
DEBS '14: Proceedings of the 8th ACM International Conference on Distributed Event-Based SystemsEvent processing systems in general and data stream processing systems in particular focus on processing of queries over unbounded event streams. The goal of the DEBS 2014 Grand Challenge is to provide a specific problem, originating from the domain of ...
The DEBS 2016 grand challenge
DEBS '16: Proceedings of the 10th ACM International Conference on Distributed and Event-based SystemsThe DEBS Grand Challenge is a series of challenges which address problems in event stream processing. The focus of the Grand Challenge in 2016 is on processing of data streams that originate from social networks. Hence, the data represents an evolving ...
Comments