Abstract
Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, support numerous applications across multiple areas such as databases, storage systems, and networking. However, the unbalanced distribution in real data streams are challenging to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter, that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot (frequent) items, our filter captures cold (infrequent) items in the first stage, and hot items in the second stage. Existing filters also require two-direction communication—with frequent exchanges between the two stages; our filter on the other hand is one-direction—each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, providing a level of genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on four typical stream processing tasks. Experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times.
Similar content being viewed by others
Notes
An optional part can exist in the first stage in Fig. 1, in case more information is required about the cold items, i.e., not only their frequencies.
In the rest of this paper, “CF” refers to our two-layer Cold Filter.
\(f_{e_t}[t-1]\) is the frequency of item \(e_t\) before updating, and \(f_{e_t}[t]= f_{e_t}[t-1] + 1\).
\([E] = \{1, 2, \ldots , E\}\).
\(Z^{+}\) is the set of non-negative integers.
The literature [2] reported skewness \(> 1.4\) in real data streams.
References
Cormode, G., Johnson, T., Korn, F., Muthukrishnan, S., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of ACM SIGMOD, pp 35–46 (2004)
Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)
Zhao, P., Aggarwal, C.C., Wang, M.: gSketch: on query estimation in graph streams. Proc. VLDB 5, 193–204 (2011)
Roy, P., Khan, A., Alonso, G.: Augmented sketch: faster and more accurate stream processing. In: Proceedings of ACM SIGMOD, pp. 1449–1463 (2016)
Chen, B., Shrivastava, A.: Densified winner take all (WTA) hashing for sparse datasets. In: Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6–10, 2018, pp. 906–916 (2018)
Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: Proceedings of ACM SIGMOD, pp. 61–72. ACM (2002)
Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Alg. 55(1), 58–75 (2005)
Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory, pp. 398–412. Springer (2005)
Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) Automata, Languages and Programming. Springer, Berlin (2002)
Schweller, R., Gupta, A., Parsons, E., Chen, Y.: Reversible sketches for efficient and accurate change detection over network data streams. In: Proceedings of ACM IMC, pp. 207–212. ACM (2004)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of VLDB, pp. 454–465. VLDB Endowment (2002)
Luo, C., Shrivastava, A.: SSH (sketch, shingle, & hash) for indexing massive-scale time series. In: NIPS 2016 Time Series Workshop, pp. 38–58 (2017)
Shrivastava, A., Konig, A.C., Bilenko, M.: Time adaptive sketches (ada-sketches) for summarizing data streams. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1417–1432. ACM (2016)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of ACM SIGMOD, pp. 476–487. ACM (2002)
Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: Proceedings of STOC, pp. 471–475. ACM (2001)
Kirsch, A., Mitzenmacher, M., Varghese, G.: Hash-based techniques for high-speed packet processing. In: Cormode, G., Thottan, M. (eds.) Algorithms for Next Generation Networks, pp. 181–218. Springer, London (2010)
Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: Making every bit count. In: Proceedings of ACM SIGMOD, pp. 775–787
Thomas, D., Bordawekar, R., et al.: On efficient query processing of stream counts on the cell processor. In: Proceedings of IEEE ICDE (2009)
Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB 9(5), 408–419 (2016)
Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid sketch: a sketch framework for frequency estimation of data streams. Proc. VLDB 10(11), 1442–1453 (2017)
Zhou, Y., Liu, P., Jin, H., Yang, T., Dang, S., Li, X.: One memory access sketh: a more accurate and faster sketch for per-flow measurement. In: IEEE Globecom (2017)
Gong, J., Yang, T., Zhou, Y., Yang, D., Chen, S., Cui, B., Li, X.: Abc: a practicable sketch framework for non-uniform multisets. IEEE Bigdata (2017)
Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting: Refined loglog algorithm. IEEE Bigcomp (2018)
Powers, D.M.: Applications and explanations of Zipf’s law. In Proceedings on EMNLP-CoNLL. Association for Computational Linguistics (1998)
Adamic, L.A., Huberman, B.A.: Power-law distribution of the world wide web. Science 287(5461), 2115–2115 (2000)
Goyal, A., Iii, Daume H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: Proceedings of EMNLP (2012)
Mandal, A., Jiang, H., Shrivastava, A., Sarkar, V.: Topkapi: parallel and fast sketches for finding top-k frequent elements. In: Advances in Neural Information Processing Systems, pp. 10898–10908 (2018)
Henzinger, M.R.: Algorithmic challenges in web search engines. Internet Math. 1(1), 115–123 (2004)
Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: Proceedings of USENIX NSDI, pp. 311–324 (2016)
Goodrich, M.T., Mitzenmacher, M.: Invertible bloom lookup tables. In: Proceedings of the 49th Annual Allerton Conference on Communication, Control, and Computing, pp. 792–799. IEEE (2011)
Xiao, Q., Qiao, Y., Zhen, M., Chen, S.: Estimating the persistent spreads in high-speed networks. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 131–142. IEEE (2014)
Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)
Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)
Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer (2004)
Source code related to cold filter meta-framework. https://github.com/zhouyangpkuer/ColdFilter. Accessed May 2018
Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1129–1140. ACM (2018)
Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 795–810. ACM (2015)
Peng, Y., Guo, J., Li, F., Qian, W., Zhou, A.: Persistent bloom filter: membership testing for the entire history. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1037–1052. ACM (2018)
Chen, J., Zhang, Q.: Bias-aware sketches. Proc. VLDB Endow. 10(9), 961–972 (2017)
Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)
Agrawal N., Vulimiri, A.: Low-latency analytics on colossal data streams with summarystore. In: Proceedings of the 26th Symposium on Operating Systems Principles, pp. 647–664. ACM (2017)
Cui, H., Keeton, K., Roy, I., Viswanathan, K., Ganger, G.R.: Using data transformations for low-latency time series analysis. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 395–407. ACM (2015)
Rabkin, A., Arye, M., Sen, S., Pai, V.S., Freedman, M.J.: Aggregation and degradation in jetstream: streaming analytics in the wide area. NSDI 14, 275–288 (2014)
Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: Accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284. ACM (2018)
Aghazadeh, A., Spring, R., LeJeune, D., Dasarathy, G., Shrivastava, A., Baraniuk, R.G.: MISSION: ultra large-scale feature selection using count-sketches. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, pp. 80–88 (2018)
Shrivastava, A.: Fast and accurate training of 100,000 classes on a single titan x. (Preprint)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of ACM PODS, pp. 1–16. ACM (2002)
Muthukrishnan, S. et al.: Data streams: algorithms and applications. Found. Trends® Theor. Comput. Sci. 1(2), 117–236 (2005)
Guo, C., Yuan, L., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. ACM SIGMCOMM CCR 45(4), 139–152 (2015)
Zhu, Y., Kang, N., Cao, J. et al.: Packet-level telemetry in large datacenter networks. In: ACM SIGMCOMM CCR, vol. 45, pp. 479–491. ACM (2015)
Pagh, R., Rodler, F.: Lossy dictionaries. Algorithms—ESA 2001, pp. 300–311 (2001)
Intel SSE2 Documentation. https://software.intel.com/en-us/node/683883. Accessed May 2018
Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: a meta-framework for faster and more accurate stream processing. In: Proceedings of SIGMOD (2018)
Lu, Y., Montanari, A., Prabhakar, B., Dharmapurikar, S., Kabbani, A.: Counter braids: a novel counter architecture for per-flow measurement. ACM Sigmetrics Perform. Eval. Rev. 36(1), 121–132 (2008)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of VLDB, pp. 346–357. VLDB Endowment (2002)
Golab, L., DeHaan, D., Demaine, E.D., Lopez-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of ACM IMC, pp. 173–178. ACM (2003)
Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. (TODS) 28(1), 51–55 (2003)
Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3), 239–250 (1959)
Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 189–197. IEEE (2000)
Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of ACM IMC, pp. 234–247. ACM (2003)
Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. (ToN) 15(5), 1059–1072 (2007)
Guha, S., McGregor, A.: Stream order and order statistics: quantile estimation in random-order streams. SIAM J. Comput. 38(5), 2044–2059 (2009)
Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of ACM SIGMOD, pp. 795–810. ACM (2015)
The caida anonymized 2016 internet traces. http://www.caida.org/data/overview/. Accessed May 2018
Real-life transactional dataset. http://fimi.ua.ac.be/data/. Accessed May 2018
Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Softw.: Pract. Exp. 34(2), 187–211 (2004)
Hash website. http://burtleburtle.net/bob/hash/evahash.html. Accessed May 2018
Ji, M., Yan, J., Gu, S., Han, J., He, X., Zhang, W.V., Chen, Z.: Learning search tasks in queries and web pages via graph regularization. In: Proceedings of ACM SIGIR, pp. 55–64. ACM (2011)
Goyal, A., Daume Iii, H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: EMNLP-CoNLL, pp. 1093–1103 (2012)
Qiao, Y., Li, T., Chen, S.: One memory access bloom filters and their generalization. In: INFOCOM, 2011 Proceedings IEEE, pp. 1745–1753. IEEE (2011)
Roy, P., Teubner, J., Alonso, G.: Efficient frequent item counting in multi-core hardware. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2012)
Acknowledgements
This work is supported by the National Key Research and Development Program of China (2018YFB1004403, 2016YFB1000304), NSFC (61672061, 61832001, and 61572039).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, T., Jiang, J., Zhou, Y. et al. Fast and accurate stream processing by filtering the cold. The VLDB Journal 28, 735–763 (2019). https://doi.org/10.1007/s00778-019-00560-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-019-00560-1