Skip to main content
Log in

Fast and accurate stream processing by filtering the cold

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Approximate stream processing algorithms, such as Count-Min sketch, Space-Saving, support numerous applications across multiple areas such as databases, storage systems, and networking. However, the unbalanced distribution in real data streams are challenging to existing algorithms. To enhance these algorithms, we propose a meta-framework, called Cold Filter, that enables faster and more accurate stream processing. Different from existing filters that mainly focus on hot (frequent) items, our filter captures cold (infrequent) items in the first stage, and hot items in the second stage. Existing filters also require two-direction communication—with frequent exchanges between the two stages; our filter on the other hand is one-direction—each item enters one stage at most once. Our filter can accurately estimate both cold and hot items, providing a level of genericity that makes it applicable to many stream processing tasks. To illustrate the benefits of our filter, we deploy it on four typical stream processing tasks. Experimental results show speed improvements of up to 4.7 times, and accuracy improvements of up to 51 times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32
Fig. 33
Fig. 34
Fig. 35
Fig. 36
Fig. 37

Similar content being viewed by others

Notes

  1. An optional part can exist in the first stage in Fig. 1, in case more information is required about the cold items, i.e., not only their frequencies.

  2. In the rest of this paper, “CF” refers to our two-layer Cold Filter.

  3. \(f_{e_t}[t-1]\) is the frequency of item \(e_t\) before updating, and \(f_{e_t}[t]= f_{e_t}[t-1] + 1\).

  4. \([E] = \{1, 2, \ldots , E\}\).

  5. \(Z^{+}\) is the set of non-negative integers.

  6. The literature [57, 72] recommends using a small number of hash functions.

  7. The literature [2] reported skewness \(> 1.4\) in real data streams.

References

  1. Cormode, G., Johnson, T., Korn, F., Muthukrishnan, S., Spatscheck, O., Srivastava, D.: Holistic UDAFs at streaming speeds. In: Proceedings of ACM SIGMOD, pp 35–46 (2004)

  2. Manerikar, N., Palpanas, T.: Frequent items in streaming data: an experimental evaluation of the state-of-the-art. Data Knowl. Eng. 68(4), 415–430 (2009)

    Article  Google Scholar 

  3. Zhao, P., Aggarwal, C.C., Wang, M.: gSketch: on query estimation in graph streams. Proc. VLDB 5, 193–204 (2011)

    Article  Google Scholar 

  4. Roy, P., Khan, A., Alonso, G.: Augmented sketch: faster and more accurate stream processing. In: Proceedings of ACM SIGMOD, pp. 1449–1463 (2016)

  5. Chen, B., Shrivastava, A.: Densified winner take all (WTA) hashing for sparse datasets. In: Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6–10, 2018, pp. 906–916 (2018)

  6. Dobra, A., Garofalakis, M., Gehrke, J., Rastogi, R.: Processing complex aggregate queries over data streams. In: Proceedings of ACM SIGMOD, pp. 61–72. ACM (2002)

  7. Cormode, G., Hadjieleftheriou, M.: Finding frequent items in data streams. Proc. VLDB 1(2), 1530–1541 (2008)

    Article  Google Scholar 

  8. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)

    MATH  Google Scholar 

  9. Cormode, G., Muthukrishnan, S.: An improved data stream summary: the count-min sketch and its applications. J. Alg. 55(1), 58–75 (2005)

    Article  MathSciNet  Google Scholar 

  10. Metwally, A., Agrawal, D., El Abbadi, A.: Efficient computation of frequent and top-k elements in data streams. In: International Conference on Database Theory, pp. 398–412. Springer (2005)

  11. Charikar, M., Chen, K., Farach-Colton, M.: Finding frequent items in data streams. In: Widmayer, P., Eidenbenz, S., Triguero, F., Morales, R., Conejo, R., Hennessy, M. (eds.) Automata, Languages and Programming. Springer, Berlin (2002)

    Google Scholar 

  12. Schweller, R., Gupta, A., Parsons, E., Chen, Y.: Reversible sketches for efficient and accurate change detection over network data streams. In: Proceedings of ACM IMC, pp. 207–212. ACM (2004)

  13. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.J.: How to summarize the universe: dynamic maintenance of quantiles. In: Proceedings of VLDB, pp. 454–465. VLDB Endowment (2002)

  14. Luo, C., Shrivastava, A.: SSH (sketch, shingle, & hash) for indexing massive-scale time series. In: NIPS 2016 Time Series Workshop, pp. 38–58 (2017)

  15. Shrivastava, A., Konig, A.C., Bilenko, M.: Time adaptive sketches (ada-sketches) for summarizing data streams. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1417–1432. ACM (2016)

  16. Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)

    Article  Google Scholar 

  17. Garofalakis, M., Gibbons, P.B.: Wavelet synopses with error guarantees. In: Proceedings of ACM SIGMOD, pp. 476–487. ACM (2002)

  18. Guha, S., Koudas, N., Shim, K.: Data-streams and histograms. In: Proceedings of STOC, pp. 471–475. ACM (2001)

  19. Kirsch, A., Mitzenmacher, M., Varghese, G.: Hash-based techniques for high-speed packet processing. In: Cormode, G., Thottan, M. (eds.) Algorithms for Next Generation Networks, pp. 181–218. Springer, London (2010)

    Chapter  Google Scholar 

  20. Pandey, P., Bender, M.A., Johnson, R., Patro, R.: A general-purpose counting filter: Making every bit count. In: Proceedings of ACM SIGMOD, pp. 775–787

  21. Thomas, D., Bordawekar, R., et al.: On efficient query processing of stream counts on the cell processor. In: Proceedings of IEEE ICDE (2009)

  22. Yang, T., Liu, A.X., Shahzad, M., Zhong, Y., Fu, Q., Li, Z., Xie, G., Li, X.: A shifting bloom filter framework for set queries. Proc. VLDB 9(5), 408–419 (2016)

    Article  Google Scholar 

  23. Yang, T., Zhou, Y., Jin, H., Chen, S., Li, X.: Pyramid sketch: a sketch framework for frequency estimation of data streams. Proc. VLDB 10(11), 1442–1453 (2017)

    Article  Google Scholar 

  24. Zhou, Y., Liu, P., Jin, H., Yang, T., Dang, S., Li, X.: One memory access sketh: a more accurate and faster sketch for per-flow measurement. In: IEEE Globecom (2017)

  25. Gong, J., Yang, T., Zhou, Y., Yang, D., Chen, S., Cui, B., Li, X.: Abc: a practicable sketch framework for non-uniform multisets. IEEE Bigdata (2017)

  26. Wang, L., Cai, Z., Wang, H., Jiang, J., Yang, T., Cui, B., Li, X.: Fine-grained probability counting: Refined loglog algorithm. IEEE Bigcomp (2018)

  27. Powers, D.M.: Applications and explanations of Zipf’s law. In Proceedings on EMNLP-CoNLL. Association for Computational Linguistics (1998)

  28. Adamic, L.A., Huberman, B.A.: Power-law distribution of the world wide web. Science 287(5461), 2115–2115 (2000)

    Article  Google Scholar 

  29. Goyal, A., Iii, Daume H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: Proceedings of EMNLP (2012)

  30. Mandal, A., Jiang, H., Shrivastava, A., Sarkar, V.: Topkapi: parallel and fast sketches for finding top-k frequent elements. In: Advances in Neural Information Processing Systems, pp. 10898–10908 (2018)

  31. Henzinger, M.R.: Algorithmic challenges in web search engines. Internet Math. 1(1), 115–123 (2004)

    Article  MathSciNet  Google Scholar 

  32. Li, Y., Miao, R., Kim, C., Yu, M.: Flowradar: a better netflow for data centers. In: Proceedings of USENIX NSDI, pp. 311–324 (2016)

  33. Goodrich, M.T., Mitzenmacher, M.: Invertible bloom lookup tables. In: Proceedings of the 49th Annual Allerton Conference on Communication, Control, and Computing, pp. 792–799. IEEE (2011)

  34. Xiao, Q., Qiao, Y., Zhen, M., Chen, S.: Estimating the persistent spreads in high-speed networks. In: 2014 IEEE 22nd International Conference on Network Protocols (ICNP), pp. 131–142. IEEE (2014)

  35. Dai, H., Shahzad, M., Liu, A.X., Zhong, Y.: Finding persistent items in data streams. Proc. VLDB Endow. 10(4), 289–300 (2016)

    Article  Google Scholar 

  36. Shokrollahi, A.: Raptor codes. IEEE Trans. Inf. Theory 52(6), 2551–2567 (2006)

    Article  MathSciNet  Google Scholar 

  37. Ganguly, S., Garofalakis, M., Rastogi, R.: Processing data-stream join aggregates using skimmed sketches. In: International Conference on Extending Database Technology, pp. 569–586. Springer (2004)

  38. Source code related to cold filter meta-framework. https://github.com/zhouyangpkuer/ColdFilter. Accessed May 2018

  39. Ting, D.: Data sketches for disaggregated subset sum and frequent item estimation. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1129–1140. ACM (2018)

  40. Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 795–810. ACM (2015)

  41. Peng, Y., Guo, J., Li, F., Qian, W., Zhou, A.: Persistent bloom filter: membership testing for the entire history. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1037–1052. ACM (2018)

  42. Chen, J., Zhang, Q.: Bias-aware sketches. Proc. VLDB Endow. 10(9), 961–972 (2017)

    Article  Google Scholar 

  43. Wei, Z., Liu, X., Li, F., Shang, S., Du, X., Wen, J.-R.: Matrix sketching over sliding windows. In: Proceedings of the 2016 International Conference on Management of Data, pp. 1465–1480. ACM (2016)

  44. Agrawal N., Vulimiri, A.: Low-latency analytics on colossal data streams with summarystore. In: Proceedings of the 26th Symposium on Operating Systems Principles, pp. 647–664. ACM (2017)

  45. Cui, H., Keeton, K., Roy, I., Viswanathan, K., Ganger, G.R.: Using data transformations for low-latency time series analysis. In: Proceedings of the Sixth ACM Symposium on Cloud Computing, pp. 395–407. ACM (2015)

  46. Rabkin, A., Arye, M., Sen, S., Pai, V.S., Freedman, M.J.: Aggregation and degradation in jetstream: streaming analytics in the wide area. NSDI 14, 275–288 (2014)

    Google Scholar 

  47. Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: Accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284. ACM (2018)

  48. Aghazadeh, A., Spring, R., LeJeune, D., Dasarathy, G., Shrivastava, A., Baraniuk, R.G.: MISSION: ultra large-scale feature selection using count-sketches. In: Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10–15, 2018, pp. 80–88 (2018)

  49. Shrivastava, A.: Fast and accurate training of 100,000 classes on a single titan x. (Preprint)

  50. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of ACM PODS, pp. 1–16. ACM (2002)

  51. Muthukrishnan, S. et al.: Data streams: algorithms and applications. Found. Trends® Theor. Comput. Sci. 1(2), 117–236 (2005)

    Article  MathSciNet  Google Scholar 

  52. Guo, C., Yuan, L., Xiang, D., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. ACM SIGMCOMM CCR 45(4), 139–152 (2015)

    Article  Google Scholar 

  53. Zhu, Y., Kang, N., Cao, J. et al.: Packet-level telemetry in large datacenter networks. In: ACM SIGMCOMM CCR, vol. 45, pp. 479–491. ACM (2015)

    Article  Google Scholar 

  54. Pagh, R., Rodler, F.: Lossy dictionaries. Algorithms—ESA 2001, pp. 300–311 (2001)

    Chapter  Google Scholar 

  55. Intel SSE2 Documentation. https://software.intel.com/en-us/node/683883. Accessed May 2018

  56. Zhou, Y., Yang, T., Jiang, J., Cui, B., Yu, M., Li, X., Uhlig, S.: Cold filter: a meta-framework for faster and more accurate stream processing. In: Proceedings of SIGMOD (2018)

  57. Lu, Y., Montanari, A., Prabhakar, B., Dharmapurikar, S., Kabbani, A.: Counter braids: a novel counter architecture for per-flow measurement. ACM Sigmetrics Perform. Eval. Rev. 36(1), 121–132 (2008)

    Article  Google Scholar 

  58. Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of VLDB, pp. 346–357. VLDB Endowment (2002)

  59. Golab, L., DeHaan, D., Demaine, E.D., Lopez-Ortiz, A., Munro, J.I.: Identifying frequent items in sliding windows over on-line packet streams. In: Proceedings of ACM IMC, pp. 173–178. ACM (2003)

  60. Karp, R.M., Shenker, S., Papadimitriou, C.H.: A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst. (TODS) 28(1), 51–55 (2003)

    Article  Google Scholar 

  61. Roberts, S.: Control chart tests based on geometric moving averages. Technometrics 1(3), 239–250 (1959)

    Article  Google Scholar 

  62. Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, pp. 189–197. IEEE (2000)

  63. Krishnamurthy, B., Sen, S., Zhang, Y., Chen, Y.: Sketch-based change detection: methods, evaluation, and applications. In: Proceedings of ACM IMC, pp. 234–247. ACM (2003)

  64. Schweller, R., Li, Z., Chen, Y., et al.: Reversible sketches: enabling monitoring and analysis over high-speed data streams. IEEE/ACM Trans. Netw. (ToN) 15(5), 1059–1072 (2007)

    Article  Google Scholar 

  65. Guha, S., McGregor, A.: Stream order and order statistics: quantile estimation in random-order streams. SIAM J. Comput. 38(5), 2044–2059 (2009)

    Article  MathSciNet  Google Scholar 

  66. Wei, Z., Luo, G., Yi, K., Du, X., Wen, J.-R.: Persistent data sketching. In: Proceedings of ACM SIGMOD, pp. 795–810. ACM (2015)

  67. The caida anonymized 2016 internet traces. http://www.caida.org/data/overview/. Accessed May 2018

  68. Real-life transactional dataset. http://fimi.ua.ac.be/data/. Accessed May 2018

  69. Rousskov, A., Wessels, D.: High-performance benchmarking with web polygraph. Softw.: Pract. Exp. 34(2), 187–211 (2004)

    Google Scholar 

  70. Hash website. http://burtleburtle.net/bob/hash/evahash.html. Accessed May 2018

  71. Ji, M., Yan, J., Gu, S., Han, J., He, X., Zhang, W.V., Chen, Z.: Learning search tasks in queries and web pages via graph regularization. In: Proceedings of ACM SIGIR, pp. 55–64. ACM (2011)

  72. Goyal, A., Daume Iii, H., Cormode, G.: Sketch algorithms for estimating point queries in NLP. In: EMNLP-CoNLL, pp. 1093–1103 (2012)

  73. Qiao, Y., Li, T., Chen, S.: One memory access bloom filters and their generalization. In: INFOCOM, 2011 Proceedings IEEE, pp. 1745–1753. IEEE (2011)

  74. Roy, P., Teubner, J., Alonso, G.: Efficient frequent item counting in multi-core hardware. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (2012)

Download references

Acknowledgements

This work is supported by the National Key Research and Development Program of China (2018YFB1004403, 2016YFB1000304), NSFC (61672061, 61832001, and 61572039).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Cui.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, T., Jiang, J., Zhou, Y. et al. Fast and accurate stream processing by filtering the cold. The VLDB Journal 28, 735–763 (2019). https://doi.org/10.1007/s00778-019-00560-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-019-00560-1

Keywords

Navigation