Abstract
Since failures in large-scale clusters can lead to severe performance degradation and break system availability, fault tolerance is critical for distributed stream processing systems (DSPSs). Plenty of fault tolerance approaches have been proposed over the last decade. However, there is no systematic work to evaluate and compare them in detail. Previous work either evaluates global performance during failure-free runtime, or merely measures throughout loss when failure happens. In this paper, it is the first work proposing an evaluation framework customized for quantitatively comparing runtime overhead and recovery efficiency of fault tolerance mechanisms in DSPSs. We define three typical configurable workloads, which are widely-adopted in previous DSPS evaluations. We construct five workload suites based on three workloads to investigate the effects of different factors on fault tolerance performance. We carry out extensive experiments on two well-known open-sourced DSPSs. The results demonstrate performance gap of two systems, which is useful for choice and evolution of fault tolerance approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Circles in the same column are the parallel instances of an operator.
- 2.
- 3.
- 4.
- 5.
References
Abadi, D.J., Carney, D., Zdonik, S.B., et al.: Aurora: a new model and architecture for data stream management. VLDBJ 12(2), 120–139 (2003)
Arasu, A., Cherniack, M., Tibbetts, R., et al.: Linear road: a stream data management benchmark. In: Proceedings of the 30th VLDB International Conference on Very Large Data Bases, pp. 480–491 (2004)
Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the borealis distributed stream processing system. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM (2005)
Bordin, M.: A Benchmark Suite for Distributed Stream Processing Systems. Ph.D. thesis, Universidade Federal do Rio Grande Do Su (2017)
Carbone, P., Katsifodimos, A., Tzoumas, K., et al.: Apache flink™: stream and batch processing in a single engine. IEEE Data Eng. Bull. 38(4), 28–38 (2015)
Chintapalli, S., Dagit, D., Poulosky, P., et al.: Benchmarking streaming computation engines: storm, flink and spark streaming. In: Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, pp. 1789–1792 (2016)
Fernandez, R.C., Migliavacca, M., Kalyvianaki, E., Pietzuch, P.R.: Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 725–736. ACM (2013)
Gill, P., Jain, N., Nagappan, N.: Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of the 2011 ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 350–361. ACM (2011)
Grier, J.: Extending the yahoo! streaming benchmark (2016). https://www.ververica.com/blog/extending-the-yahoo-streaming-benchmark
Heinze, T., Zia, M., Fetzer, C., et al.: An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of the 9th ACM DEBS International Conference on Distributed Event-Based Systems, pp. 150–161. ACM (2015)
Huang, Q., Lee, P.P.C.: Toward high-performance distributed stream processing via approximate fault tolerance. PVLDB 10(3), 73–84 (2016)
Hwang, J., Çetintemel, U., Zdonik, S.B.: Fast and highly-available stream processing over wide area networks. In: Proceedings of the 24th IEEE ICDE International Conference on Data Engineering, pp. 804–813. IEEE (2008)
Kwon, Y., Balazinska, M., Greenberg, A.G.: Fault-tolerant stream processing using a distributed, replicated file system. PVLDB 1(1), 574–585 (2008)
Lu, R., Wu, G., Xie, B., Hu, J.: Streambench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of the 7th IEEE/ACM International Conference on Utility and Cloud Computing, pp. 69–78. IEEE/ACM (2014)
Martin, A., Smaneoto, T., Fetzer, C., et al.: User-constraint and self-adaptive fault tolerance for event stream processing systems. In: Proceedings of the 45th Annual IEEE/IFIP DSN International Conference on Dependable Systems and Networks, pp. 462–473. IEEE/IFIP (2015)
Sebepou, Z., Magoutis, K.: CEC: continuous eventual checkpointing for data stream processing operators. In: Proceedings of the 2011 IEEE/IFIP DSN International Conference on Dependable Systems and Networks, pp. 145–156. IEEE/IFIP (2011)
Shah, M.A., Hellerstein, J.M., Brewer, E.A.: Highly-available, fault-tolerant, parallel dataflows. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 827–838. ACM (2004)
Shukla, A., Chaturvedi, S., Simmhan, Y.L.: RIoTbench: an IoT benchmark for distributed stream processing systems. CCPE 29(21), e4257 (2017)
Su, L., Zhou, Y.: Passive and partially active fault tolerance for massively parallel stream processing engines. TKDE 31(1), 32–45 (2019)
Toshniwal, A., Taneja, S., Ryaboy, D.V., et al.: Storm@Twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 147–156. ACM (2014)
Wang, Y.: Stream Processing Systems Benchmark: StreamBench. Master’s thesis, Aalto University (2016)
Zaharia, M., Das, T., Stoica, I., et al.: Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of the 24th ACM SIGOPS Symposium on Operating Systems Principles, pp. 423–438. ACM (2013)
Acknowledgement
The work is supported by National Key Research and Development Plan Project (No.2018YFB1003402) and National Science Foundation of China (NSFC) (No.61672233,61802273). Ke Shu is supported by PingCAP.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, X. et al. (2020). Evaluating Fault Tolerance of Distributed Stream Processing Systems. In: Wang, X., Zhang, R., Lee, YK., Sun, L., Moon, YS. (eds) Web and Big Data. APWeb-WAIM 2020. Lecture Notes in Computer Science(), vol 12318. Springer, Cham. https://doi.org/10.1007/978-3-030-60290-1_8
Download citation
DOI: https://doi.org/10.1007/978-3-030-60290-1_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60289-5
Online ISBN: 978-3-030-60290-1
eBook Packages: Computer ScienceComputer Science (R0)