Abstract
Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data. Since stream processing systems (SPSs) usually require distributed deployment on clusters of servers in face of large-scale of data, it is especially common to meet with failures of processing nodes or communication networks, but should be handled seriously considering service quality. A failed system may produce wrong results or become unavailable, resulting in a decline in user experience or even significant financial loss. Hence, a large amount of fault tolerance approaches have been proposed for SPSs. These approaches often have their own priorities on specific performance concerns, e.g., runtime overhead and recovery efficiency. Nevertheless, there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs, which will become an obstacle for the development of SPSs. Therefore, we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs. Furthermore, we propose an evaluation framework tailored for fault tolerance, demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs. Finally, we specify future research directions in this domain.
Similar content being viewed by others
References
Naughton J F, DeWitt D J, Maier D, Aboulnaga A, Chen J J, Galanis L, Kang J, Krishnamurthy R, Luo Q, Prakash N, Ramamurthy R, Shanmugasundaram J, Tian F, Tufte K, Viglas S, Wang Y, Zhang C, Jackson B, Chen R. The niagara internet query system. IEEE Data Engineering Bulletin, 2001, 24(2): 27–33
Abadi D J, Carney D, Çetintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N. B Zdonik S. Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases, 2003, 12(2): 120–139
Motwani R, Arasu J, Widomand A, Babcock B, Babu S, Datar M, Olston G S, Mankuand C, Varma J, Rosensteinand R. Query processing, approximation, and resource management in a data stream management system. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
Chandrasekaran S, Cooper O, Deshpande A, J Franklin M, M Hellerstein J, Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F A Shah M. Telegraphcq: continuous dataflow processing for an uncertain world. In: Proceedings of Conference on Innovative Data Systems Research. 2003
Heinze T, Aniello L, Querzoni L, Jerzak Z. Cloud-based data stream processing. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2014. 238–245
Cherniack M, Balakrishnan H, Balazinska M, Carney D, Çetintemel U, Xing Y, Zdonik S B. Scalable distributed stream processing. In: Proceedings of International Conference on Innovative Data Systems Research. 2003
Abadi D J, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang J H, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y B Zdonik S. The design of the borealis stream processing engine. In: Proceedings of Conference on Innovative Data Systems Research. 2005. 277–289
Dean J Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating System Design and Implementation. 2004, 137–150
Toshniwal A, Taneja S, Shukla A, Ramasamy K, M Patel J, Kulkarni S, Jackson J, Gade K, Fu M S, Donham J, Bhagat N, Mittal S, V Ryaboy D. Storm@twitter. In: Proceedings of ACM International Conference on Management of Data. 2014, 147–156
Zaharia M, Das T, Li H Y, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 423–438
Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink™: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 2015, 38(4): 28–38
Noghabi S A, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, H Campbell R. Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment, 2017, 10(12): 1634–1645
Gill P, Jain N, Nagappan N. Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 2011, 350–361
Dean J. Handling large datasets at google: current systems and future directions. In Data-Intensive Computing Symposium, 2008
V Vishwanath K Nagappan N. Characterizing cloud computing hardware reliability. In: Proceedings of Symposium on Cloud Computing. 2010, 193–204
Shankland S. Google spotlights data center inner workings. CNET News, 2008
Raphael J R. In pictures: the worst cloud outages of 2013. PCWorld, 2013
Shah M A, Hellerstein J M, Brewer E A. Highly-available, faulttolerant, parallel dataflows. In: Proceedings of ACM International Conference on Management of Data. 2004, 827–838
Hwang J H, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S B. High-availability algorithms for distributed stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2005, 779–790
Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faulttolerance in the borealis distributed stream processing system. In: Proceedings of ACM International Conference on Management of Data. 2005, 13–24
Hwang J H, Çetintemel U, B Zdonik S. Fast and highly-available stream processing over wide area networks. In: Proceedings of IEEE International Conference on Data Engineering. 2008, 804–813
Hwang J H, Xing Y, Çetintemel U, B Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2007, 176–185
Kwon Y C, Balazinska M, G Greenberg A. Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 2008, 1(1): 574–585
Gu Y, Zhang Z, Ye F, Yang H, Kim M, Lei H, Liu Z. An empirical study of high availability in stream processing systems. In: Proceedings of ACM/IFIP/USENIX International Middleware Conference. 2009
Sebepou Z Magoutis K. Cec: continuous eventual checkpointing for data stream processing operators. In: Proceedings of International Conference on Dependable Systems and Networks. 2011, 145–156
Fernandez R C, Migliavacca M, Kalyvianaki E, R Pietzuch P. Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of ACM International Conference on Management of Data. 2013, 725–736
Koldehofe B, Mayer R, Ramachandran U, Rothermel K, Völz M. Rollback-recovery without checkpoints in distributed event processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2013, 27–38
Huang Q Lee P. Toward high-performance distributed stream processing via approximate fault tolerance. Proceedings of the VLDB Endowment, 2016, 10(3): 73–84
Zhang Z, Gu Y, Ye F, Yang H, Kim M, Lei H, Liu Z. A hybrid approach to high availability in stream processing systems. In: Proceedings of IEEE International Conference on Distributed Computing Systems. 2010, 138–148
Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C. An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2015, 150–161
Su L Zhou Y L. Tolerating correlated failures in massively parallel stream processing engines. In: Proceedings of IEEE International Conference on Data Engineering. 2016, 517–528
Su L Zhou Y L. Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(1): 32–45
Martin A, Smaneoto T, Dietze T, Brito A, Fetzer C. User-constraint and self-adaptive fault tolerance for event stream processing systems. In: Proceedings og IEEE/IFIP International Conference on Dependable Systems and Networks. 2015
Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 1990, 22(4): 299–319
Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of ACM Symposium on Principles of Distributed Computing. 2007, 398–407
Elnozahy E N, Alvisi L, Wang Y M, Johnson D B. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34(3): 375–408
Balazinska M, Hwang J H, Shah M A. Fault-tolerance and high availability in data stream management systems. In Encyclopedia of Database Systems. 2009, 1109–1115
Gradvohl A L S, Senger H, Arantes L, Sens P. Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 2014, 6(2): 174–179
Nasir M A U. Fault tolerance for stream processing engines. arXiv preprint, abs/1605.00928, 2016
Arasu A, Cherniack M, Galvez E F, Maier D, Maskey A, Ryvkina E, Stonebraker M, Tibbetts R. Linear road: a stream data management benchmark. In: Proceedings of ACM International Conference on Very Large Data Bases. 2004, 480–491
Lu R R, Wu G, Xie B, Hu J T. Streambench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing. 2014, 69–78
Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng B Y, Poulosky P. Benchmarking streaming computation engines: storm, flink and spark streaming. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops. 2016, 1789–1792
Grier J. Extending the yahoo! streaming benchmark. Ververica, 2016
Wang Y J. Stream processing systems benchmark: streambench. Master’s thesis, Aalto University, 2016
Shukla A, Chaturvedi S, Simmhan Y. Riotbench: a real-time iot benchmark for distributed stream processing platforms. arXiv preprint, abs/1701.08530, 2017
Bordin M V. A benchmark suite for distributed stream processing systems. PhD thesis, Universidade Federal do Rio Grande Do Su, 2017
Karimov J, Rabl T, Katsifodimos A, Samarev R, Heiskanen H, Markl V. Benchmarking distributed stream data processing systems. In: Proceedings of IEEE International Conference on Data Engineering. 2018, 1507–1518
Venkataraman S, Panda A, Ousterhout K, Armbrust M, Ghodsi A, Franklin M J, Recht B, Stoica I. Drizzle: fast and adaptable stream processing at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2017, 374–389
Akidau T, Bradshaw R, Chambers C, Chernyak S, FernándezMoctezuma R, Lax R, McVeety S, Mills D, Perry F, Schmidt E, Whittle S. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-oforder data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792–1803
Tucker P A, Maier D, Sheard T, Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 555–568
Chandy K M Lamport L. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 1985, 3(1): 63–75
Wang Y M, Chung P Y, Lin I J, Fuchs W K. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1995, 6(5): 546–554
Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of IEEE International Conference on Data Mining Workshops. 2010, 170–177
Randell B. System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1975, 1(2): 221–232
Murray D G, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 439–455
Akidau T, Balikov A, Bekiroglu K, Chernyak S, Haberman J, Lax R, McVeety S, Mills D, Nordstrom P, Whittle S. Millwheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11): 1033–1044
Qian Z P, He Y, Su C Z, Wu Z J, Zhu H Y, Zhang T Z, Zhou L D, Yu Y, Zhang Z. Timestream: reliable stream computation in the cloud. In: Proceedings of Eurosys Conference. 2013, 1–14
Bhargava B K S Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In: Proceedings of Symposium on Reliable Distributed Systems. 1988, 3–12
Feldman S I Brown C B. Igor: a system for program debugging via reversible execution. In: Proceedings of ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging. 1988, 112–123
Ghemawat S, Gobioff H, Leung S. The google file system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2003, 29–43
Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies. 2010, 1–10
Pohl C, Götze P, Sattler K. A cost model for data stream processing on modern hardware. In: Proceedings of International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures. 2017
Zeuch S, Breß S, Rabl T, Monte B D, Karimov J, Lutz C, Renz M, Traub J, Markl V. Analyzing efficient stream processing on modern hardware. Proceedings of the VLDB Endowment, 2019, 12(5): 516–530
Acknowledgements
The work was supported by the National Key Research and Development Plan Project (2018YFB1003404).
Author information
Authors and Affiliations
Corresponding author
Additional information
Xiaotong Wang received her BE degree in computer science from Nanjing Normal University, China in 2012 and PhD degree in software engineering from East China Normal University in 2021, under the supervision of Professor Rong Zhang. Her research interests include data manage management and distributed stream processing.
Chunxi Zhang received her BE degree in software engineering from Shandong University of Science and Technology, China in 2011 and PhD degree in software engineering from East China Normal University in 2021. Her research interest is performance evaluation of databases, including intensive workload generation, contention simulation and isolation validation.
Junhua Fang received his PhD degree from East China Normal University, under the supervision of Professor Aoying Zhou. He is an associate professor in Advanced Data Analytics Group at School of Computer Science and Technology, Soochow Univerisity, China. His research focues on distributed database and parallel stream processing. He is a member of China Computer Federation.
Rong Zhang received her PhD degree in computer science from Fudan University, China in 2007. She joined East China Normal University since 2011 and is currently a professor there. From 2007 to 2010, she worked as an expert researcher with NICT, Japan. Her current research interests include knowledge management, distributed data management, and database benchmarking. She is a member of China Computer Federation.
Weining Qian received his PhD degree in computer science from Fudan University, China in 2004. His is a professor and dean of the School of Data Science and Engineering, East China Normal University, China. He is now serving as a standing committee member of Database Technology Committee of China Computer Federation, and committee member of ACM SIFMOD China Chapter. His research interests include scalable transaction processing, benchmarking big data systems, and management and analysis of massive datasets.
Aoying Zhou is a professor on computer science in East China Normal University (ECNU), where he is heading the School of Data Science and Engineering. Before joining ECNU in 2008, he worked for Fudan University at the Computer Science Department for 15 years. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC. He is now acting as a vice-director of ACM SIGMOD China and Database Technology Committee of China Computer Federation. He is serving as a member of the editorial boards of the VLDB Journal, the WWW Journal, and so on. His research interests include data management, in-memory cluster computing, big data benchmarking, and performance optimization.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Wang, X., Zhang, C., Fang, J. et al. A comprehensive study on fault tolerance in stream processing systems. Front. Comput. Sci. 16, 162603 (2022). https://doi.org/10.1007/s11704-020-0248-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-020-0248-x