Skip to main content
Log in

A comprehensive study on fault tolerance in stream processing systems

  • Review Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Stream processing has emerged as a useful technology for applications which require continuous and low latency computation on infinite streaming data. Since stream processing systems (SPSs) usually require distributed deployment on clusters of servers in face of large-scale of data, it is especially common to meet with failures of processing nodes or communication networks, but should be handled seriously considering service quality. A failed system may produce wrong results or become unavailable, resulting in a decline in user experience or even significant financial loss. Hence, a large amount of fault tolerance approaches have been proposed for SPSs. These approaches often have their own priorities on specific performance concerns, e.g., runtime overhead and recovery efficiency. Nevertheless, there is a lack of a systematic overview and classification of the state-of-the-art fault tolerance approaches in SPSs, which will become an obstacle for the development of SPSs. Therefore, we investigate the existing achievements and develop a taxonomy of the fault tolerance in SPSs. Furthermore, we propose an evaluation framework tailored for fault tolerance, demonstrate the experimental results on two representative open-sourced SPSs and exposit the possible disadvantages in current designs. Finally, we specify future research directions in this domain.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Naughton J F, DeWitt D J, Maier D, Aboulnaga A, Chen J J, Galanis L, Kang J, Krishnamurthy R, Luo Q, Prakash N, Ramamurthy R, Shanmugasundaram J, Tian F, Tufte K, Viglas S, Wang Y, Zhang C, Jackson B, Chen R. The niagara internet query system. IEEE Data Engineering Bulletin, 2001, 24(2): 27–33

    Google Scholar 

  2. Abadi D J, Carney D, Çetintemel U, Cherniack M, Convey C, Lee S, Stonebraker M, Tatbul N. B Zdonik S. Aurora: a new model and architecture for data stream management. The International Journal on Very Large Data Bases, 2003, 12(2): 120–139

    Article  Google Scholar 

  3. Motwani R, Arasu J, Widomand A, Babcock B, Babu S, Datar M, Olston G S, Mankuand C, Varma J, Rosensteinand R. Query processing, approximation, and resource management in a data stream management system. In: Proceedings of International Conference on Innovative Data Systems Research. 2003

  4. Chandrasekaran S, Cooper O, Deshpande A, J Franklin M, M Hellerstein J, Hong W, Krishnamurthy S, Madden S, Raman V, Reiss F A Shah M. Telegraphcq: continuous dataflow processing for an uncertain world. In: Proceedings of Conference on Innovative Data Systems Research. 2003

  5. Heinze T, Aniello L, Querzoni L, Jerzak Z. Cloud-based data stream processing. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2014. 238–245

  6. Cherniack M, Balakrishnan H, Balazinska M, Carney D, Çetintemel U, Xing Y, Zdonik S B. Scalable distributed stream processing. In: Proceedings of International Conference on Innovative Data Systems Research. 2003

  7. Abadi D J, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang J H, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y B Zdonik S. The design of the borealis stream processing engine. In: Proceedings of Conference on Innovative Data Systems Research. 2005. 277–289

  8. Dean J Ghemawat S. Mapreduce: simplified data processing on large clusters. In: Proceedings of USENIX Symposium on Operating System Design and Implementation. 2004, 137–150

  9. Toshniwal A, Taneja S, Shukla A, Ramasamy K, M Patel J, Kulkarni S, Jackson J, Gade K, Fu M S, Donham J, Bhagat N, Mittal S, V Ryaboy D. Storm@twitter. In: Proceedings of ACM International Conference on Management of Data. 2014, 147–156

  10. Zaharia M, Das T, Li H Y, Hunter T, Shenker S, Stoica I. Discretized streams: fault-tolerant streaming computation at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 423–438

  11. Carbone P, Katsifodimos A, Ewen S, Markl V, Haridi S, Tzoumas K. Apache flink™: stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 2015, 38(4): 28–38

    Google Scholar 

  12. Noghabi S A, Paramasivam K, Pan Y, Ramesh N, Bringhurst J, Gupta I, H Campbell R. Stateful scalable stream processing at linkedin. Proceedings of the VLDB Endowment, 2017, 10(12): 1634–1645

    Article  Google Scholar 

  13. Gill P, Jain N, Nagappan N. Understanding network failures in data centers: measurement, analysis, and implications. In: Proceedings of ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 2011, 350–361

  14. Dean J. Handling large datasets at google: current systems and future directions. In Data-Intensive Computing Symposium, 2008

  15. V Vishwanath K Nagappan N. Characterizing cloud computing hardware reliability. In: Proceedings of Symposium on Cloud Computing. 2010, 193–204

  16. Shankland S. Google spotlights data center inner workings. CNET News, 2008

  17. Raphael J R. In pictures: the worst cloud outages of 2013. PCWorld, 2013

  18. Shah M A, Hellerstein J M, Brewer E A. Highly-available, faulttolerant, parallel dataflows. In: Proceedings of ACM International Conference on Management of Data. 2004, 827–838

  19. Hwang J H, Balazinska M, Rasin A, Çetintemel U, Stonebraker M, Zdonik S B. High-availability algorithms for distributed stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2005, 779–790

  20. Balazinska M, Balakrishnan H, Madden S, Stonebraker M. Faulttolerance in the borealis distributed stream processing system. In: Proceedings of ACM International Conference on Management of Data. 2005, 13–24

  21. Hwang J H, Çetintemel U, B Zdonik S. Fast and highly-available stream processing over wide area networks. In: Proceedings of IEEE International Conference on Data Engineering. 2008, 804–813

  22. Hwang J H, Xing Y, Çetintemel U, B Zdonik S. A cooperative, self-configuring high-availability solution for stream processing. In: Proceedings of IEEE International Conference on Data Engineering. 2007, 176–185

  23. Kwon Y C, Balazinska M, G Greenberg A. Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 2008, 1(1): 574–585

    Article  Google Scholar 

  24. Gu Y, Zhang Z, Ye F, Yang H, Kim M, Lei H, Liu Z. An empirical study of high availability in stream processing systems. In: Proceedings of ACM/IFIP/USENIX International Middleware Conference. 2009

  25. Sebepou Z Magoutis K. Cec: continuous eventual checkpointing for data stream processing operators. In: Proceedings of International Conference on Dependable Systems and Networks. 2011, 145–156

  26. Fernandez R C, Migliavacca M, Kalyvianaki E, R Pietzuch P. Integrating scale out and fault tolerance in stream processing using operator state management. In: Proceedings of ACM International Conference on Management of Data. 2013, 725–736

  27. Koldehofe B, Mayer R, Ramachandran U, Rothermel K, Völz M. Rollback-recovery without checkpoints in distributed event processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2013, 27–38

  28. Huang Q Lee P. Toward high-performance distributed stream processing via approximate fault tolerance. Proceedings of the VLDB Endowment, 2016, 10(3): 73–84

    Article  Google Scholar 

  29. Zhang Z, Gu Y, Ye F, Yang H, Kim M, Lei H, Liu Z. A hybrid approach to high availability in stream processing systems. In: Proceedings of IEEE International Conference on Distributed Computing Systems. 2010, 138–148

  30. Heinze T, Zia M, Krahn R, Jerzak Z, Fetzer C. An adaptive replication scheme for elastic data stream processing systems. In: Proceedings of ACM International Conference on Distributed Event-Based Systems. 2015, 150–161

  31. Su L Zhou Y L. Tolerating correlated failures in massively parallel stream processing engines. In: Proceedings of IEEE International Conference on Data Engineering. 2016, 517–528

  32. Su L Zhou Y L. Passive and partially active fault tolerance for massively parallel stream processing engines. IEEE Transactions on Knowledge and Data Engineering, 2019, 31(1): 32–45

    Article  Google Scholar 

  33. Martin A, Smaneoto T, Dietze T, Brito A, Fetzer C. User-constraint and self-adaptive fault tolerance for event stream processing systems. In: Proceedings og IEEE/IFIP International Conference on Dependable Systems and Networks. 2015

  34. Schneider F B. Implementing fault-tolerant services using the state machine approach: a tutorial. ACM Computing Surveys, 1990, 22(4): 299–319

    Article  Google Scholar 

  35. Chandra T D, Griesemer R, Redstone J. Paxos made live: an engineering perspective. In: Proceedings of ACM Symposium on Principles of Distributed Computing. 2007, 398–407

  36. Elnozahy E N, Alvisi L, Wang Y M, Johnson D B. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 2002, 34(3): 375–408

    Article  Google Scholar 

  37. Balazinska M, Hwang J H, Shah M A. Fault-tolerance and high availability in data stream management systems. In Encyclopedia of Database Systems. 2009, 1109–1115

  38. Gradvohl A L S, Senger H, Arantes L, Sens P. Comparing distributed online stream processing systems considering fault tolerance issues. Journal of Emerging Technologies in Web Intelligence, 2014, 6(2): 174–179

    Article  Google Scholar 

  39. Nasir M A U. Fault tolerance for stream processing engines. arXiv preprint, abs/1605.00928, 2016

  40. Arasu A, Cherniack M, Galvez E F, Maier D, Maskey A, Ryvkina E, Stonebraker M, Tibbetts R. Linear road: a stream data management benchmark. In: Proceedings of ACM International Conference on Very Large Data Bases. 2004, 480–491

  41. Lu R R, Wu G, Xie B, Hu J T. Streambench: towards benchmarking modern distributed stream computing frameworks. In: Proceedings of IEEE/ACM International Conference on Utility and Cloud Computing. 2014, 69–78

  42. Chintapalli S, Dagit D, Evans B, Farivar R, Graves T, Holderbaugh M, Liu Z, Nusbaum K, Patil K, Peng B Y, Poulosky P. Benchmarking streaming computation engines: storm, flink and spark streaming. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium Workshops. 2016, 1789–1792

  43. Grier J. Extending the yahoo! streaming benchmark. Ververica, 2016

  44. Wang Y J. Stream processing systems benchmark: streambench. Master’s thesis, Aalto University, 2016

  45. Shukla A, Chaturvedi S, Simmhan Y. Riotbench: a real-time iot benchmark for distributed stream processing platforms. arXiv preprint, abs/1701.08530, 2017

  46. Bordin M V. A benchmark suite for distributed stream processing systems. PhD thesis, Universidade Federal do Rio Grande Do Su, 2017

  47. Karimov J, Rabl T, Katsifodimos A, Samarev R, Heiskanen H, Markl V. Benchmarking distributed stream data processing systems. In: Proceedings of IEEE International Conference on Data Engineering. 2018, 1507–1518

  48. Venkataraman S, Panda A, Ousterhout K, Armbrust M, Ghodsi A, Franklin M J, Recht B, Stoica I. Drizzle: fast and adaptable stream processing at scale. In: Proceedings of ACM Symposium on Operating Systems Principles. 2017, 374–389

  49. Akidau T, Bradshaw R, Chambers C, Chernyak S, FernándezMoctezuma R, Lax R, McVeety S, Mills D, Perry F, Schmidt E, Whittle S. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-oforder data processing. Proceedings of the VLDB Endowment, 2015, 8(12): 1792–1803

    Article  Google Scholar 

  50. Tucker P A, Maier D, Sheard T, Fegaras L. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering, 2003, 15(3): 555–568

    Article  Google Scholar 

  51. Chandy K M Lamport L. Distributed snapshots: determining global states of distributed systems. ACM Transactions on Computer Systems, 1985, 3(1): 63–75

    Article  Google Scholar 

  52. Wang Y M, Chung P Y, Lin I J, Fuchs W K. Checkpoint space reclamation for uncoordinated checkpointing in message-passing systems. IEEE Transactions on Parallel and Distributed Systems, 1995, 6(5): 546–554

    Article  Google Scholar 

  53. Neumeyer L, Robbins B, Nair A, Kesari A. S4: distributed stream computing platform. In: Proceedings of IEEE International Conference on Data Mining Workshops. 2010, 170–177

  54. Randell B. System structure for software fault tolerance. IEEE Transactions on Software Engineering, 1975, 1(2): 221–232

    MathSciNet  Google Scholar 

  55. Murray D G, McSherry F, Isaacs R, Isard M, Barham P, Abadi M. Naiad: a timely dataflow system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2013, 439–455

  56. Akidau T, Balikov A, Bekiroglu K, Chernyak S, Haberman J, Lax R, McVeety S, Mills D, Nordstrom P, Whittle S. Millwheel: faulttolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 2013, 6(11): 1033–1044

    Article  Google Scholar 

  57. Qian Z P, He Y, Su C Z, Wu Z J, Zhu H Y, Zhang T Z, Zhou L D, Yu Y, Zhang Z. Timestream: reliable stream computation in the cloud. In: Proceedings of Eurosys Conference. 2013, 1–14

  58. Bhargava B K S Lian. Independent checkpointing and concurrent rollback for recovery in distributed systems — an optimistic approach. In: Proceedings of Symposium on Reliable Distributed Systems. 1988, 3–12

  59. Feldman S I Brown C B. Igor: a system for program debugging via reversible execution. In: Proceedings of ACM SIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging. 1988, 112–123

  60. Ghemawat S, Gobioff H, Leung S. The google file system. In: Proceedings of ACM Symposium on Operating Systems Principles. 2003, 29–43

  61. Shvachko K, Kuang H, Radia S, Chansler R. The hadoop distributed file system. In: Proceedings of IEEE Symposium on Mass Storage Systems and Technologies. 2010, 1–10

  62. Pohl C, Götze P, Sattler K. A cost model for data stream processing on modern hardware. In: Proceedings of International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures. 2017

  63. Zeuch S, Breß S, Rabl T, Monte B D, Karimov J, Lutz C, Renz M, Traub J, Markl V. Analyzing efficient stream processing on modern hardware. Proceedings of the VLDB Endowment, 2019, 12(5): 516–530

    Article  Google Scholar 

Download references

Acknowledgements

The work was supported by the National Key Research and Development Plan Project (2018YFB1003404).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rong Zhang.

Additional information

Xiaotong Wang received her BE degree in computer science from Nanjing Normal University, China in 2012 and PhD degree in software engineering from East China Normal University in 2021, under the supervision of Professor Rong Zhang. Her research interests include data manage management and distributed stream processing.

Chunxi Zhang received her BE degree in software engineering from Shandong University of Science and Technology, China in 2011 and PhD degree in software engineering from East China Normal University in 2021. Her research interest is performance evaluation of databases, including intensive workload generation, contention simulation and isolation validation.

Junhua Fang received his PhD degree from East China Normal University, under the supervision of Professor Aoying Zhou. He is an associate professor in Advanced Data Analytics Group at School of Computer Science and Technology, Soochow Univerisity, China. His research focues on distributed database and parallel stream processing. He is a member of China Computer Federation.

Rong Zhang received her PhD degree in computer science from Fudan University, China in 2007. She joined East China Normal University since 2011 and is currently a professor there. From 2007 to 2010, she worked as an expert researcher with NICT, Japan. Her current research interests include knowledge management, distributed data management, and database benchmarking. She is a member of China Computer Federation.

Weining Qian received his PhD degree in computer science from Fudan University, China in 2004. His is a professor and dean of the School of Data Science and Engineering, East China Normal University, China. He is now serving as a standing committee member of Database Technology Committee of China Computer Federation, and committee member of ACM SIFMOD China Chapter. His research interests include scalable transaction processing, benchmarking big data systems, and management and analysis of massive datasets.

Aoying Zhou is a professor on computer science in East China Normal University (ECNU), where he is heading the School of Data Science and Engineering. Before joining ECNU in 2008, he worked for Fudan University at the Computer Science Department for 15 years. He is the winner of the National Science Fund for Distinguished Young Scholars supported by NSFC. He is now acting as a vice-director of ACM SIGMOD China and Database Technology Committee of China Computer Federation. He is serving as a member of the editorial boards of the VLDB Journal, the WWW Journal, and so on. His research interests include data management, in-memory cluster computing, big data benchmarking, and performance optimization.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, X., Zhang, C., Fang, J. et al. A comprehensive study on fault tolerance in stream processing systems. Front. Comput. Sci. 16, 162603 (2022). https://doi.org/10.1007/s11704-020-0248-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11704-020-0248-x

Keywords

Navigation