skip to main content
10.1145/2675743.2771831acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

An adaptive replication scheme for elastic data stream processing systems

Published: 24 June 2015 Publication History

Abstract

A major challenge for cloud-based systems is to be fault tolerant to cope with an increasing probability of faults in cloud environments. This is especially true for in-memory computing solutions like data stream processing systems, where a single host failure might result in an unrecoverable information loss.
In state of the art data streaming systems either active replication or upstream backup are applied to ensure fault tolerance, which have a high resource overhead or a high recovery time respectively. This paper combines these two fault tolerance mechanisms in one system to minimize the number of violations of a user-defined recovery time threshold and to reduce the overall resource consumption compared to active replication. The system switches for individual operators between both replication techniques dynamically based on the current workload characteristics. Our approach is implemented as an extension of an elastic data stream processing engine, which is able to reduce the number of used hosts due to the smaller replication overhead. Based on a real-world evaluation we show that our system is able to reduce the resource usage by up to 19% compared to an active replication scheme.

References

[1]
D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, et al. The Design of the Borealis Stream Processing Engine. In Proceedings of the Second Biennial Conference on Innovative Data Systems Research, CIDR 2005, pages 277--289, 2005.
[2]
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033--1044, 2013.
[3]
Amazon. Amazon EC2. http://aws.amazon.com/ec2/.
[4]
M. Balazinska, H. Balakrishnan, S. Madden, and M. Stonebraker. Fault-tolerance in the borealis distributed stream processing system. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2005, pages 13--24. ACM, 2005.
[5]
N. Bansal, R. Bhagwan, N. Jain, Y. Park, D. Turaga, and C. Venkatramani. Towards optimal resource allocation in partial-fault tolerant applications. In Proceedings of the 27th IEEE International Conference on Computer Communications, INFOCOM 2008. IEEE, 2008.
[6]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: continuous dataflow processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pages 668--668. ACM, 2003.
[7]
E. G. Coffman Jr, M. R. Garey, and D. S. Johnson. Approximation algorithms for bin packing: A survey. In Approximation algorithms for NP-hard problems, pages 46--93. PWS Publishing Co., 1996.
[8]
F. Cristian and C. Fetzer. The timed asynchronous distributed system model. Parallel and Distributed Systems, IEEE Transactions on, 10(6):642--657, 1999.
[9]
J. Dean. Handling large datasets at google: Current systems and future directions. In DISC, 2008.
[10]
J. Dougherty, R. Kohavi, M. Sahami, et al. Supervised and unsupervised discretization of continuous features. In Machine learning: proceedings of the twelfth international conference, pages 194--202, 1995.
[11]
R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the SIGMOD International Conference on Management of Data, SIGMOD 2013, pages 725--736. ACM, 2013.
[12]
C. Fetzer, U. Schiffel, and M. Süßkraut. AN-Encoding Compiler: Building Safety-Critical Systems with Commodity Hardware. In SAFECOMP, 2009.
[13]
X. Fuyuan, T. Kitasuka, and M. Aritsugi. Economical and fault-tolerant load balancing in distributed stream processing systems. IEICE Transactions on Information and Systems, 2012.
[14]
H. Ghanbari, B. Simmons, M. Litoiu, and G. Iszlai. Exploring alternative approaches to implement an elasticity policy. In Proceedings of the IEEE International Conference on Cloud Computing, CLOUD 2011, pages 716--723. IEEE, 2011.
[15]
Y. Gu, Z. Zhang, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. An empirical study of high availability in stream processing systems. In Proceedings of the 10th ACM/IFIP/USENIX International Middleware Conference, Middleware 2009. ACM, 2009.
[16]
T. Heinze, Z. Jerzak, G. Hackenbroich, and C. Fetzer. Latency-aware elastic scaling for distributed data stream processing systems. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, DEBS 2014, pages 13--22. ACM, 2014.
[17]
T. Heinze, V. Pappalardo, Z. Jerzak, and C. Fetzer. Auto-scaling techniques for elastic data stream processing. In Workshops Proceedings of the 30th International Conference on Data Engineering Workshops, ICDEW 2014, pages 296--302. IEEE, 2014.
[18]
J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Proceedings of the 21st International Conference on Data Engineering, ICDE 2005. IEEE, 2005.
[19]
J.-H. Hwang, Y. Xing, U. Cetintemel, and S. Zdonik. A cooperative, self-configuring high-availability solution for stream processing. In Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, pages 176--185. IEEE, 2007.
[20]
Z. Jerzak and H. Ziekow. The DEBS 2014 grand challenge. In Proceedings of the 8th ACM International Conference on Distributed Event-Based Systems, pages 266--269. ACM, 2014.
[21]
A. Martin, C. Fetzer, and A. Brito. Active replication at (almost) no cost. In Proceedings of the IEEE 32nd Symposium on Reliable Distributed Systems, SRDS 2011, pages 689--699. IEEE, 2011.
[22]
Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. Timestream: Reliable stream computation in the cloud. In Proceedings of the 8th ACM European Conference on Computer Systems, Eurosys 2013, pages 1--14, 2013.
[23]
T. Repantis and V. Kalogeraki. Replica placement for high availability in distributed stream processing systems. In Proceedings of the Second International Conference on Distributed Event-Based Systems, DEBS 2008, pages 181--192. ACM, 2008.
[24]
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Proceedings of the 19th IEEE International Conference on Data Engineering, ICDE 2003, pages 25--36. IEEE, 2003.
[25]
P. Upadhyaya, Y. Kwon, and M. Balazinska. A latency and fault-tolerance optimizer for online parallel query plans. In Proceedings of the SIGMOD International Conference on Management of Data, SIGMOD 2011, pages 241--252. ACM, 2011.
[26]
K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Proceedings of the first ACM Annual Symposium on Cloud Computing, SoCC 2010. ACM, 2010.
[27]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the ACM SIGOPS 24th Symposium on Operating Systems Principles, SOSP 2013, pages 423--438. ACM, 2013.
[28]
Z. Zhang, Y. Gu, F. Ye, H. Yang, M. Kim, H. Lei, and Z. Liu. A hybrid approach to high availability in stream processing systems. In Proceedings of the IEEE 30th International Conference on Distributed Computing Systems, ICDCS 2010. IEEE, 2010.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
DEBS '15: Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems
June 2015
385 pages
ISBN:9781450332866
DOI:10.1145/2675743
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. active replication
  2. distributed data stream processing
  3. fault tolerance
  4. upstream backup

Qualifiers

  • Research-article

Conference

DEBS '15

Acceptance Rates

Overall Acceptance Rate 145 of 583 submissions, 25%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A survey on the evolution of stream processing systemsThe VLDB Journal10.1007/s00778-023-00819-833:2(507-541)Online publication date: 22-Nov-2023
  • (2022)DaltonProceedings of the VLDB Endowment10.14778/3570690.357069916:3(491-504)Online publication date: 1-Nov-2022
  • (2022)A comprehensive study on fault tolerance in stream processing systemsFrontiers of Computer Science: Selected Publications from Chinese Universities10.1007/s11704-020-0248-x16:2Online publication date: 1-Apr-2022
  • (2021)Cost-aware & Fault-tolerant Geo-distributed Edge Computing for Low-latency Stream Processing2021 IEEE 7th International Conference on Collaboration and Internet Computing (CIC)10.1109/CIC52973.2021.00026(117-124)Online publication date: Dec-2021
  • (2021)Resilient Stream Processing in Edge Computing2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid51090.2021.00060(504-513)Online publication date: May-2021
  • (2021)A Performance Analysis of Fault Recovery in Stream Processing FrameworksIEEE Access10.1109/ACCESS.2021.30932089(93745-93763)Online publication date: 2021
  • (2021)TCEP: Transitions in operator placement to adapt to dynamic network environmentsJournal of Computer and System Sciences10.1016/j.jcss.2021.05.003122(94-125)Online publication date: Dec-2021
  • (2021)Research on Optimal Checkpointing-Interval for Flink Stream Processing ApplicationsMobile Networks and Applications10.1007/s11036-020-01729-7Online publication date: 6-Jan-2021
  • (2020)SPEAr: Expediting Stream Processing with Accuracy Guarantees2020 IEEE 36th International Conference on Data Engineering (ICDE)10.1109/ICDE48307.2020.00100(1105-1116)Online publication date: Apr-2020
  • (2020)Evaluating Fault Tolerance of Distributed Stream Processing SystemsWeb and Big Data10.1007/978-3-030-60290-1_8(101-116)Online publication date: 14-Oct-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media