Loading [MathJax]/extensions/MathMenu.js
Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems | IEEE Journals & Magazine | IEEE Xplore

Adaptive Fragment-Based Parallel State Recovery for Stream Processing Systems


Abstract:

Today, large-scale cloud organizations are deploying datacenters and “edge” clusters globally to provide low-latency access to services. Running stream applications acros...Show More

Abstract:

Today, large-scale cloud organizations are deploying datacenters and “edge” clusters globally to provide low-latency access to services. Running stream applications across geo-distributed sites are emerging as a daily requirement. However, existing efforts have dominantly centered around stateless stream processing, leaving another urgent trend-stateful stream processing-much less explored. A driving need is to store and update states during processing, and most importantly, successfully recover large distributed states when faults and failures happen. Existing studies exhibit major limitations including: (1) they mostly inherit MapReduce's “single master/many workers” architecture, where the central master can easily become ascalability bottleneck; (2) they offer state recovery mainly through three approaches: replication recovery, checkpointing recovery, and DStream-based lineage recovery, which are either slow, resource-expensive or failing to handle multiple failures; and (3) they are not adaptive to heterogeneous hardware settings. We present A-FP4S, a novel adaptive fragments-based parallel state recovery mechanism for stream processing systems. A-FP4S organizes stream operators into a distributed hash table based peer-to-peer overlay and divides each node's local state into many fragments. These fragments are periodically stored in node's multiple neighbors, ensuring different sets of available fragments can reconstruct failed states in parallel. This mechanism is extremely scalable to the lost state, significantly reduces failure recovery time, and can tolerate multiple node failures. A-FP4S is adaptive to heterogeneous hardware settings by automatic parameter tuning over phases. Compared to Apache Storm, A-FP4S achieves 31.8% to 50.5% reduction in recovery latency. Large-scale experiments using real-world datasets demonstrate A-FP4S's attractive scalability and adaptivity properties.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 34, Issue: 8, August 2023)
Page(s): 2464 - 2478
Date of Publication: 21 March 2023

ISSN Information:

Funding Agency:


Contact IEEE to Subscribe

References

References is not available for this document.