Abstract:
The increasing algorithmic complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a m...View moreMetadata
Abstract:
The increasing algorithmic complexity and dataset sizes necessitate the use of networked machines for many graph-parallel algorithms, which also makes fault tolerance a must due to the increasing scale of machines. Unfortunately, existing large-scale graph-parallel systems usually adopt a distributed checkpoint mechanism for fault tolerance, which incurs not only notable performance overhead but also lengthy recovery time. This paper observes that the vertex replicas created for distributed graph computation can be naturally extended for fast in-memory recovery of graph states. This paper describes Imitator, a new fault tolerance mechanism, which supports cheap maintenance of vertex states by replicating them to their replicas during normal message exchanges, and provides fast in-memory reconstruction of failed vertices from replicas in other machines. Imitator has been implemented on Cyclops with edge-cut and PowerLyra with vertex-cut. Evaluation on a 50-node EC-2 like cluster shows that Imitator incurs an average of 1.37 and 2.32 percent performance overhead (ranging from -0.6 to 3.7 percent) for Cyclops and PowerLyra respectively, and can recover from failures of more than one million of vertices with less than 3.4 seconds.
Published in: IEEE Transactions on Parallel and Distributed Systems ( Volume: 29, Issue: 7, 01 July 2018)