Skip to main content
Log in

SA-RSR: a read-optimal data recovery strategy for XOR-coded distributed storage systems

SA-RSR: 一种适用于异或类纠删码分布式存储系统的数据读取最优恢复方法

  • Research Articles
  • Published:
Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Abstract

To ensure the reliability and availability of data, redundancy strategies are always required for distributed storage systems. Erasure coding, one of the representative redundancy strategies, has the advantage of low storage overhead, which facilitates its employment in distributed storage systems. Among the various erasure coding schemes, XOR-based erasure codes are becoming popular due to their high computing speed. When a single-node failure occurs in such coding schemes, a process called data recovery takes place to retrieve the failed node’s lost data from surviving nodes. However, data transmission during the data recovery process usually requires a considerable amount of time. Current research has focused mainly on reducing the amount of data needed for data recovery to reduce the time required for data transmission, but it has encountered problems such as significant complexity and local optima. In this paper, we propose a random search recovery algorithm, named SA-RSR, to speed up single-node failure recovery of XOR-based erasure codes. SA-RSR uses a simulated annealing technique to search for an optimal recovery solution that reads and transmits a minimum amount of data. In addition, this search process can be done in polynomial time. We evaluate SA-RSR with a variety of XOR-based erasure codes in simulations and in a real storage system, Ceph. Experimental results in Ceph show that SA-RSR reduces the amount of data required for recovery by up to 30.0% and improves the performance of data recovery by up to 20.36% compared to the conventional recovery method.

摘要

冗余策略经常被用于分布式存储系统, 以保证数据的可靠性与可用性。纠删码是一种代表性的冗余策略, 具有低存储开销优势, 这种优势促进了它在分布式存储系统中的应用。在各种纠删码机制中, 异或类纠删码凭借高计算效率变得越来越流行。采用异或类纠删码机制的存储系统, 如果发生单节点故障, 便会进行数据恢复, 该过程需要从幸存节点中下载数据, 然后恢复故障节点中的数据。然而, 数据恢复过程中的数据传输通常需要相当长时间。目前研究主要集中在通过减少数据恢复过程所需数据量, 减少数据传输所需时间, 但存在复杂度高和局部最优解等问题。本文提出一种随机搜索恢复算法, SA-RSR, 该算法能加速异或类纠删码单节点故障恢复。SA-RSR利用模拟退火技术寻找读取和传输最少数据量的最优恢复机制, 且该搜索过程可在多项式时间内完成。最后, 为验证该方法的有效性, 使用多种异或类纠删码进行仿真验证, 并在真实存储系统Ceph中验证。实验结果表明, 与传统恢复方法相比, SA-RSR减少了30%的数据读取与传输量, 提高了20.36%的数据恢复性能。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

Download references

Author information

Authors and Affiliations

Authors

Contributions

Xingjun ZHANG designed the research. Ningjing LIANG and Yunfei LIU processed the data. Ningjing LIANG drafted the paper. Changjiang ZHANG helped organize the paper. Ningjing LIANG, Changjiang ZHANG, and Yang LI revised and finalized the paper.

Corresponding author

Correspondence to Xingjun Zhang  (张兴军).

Additional information

Compliance with ethics guidelines

Xingjun ZHANG, Ningjing LIANG, Yunfei LIU, Changjiang ZHANG, and Yang LI declare that they have no conflict of interest.

Project supported by the National Natural Science Foundation of China (No. 62172327)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, X., Liang, N., Liu, Y. et al. SA-RSR: a read-optimal data recovery strategy for XOR-coded distributed storage systems. Front Inform Technol Electron Eng 23, 858–875 (2022). https://doi.org/10.1631/FITEE.2100242

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1631/FITEE.2100242

Key words

CLC number

关键词

Navigation