A Proactive Fault Tolerance Scheme for Large Scale Storage Systems

Ji, Xinpu; Ma, Yuxiang; Ma, Rui; Li, Peng; Ma, Jingwei; Wang, Gang; Liu, Xiaoguang; Li, Zhongwei

doi:10.1007/978-3-319-27137-8_26

Xinpu Ji¹⁷,
Yuxiang Ma¹⁷,
Rui Ma¹⁷,
Peng Li¹⁷,
Jingwei Ma¹⁷,
Gang Wang¹⁷,
Xiaoguang Liu¹⁷ &
…
Zhongwei Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1900 Accesses
8 Citations
3 Altmetric

Abstract

Facing increasingly high failure rate of drives in data centers, reactive fault tolerance mechanisms alone can hardly guarantee high reliability. Therefore, some hard drive failure prediction models that can predict soon-to-fail drives in advance have been raised. But few researchers applied these models to distributed systems to improve the reliability.

This paper proposes SSM (Self-Scheduling Migration) which can monitor drives’ health status and reasonably migrate data from the soon-to-fail drives to others in advance using the results produced by the prediction models. We adopt a self-scheduling migration algorithm into distributed systems to transfer the data from soon-to-fail drives. This algorithm can dynamically adjust the migration rates according to drives’ severity level, which is generated from the realtime prediction results. Moreover, the algorithm can make full use of the resources and balance the load when selecting migration source and destination drives. On the premise of minimizing the side effects of migration to system services, the migration bandwidth is reasonably allocated. We implement a prototype based on Sheepdog distributed system. The system only sees respectively $8\,\%$ and $13\,\%$ performance drops on read and write operations caused by migration. Compared with reactive fault tolerance, SSM significantly improves system reliability and availability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Novel Data Placement Algorithm for Distributed Storage System Based on Fault-Tolerant Domain

Article 26 November 2020

A practical cross-datacenter fault-tolerance algorithm in the cloud storage system

Article 05 April 2017

A Distributed Fault Analysis (DFA) Method for Fault Tolerance in High-Performance Computing Systems

References

Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM (2010)
Google Scholar
Bairavasundaram, L.N., Goodson, G.R., Pasupathy, S., Schindler, J.: An analysis of latent sector errors in disk drives. ACM SIGMETRICS Perform. Eval. Rev. 35, 289–300 (2007)
Article Google Scholar
Bairavasundaram, L.N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Goodson, G.R., Schroeder, B.: An analysis of data corruption in the storage stack. ACM Trans. Storage (TOS) 4(3), 8 (2008)
Google Scholar
Allen, B.: Monitoring hard disks with smart. Linux J. (117), 74–77 (2004)
Google Scholar
Li, J., Ji, X., Zhu, B., Wang, G., Liu, X.: Hard drive failure prediction using classication and regression trees. In: DSN (2014)
Google Scholar
Qin, A., Hu, D., Liu, J., Yang, W., Tan, D.: Fatman: cost-saving and reliable archival storage based on volunteer resources. Proc. VLDB Endow. 7(13), 1748–1753 (2014)
Article Google Scholar
Wu, S., Jiang, H., Mao, B.: Proactive data migration for improved storage availability in large-scale data centers (2014)
Google Scholar
Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (RAID) 17(3), 109–116 (1988)
Google Scholar
Blaum, M., Brady, J., Bruck, J., Menon, J.: Evenodd: an effcient scheme for tolerating double disk failures in raid architectures. IEEE Trans. Comput. 44(2), 192–202 (1995)
Article MATH Google Scholar
Cidon, A., Rumble, S.M., Stutsman, R., Katti, S., Ousterhout, J.K., Rosenblum, M.: Copysets: reducing the frequency of data loss in cloud storage. In: USENIX Annual Technical Conference, pp. 37–48. Citeseer (2013)
Google Scholar
Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: OSDI, pp. 61–74 (2010)
Google Scholar
Hafner, J.L.: Weaver codes: highly fault tolerant erasure codes for storage systems. In: FAST, vol. 5, pp. 16–16 (2005)
Google Scholar
Papailiopoulos, D.S., Luo, J., Dimakis, A.G., Huang, C., Li, J.: Simple regenerating codes: network coding for cloud storage. In: INFOCOM, 2012 Proceedings IEEE, pp. 2801–2805. IEEE (2012)
Google Scholar
Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Machine learning methods for predicting failures in hard drives: a multiple-instance application. J. Mach. Learn. Res. 6, 783–816 (2005)
MathSciNet MATH Google Scholar
Ma, A., Douglis, F., Lu, G., Sawyer, D., Chandra, S., Hsu, W.: Raidshield: characterizing, monitoring, and proactively protecting against disk failures. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, pp. 241–256. USENIX Association (2015)
Google Scholar

Download references

Acknowledgments

This work is partially supported by NSF of China (grant numbers: 61373018, 11301288), Program for New Century Excellent Talents in University (grant number: NCET130301) and the Fundamental Research Funds for the Central Universities (grant number: 65141021).

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, Tianjin, 300350, China
Xinpu Ji, Yuxiang Ma, Rui Ma, Peng Li, Jingwei Ma, Gang Wang, Xiaoguang Liu & Zhongwei Li

Authors

Xinpu Ji
View author publications
You can also search for this author in PubMed Google Scholar
Yuxiang Ma
View author publications
You can also search for this author in PubMed Google Scholar
Rui Ma
View author publications
You can also search for this author in PubMed Google Scholar
Peng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingwei Ma
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoguang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongwei Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Xiaoguang Liu or Zhongwei Li .

Editor information

Editors and Affiliations

Central South University, Changsha, China
Guojun Wang
The University of Sydney, Sydney, New South Wales, Australia
Albert Zomaya
University of Murcia, Murcia, Murcia, Spain
Gregorio Martinez
Hunan University, Changsha, China
Kenli Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ji, X. et al. (2015). A Proactive Fault Tolerance Scheme for Large Scale Storage Systems. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-27137-8_26
Published: 16 December 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27136-1
Online ISBN: 978-3-319-27137-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics