Skip to main content

A Proactive Fault Tolerance Scheme for Large Scale Storage Systems

  • Conference paper
  • First Online:
Algorithms and Architectures for Parallel Processing (ICA3PP 2015)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9530))

Abstract

Facing increasingly high failure rate of drives in data centers, reactive fault tolerance mechanisms alone can hardly guarantee high reliability. Therefore, some hard drive failure prediction models that can predict soon-to-fail drives in advance have been raised. But few researchers applied these models to distributed systems to improve the reliability.

This paper proposes SSM (Self-Scheduling Migration) which can monitor drives’ health status and reasonably migrate data from the soon-to-fail drives to others in advance using the results produced by the prediction models. We adopt a self-scheduling migration algorithm into distributed systems to transfer the data from soon-to-fail drives. This algorithm can dynamically adjust the migration rates according to drives’ severity level, which is generated from the realtime prediction results. Moreover, the algorithm can make full use of the resources and balance the load when selecting migration source and destination drives. On the premise of minimizing the side effects of migration to system services, the migration bandwidth is reasonably allocated. We implement a prototype based on Sheepdog distributed system. The system only sees respectively \(8\,\%\) and \(13\,\%\) performance drops on read and write operations caused by migration. Compared with reactive fault tolerance, SSM significantly improves system reliability and availability.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Vishwanath, K.V., Nagappan, N.: Characterizing cloud computing hardware reliability. In: Proceedings of the 1st ACM Symposium on Cloud Computing, pp. 193–204. ACM (2010)

    Google Scholar 

  2. Bairavasundaram, L.N., Goodson, G.R., Pasupathy, S., Schindler, J.: An analysis of latent sector errors in disk drives. ACM SIGMETRICS Perform. Eval. Rev. 35, 289–300 (2007)

    Article  Google Scholar 

  3. Bairavasundaram, L.N., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Goodson, G.R., Schroeder, B.: An analysis of data corruption in the storage stack. ACM Trans. Storage (TOS) 4(3), 8 (2008)

    Google Scholar 

  4. Allen, B.: Monitoring hard disks with smart. Linux J. (117), 74–77 (2004)

    Google Scholar 

  5. Li, J., Ji, X., Zhu, B., Wang, G., Liu, X.: Hard drive failure prediction using classication and regression trees. In: DSN (2014)

    Google Scholar 

  6. Qin, A., Hu, D., Liu, J., Yang, W., Tan, D.: Fatman: cost-saving and reliable archival storage based on volunteer resources. Proc. VLDB Endow. 7(13), 1748–1753 (2014)

    Article  Google Scholar 

  7. Wu, S., Jiang, H., Mao, B.: Proactive data migration for improved storage availability in large-scale data centers (2014)

    Google Scholar 

  8. Patterson, D.A., Gibson, G., Katz, R.H.: A case for redundant arrays of inexpensive disks (RAID) 17(3), 109–116 (1988)

    Google Scholar 

  9. Blaum, M., Brady, J., Bruck, J., Menon, J.: Evenodd: an effcient scheme for tolerating double disk failures in raid architectures. IEEE Trans. Comput. 44(2), 192–202 (1995)

    Article  MATH  Google Scholar 

  10. Cidon, A., Rumble, S.M., Stutsman, R., Katti, S., Ousterhout, J.K., Rosenblum, M.: Copysets: reducing the frequency of data loss in cloud storage. In: USENIX Annual Technical Conference, pp. 37–48. Citeseer (2013)

    Google Scholar 

  11. Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: OSDI, pp. 61–74 (2010)

    Google Scholar 

  12. Hafner, J.L.: Weaver codes: highly fault tolerant erasure codes for storage systems. In: FAST, vol. 5, pp. 16–16 (2005)

    Google Scholar 

  13. Papailiopoulos, D.S., Luo, J., Dimakis, A.G., Huang, C., Li, J.: Simple regenerating codes: network coding for cloud storage. In: INFOCOM, 2012 Proceedings IEEE, pp. 2801–2805. IEEE (2012)

    Google Scholar 

  14. Murray, J.F., Hughes, G.F., Kreutz-Delgado, K.: Machine learning methods for predicting failures in hard drives: a multiple-instance application. J. Mach. Learn. Res. 6, 783–816 (2005)

    MathSciNet  MATH  Google Scholar 

  15. Ma, A., Douglis, F., Lu, G., Sawyer, D., Chandra, S., Hsu, W.: Raidshield: characterizing, monitoring, and proactively protecting against disk failures. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies, pp. 241–256. USENIX Association (2015)

    Google Scholar 

Download references

Acknowledgments

This work is partially supported by NSF of China (grant numbers: 61373018, 11301288), Program for New Century Excellent Talents in University (grant number: NCET130301) and the Fundamental Research Funds for the Central Universities (grant number: 65141021).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Xiaoguang Liu or Zhongwei Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ji, X. et al. (2015). A Proactive Fault Tolerance Scheme for Large Scale Storage Systems. In: Wang, G., Zomaya, A., Martinez, G., Li, K. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2015. Lecture Notes in Computer Science(), vol 9530. Springer, Cham. https://doi.org/10.1007/978-3-319-27137-8_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27137-8_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27136-1

  • Online ISBN: 978-3-319-27137-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics