A practical cross-datacenter fault-tolerance algorithm in the cloud storage system

Cheng, Yuxia; Yu, Xinjie; Chen, Wenzhi; Chang, Rui; Xiang, Yang

doi:10.1007/s10586-017-0840-5

A practical cross-datacenter fault-tolerance algorithm in the cloud storage system

Published: 05 April 2017

Volume 20, pages 1801–1813, (2017)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Yuxia Cheng ORCID: orcid.org/0000-0002-6928-9549¹,
Xinjie Yu²,
Wenzhi Chen²,
Rui Chang³ &
…
Yang Xiang¹

381 Accesses
4 Citations
Explore all metrics

Abstract

The fault-tolerance property in most cloud storage systems are designed within the scale of a single datacenter. The single datacenter as a whole may be unreachable or crashed due to severe problems, such as broken network links, power supply interruptions, and natural disasters, etc. Therefore, the design of an effective cross-datacenter fault-tolerant storage system is important to protect data security in the cloud. However, building a cross-datacenter fault-tolerant system faces great challenges, such as high latency, low throughput, high costs of bandwidth resources between datacenters. In this paper, we propose a practical cross-datacenter fault-tolerant (CDFT) algorithm in the cloud storage system. Our fault-tolerant algorithm design considers the difficult tradeoffs among fault tolerance, latency, throughput, network and storage costs. We propose the Domain Fault Codes (DFC) and the topology-aware scheduling techniques, which can tolerate the whole datacenter breakdown. We implemented the DFC-CDFT algorithm in a prototype cloud storage system. The experimental results showed that the proposed DFC-CDFT algorithm can effectively recover data blocks from the single datacenter failure while achieves low storage and bandwidth costs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Novel Data Placement Algorithm for Distributed Storage System Based on Fault-Tolerant Domain

Article 26 November 2020

Fault Tolerance for Large Scale Storage Systems

RS-Pooling: an adaptive data distribution strategy for fault-tolerant and large-scale storage systems

Article 19 November 2015

References

Data Center Knowledge. UPDATE: Explosion in Downtown Los Angeles Disrupts Data Center Operations[N/OL]. http://www.datacenterknowledge.com/archives/2015/ 08/21/explosion-downtown-los-angeles-disrupts-data-center-operatio ns/. Accessed 15 Oct 2016
Bailis, P., Davidson, A., Fekete, A., Ghodsi, A., Hellerstein, J.M., Stoica, I.: Highly available transactions: virtues and limitations. Proc. VLDB Endow. 7(3), 181–192 (2013)
Article Google Scholar
Greenberg, A., Hamilton, J., Maltz, D.A., Patel, P.: The cost of a cloud: research problems in data center networks. ACM SIGCOMM Comput. Commun. Rev. 39(1), 68–73 (2008)
Article Google Scholar
Shah, N.B., Lee, K., Ramchandran, K.: The MDS queue: analysing the latency performance of erasure codes. In: Proceedings of International Symposium on Information Theory (2014)
Bailis, P.: Communication Costs in Real-world Networks [R/OL]. http://www.bailis.org/blog/communication-costs-in-real-world-networks/. Accessed 16 Oct 2016
Ghemawat, S., Gobioff, H., Leung, S.-T.: The Google file system. ACM SIGOPS Oper. Syst. Rev. ACM 37(5), 29–43 (2003)
Article Google Scholar
Sivasubramanian, S.: Amazon dynamoDB: a seamlessly scalable non-relational database service. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 729–730. ACM (2012)
Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S.: Erasure coding in Windows Azure storage. In: Proceedings of USENIX Annual Technical Conference (2012)
Fikes, A.: Storage architecture and challenges. Talk at the Google Faculty Summit (2010)
Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file system[C]. In: IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), pp. 1–10. IEEE (2010)
Dian Fu, Avik Key.: HDFS-5442: Zero loss HDFS data replication for multiple datacenters[EB/OL]. https://issues.apache.org/jira/browse/HDFS-5442. Accessed 16 Oct 2016
Zhang, Z., Jiang, W.: HDFS-7285: Erasure Coding Support inside HDFS[EB/OL]. https://issues.apache.org/jira/browse/HDFS-7285. Accessed 16 Oct 2016
The Apache Software Foundation.: HDFS-RAID Wiki[EB/OL]. http://wiki.apache.org/hadoop/HDFS-RAID. Accessed 16 Oct 2016
Fan, B., Tantisiriroj, W., Xiao, L., Gibson, G.: DiskReduce: RAID for data-intensive scalable computing. In: Proceedings of the 4th Annual Workshop on Petascale Data Storage, pp. 6–10. ACM (2009)
Sathiamoorthy, M., Asteris, M., Papailiopoulos, D., Dimakis, A.G., Vadali, R., Chen, S., Borthakur, D.: Xoring elephants: novel erasure codes for big data. Proc. VLDB Endow. VLDB Endow. 6(5), 325–336 (2013)
Article Google Scholar
Amazon Web Services, Inc.: Amazon DynamoDB Developer Guide: Cross-Region Replication Using DynamoDB Streams[R/OL]. http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.CrossRegionRepl.html. Accessed 16 Oct 2016
The Apache Software Foundation.: Apache Cassandra. http://cassandra.apache.org/. Accessed 16 Oct 2016
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. ACM Trans. Comput. Syst. (TOCS) 26(2), 4 (2008)
Article Google Scholar
Baker, J., Bond, C., Corbett, J.C., Furman, J.J., Khorlin, A., Larson, J., Leon, J.-M., Li, Y., Lloyd, A., Yushprakh, V.: Megastore: providing scalable, highly available storage for interactive services. CIDR 11, 223–234 (2011)
Google Scholar
Corbett, J.C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J.J., Ghemawat, S., et al.: Spanner: Google’s globally distributed database. ACM Trans. Comput. Syst. (TOCS) 31(3), 8 (2013)
Article Google Scholar
Silberstein, M., Ganesh, L., Wang, Y., Alvisi, L., Dahlin, M.: Lazy means smart: reducing repair bandwidth costs in erasure-coded distributed storage. In: Proceedings of International Conference on Systems and Storage (2014)
Huang, J., Liang, X., Qin, X., Xie, P., Xie, C.: Scale-RS: an efficient scaling scheme for RS-coded storage clusters. IEEE Trans. Parallel Distrib. Syst. 26(6), 1704–1717 (2015)
Article Google Scholar
Reed, I.S., Solomon, G.: Polynomial codes over certain finite fields. J. Soc. Ind. Appl. Math. 8(2), 300–304 (1960)
Article MathSciNet MATH Google Scholar
Galois field. https://en.wikipedia.org/wiki/Finite_field. Accessed 16 Oct 2016
Rashmi, K.V., Nakkiran, P., Wang, J., Shah, N.B., Ramchandran, K.: Having Your Cake and Eating It Too: Jointly Optimal Erasure Codes for I/O, Storage, and Network-bandwidth. In: USENIX Conference on File and Storage Technologies (2015)
Singh, A., Ong, J., Agarwal, A., Anderson, G.: Jupiter rising: a decade of clos topologies and centralized control in Google’s datacenter network. Commun. ACM 59(9), 88–97 (2016)
Article Google Scholar
Xiao, L., Ren, K., Zheng, Q., Gibson, G.A.: ShardFS vs. IndexFS: replication vs. caching strategies for distributed metadata management in cloud storage systems. In: Proceedings of the Sixth ACM Symposium on Cloud Computing (2015)
Thomson, A., Abadi, D.J.: CalvinFS: consistent WAN replication and scalable metadata management for distributed file systems. In: Proceedings of the 13th USENIX Conference on File and Storage Technologies (2015)
LevelDB. https://github.com/google/leveldb. Accessed 16 Oct 2016
Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.-A., Barroso, L., Grimes, C., Quinlan, S.: Availability in Globally Distributed Storage Systems. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, pp. 61–74 (2010)
Standard, NIST-FIPS.: Announcing the advanced encryption standard (AES). Fed. Inf. Process. Stand. Publ. 197, 1–51 (2001)

Download references

Author information

Authors and Affiliations

Deakin University, 221 Burwood Highway, Burwood, VIC, 3125, Australia
Yuxia Cheng & Yang Xiang
Zhejiang University, 38 Zheda Road, Xihu, Hangzhou, China
Xinjie Yu & Wenzhi Chen
State Key Laboratory of Mathematical Engineering and Advanced Computing, NO.62 Science Avenue, High-Tech Zone, Zhengzhou, China
Rui Chang

Authors

Yuxia Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Xinjie Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wenzhi Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Xiang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxia Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cheng, Y., Yu, X., Chen, W. et al. A practical cross-datacenter fault-tolerance algorithm in the cloud storage system. Cluster Comput 20, 1801–1813 (2017). https://doi.org/10.1007/s10586-017-0840-5

Download citation

Received: 18 October 2016
Revised: 21 January 2017
Accepted: 27 March 2017
Published: 05 April 2017
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10586-017-0840-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A practical cross-datacenter fault-tolerance algorithm in the cloud storage system

Abstract

Access this article

Similar content being viewed by others

Novel Data Placement Algorithm for Distributed Storage System Based on Fault-Tolerant Domain

Fault Tolerance for Large Scale Storage Systems

RS-Pooling: an adaptive data distribution strategy for fault-tolerant and large-scale storage systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A practical cross-datacenter fault-tolerance algorithm in the cloud storage system

Abstract

Access this article

Similar content being viewed by others

Novel Data Placement Algorithm for Distributed Storage System Based on Fault-Tolerant Domain

Fault Tolerance for Large Scale Storage Systems

RS-Pooling: an adaptive data distribution strategy for fault-tolerant and large-scale storage systems

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation