Multi-site data distribution for disaster recovery—A planning framework

https://doi.org/10.1016/j.future.2014.07.007Get rights and content

Highlights

  • We describe a fault-tolerant multi-cloud data backup scheme using erasure coding.

  • The data is distributed using a plan driven by a multi-criteria optimization.

  • The plan uses parameters like cost, replication level, recoverability objective etc.

  • Both single customer and multiple customer cases are tackled.

  • Simulation results for the plans and sensitivity analyses are discussed.

Abstract

In this paper, we present DDP-DR: a Data Distribution Planner for Disaster Recovery. DDP-DR provides an optimal way of backing-up critical business data into data centers (DCs) across several Geographic locations. DDP-DR provides a plan for replication of backup data across potentially large number of data centers so that (i) the client data is recoverable in the event of catastrophic failure at one or more data centers (disaster recovery) and, (ii) the client data is replicated and distributed in an optimal way taking into consideration major business criteria such as cost of storage, protection level against site failures, and other business and operational parameters like recovery point objective (RPO), and recovery time objective (RTO). The planner uses Erasure Coding (EC) to divide and codify data chunks into fragments and distribute the fragments across DR sites or storage zones so that failure of one or more site / zone can be tolerated and data can be regenerated. We describe data distribution planning approaches for both single customer and multiple customer scenarios.

Introduction

In today’s enterprise computing, data centers generate an overwhelming volume of data. Applications such as particle physics  [1], storing of web pages and indexes  [2], social networking applications, and engineering applications of pharmaceutical and semi-conductor companies can easily generate petabytes of data over days and weeks. Disaster Recovery (DR) and Business Continuity planning (BCP) require that critical enterprise data is backed up periodically and kept in geographically separate and secure locations. In the event of operational disruption at the primary site, the operation can be resumed at an alternate site where the backed up data and log files are shipped and applications/services can be instantiated again. Additionally, recent regulatory and compliance standards like HIPAA, SOX and GLBA mandate that all operational data is retained for a certain period of time and be made available for auditing. With the increasing volume of data and increasing emphasis on service availability and data retention, the technology and process of handling backup and recovery have come under renewed scrutiny.

Traditionally, the data backup and archival are done using magnetic tapes which are processed and transported to a remote location. However, such procedure is manual and cumbersome (therefore slow) and rapid data restoration and service resumption are often not possible. Recently, with the advent of cheap, improved storage and online disk backup technology, and advances in networking; online remote backup options have become attractive [3]. The storage area network and virtualization technology has become sophisticated enough to create a storage volume snapshot to a remote site  [4]. Increasingly, open-source technologies such as RSync  [5] are being used to achieve the same goals; albeit with a lower efficiency.

Cloud computing and cheap online storage technologies are promising to change the landscape of disaster recovery. The data from the primary site is now backed up in the cloud and/or in multiple geographically separated data centers to improve fault tolerance and availability  [6]. Several cloud infrastructure and storage vendors such as Amazon S3, Glacier  [7], and Rackspace  [8] provide storage for backup. Several other vendors like Zamanda  [9], use the cloud storage, such as Amazon S3, to provide backup services. Organizations are also adopting hybrid approach—where very critical or sensitive data is stored within the enterprise and non-sensitive data is dispatched to cloud. While backup using a single cloud or online storage is cheap and practical, online storing of encrypted backup data to a single third-party storage provider may not be prudent due to the lack of operational control, security, reliability and availability issues. It is advisable that organizations hedge their bets by replicating data to multiple cloud locations and data centers. It is also observed that in large organizations, having data centers (DC) in multiple geographies, DR may involve using one regional data center as an alternate site against another by replicating data. Replicating DR data across sites improves the availability by reducing the risk of simultaneous co-related failures.

In this context, we present a schematic diagram for a multi-site DR in Fig. 1. The primary site (DC1) hosts the servers and storage for production, test, and development. Historical operational data is periodically copied to the staging servers where aggregation and de-duplication are run. The “backup” ready data is then replicated to multiple data centers1 that the firm owns (DC2 and DC3) and/or to the public cloud storage providers. In the event of failure at the primary site, the data can be recovered to the recovery (also called secondary) site on demand. Data recovery or retrieval may require additional compute resources to carry out costly operations such as de-compression and decryption of data. Therefore, recovery can be optionally offloaded to a server firm (DC4) or to a dedicated processing hardware in the DR sites that can do bulk recovery of multiple customers within stipulated time bounds.

Since we propose a distributed storage substrate, one possible mechanism to maintain data consistency across backup sites is to create a peer-to-peer based storage overlay layer across the sites. Various distributed archival storage substrate are discussed in literature  [10], [11], but this is not the primary concern of this paper.

The current approach of multi-site backup is to replicate data to single or multiple remote sites so that co-related storage or network failures do not hamper data availability. Replication, however, increases data redundancy linearly with the number of sites. Plain replication, even with data compression technologies, makes the data footprint quite large. Additionally, it is often seen that the strategy of data placement and distribution is not driven from recovery standpoint (there is no way of telling if the data recovery can happen on time if any of the primary sites fails) and overall storage topology may get sub-optimal.

Therefore, there is a need for rationalizing and optimizing distributed backup storage—from the point of view of data footprint, cost, security, availability and data recoverability (within time and cost). Disaster Recovery planning  [12] often overlooks this critical issue.

In this paper, we propose a novel planning approach, called Data Distribution Planner for Disaster Recovery (DDP-DR in short). The planner creates a plan for distributing data across heterogeneous DR sites in a manner so that it satisfies multiple objectives—the overall cost of storage can be kept below a customer specified bound, the data can survive outages of up-to a pre-specified number of sites, and the distributed data can be recovered and re-assembled within a customer specified time bound. To reduce the data replication footprint, we take recourse to Erasure Coding (EC)  [13]. We combine EC based data encoding technique with Linear Programming (LP) based constraint satisfaction algorithm. Backup data files are broken up through EC into multiple data and code fragments. Coding rate of EC determines how many data and code fragments will be created from a data file. In this work, we drive coding rate, and therefore, data footprint through a set of optimized parameters based on failure protection level required by the customers. This data fragmentation and coding technique create considerably less data redundancy overhead than the conventional multi-site replication method. The optimization, a key part of the work, is done through constraint based mathematical problem formulation and Linear Programming method.

In our research problem formulation, we analyze a couple of likely scenarios for cloud-based distributed DR system planning. In the first scenario: we discuss the problem from the view-point of a customer (an institution or an individual) who wishes to back up and archive the data remotely across different cloud based data centers. The customer wants to create a data distribution strategy so that the data can be distributed in a redundant manner to achieve certain degree of tolerance to the failure of the backup data centers; and at the same time the customer has a threshold for the time taken to back up the incremental data at a certain interval and to recover data within a certain period to a preferred secondary site if the primary site is struck by disaster. Additionally, the customer has a limit on how much can be spent as the rental cost of storage. We term this as Single Customer Problem. In the second scenario, we discuss the formulation from the view-point of a DR Service provider, running (or renting out) multiple storage sites; so that it can accommodate individual customers with different needs for data redundancy, cost, and data backup/recovery time bounds. We call this as Multi-Customer Problem.

As a research methodology, we take recourse to mathematical formulation of the problem and validation through simulation. We take hypothetical but realistic values for storage media cost, cost and bandwidth of network links, and representative data volumes for archival and backup. The representative cost values are obtained from the public data of storage vendors. The storage costs of commercial public cloud providers are not used in our calculation as the actual type and nature of underlying storage media are not known to us. It is more likely that a large corporation or scientific institution adopting multi-data center based archival will adopt a strategy where it can rent storage directly from providers (such as large Telcos) and, therefore, will have the option to select the type of media, replication and bandwidth. However, our formulation can work for public cloud providers as well if the similar data is made available from public cloud providers.

To our knowledge, the proposed formulation and approach for planning data placement in DDP-DR is unique as no other similar approach is reported. The cloud storage vendors and DR Service providers highlight the virtues of distributed data replica across regions and data centers for availability reasons  [14], but no structured mathematical formulation has been reported so far. Similarly, the traditional works on static and dynamic data placement and replication in Grid environments are traditionally focused on efficient job execution through the enablement of data locality  [15], [16]. The usage of distributed cloud based infrastructure as storage substrate has been studied extensively from the perspective of system development and not from the perspective of data placement planning.

The organization is the paper is as follows: in Section  2, we describe the topology, data backup and recovery processes associated with DR and also state our assumptions. We present the formulation for Single Customer Problem in Section  3. The model extensions for Multiple Customer variant is presented in Section  4. The implementation and sample results are described in Section  5. Related works are described in Section  6, followed by the conclusion and possible extensions of the work (Section  7).

Section snippets

DR process and topological assumptions

In DR, one of the critical IT processes involves copying and distribution data offsite to remote location(s) and keeping the data synchronized as much as possible with the primary copy. This is because, in the event of failure, we want to resume operations with data which is as near point in time as when the operation has failed. The backup data is treated as a single file even though it can be aggregate of multiple user and application data, log files, system configuration files etc. We call

DDP-DR approach for Single Customer Problem

Consider a scenario where a customer (an institution or an individual) wants to back up data to multitude of remotely located cloud-based data centers to support a disaster recovery strategy. As motivated in Section  1.3, such a customer will want to devise an optimized plan for data distribution where data can be backed up across possibly heterogeneous storage nodes in remote data centers across different zones or geographies in a manner that the time limits for backup and retrieval are met

DDP-DR approach for Multi-Customer Problem

For multiple customer case, we motivate the problem from the perspective of a storage or backup service provider (or department) that provides multi-data center or cloud based storage service. Consider a DR provider or a consulting company which owns or leases storage zones and network links across different data centers so that it can host its tenants (customers) archived data redundantly across the centers to provide data availability and recovery if the primary operation site of any of its

Implementation and sample results

DDP-DR has been implemented as a standalone software tool. The tool can accept the business SLAs through an input file as a set of values, bounds and parameters. For erasure coding, we use standard MDS module implementing Reed–Solomon technique. A linear programming solver (IBM ILOG CPLEX Optimizer) was used. We have taken extensive runs with number of DCs varying from 3 to 50. For infrastructure and customer parameters similar to the one discussed in the illustrative example to be discussed

Scalability of the solution approach

The scalability of our solution, i.e. the ability to give results for large planning sets, depends primarily on the scalability of the underlying linear programming solver used. With the current linear programming solver (IBM ILOG CPLEX Optimizer) which we used, we have taken extensive runs with number of DCs varying from 3 to 50. A summary of the results of the single customer case, when the number of DCs is 12 and 24, is presented below.

We generated a set of input values for Storage Types,

Related work

There is little published literature on the planning of distributed storage on the basis of performance and recoverability SLA or DR purpose. Minerva system, discussed by Alvarez et al.  [25], works on creating an array of storage nodes based on performance (transaction I/O supported, disk characteristics etc.). This work, however, focuses on how to design and optimize the storage array for optimal transaction rates and does not deal with the case of archival data with incremental data addition

Conclusion and future work

With the advent of sophisticated online backup, multi-geography sites and cloud computing, multi-site DR is increasingly becoming popular for better reliability and availability of operational data. In this paper, we describe DDP-DR a novel data distribution plan for multi-site disaster recovery, where backup data can reside in multiple data centers including the public cloud. The plan takes customer policy level constraints and infrastructural constraints into consideration to suggest a series

Shubhashis Sengupta is a senior researcher with Accenture Technology Labs. He has received Bachelor Degree in Comp. Science from Jadavpur University Calcutta, and Ph.D. in Management Information Systems from Indian Institute of Management. His research interest includes distributed computing, performance engineering, Grids, cloud and software architecture. Shubhashis is a senior member of ACM and member of IEEE Computer Society.

References (44)

  • U. Cibej et al.

    The complexity of static data replication in data grids

    J. Parallel Comput.

    (2005)
  • J. Schiers, Multi-PB distributed databases; IT Division, DB Group, CERN, Presentation available at:...
  • A. Gulli, A. Signorini, The indexable web is more than 11.5 billion pages,...
  • Mozy Online Backup Storage at...
  • NetApp SnapMirror technical documentation at...
  • K.S. Zahed, P.S. Rani, U.V. Saradhi, A. Potluri, Reducing storage requirements of snapshot backups based on rsync...
  • J. Chidambaram, C. Prabhu, et al. A methodology for high availability of data for business continuity planning/disaster...
  • Amazon Glacier storage system at...
  • Rackspace Cloudfile storage system at...
  • Zamanda cloud backup at...
  • J. Kubiatowicz, D. Bindel, Y. Chen, S. Czerwinski, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer,...
  • A. Adya, W.J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J.R. Douceur, J. Howell, J.R. Lorch, M. Theimer, R.P....
  • M. Wallace et al.

    The Disaster Recovery Handbook—A Step-by-Step Plan to Ensure Business Continuity and Protect Vital Operations, Facilities, and Assets

    (2007)
  • J.S. Plank, Erasure codes for storage applications; Tutorial Given at FAST-2005: 4th Usenix Conference on File and...
  • Disaster Recovery issues whitepaper at...
  • W.H. Bell, D.G. Cameron, R. Carvajal-Schiaffino, A.P. Millar, K. Stockinger, F. Zini, Evaluation of an economy-based...
  • X. Tang et al.

    QoS-aware replica placement for content distribution

    IEEE Trans. Parallel Distrib. Syst.

    (2005)
  • H. Bin

    A general architecture for monitoring data storage with openstack cloud storage and RDBMS

    Appl. Mech. Mater.

    (2013)
  • R. Rodrigues et al.

    High availability in DHTs: erasure coding vs. replication

  • R. Rodrigues, B. Liskov, High availability in DHTs: erasure coding vs replication, in: Proceedings of International...
  • Z. Chen, X. Wang, Y. Jin, W. Zhou, Exploring fault-tolerant distributed storage system using GE code, in: Proceedings...
  • H. Weatherspoon, J.D. Kubiatowicz, Erasure coding vs. replication: A quantitative comparison, Lecture Notes in Computer...
  • Cited by (29)

    • Naming and name resolution in the future internet: Introducing the NovaGenesis approach

      2017, Future Generation Computer Systems
      Citation Excerpt :

      Sensors and/or gateways can publish raw data to other NG services, which can contextualize, aggregate and inject filtered information in big data frameworks, like Spark [42]. Another issue closely related to PSS, GIRS, and HTS is disaster recovery [8,9,22]. These core services can be replicated in different host machines to increase robustness.

    • Federated cloud resource management: Review and discussion

      2017, Journal of Network and Computer Applications
    View all citing articles on Scopus

    Shubhashis Sengupta is a senior researcher with Accenture Technology Labs. He has received Bachelor Degree in Comp. Science from Jadavpur University Calcutta, and Ph.D. in Management Information Systems from Indian Institute of Management. His research interest includes distributed computing, performance engineering, Grids, cloud and software architecture. Shubhashis is a senior member of ACM and member of IEEE Computer Society.

    Annervaz K M is a researcher at Accenture Technology Labs, Bangalore. He graduated with masters in Computer Science and Engineering from Indian Institute of Technology Bombay. His main research interests are in Algorithms, System Models, Formal Verification, Machine Learning and Optimization Techniques. Currently his research focus is on the application of Optimization and Machine Learning techniques to real world problems, especially in the area of Software Engineering and System Architectures.

    View full text