Elsevier

Computer Networks

Volume 55, Issue 5, 1 April 2011, Pages 1100-1113
Computer Networks

Towards the design of optimal data redundancy schemes for heterogeneous cloud storage infrastructures

https://doi.org/10.1016/j.comnet.2010.11.004Get rights and content

Abstract

Nowadays, data storage requirements from end-users are growing, demanding more capacity, more reliability and the capability to access information from anywhere. Cloud storage services meet this demand by providing transparent and reliable storage solutions. Most of these solutions are built on distributed infrastructures that rely on data redundancy to guarantee a 100% of data availability. Unfortunately, existing redundancy schemes very often assume that resources are homogeneous, an assumption that may increase storage costs in heterogeneous infrastructures – e.g., clouds built of voluntary resources.

In this work, we analyze how distributed redundancy schemes can be optimally deployed over heterogeneous infrastructures. Specifically, we are interested in infrastructures where nodes present different online availabilities. Considering these heterogeneities, we present a mechanism to measure data availability more precisely than existing works. Using this mechanism, we infer the optimal data placement policy that reduces the redundancy used, and then its associated overheads. In heterogeneous settings, our results show that data redundancy can be reduced up to 70%.

Introduction

We are witnessing today the rapid proliferation of cloud storage services as means to provision reliable storage and backup of files. Amazon S3 [1] is a representative example but also Mosso [27], Wuala [38] or Cleversafe [10]. All of these services offer users clean and simple storage interfaces, hiding the details of the actual location and management of resources. Most of these clouds (e.g., [1], [27]) are built on well-provisioned and well-managed infrastructures, typically data centers, that are responsible for provisioning users with storage services. Very often, these data centers are controlled exclusively by cloud providers (e.g., Amazon, Google, Microsoft, etc.), whereas the user pays a price for the use of their resources. There is also the notion of resource-performance guarantee between the cloud provider and the user, that ensures that the user sees the performance he/she expects to see.

Current cloud storage infrastructures are focused on providing users with easy interfaces and high performance services. However, there are some classes of storage services for which the current cloud model may not fit well. For example, consider a research institute that wishes to freely share its results with others institutions as a “public service” (e.g., in form of a digital repository to make protein research more accessible to scientists), but it requires deployment resources. Since this service may not be commercial, the service deployers may not want to pay the cost of running the service.

To host such services, Chandra and Weissman in [7] proposed the idea of using voluntary resources (those donated by end-users in @home systems [2] and peer-to-peer (P2P) networks [33]) to form nebulas, i.e., more dispersed, less managed clouds. Nebulas draw on many of the ideas advanced in Grids, P2P systems and distributed data centers [9].

For cloud storage, volunteer resources will be attractive for several reasons:

  • Scalability: Many volunteer systems (e.g., @home systems) consist of millions of hosts; thereby providing a large amount of resource capacity and scalability;

  • Geographic dispersion: Volunteer resources are highly distributed (e.g., P2P systems); and

  • Low cost of deployment: Volunteer resources are available for free or at very low cost, which implies that a large amount of disk capacity is available for storage services.

As an example, a large nebula may run a cloud service over a @home system such as Folding@home [32]. This platform has approximately 250,000 hosts providing over 1 Petaflop, which makes Folding@home platform comparable to some of the fastest supercomputers in the world today. Since in these settings there is a high degree of heterogeneity, both in terms of storage and computational capacity, along with churn and failures, the provision of reliability poses a huge challenge to democratize cloud computing.

This vision is also advocated by the developers of Open Cirrus [5]. Open Cirrus is a cloud testbed for the research community that federates heterogeneous distributed data centers to offer global services, such as sign-on, monitoring and storage. To be operative for the community, handling heterogeneity is therefore critical in these cloud platforms. Actually, heterogeneity will be one of the major challenges of cloud computing platforms in the future [35].

To meet this challenge, Chandra and Weissman [7] state that the solution resides on incorporating the reliability of nodes in service deployment decisions in addition to basic criteria such as computational speed and storage capacity. For cloud storage it means that the storage process must be aware of the stability of nodes when it stores data to them. Since preserving data availability requires redundancy and scattering each object into multiple hosts, considering the heterogeneity of hosts is essential to optimize the amount of redundancy. An excess of redundancy results in a higher storage and communication burden, with no benefit for the user.

Unfortunately, existing storage solutions [37], [25], [29], [14] have considered homogeneous settings, where all nodes are treated equal regarding their on-line/off-line behavior. Although this model is appropriate for commercial clouds, it can be tricky for clouds built of heterogeneous hosts such as nebulas and distributed data centers.

Goals. The aim of this paper is to examine whether cloud storage systems can reduce the required redundancy by considering heterogeneous node availabilities:

At first glance, it seems quite intuitive that one can increase data availability by assigning more redundant information to the most stable nodes. However, if we take this approach to the extreme, i.e., by considering only the highest stable nodes, we may experience a decrease in data availability for the simple reason that there are less hosts where to distribute the same amount of redundancy. This illustrates the importance of finding an optimal trade-off between the number of hosts and their on-line availability, which is the novel contribution of this article. By finding the appropriate trade-off, cloud systems will be able to maximize the data availability they provide while reducing the redundancy required to do it.

However, this maximization implies that availability-aware cloud systems should be able:

  • To measure data availability accurately;

  • To determine the optimal quantity of information to be assigned to each host; and

  • To find the minimum data redundancy that maintains the targeted data availability.

Throughout all the manuscript, we will build our results considering a redundancy scheme based on generic erasure codes. The main reason is that erasure codes are more flexible than traditional schemes based on replication, a feature that make them very appropriate for heterogeneous platforms. Additionally, they are usually more effective in terms of data overhead [29], [36], [25] than replication.

Contributions. In this article, we address the above three aspects for building a heterogeneity-aware cloud storage system. Our three main contributions can be stated in the same fashion:

  • 1.

    We develop an analytical framework to compute data availability in heterogeneous environments. Since it is computationally costly to calculate the data availability when the set of storage nodes grows, we propose a Monte Carlo method to estimate the real value in a less computational expensive way.

  • 2.

    Since determining the optimal amount of redundancy to be assigned at each host is computationally hard, we propose a novel heuristic based on particle swarm optimization (PSO) [21]. From our results, we infer a simple allocation function to optimally find the minimum redundancy required.

  • 3.

    Finally, we provide a simple iterative algorithm to determine the minimum redundancy required to guarantee the data availability requirements of different possible storage applications.

Our theoretical and simulation-based results show that a storage system implementing our heterogeneity-aware erasure code scheme could reduce redundancy up to 70% in highly heterogeneous clouds.

The rest of the paper is organized as follows. Section 2 describes the related works. In Sections 3 Data storage model, 4 Problem statement we analytically pose our storage model and the problem statement. Then, the successive sections address each of the three stated contributions. Section 5 describes how data availability can be measured in heterogeneous environments, gives the computationally complexity of this measurement, and proposes a Monte Carlo method to approximate this value when it cannot be analytically measured. In Section 6 we find the optimal number of redundant blocks assigned to each storage node, and in Section 7, we find the required data redundancy to guarantee high data availability. Section 8 presents the benefits of heterogeneity-aware redundancy schemes. Finally in Section 9 we present the conclusions and further research lines.

Section snippets

Related work

Cloud storage services have gained popularity in the last years [10], [38], [1], [27]. These services allow users to store data off-site, masking the complexities of the infrastructure supporting the storage service. To store large amounts of data from thousands of users, cloud storage systems build their services over distributed storage infrastructures, more scalable and more reliable than centralized solutions. Different works have proposed distributed data storage infrastructures, some

Data storage model

Throughout this paper, we will consider a distributed storage system working in its steady state: the number of nodes and the number of stored objects is kept constant. Let H represent the set with all the storage nodes in the system. Then, the storage process stores each data object in a small subset of nodes, N,NH. In this work we do not treat how nodes in N are chosen. A possible solution is to ask to a centralized directory service for a list of nodes with free storage capacity. However,

Problem statement

Existing storage systems do not consider the individual node availabilities, ai, and use the mean node availability a¯ as the basis of their models. Recall that the mean node availability is given by,a¯=1|N|iNai.

Without considering heterogeneities, all nodes are treated equally, and then, the assignment function stores to all nodes the same number of redundant blocks. Doing the contrary would lead to the superfluous decision of choosing to which node store more redundancy. Under homogeneous

Measuring data availability

In this section, we provide the basic framework to measure data availability, d, in a specific storage scenario. As we previously defined, data availability is a probabilistic metric that depends on several parameters. Among all these parameters, we are interested in measuring d as a function of: (1) the erasure code’s parameters k and n, (2) the assignment function g and (3) the set of nodes N. Hence, we aim to obtain a function, D, to measure data availability, such that, d=D(k,n,g,N).

In

Finding optimal assignment function

Once we solved the problem of how to measure data availability, in this Section we face the problem of how to assign the n redundant blocks in order to maximize data availability. Unlike the homogeneous case, where each node was responsible of one block, under the heterogeneous case, we will try to increase data availability by assigning more blocks to the more stable nodes. However, finding the optimal data assignment function g is another hard computable task. Determining all the possible

Finding optimal redundancy

Once we know how to measure data availability and the optimal assignment function, it only remains to find the minimum amount of redundancy required to guarantee a minimum data availability dˆ. Finding the redundancy means finding the minimum r=nk ratio that achieves dˆ. However, since two different parameters: k and n, are involved in r, different pairs of k and n can be used to achieve dˆ. In the existing literature, storage systems initially set k to a fixed value and increase n from n = k

Redundancy savings

Our validation is focused on answering the following question: Is it worth to use a heterogeneous-aware redundancy scheme instead of a typical and simple homogeneous one?

To answer this question we define the redundancy saving ratio metric (RSR). This metric measures the savings in redundancy caused by considering a heterogeneous storage system instead of a homogeneous one. Let rhomo and rheter be the data redundancies that a homogeneous system and a heterogeneous system needs in order to

Conclusions and further work

Existing cloud storage services are designed and built on the assumption that all storage backends constitute a homogeneous set of distributed resources. This assumption lead these systems to consider a unique online availability for all nodes in the cloud, and then, to optimize their data redundancy according to this assumption. However, as we showed, this assumption simplifies the way data availability is measured, but it introduces an error that causes an increase in the data redundancy, and

Acknowledgments

We would like to express our gratitude to the anonymous reviewers for the insights and comments provided during the review process, which have greatly contributed to improve the quality of the original manuscript.

Lluís Pàmies i Juárez, University Rovira i Virgili, Spain. Ph. D. Student at Universitat Rovira i Virgili, Tarragona, Spain, in the Department of Computer Engineering and Mathematics under the supervision of Ph.D. Pedro García López. He got his M.Sc. in Computer Science Engineering from Universitat Rovira i Virgili in 2007. His research interests are in distributed systems and peer-to-peer networks.

References (39)

  • Amazon.com, 2009. Amazon S3....
  • D.P. Anderson, Boinc: a system for public-resource computing and storage, in: Proceedings of the Fifth IEEE/ACM...
  • C. Blake, R. Rodrigues, High availability, scalable storage, dynamic peer networks: pick two, in: Proceedings of the...
  • W.J. Bolosky, J.R. Douceur, D. Ely, M. Theimer, Feasibility of a serverless distributed file system deployed on an...
  • R. Campbell, I. Gupta, M. Heath, S.Y. Ko, M. Kozuch, M. Kunze, T. Kwan, K. Lai, H.Y. Lee, M. Lyons, D. Milojicic, D....
  • A.L. Cauchy, Cours d’analyse de l’École Royale Polytechnique, première partie: analyse algébrique,...
  • Chandra, A., Weissman, J., 2009. Nebulas: Using distributed voluntary resources to build clouds, in: Proceedings of the...
  • B.G. Chun, F. Dabek, A. Haeberlen, E. Sit, H. Weatherspoon, M.F. Kaashoek, J. Kubiatowicz, Efficient replica...
  • K. Church, A. Greenberg, J. Hamilton, On delivering embarrassingly distributed cloud services, in: Proceedings of...
  • CleverSafe, 2010, Cleversafe,...
  • L. Cox, B. Noble, Pastiche: making backup cheap and easy, in: Proceedings of Fifth USENIX Symposium on Operating...
  • A. Datta, K. Aberer, Internet-scale storage systems under churn. A study of the steady-state using markov models, in:...
  • Y. Deng et al.

    A heterogeneous storage grid enabled by grid service

    SIGOPS Oper. Syst. Rev.

    (2007)
  • A. Dimakis, P. Godfrey, M. Wainwright, K. Ramchandran, Network coding for distributed storage systems, in: Proceedings...
  • A. Duminuco, E.W. Biersack, Hierarchical codes: how to make erasure codes attractive for peer-to-peer storage systems,...
  • A. Duminuco, E.W. Biersack, T. En-Najjary, Proactive replication in distributed storage systems using machine...
  • M.L. Fakult, Peerstore: better performance by relaxing in peer-to-peer backup, in: Proceedings of the Fourth...
  • B. Godfrey, Repository of availability traces, 2010,...
  • S. Guha, N. Daswani, R. Jain, An experimental study of the skype peer-to-peer voip system, in: Proceedings of the Fifth...
  • Cited by (0)

    Lluís Pàmies i Juárez, University Rovira i Virgili, Spain. Ph. D. Student at Universitat Rovira i Virgili, Tarragona, Spain, in the Department of Computer Engineering and Mathematics under the supervision of Ph.D. Pedro García López. He got his M.Sc. in Computer Science Engineering from Universitat Rovira i Virgili in 2007. His research interests are in distributed systems and peer-to-peer networks.

    Pedro García López, University Rovira i Virgili, Spain. Pedro Garcia is professor at the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain). He obtained his Ph.D. in 2003 in the University of Murcia about Collaborative Distributed Systems. During his Ph.D. he also worked in the university of Ghent (Belgium) and GMD-FIT Bonn. His research topics are distributed systems, peer-to-peer, software architectures and middleware and collaborative environments. He has published more than 50 papers and participated in several spanish and european research projects. He currently leads in Tarragona the “Architectures and Telematic Services” research group and coordinates the URV team in the project IST-POPEYE (Peer-to-Peer collaborative Working environments over Mobile Ad-Hoc Networks) and P2PGRID (Self-Adjusting Peer-to-Peer and Grid Systems).

    Marc Sánchez-Artigas, University Rovira i Virgili, Spain. Marc Sànchez Artigas is a Ph.D. student in the Department of Computer Science and Mathematics at Universitat Rovira i Virgili, Spain. His research interests include distributed systems and overlay infrastructures for structured P2P networks. Artigas has a BS and an MS in computer science from Universitat Rovira i Virgili. Contact him at [email protected].

    Blas Herrera, University Rovira i Virgili, Spain. Blas Herrera is professor at the Computer Engineering and Mathematics Department at the University Rovira i Virgili (Spain). He obtained his Ph.D. in 1994 in the Universitat Autònoma de Barcelona (Spain) about Differential Geometry. He is currently an active researcher on Mechanics, Geometry and Computer Engineering.

    An early version of this work was presented at ICPP’09 [26]. This work has been partially funded by the Spanish Ministry of Science and Innovation through project DELFIN, Ref. TIN2010-20140-C03-03.

    View full text