To move or not to move: Cost optimization in a dual cloud-based storage architecture

https://doi.org/10.1016/j.jnca.2016.08.029Get rights and content

Abstract

IT enterprises have recently witnessed a dramatic increase in data volume and faced with challenges of storing and retrieving their data. Thanks to the fact that cloud infrastructures offer storage and network resources in several geographically dispersed data centers (DCs), data can be stored and shared in scalable and highly available manner with little or no capital investment. Due to diversity of pricing options and variety of storage and network resources offered by cloud providers, enterprises encounter nontrivial choice of what combination of storage options should be used in order to minimize the monetary cost of managing data in large volumes. To minimize the cost of data storage management in the cloud, we propose two data object placement algorithms, one optimal and another near optimal, that minimize residential (i.e., storage, data access operations), delay, and potential migration costs in a dual cloud-based storage architecture (i.e., the combination of a temporal and a backup DC). We evaluate our algorithms using real-world traces from Twitter. Results confirm the importance and effectiveness of the proposed algorithms and highlight the benefits of leveraging pricing differences and data migration across cloud storage providers (CSPs).

Introduction

Data volume is one of the important characteristics of cloud-based application (e.g., Online Social Network) and has been changed from TB to PB with an inevitable move to ZB in current IT enterprises. From statistical perspective, 8 × 105 PB of data were stored in the world by the year of 2000 and it is expected that this number will increase to 35 ZB by 2020 (Yu et al., 2015). Storing and retrieving such data volume demand a highly available, scalable, and cost-efficient infrastructure.

Thanks to the cloud infrastructures, management of such large volume data has been simplified and the need for capital investment has been removed from IT companies. However, this creates a major concern for these companies regarding the cost of data management in the cloud. The cost of data storage management (simply, cost of data management) is a vital factor from companies’ perspective since it is the essential driver behind the migration to the cloud. Thus, companies are in favor of optimizing data management cost in the cloud deployments. In order to optimize data management cost, choosing a suitable storage option across CSPs in the right time becomes a nontrivial task. This happens due to the two following reasons.

First, there is an array of pricing options for the variety of storage and network services across CSPs (e.g., Amazon, Google, and Microsoft Azure). CSPs currently offer at least two classes of storage service: Standard Storage (SS) and Reduced Redundancy Storage (RRS). RRS enables users to reduce cost at the expenses of lower levels of redundancy (i.e., less reliability and availability) as compared to SS. These services provide users with API to Get (read) data from storage and to Put (write) data into it. In mid-2015, Amazon and Google respectively introduced Infrequent Access Storage (IAS) and Nearline storage services that aims at hosting objects with infrequent Gets/Puts. Both services charge lower storage cost in comparison to their corresponding RRS but higher cost for Gets and Puts.

Furthermore, CSPs charge users with different out-network costs to read data from a DC to the Internet (typically in-network data transfer is free). They also offer discounts for data transfer between their DCs. For example, Amazon reduces out-network cost when data is transferred across its DCs in different regions and Google offers free of charge data exchange between its DCs in the same region. Thus, taking the advantage of diversification in price of storage and network (as well as service type) plays an important factor in residential cost (i.e., Storage and data access operations) as a major part of the data management cost. Note that data access operations are Get and Put in this paper.

Second, there is time-varying workload on the object stored in the cloud. Presume that an object is a tweet/photo and it is posted on the user's feed (e.g., timeline in Facebook) by herself or her friends. Gets and Puts are usually high in the early lifetime of the object and we say that such object is in hot-spot status. As time passes, Gets and Puts are reduced and we refer that the object is in cold-spot status. Thus, it is cost-efficient to store the object in a DC with lower out-network cost (referred as a temporal DC) in its early lifetime, and then migrate it to the DC with lower cost in storage (referred as a backup DC). If the object migration happens between a temporal DC and a backup DC, the user incurs migration cost. This cost is another part of the data management cost, which is affected by the number of Gets, Puts, and the object size. It is important to note that the migration cost might be zero in some cases: (i) if both DCs belong to the same provider and are in the same region, then transferring objects between DCs is free, as in Google provider, and (ii) if temporal and backup DCs are the same and the object is just moved from a storage class to another (i.e., from SS to IAS) within the same DC.

Besides discussed costs, latency for reading from (writing into) the data store is also a vital performance criterion from the user's perspective. The latency is defined as the elapsed time between issuing a Get/Put and receiving the required object. To respect this criterion, we convert latency into monetary cost, as a latency cost, and integrate it in our cost model.

In summary, by wisely taking into account the discussed differences in prices across CSPs and time-varying workload, we can reduce the data management cost (i.e., residential, latency, and migration costs) as one of the main user's concerns with regard to the cloud deployment. To address this issue, we make the following contributions:

  • We propose an optimal algorithm that optimizes data management cost in the dual cloud-based architecture when the workload in terms of Gets and Puts on the objects is known.

  • We also propose a near-optimal algorithm that achieves competitive cost as compared to that obtained by the optimal algorithm in the absence of future workload knowledge.

  • We demonstrate the effectiveness of the proposed algorithms by using the real-world traces from Twitter in a simulation.

The reminder of this paper is organized as follows. Section 2 discusses related work. Section 3 presents system and cost model. In Section 4, we describe our object placement algorithms to save cost. Section 5 presents our simulation experiments and evaluation of the proposed algorithms. Finally, in Section 6, we conclude this paper with future work issue related to cost optimization of data management across Geo-replicated cloud-based data stores.

Section snippets

Related work

We compare our work in this paper with state-of-the-art works in five categories: benefits of cloud deployments, Geo-distributed cloud storage services, cloud-based Content Delivery Network (CDN), hierarchical storage management, and computing resource allocation.

Benefits of cloud deployments: Some recent studies investigate when to use cloud-based services, in particular focusing on how and when to migrate applications from a private cloud to a public one (Hajjat et al., 2010, Tak et al., 2011

System and cost model

We first describe the dual cloud-based architecture, which can lead to reduced monetary cost for applications. Then, we discuss the cost model and the objective function that should be minimized considering the objective and specifications of the architecture.

Data management cost optimization

To solve the aforementioned cost optimization problem, we first propose a dynamic algorithm to minimize the overall cost while the future workload is assumed to be known a priori. Then, we present a heuristic algorithm to achieve competitive cost as compared to the cost of dynamic algorithm for unknown objects workload.

Performance evaluation

In this section, we first discuss the experimental settings in terms of workload characteristics, DCs specifications, and assignment of users to DCs. Then, we study the performance of the proposed algorithms in terms of cost saving and investigate the effect of the various parameters on the cost saving.

Conclusions and future work

Choosing storage options across CSPs for time-varying workload is critical for optimizing data management cost. In particular, issues such as when should an object be migrated and in which storage class it should be stored need to be addressed. We consider a fine-grained architecture and propose two algorithms that determine optimal (resp. near optimal) placement of the object with (resp. without) the knowledge of the future workload. Such a fine-grained architecture provides evidence that one

Acknowledgments

We thank Rodrigo N. Calheiros, Adel Nadjaran Toosi, Sareh Fotuhi Piraghaj, Chenhao Qu, and Yali Zhao for their valuable comments in improving the quality of the paper. This work is supported by the Australian Research Council Future Fellowship and Discovery Project Grant.

References (37)

  • Elmore, A.J., Das, S., Agrawal, D., El Abbadi, A., 2011. Zephyr: Live migration in shared nothing databases for elastic...
  • Fang, Y., Wang, F., Ge, J., 2010. A task scheduling algorithm based on load balancing in cloud computing. In:...
  • Gao, P.X., Curtis, A.R., Wong, B., Keshav, S., 2012. It's not easy being green, in: Proceedings of the ACM SIGCOMM 2012...
  • M. Hajjat et al.

    Cloudward boundplanning for beneficial migration of enterprise applications to the cloud

    SIGCOMM Comput. Commun. Rev.

    (2010)
  • L. Jiao et al.

    Optimizing cost for online social networks on geo-distributed clouds

    Netw. IEEE/ACM Trans.

    (2014)
  • Kotla, R., Alvisi, L., Dahlin, M., 2007. Safestore: a durable and practical storage system. In: Proceedings of the...
  • Latency-it Costs You....
  • Li, R., Wang, S., Deng, H., Wang, R., Chang, K.C., 2012. Towards social user profiling: unified and discriminative...
  • Cited by (25)

    • Online cost optimization algorithms for tiered cloud storage services

      2020, Journal of Systems and Software
      Citation Excerpt :

      Users Location: Users are assigned to DCs as follows. We allocate users to DCs in South USA and West USA with the help of Google Maps Geocoding API (Mansouri and Buyya, 2016). For DC in Canada Central, we redirect the users to DC in AM-USW(Oregon) since it is the closest to these users.

    • Dynamic replication and migration of data objects with hot-spot and cold-spot statuses across storage data centers

      2019, Journal of Parallel and Distributed Computing
      Citation Excerpt :

      In contrast to all described solutions above, our proposed algorithm exploits pricing differences across data stores (offered by different CSPs) with different storage classes. This work is in line with our previous studies [14,15] and aims at reducing the monetary cost when objects receive Gets and Puts from a wide range of DCs. This work is closely aligned with applications that host a large number of users since it uses a lightweight solution as compared to [15].

    View all citing articles on Scopus
    View full text