research-article

CAST: Tiering Storage for Data Analytics in the Cloud

Authors:

M. Safdar Iqbal,

Ali R. ButtAuthors Info & Claims

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Pages 45 - 56

https://doi.org/10.1145/2749246.2749252

Published: 15 June 2015 Publication History

Abstract

Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrificing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flexible, makes the choice of storage deployment non trivial. Crafting deployment scenarios to leverage these choices in a cost-effective manner - under the unique pricing models and multi-tenancy dynamics of the cloud environment - presents unique challenges in designing cloud-based data analytics frameworks.

In this paper, we propose CAST, a Cloud Analytics Storage Tiering solution that cloud tenants can use to reduce monetary cost and improve performance of analytics workloads. The approach takes the first step towards providing storage tiering support for data analytics in the cloud. CAST performs offline workload profiling to construct job performance prediction models on different cloud storage services, and combines these models with workload specifications and high-level tenant goals to generate a cost-effective data placement and storage provisioning plan. Furthermore, we build CAST++ to enhance CAST's optimization model by incorporating data reuse patterns and across-jobs interdependencies common in realistic analytics workloads. Tests with production workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster demonstrate that CAST++ achieves 1.21X performance and reduces deployment costs by 51.4% compared to local storage configuration.

References

[1]

Amazon EMR. http://aws.amazon.com/elasticmapreduce.

[2]

Apache Hadoop. http://hadoop.apache.org.

[3]

Apache Oozie. http://oozie.apache.org.

[4]

Azure Storage. http://azure.microsoft.com/en-us/services/storage.

[5]

EC2 Storage. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html.

[6]

Google Cloud Pricing. https://cloud.google.com/compute/#pricing.

[7]

Google Cloud Storage Connector for Hadoop. https://cloud.google.com/hadoop/google-cloud-storageconnector.

[8]

Hadoop on Google Compute Engine. https://cloud.google.com/solutions/hadoop.

[9]

HDFS-2832. https://issues.apache.org/jira/browse/HDFS-2832.

[10]

HP Cloud Storage. http://www.hpcloud.com/products-services/storage-cdn.

[11]

Impala. http://impala.io.

[12]

Microsoft Azure HDInsight. http://azure.microsoft.com/en-us/services/hdinsight.

[13]

Mumak: MapReduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.

[14]

C. Albrecht, A. Merchant, M. Stokely, M. Waliji, F. Labelle, N. Coehlo, X. Shi, and C. E. Schrock. Janus: Optimal flash provisioning for cloud storage workloads. In Proceedings of USENIX ATC 2013.

Digital Library

[15]

G. Ananthanarayanan, A. Ghodsi, A. Warfield, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated memory caching for parallel jobs. In Proceedings of USENIX NSDI 2012.

Digital Library

[16]

R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, and A. Rowstron. Scale-up vs scale-out for hadoop: Time to rethink? In Proceedings of ACM SoCC 2013.

Digital Library

[17]

S. Babu. Towards automatic optimization of MapReduce programs. In Proceedings of ACM SoCC 2010.

Digital Library

[18]

Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in Big Data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12):1802--1813, Aug. 2012.

Digital Library

[19]

B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of ACM SoCC 2010.

Digital Library

[20]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of USENIX OSDI 2004.

Digital Library

[21]

I. Elghandour and A. Aboulnaga. ReStore: Reusing results of MapReduce jobs. PVLDB, 5(6):586--597, Feb. 2012.

Digital Library

[22]

J. Guerra, H. Pucha, J. Glider, W. Belluomini, and R. Rangaswami. Cost effective storage using extent based dynamic tiering. In Proceedings of USENIX FAST 2011.

Digital Library

[23]

T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis of HDFS under HBase: A Facebook Messages case study. In Proceedings of USENIX FAST 2014.

Digital Library

[24]

H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111--1122, 2011.

Digital Library

[25]

H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for Big Data analytics. In CIDR, 2011.

[26]

M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of ACM EuroSys 2007.

Digital Library

[27]

V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the tenant-provider gap in cloud services. In Proceedings of ACM SoCC 2012.

Digital Library

[28]

H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu. Evaluating Phase Change Memory for enterprise storage systems: A study of caching and tiering approaches. In Proceedings of USENIX FAST 2014.

Digital Library

[29]

S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 220(4598):671--680, 1983.

[30]

E. Kreyszig. Advanced Engineering Mathematics. Wiley, 10th edition, August 2011.

[31]

Krish K.R., A. Anwar, and A. R. Butt. hatS: A heterogeneity-aware tiered storage for Hadoop. In Proceedings of IEEE/ACM CCGrid 2014.

[32]

Krish K.R., A. Anwar, and A. R. Butt. øSched: A heterogeneity-aware Hadoop workflow scheduler. In Proceedings of IEEE MASCOTS 2014.

Digital Library

[33]

S. Li, S. Hu, S. Wang, L. Su, T. Abdelzaher, I. Gupta, and R. Pace. WOHA: Deadline-aware Map-Reduce workflow scheduling framework over Hadoop clusters. In Proceedings of IEEE ICDCS 2014.

Digital Library

[34]

Z. Li, A. Mukker, and E. Zadok. On the importance of evaluating storage systems' $costs. In Proceedings of USENIX HotStorage 2014.

Digital Library

[35]

H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for MapReduce workflows. PVLDB, 5(11):1196--1207, July 2012.

Digital Library

[36]

M. Mao and M. Humphrey. Auto-scaling to minimize cost and meet application deadlines in cloud workflows. In Proceedings of ACM/IEEE SC 2011.

Digital Library

[37]

F. Meng, L. Zhou, X. Ma, S. Uttamchandani, and D. Liu. vCacheShare: Automated server flash cache space management in a virtualization environment. In Proceedings of USENIX ATC 2014.

Digital Library

[38]

M. Mihailescu, G. Soundararajan, and C. Amza. MixApart: Decoupled analytics for shared storage systems. In Proceedings of USENIX FAST 2013.

Digital Library

[39]

K. P. Puttaswamy, T. Nandagopal, and M. Kodialam. Frugal storage for cloud file systems. In Proceedings of ACM EuroSys 2012. ACM.

Digital Library

[40]

K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of IEEE MSST 2010.

Digital Library

[41]

A. Verma, L. Cherkasova, and R. H. Campbell. Play it again, SimMR! In Proceedings of IEEE CLUSTER 2011.

Digital Library

[42]

G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design decisions in MapReduce setups. In Proceedings of IEEE MASCOTS 2009.

[43]

H. Wang and P. Varman. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In Proceedings of USENIX FAST 2014.

Digital Library

[44]

A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the deployment of computations in the cloud with Conductor. In Proceedings of USENIX NSDI 2012.

Digital Library

[45]

D. Yuan, Y. Yang, X. Liu, and J. Chen. A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 26(8):1200--1214, 2010.

Digital Library

[46]

M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of ACM EuroSys 2010.

Digital Library

[47]

M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of USENIX NSDI 2012.

Digital Library

Cited By

Luo MZhou KWang HJiang ZKong HJi Y(2024)CAMS: A Cost-Aware Migration Scheme for Cloud Object Storage Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781371(1-4)Online publication date: 9-Nov-2024
https://doi.org/10.1109/NAS63802.2024.10781371
Guo ALi JSukprasert PKhuller SDeshpande AMukherjee K(2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00049
Bader JLehmann FThamsen LLeser UKao O(2024)Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructuresFuture Generation Computer Systems10.1016/j.future.2023.08.022150(171-185)Online publication date: Jan-2024
https://doi.org/10.1016/j.future.2023.08.022
Show More Cited By

Index Terms

CAST: Tiering Storage for Data Analytics in the Cloud
1. Software and its engineering
  1. Software creation and management
    1. Software development process management
      1. Software development methods
  2. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Allocation / deallocation strategies

Recommendations

Challenges and Benefits of Deploying Big Data Analytics in the Cloud for Business Intelligence

Cloud computing and big data analytics are, without a doubt, two of the most important technologies to enter the mainstream IT industry in recent years. Surprisingly, the two technologies are coming together to deliver powerful results and benefits for ...
TomusBlobs: scalable data-intensive processing on Azure clouds

The emergence of cloud computing has brought the opportunity to use large-scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the 'elasticity' in resource usage and ...
Towards a framework for large-scale multimedia data storage and processing on Hadoop platform

Cloud computing techniques take the form of distributed computing by utilizing multiple computers to execute computing simultaneously on the service side. To process the increasing quantity of multimedia data, numerous large-scale multimedia data ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

June 2015

296 pages

ISBN:9781450335508

DOI:10.1145/2749246

General Chair:
Thilo Kielmann
VU University Amsterdam, The Netherlands
,
Program Chairs:
Dean Hildebrand
IBM Research Almaden
,
Michela Taufer
University of Delaware

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF

Conference

HPDC'15

Sponsor:

University of Arizona
SIGARCH

HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing

June 15 - 19, 2015

Oregon, Portland, USA

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

30
Total Citations
View Citations
594
Total Downloads

Downloads (Last 12 months)32
Downloads (Last 6 weeks)4

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Luo MZhou KWang HJiang ZKong HJi Y(2024)CAMS: A Cost-Aware Migration Scheme for Cloud Object Storage Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781371(1-4)Online publication date: 9-Nov-2024
https://doi.org/10.1109/NAS63802.2024.10781371
Guo ALi JSukprasert PKhuller SDeshpande AMukherjee K(2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
https://doi.org/10.1109/IPDPS57955.2024.00049
Bader JLehmann FThamsen LLeser UKao O(2024)Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructuresFuture Generation Computer Systems10.1016/j.future.2023.08.022150(171-185)Online publication date: Jan-2024
https://doi.org/10.1016/j.future.2023.08.022
Tzenetopoulos AMasouros DXydis SSoudris D(2024)Orchestration Extensions for Interference- and Heterogeneity-Aware Placement for Data-AnalyticsInternational Journal of Parallel Programming10.1007/s10766-024-00771-252:4(298-323)Online publication date: 28-May-2024
https://doi.org/10.1007/s10766-024-00771-2
Zhang JWang AMa XCarver BNewman NAnwar ARupprecht LTarasov VSkourtis DYan FCheng Y(2023)InfiniStore: Elastic Serverless Cloud StorageProceedings of the VLDB Endowment10.14778/3587136.358713916:7(1629-1642)Online publication date: 1-Mar-2023
https://dl.acm.org/doi/10.14778/3587136.3587139
Liu MPan LLiu S(2023)RLTiering: A Cost-Driven Auto-Tiering System for Two-Tier Cloud Storage Using Deep Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322486534:2(501-518)Online publication date: 1-Feb-2023
https://doi.org/10.1109/TPDS.2022.3224865
Mukherjee KShah RSaini SSingh KKhushi Kesarwani HBarnwal KChauhan A(2023)Towards Optimizing Storage Costs on the Cloud2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00223(2919-2932)Online publication date: Apr-2023
https://doi.org/10.1109/ICDE55515.2023.00223
Ge XCao ZDu DGanesan PHahn D(2022)HintStor: A Framework to Study I/O Hints in Heterogeneous StorageACM Transactions on Storage10.1145/348914318:2(1-24)Online publication date: 10-Mar-2022
https://dl.acm.org/doi/10.1145/3489143
Kim JSon MLee K(2022)MPEC: Distributed Matrix Multiplication Performance Modeling on a Scale-Out Cloud Environment for Data Mining JobsIEEE Transactions on Cloud Computing10.1109/TCC.2019.295040010:1(521-538)Online publication date: 1-Jan-2022
https://doi.org/10.1109/TCC.2019.2950400
Liu MPan LLiu S(2021)Keep Hot or Go Cold: A Randomized Online Migration Algorithm for Cost Optimization in STaaS CloudsIEEE Transactions on Network and Service Management10.1109/TNSM.2021.309653318:4(4563-4575)Online publication date: Dec-2021
https://doi.org/10.1109/TNSM.2021.3096533
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten