skip to main content
10.1145/2749246.2749252acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

CAST: Tiering Storage for Data Analytics in the Cloud

Published: 15 June 2015 Publication History

Abstract

Enterprises are increasingly moving their big data analytics to the cloud with the goal of reducing costs without sacrificing application performance. Cloud service providers offer their tenants a myriad of storage options, which while flexible, makes the choice of storage deployment non trivial. Crafting deployment scenarios to leverage these choices in a cost-effective manner - under the unique pricing models and multi-tenancy dynamics of the cloud environment - presents unique challenges in designing cloud-based data analytics frameworks.
In this paper, we propose CAST, a Cloud Analytics Storage Tiering solution that cloud tenants can use to reduce monetary cost and improve performance of analytics workloads. The approach takes the first step towards providing storage tiering support for data analytics in the cloud. CAST performs offline workload profiling to construct job performance prediction models on different cloud storage services, and combines these models with workload specifications and high-level tenant goals to generate a cost-effective data placement and storage provisioning plan. Furthermore, we build CAST++ to enhance CAST's optimization model by incorporating data reuse patterns and across-jobs interdependencies common in realistic analytics workloads. Tests with production workload traces from Facebook and a 400-core Google Cloud based Hadoop cluster demonstrate that CAST++ achieves 1.21X performance and reduces deployment costs by 51.4% compared to local storage configuration.

References

[1]
Amazon EMR. http://aws.amazon.com/elasticmapreduce.
[2]
Apache Hadoop. http://hadoop.apache.org.
[3]
Apache Oozie. http://oozie.apache.org.
[4]
Azure Storage. http://azure.microsoft.com/en-us/services/storage.
[5]
EC2 Storage. http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Storage.html.
[6]
Google Cloud Pricing. https://cloud.google.com/compute/#pricing.
[7]
Google Cloud Storage Connector for Hadoop. https://cloud.google.com/hadoop/google-cloud-storageconnector.
[8]
Hadoop on Google Compute Engine. https://cloud.google.com/solutions/hadoop.
[9]
HDFS-2832. https://issues.apache.org/jira/browse/HDFS-2832.
[10]
HP Cloud Storage. http://www.hpcloud.com/products-services/storage-cdn.
[11]
Impala. http://impala.io.
[12]
Microsoft Azure HDInsight. http://azure.microsoft.com/en-us/services/hdinsight.
[13]
Mumak: MapReduce Simulator. https://issues.apache.org/jira/browse/MAPREDUCE-728.
[14]
C. Albrecht, A. Merchant, M. Stokely, M. Waliji, F. Labelle, N. Coehlo, X. Shi, and C. E. Schrock. Janus: Optimal flash provisioning for cloud storage workloads. In Proceedings of USENIX ATC 2013.
[15]
G. Ananthanarayanan, A. Ghodsi, A. Warfield, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated memory caching for parallel jobs. In Proceedings of USENIX NSDI 2012.
[16]
R. Appuswamy, C. Gkantsidis, D. Narayanan, O. Hodson, and A. Rowstron. Scale-up vs scale-out for hadoop: Time to rethink? In Proceedings of ACM SoCC 2013.
[17]
S. Babu. Towards automatic optimization of MapReduce programs. In Proceedings of ACM SoCC 2010.
[18]
Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in Big Data systems: A cross-industry study of MapReduce workloads. PVLDB, 5(12):1802--1813, Aug. 2012.
[19]
B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with YCSB. In Proceedings of ACM SoCC 2010.
[20]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of USENIX OSDI 2004.
[21]
I. Elghandour and A. Aboulnaga. ReStore: Reusing results of MapReduce jobs. PVLDB, 5(6):586--597, Feb. 2012.
[22]
J. Guerra, H. Pucha, J. Glider, W. Belluomini, and R. Rangaswami. Cost effective storage using extent based dynamic tiering. In Proceedings of USENIX FAST 2011.
[23]
T. Harter, D. Borthakur, S. Dong, A. Aiyer, L. Tang, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Analysis of HDFS under HBase: A Facebook Messages case study. In Proceedings of USENIX FAST 2014.
[24]
H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of MapReduce programs. PVLDB, 4(11):1111--1122, 2011.
[25]
H. Herodotou, H. Lim, G. Luo, N. Borisov, L. Dong, F. B. Cetin, and S. Babu. Starfish: A self-tuning system for Big Data analytics. In CIDR, 2011.
[26]
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proceedings of ACM EuroSys 2007.
[27]
V. Jalaparti, H. Ballani, P. Costa, T. Karagiannis, and A. Rowstron. Bridging the tenant-provider gap in cloud services. In Proceedings of ACM SoCC 2012.
[28]
H. Kim, S. Seshadri, C. L. Dickey, and L. Chiu. Evaluating Phase Change Memory for enterprise storage systems: A study of caching and tiering approaches. In Proceedings of USENIX FAST 2014.
[29]
S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated annealing. SCIENCE, 220(4598):671--680, 1983.
[30]
E. Kreyszig. Advanced Engineering Mathematics. Wiley, 10th edition, August 2011.
[31]
Krish K.R., A. Anwar, and A. R. Butt. hatS: A heterogeneity-aware tiered storage for Hadoop. In Proceedings of IEEE/ACM CCGrid 2014.
[32]
Krish K.R., A. Anwar, and A. R. Butt. øSched: A heterogeneity-aware Hadoop workflow scheduler. In Proceedings of IEEE MASCOTS 2014.
[33]
S. Li, S. Hu, S. Wang, L. Su, T. Abdelzaher, I. Gupta, and R. Pace. WOHA: Deadline-aware Map-Reduce workflow scheduling framework over Hadoop clusters. In Proceedings of IEEE ICDCS 2014.
[34]
Z. Li, A. Mukker, and E. Zadok. On the importance of evaluating storage systems' $costs. In Proceedings of USENIX HotStorage 2014.
[35]
H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for MapReduce workflows. PVLDB, 5(11):1196--1207, July 2012.
[36]
M. Mao and M. Humphrey. Auto-scaling to minimize cost and meet application deadlines in cloud workflows. In Proceedings of ACM/IEEE SC 2011.
[37]
F. Meng, L. Zhou, X. Ma, S. Uttamchandani, and D. Liu. vCacheShare: Automated server flash cache space management in a virtualization environment. In Proceedings of USENIX ATC 2014.
[38]
M. Mihailescu, G. Soundararajan, and C. Amza. MixApart: Decoupled analytics for shared storage systems. In Proceedings of USENIX FAST 2013.
[39]
K. P. Puttaswamy, T. Nandagopal, and M. Kodialam. Frugal storage for cloud file systems. In Proceedings of ACM EuroSys 2012. ACM.
[40]
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of IEEE MSST 2010.
[41]
A. Verma, L. Cherkasova, and R. H. Campbell. Play it again, SimMR! In Proceedings of IEEE CLUSTER 2011.
[42]
G. Wang, A. R. Butt, P. Pandey, and K. Gupta. A simulation approach to evaluating design decisions in MapReduce setups. In Proceedings of IEEE MASCOTS 2009.
[43]
H. Wang and P. Varman. Balancing fairness and efficiency in tiered storage systems with bottleneck-aware allocation. In Proceedings of USENIX FAST 2014.
[44]
A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the deployment of computations in the cloud with Conductor. In Proceedings of USENIX NSDI 2012.
[45]
D. Yuan, Y. Yang, X. Liu, and J. Chen. A data placement strategy in scientific cloud workflows. Future Generation Computer Systems, 26(8):1200--1214, 2010.
[46]
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In Proceedings of ACM EuroSys 2010.
[47]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauly, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of USENIX NSDI 2012.

Cited By

View all
  • (2024)CAMS: A Cost-Aware Migration Scheme for Cloud Object Storage Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781371(1-4)Online publication date: 9-Nov-2024
  • (2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
  • (2024)Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructuresFuture Generation Computer Systems10.1016/j.future.2023.08.022150(171-185)Online publication date: Jan-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
June 2015
296 pages
ISBN:9781450335508
DOI:10.1145/2749246
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. big data analytics
  2. cloud computing
  3. mapreduce
  4. storage tiering

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

HPDC'15
Sponsor:

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)32
  • Downloads (Last 6 weeks)4
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)CAMS: A Cost-Aware Migration Scheme for Cloud Object Storage Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781371(1-4)Online publication date: 9-Nov-2024
  • (2024)To Store or Not to Store: a graph theoretical approach for Dataset Versioning2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS57955.2024.00049(479-493)Online publication date: 27-May-2024
  • (2024)Lotaru: Locally predicting workflow task runtimes for resource management on heterogeneous infrastructuresFuture Generation Computer Systems10.1016/j.future.2023.08.022150(171-185)Online publication date: Jan-2024
  • (2024)Orchestration Extensions for Interference- and Heterogeneity-Aware Placement for Data-AnalyticsInternational Journal of Parallel Programming10.1007/s10766-024-00771-252:4(298-323)Online publication date: 28-May-2024
  • (2023)InfiniStore: Elastic Serverless Cloud StorageProceedings of the VLDB Endowment10.14778/3587136.358713916:7(1629-1642)Online publication date: 1-Mar-2023
  • (2023)RLTiering: A Cost-Driven Auto-Tiering System for Two-Tier Cloud Storage Using Deep Reinforcement LearningIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2022.322486534:2(501-518)Online publication date: 1-Feb-2023
  • (2023)Towards Optimizing Storage Costs on the Cloud2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00223(2919-2932)Online publication date: Apr-2023
  • (2022)HintStor: A Framework to Study I/O Hints in Heterogeneous StorageACM Transactions on Storage10.1145/348914318:2(1-24)Online publication date: 10-Mar-2022
  • (2022)MPEC: Distributed Matrix Multiplication Performance Modeling on a Scale-Out Cloud Environment for Data Mining JobsIEEE Transactions on Cloud Computing10.1109/TCC.2019.295040010:1(521-538)Online publication date: 1-Jan-2022
  • (2021)Keep Hot or Go Cold: A Randomized Online Migration Algorithm for Cost Optimization in STaaS CloudsIEEE Transactions on Network and Service Management10.1109/TNSM.2021.309653318:4(4563-4575)Online publication date: Dec-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media