skip to main content
10.1145/3127479.3131623acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Latency reduction and load balancing in coded storage systems

Published: 24 September 2017 Publication History

Abstract

Erasure coding has been used in storage systems to enhance data durability at a lower storage overhead. However, these systems suffer from long access latency tails due to a lack of flexible load balancing mechanisms and passively launched degraded reads when the original storage node of the requested data becomes a hotspot. We provide a new perspective to load balancing in coded storage systems by proactively and intelligently launching degraded reads and propose a variety of schemes to make optimal decisions either per request or across requests statistically. Experiments on a 98-machine cluster based on the request traces of 12 million objects collected from Windows Azure Storage (WAS) show that our schemes can reduce the median latency by 44.7% and the 95th-percentile tail latency by 77.8% in coded storage systems.

References

[1]
Cristina L Abad, Nick Roberts, Yi Lu, and Roy H Campbell. 2012. A storage-centric analysis of mapreduce workloads: File popularity, temporal locality and arrival patterns. In Workload Characterization (IISWC), 2012 IEEE International Symposium on. IEEE, 100--109.
[2]
MOSEK ApS. 2017. The MOSEK Python optimizer API manual Version 7.1 (Revision 62). http://docs.mosek.com/7.1/pythonapi/index.html
[3]
Dhruba Borthakur. 2008. HDFS architecture guide. HADOOP APACHE PROJECT http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf (2008).
[4]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, et al. 2011. Windows Azure Storage: a highly available cloud storage service with strong consistency. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. ACM, 143--157.
[5]
Shengbo Chen, Yin Sun, Longbo Huang, Prasun Sinha, Guanfeng Liang, Xin Liu, Ness B Shroff, et al. 2014. When queueing meets coding: Optimal-latency data retrieving scheme in storage clouds. In IEEE INFOCOM 2014-IEEE Conference on Computer Communications. IEEE, 1042--1050.
[6]
Yanpei Chen, Sara Alspaugh, and Randy Katz. 2012. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment 5, 12 (2012), 1802--1813.
[7]
Aditya Ganjam, Faisal Siddiqui, Jibin Zhan, Xi Liu, Ion Stoica, Junchen Jiang, Vyas Sekar, and Hui Zhang. 2015. C3: Internet-Scale Control Plane for Video Quality Optimization. In 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15). USENIX Association, Oakland, CA, 131--144. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/ganjam
[8]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file system. In ACM SIGOPS operating systems review, Vol. 37. ACM, 29--43.
[9]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, Sergey Yekhanin, et al. 2012. Erasure Coding in Windows Azure Storage. In Usenix annual technical conference. Boston, MA, 15--26.
[10]
Gauri Joshi, Emina Soljanin, and Gregory Wornell. 2015. Efficient replication of queued tasks to reduce latency in cloud systems. In 53rd Annual Allerton Conference on Communication, Control, and Computing.
[11]
Osama Khan, Randal C Burns, James S Plank, William Pierce, and Cheng Huang. 2012. Rethinking erasure codes for cloud file systems: minimizing I/O for recovery and degraded reads. In FAST. 20.
[12]
Guozheng Liang and Ulas C Kozat. 2014. Fast cloud: Pushing the envelope on delay performance of cloud storage with coding. Networking, IEEE/ACM Transactions on 22, 6 (2014), 2012--2025.
[13]
Michael Mitzenmacher. 2001. The power of two choices in randomized load balancing. Parallel and Distributed Systems, IEEE Transactions on 12, 10 (2001), 1094--1104.
[14]
Michael David Mitzenmacher. 1996. The Power of Two Choices in Randomized Load Balancing. Ph.D. Dissertation. UNIVERSITY of CALIFORNIA at BERKELEY.
[15]
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. ACM, 69--84.
[16]
KV Rashmi, Nihar B Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2013. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. Proc. USENIX HotStorage (2013).
[17]
KV Rashmi, Nihar B Shah, Dikang Gu, Hairong Kuang, Dhruba Borthakur, and Kannan Ramchandran. 2014. A hitchhiker's guide to fast and efficient data reconstruction in erasure-coded data centers. In Proceedings of the 2014 ACM conference on SIGCOMM. ACM, 331--342.
[18]
Kai Ren, YongChul Kwon, Magdalena Balazinska, and Bill Howe. 2013. Hadoop's adolescence: an analysis of Hadoop usage in scientific workloads. Proceedings of the VLDB Endowment 6, 10 (2013), 853--864.
[19]
Andrea W Richa, M Mitzenmacher, and R Sitaraman. 2001. The power of two random choices: A survey of techniques and results. Combinatorial Optimization 9 (2001), 255--304.
[20]
Maheswaran Sathiamoorthy, Megasthenis Asteris, Dimitris Papailiopoulos, Alexandros G Dimakis, Ramkumar Vadali, Scott Chen, and Dhruba Borthakur. 2013. Xoring elephants: Novel erasure codes for big data. In Proceedings of the VLDB Endowment, Vol. 6. VLDB Endowment, 325--336.
[21]
Eric Schurman and Jake Brutlag. 2009. The user and business impact of server delays, additional bytes, and HTTP chunking in web search. In Velocity Web Performance and Operations Conference.
[22]
Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. 2014. The MDS queue: Analysing the latency performance of erasure codes. In 2014 IEEE International Symposium on Information Theory. IEEE, 861--865.
[23]
Nihar B Shah, Kangwook Lee, and Kannan Ramchandran. 2016. When do redundant requests reduce latency? IEEE Transactions on Communications 64, 2 (2016), 715--722.
[24]
Yin Sun, Zizhan Zheng, C Emre Koksal, Kyu-Han Kim, and Ness B Shroff. 2015. Provably delay efficient data retrieving in storage clouds. arXiv preprint arXiv:1501.01661 (2015).
[25]
Itzhak Tamo and Alexander Barg. 2014. A family of optimal locally recoverable codes. Information Theory, IEEE Transactions on 60, 8 (2014), 4661--4676.
[26]
Hakim Weatherspoon and John D Kubiatowicz. 2002. Erasure coding vs. replication: A quantitative comparison. In Peer-to-Peer Systems. Springer, 328--337.
[27]
Mingyuan Xia, Mohit Saxena, Mario Blaum, and David A Pease. 2015. A tale of two erasure codes in HDFS. In To appear in Proceedings of 13th Usenix Conference on File and Storage Technologies.
[28]
Yu Xiang, Tian Lan, Vaneet Aggarwal, and Yih Farn R Chen. 2014. Joint latency and cost optimization for era-surecoded data center storage. ACM SIGMETRICS Performance Evaluation Review 42, 2 (2014), 3--14.
[29]
Shiqin Yan, Huaicheng Li, Mingzhe Hao, Michael Hao Tong, Swaminathan Sundararaman, Andrew A. Chien, and Haryadi S. Gunawi. 2017. Tiny-Tail Flash: Near-Perfect Elimination of Garbage Collection Tail Latencies in NAND SSDs. In 15th USENIX Conference on File and Storage Technologies (FAST 17). USENIX Association, Santa Clara, CA, 15--28. https://www.usenix.org/conference/fast17/technical-sessions/presentation/yan
[30]
Lei Ying, R Srikant, and Xiaohan Kang. 2015. The power of slightly more than one sample in randomized load balancing. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 1131--1139.
[31]
Yujia Zhu, James Lin, Patrick PC Lee, and Yan Xu. [n. d.]. Boosting Degraded Reads in Heterogeneous Erasure-Coded Storage Systems. ([n. d.]).

Cited By

View all
  • (2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
  • (2024)Flash-oriented Coded Storage: Research Status and Future DirectionsACM Transactions on Storage10.1145/370899521:1(1-37)Online publication date: 19-Dec-2024
  • (2024)HSM: A Hybrid Storage Method Based on the Heat of Data and Global Disk Space UtilizationIEEE Access10.1109/ACCESS.2024.338298712(48630-48639)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '17: Proceedings of the 2017 Symposium on Cloud Computing
September 2017
672 pages
ISBN:9781450350280
DOI:10.1145/3127479
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 September 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. erasure coded system
  2. load balancing
  3. optimization
  4. tail latency reduction

Qualifiers

  • Research-article

Conference

SoCC '17
Sponsor:
SoCC '17: ACM Symposium on Cloud Computing
September 24 - 27, 2017
California, Santa Clara

Acceptance Rates

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)A Survey of the Past, Present, and Future of Erasure Coding for Storage SystemsACM Transactions on Storage10.1145/370899421:1(1-39)Online publication date: 8-Jan-2025
  • (2024)Flash-oriented Coded Storage: Research Status and Future DirectionsACM Transactions on Storage10.1145/370899521:1(1-37)Online publication date: 19-Dec-2024
  • (2024)HSM: A Hybrid Storage Method Based on the Heat of Data and Global Disk Space UtilizationIEEE Access10.1109/ACCESS.2024.338298712(48630-48639)Online publication date: 2024
  • (2024)Olsync: Object-level tiering and coordination in tiered storage systems based on software-defined networkFuture Generation Computer Systems10.1016/j.future.2024.107521(107521)Online publication date: Sep-2024
  • (2023)Erasure Codes for Cold Data in Distributed Storage SystemsApplied Sciences10.3390/app1304217013:4(2170)Online publication date: 8-Feb-2023
  • (2023)Extending and Programming the NVMe I/O Determinism Interface for Flash ArraysACM Transactions on Storage10.1145/356842719:1(1-33)Online publication date: 11-Jan-2023
  • (2023)Sampling-Based Caching for Low Latency in Distributed Coded Storage SystemsIEEE Transactions on Services Computing10.1109/TSC.2023.331831516:6(4275-4287)Online publication date: Nov-2023
  • (2023)Adaptive and Scalable Caching With Erasure Codes in Distributed Cloud-Edge Storage SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2022.316866211:2(1840-1853)Online publication date: 1-Apr-2023
  • (2023)FACHS: Adaptive Hybrid Storage Strategy Based on File Access CharacteristicsIEEE Access10.1109/ACCESS.2023.324309811(16855-16862)Online publication date: 2023
  • (2022)Short Tail: taming tail latency for erasure-code-based in-memory systemsShortTail:降低纠删码内存存储系统的尾部延迟Frontiers of Information Technology & Electronic Engineering10.1631/FITEE.210056623:11(1646-1657)Online publication date: 1-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media