skip to main content
research-article

Derrick: A Three-layer Balancer for Self-managed Continuous Scalability

Published: 19 June 2023 Publication History

Abstract

Data arrangement determines the capacity, resilience, and performance of a distributed storage system. A scalable self-managed system must place its data efficiently not only during stable operation but also after an expansion, planned downscaling, or device failures. In this article, we present Derrick, a data balancing algorithm addressing these needs, which has been developed for HYDRAstor, a highly scalable commercial storage system. Derrick makes its decisions quickly in case of failures but takes additional time to find a nearly optimal data arrangement and a plan for reaching it when the device population changes. Compared to balancing algorithms in two other state-of-the-art systems, Derrick provides better capacity utilization, reduced data movement, and improved performance. Moreover, it can be easily adapted to meet custom placement requirements.

References

[1]
Kyar Nyo Aye and Thandar Thein. 2014. A Data Rebalancing Mechanism for Gluster File System. Ph. D. Dissertation. MERAL Portal.
[2]
Doug Beaver, Sanjeev Kumar, Harry C. Li, Jason Sobel, and Peter Vajgel. 2010. Finding a needle in haystack: Facebook’s photo storage. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10).
[3]
Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. Windows azure storage: A highly available cloud storage service with strong consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles. 143–157.
[4]
Ceph. 2022. Balancer Plugin. Retrieved April 16, 2023 from https://docs.ceph.com/en/mimic/mgr/balancer/.
[6]
Ceph. 2022. Placement Groups. Retrieved April 16, 2023 from https://docs.ceph.com/en/latest/rados/operations/placement-groups/.
[7]
Ceph. 2022. V0.94.10 HAMMER. Retrieved April 16, 2023 from https://docs.ceph.com/docs/master/releases/hammer/.
[8]
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2013. Spanner: Google’s globally distributed database. ACM Transactions on Computer Systems 31, 3 (2013), 1–22.
[9]
Tom Coughlin. 2021. C1Q 2021 HDD Update. Retrieved April 16, 2023 from https://forbes.com/sites/tomcoughlin/2021/05/04/c1q-2021-hdd-update/.
[10]
Fausto Distante and Vincenzo Piuri. 1989. Hill-climbing heuristics for optimal hardware dimensioning and software allocation in fault-tolerant distributed systems. IEEE Transactions on Reliability 38, 1 (1989), 28–39.
[11]
John R. Douceur and Roger P. Wattenhofer. 2001. Large-scale simulation of replica placement algorithms for a serverless distributed file system. In MASCOTS 2001, Proceedings Ninth International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.
[12]
Cezary Dubnicki, Leszek Gryz, Lukasz Heldt, Michal Kaczmarczyk, Wojciech Kilian, Przemyslaw Strzelczak, Jerzy Szczepkowski, Cristian Ungureanu, and Michal Welnicki. 2009. HYDRAstor: A scalable secondary storage. In FAST, Vol. 9, 197–210.
[13]
OpenStack Foundation. 2022. Administrator’s Guide. Retrieved April 16, 2023 from https://docs.openstack.org/swift/latest/admin_guide.html.
[14]
OpenStack Foundation. 2022. Increasing Partition Power. Retrieved April 16, 2023 from https://specs.openstack.org/openstack/swift-specs/specs/in_progress/increasing_partition_power.html.
[15]
Lukasz Golab, Marios Hadjieleftheriou, Howard Karloff, and Barna Saha. 2013. Distributed data placement via graph partitioning. arXiv:1312.0285. Retrieved May 19, 2023 from https://arxiv.org/abs/1312.0285.
[16]
Greg Holt. 2011. Building a Consistent Hashing Ring. Retrieved April 16, 2023 from https://docs.openstack.org/swift/latest/ring_background.html.
[17]
Hanxu Hou, Patrick P. C. Lee, Kenneth W. Shum, and Yuchong Hu. 2019. Rack-aware regenerating codes for data centers. IEEE Transactions on Information Theory 65, 8 (2019), 4730–4745.
[18]
HPE. 2018. StorageExperts What’s new with HPE Scalable Object Storage with Scality RING? Retrieved April 16, 2023 from https://community.hpe.com/t5/Around-the-Storage-Block/What-s-new-with-HPE-Scalable-Object-Storage-with-Scality-RING/ba-p/7008183.
[19]
Hung-Chang Hsiao, Hsueh-Yi Chung, Haiying Shen, and Yu-Chang Chao. 2012. Load rebalancing for distributed file systems in clouds. IEEE Transactions on Parallel and Distributed Systems 24, 5 (2012), 951–962.
[20]
Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and Sergey Yekhanin. 2012. Erasure coding in windows azure storage. In Proceedings of the 2012 USENIX Annual Technical Conference.15–26.
[21]
IBM. 2018. IBM Cloud Object Storage System™, Storage Pool Expansion Guide. Retrieved April 16, 2023 from https://www.ibm.com/docs/en/STXNRM_3.14.1/coss.doc/pdfs/storagePoolExpansion_bookmap.pdf.
[22]
IDC. 2022. Enterprise Storage Systems Market Share. Retrieved May 1, 2022 from https://idc.com/promo/enterprise-storage-systems.
[23]
Alan W. Johnson and Sheldon H. Jacobson. 2002. On the convergence of generalized hill climbing algorithms. Discrete Applied Mathematics 119, 1–2 (2002), 37–57.
[24]
Saurabh Kadekodi, Francisco Maturana, Sanjith Athlur, Arif Merchant, K. V. Rashmi, and Gregory R. Ganger. 2022. Tiger:\(\lbrace\)Disk-Adaptive\(\rbrace\) redundancy without placement restrictions. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation.413–429.
[25]
Saurabh Kadekodi, Francisco Maturana, Suhas Jayaram Subramanya, Juncheng Yang, K. V. Rashmi, and Gregory R. Ganger. 2020. \(\lbrace\)PACEMAKER\(\rbrace\): Avoiding \(\lbrace\)HeART\(\rbrace\) attacks in storage clusters with disk-adaptive redundancy. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation.369–385.
[26]
Saurabh Kadekodi, K. V. Rashmi, and Gregory R. Ganger. 2019. Cluster storage systems gotta have HeART: Improving storage efficiency by exploiting disk-reliability heterogeneity. In Proceedings of the 17th USENIX Conference on File and Storage Technologies.345–358.
[27]
Govinda M. Kamath, N. Prakash, V. Lalitha, and P. Vijay Kumar. 2014. Codes with local regeneration and erasure correction. IEEE Transactions on Information Theory 60, 8 (2014), 4637–4660.
[28]
David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the 29th Annual ACM Symposium on Theory of Computing. 654–663.
[29]
Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Operating Systems Review 44, 2 (2010), 35–40.
[30]
Huiba Li, Yiming Zhang, Dongsheng Li, Zhiming Zhang, Shengyun Liu, Peng Huang, Zheng Qin, Kai Chen, and Yongqiang Xiong. 2019. Ursa: Hybrid block storage for cloud-scale virtual disks. In Proceedings of the 14th EuroSys Conference 2019. 1–17.
[31]
Michael Luby, Mark Watson, Tiago Gasiba, Thomas Stockhammer, and Wen Xu. 2006. Raptor codes for reliable download delivery in wireless broadcast systems. In CCNC 6 (2006), 192–197.
[32]
Julia Palmer, Jerry Rozeman, Chandra Mukhyala, and Jeff Vogel. 2021. Magic Quadrant for Distributed File Systems and Object Storage. ID: G00738148. Retrieved May 19, 2023 from https://www.gartner.com/en/documents/4006429.
[33]
Satadru Pan, Theano Stavrinos, Yunqiao Zhang, Atul Sikaria, Pavel Zakharov, Abhinav Sharma, Shiva Shankar P., Mike Shuey, Richard Wareing, Monika Gangapuram, Guanglei Cao, Christian Preseau, Pratap Singh, Kestutis Patiejunas, JR Tipton, Ethan Katz-Bassett, and Wyatt Lloyd. 2021. Facebook’s tectonic filesystem: Efficiency from exascale. In Proceedings of the 19th USENIX Conference on File and Storage Technologies.217–231.
[34]
John Paulsen. 2021. Energy Assisted Magnetic Recording Will Solve the Need for Capacity. Retrieved April 16, 2023 from https://blog.seagate.com/enterprises/energy-assisted-magnetic-recording-will-solve-the-need-for-capacity/.
[35]
Jerome Saltzer and M. Frans Kaashoek. 2009. Principles of Computer System Design: An Introduction. Morgan Kaufmann.
[36]
Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. 2010. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies. IEEE, 1–10.
[37]
Piotr Skowron, Marek Tomasz Biskup, Łukasz Heldt, and Cezary Dubnicki. 2013. Fuzzy adaptive control for heterogeneous tasks in high-performance storage systems. In Proceedings of the 6th International Systems and Storage Conference. 1–11.
[38]
Przemyslaw Strzelczak, Elzbieta Adamczyk, Urszula Herman-Izycka, Jakub Sakowicz, Lukasz Slusarczyk, Jaroslaw Wrona, and Cezary Dubnicki. 2013. Concurrent deletion in a distributed content-addressable storage system with global deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies.161–174.
[39]
Li Wang, Yiming Zhang, Jiawei Xu, and Guangtao Xue. 2020. MAPX: Controlled data migration in the expansion of decentralized object-based storage systems. In Proceedings of the 18th USENIX Conference on File and Storage Technologies.1–11.
[40]
Qingsong Wei, Bharadwaj Veeravalli, Bozhao Gong, Lingfang Zeng, and Dan Feng. 2010. CDRM: A cost-effective dynamic replication management scheme for cloud storage cluster. In Proceedings of the 2010 IEEE International Conference on Cluster Computing. IEEE, 188–196.
[41]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos Maltzahn. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the 7th Symposium on Operating Systems Design and Implementation. 307–320.
[42]
Sage A. Weil, Scott A. Brandt, Ethan L. Miller, and Carlos Maltzahn. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In SC’06: Proceedings of the 2006 ACM/IEEE Conference on Supercomputing. IEEE, 31–31.
[43]
Wei Xie and Yong Chen. 2017. Elastic consistent hashing for distributed storage systems. In Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium. IEEE, 876–885.
[44]
Mi Zhang, Shujie Han, and Patrick PC Lee. 2017. A simulation analysis of reliability in erasure-coded data centers. In Proceedings of the 2017 IEEE 36th Symposium on Reliable Distributed Systems. IEEE, 144–153.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 19, Issue 3
August 2023
233 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/3604654
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2023
Online AM: 28 April 2023
Accepted: 17 March 2023
Revised: 30 September 2022
Received: 05 May 2022
Published in TOS Volume 19, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data balancing
  2. distributed storage
  3. capacity utilization

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 162
    Total Downloads
  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)4
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media