research-article

Kinesis: A new approach to replica placement in distributed storage systems

Authors:

John MacCormick,

Nicholas Murphy,

Venugopalan Ramasubramanian,

Lidong ZhouAuthors Info & Claims

ACM Transactions on Storage (TOS), Volume 4, Issue 4

Article No.: 11, Pages 1 - 28

https://doi.org/10.1145/1480439.1480440

Published: 09 February 2009 Publication History

Abstract

Kinesis is a novel data placement model for distributed storage systems. It exemplifies three design principles: structure (division of servers into a few failure-isolated segments), freedom of choice (freedom to allocate the best servers to store and retrieve data based on current resource availability), and scattered distribution (independent, pseudo-random spread of replicas in the system). These design principles enable storage systems to achieve balanced utilization of storage and network resources in the presence of incremental system expansions, failures of single and shared components, and skewed distributions of data size and popularity. In turn, this ability leads to significantly reduced resource provisioning costs, good user-perceived response times, and fast, parallelized recovery from independent and correlated failures.

This article validates Kinesis through theoretical analysis, simulations, and experiments on a prototype implementation. Evaluations driven by real-world traces show that Kinesis can significantly outperform the widely used Chain replica-placement strategy in terms of resource requirements, end-to-end delay, and failure recovery.

References

[1]

Azar, Y., Broder, A. Z., Karlin, A. R., and Upfal, E. 1999. Balanced allocations. SIAM J. Comput. 29, 1, 180--200.

Digital Library

[2]

Berenbrink, P., Czumaj, A., Steger, A., and Vöcking, B. 2000. Balanced allocations: the heavily loaded case. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC).

Digital Library

[3]

Byers, J., Considine, J., and Mitzenmacher, M. 2003. Simple load balancing for distributed hash tables. In Proceedings of the International Workshop on Peer-to-Peer Systems (IPTPS).

[4]

Czumaj, A., Riley, C., and Scheideler, C. 2003. Perfectly balanced allocation.

[5]

Dabek, F., Kaashoek, M., Karger, D., Morris, R., and Stoica, I. 2001. Wide-Area cooperative storage with CFS. In Proceedings of the SIGOPS Symposium on Operating Systems Principles (SOSP).

Digital Library

[6]

Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003. The Google file system. In Proceedings of the SIGOPS Symposium on Operating Systems Principles (SOSP).

Digital Library

[7]

Godfrey, B., Lakshminarayanan, K., Surana, S., Karp, R., and Stoica, I. 2004. Load balancing in dynamic structured p2p systems. In Proceedings of the Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM).

[8]

Hsiao, H. and DeWitt, D. J. 1990. Chained declustering: A new availability strategy for multiprocessor database machines. In Proceedings of the International Conference on Data Engineering (ICDE).

Digital Library

[9]

Ji, M., Felten, E. W., Wang, R., and Singh, J. P. 2000. Archipelago: An island-based file system for highly available and scalable internet services. In Proceedings of the Windows Systems Symposium.

Digital Library

[10]

Karger, D., Lehman, E., Leighton, T., Levine, M., Lewin, D., and Panigrahy, R. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC).

Digital Library

[11]

Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., and Zhao, B. 2000. OceanStore: An architecture for global-scale persistent storage. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[12]

Lee, E. K. and Thekkath, C. A. 1996. Petal: Distributed virtual disks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[13]

Litwin, W. 1980. Linear hashing: A new tool for file and table addressing. In Proceedings of the Intlernational Conference on Very Large Data Bases (VLDB).

Digital Library

[14]

Lumb, C. R., Golding, R., and Ganger, G. R. 2004. DSPTF: Decentralized request distribution in brickbased storage systems. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[15]

MacCormick, J., Murphy, N., Najork, M., Thekkath, C. A., and Zhou, L. 2004. Boxwood: Abstractions as the foundation for storage infrastructure. In Proceedings of the ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI).

Digital Library

[16]

Pagh, R. and Rodler, F. F. 2004. Cuckoo hashing. J. Algor. 51, 2, 122--144.

Digital Library

[17]

Pai, V. S., Aron, M., Banga, G., Svendsen, M., Druschel, P., Zwaenepoel, W., and Nahum, E. 1998. Locality-Aware request distribution in cluster-based network servers. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS).

Digital Library

[18]

Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival storage. In Proceedings of the USENIX Conference on File and Storage Technologies (FAST).

Digital Library

[19]

Rowstron, A. and Druschel, P. 2001. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the SIGOPS Symposium on Operating Systems Principles (SOSP).

Digital Library

[20]

Sanders, P., Egner, S., and Korst, J. H. M. 2003. Fast concurrent access to parallel disks. Algorithmica 35, 1, 21--55.

[21]

Talwar, K. and Wieder, U. 2007. Ballanced allocations: The weighted case. In Proceedings of the Annual ACM Symposium on Theory of Computing (STOC).

Digital Library

[22]

van Renesse, R. and Schneider, F. B. 2004. Chain replication for supporting high throughput and availability. In Proceedings of the ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI).

Digital Library

[23]

Vöcking, B. 1999. How asymmetry helps load balancing. In Proceedings of the Annual Symposium on Foundations of Computer Science (FOCS). New York, NY.

Digital Library

[24]

Weil, S. A., Brandt, S. A., Miller, E. L., Long, D. D. E., and Maltzahn, C. 2006. Ceph: A scalable, high-performance distributed file system. In Proceedings of the ACM/USENIX Symposium on Operating Systems Design and Implementation (OSDI).

Digital Library

[25]

Weil, S. A., Brandt, S. A., Miller, E. L., and Maltzahn, C. 2006. CRUSH: Controlled, scalable, decentralized placement of replicated data. In Proceedings of the International Conference on Super Computing (SC).

Digital Library

[26]

Wieder, U. 2007. Ballanced allocations with heterogeneous bins. In Proceedings of the Sympostiom on Parallel Algorithms and Architecture (SPAA).

Digital Library

Cited By

Li JDeng YFan ZZhong ZMin G(2024)Towards Energy-Efficient and Thermal-Aware Data Placement for Storage ClustersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33516849:4(631-647)Online publication date: Jul-2024
https://doi.org/10.1109/TSUSC.2024.3351684
Wang ZLuo T(2024)A Scalable, Fault Resilient and Balanced Storage Architecture for Cyber-Physical Systems2024 IEEE 19th Conference on Industrial Electronics and Applications (ICIEA)10.1109/ICIEA61579.2024.10665062(1-6)Online publication date: 5-Aug-2024
https://doi.org/10.1109/ICIEA61579.2024.10665062
Li JDeng YZhou YWu ZPang SMin G(2023)TADRP: Toward Thermal-Aware Data Replica Placement in Data-Intensive Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.326386420:4(4397-4415)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TNSM.2023.3263864
Show More Cited By

Index Terms

Kinesis: A new approach to replica placement in distributed storage systems
1. Information systems
  1. Information retrieval
    1. Search engine architectures and scalability
      1. Distributed retrieval
      2. Peer-to-peer retrieval
  2. Information storage systems
    1. Storage architectures
      1. Distributed storage

Recommendations

Reparo: A Fast RAID Recovery Scheme for Ultra-large SSDs
A recent ultra-large SSD (e.g., a 32-TB SSD) provides many benefits in building cost-efficient enterprise storage systems. Owing to its large capacity, however, when such SSDs fail in a RAID storage system, a long rebuild overhead is inevitable for RAID ...
ACS: an alternate coding scheme to improve degrade read performance for SSD-based RAID5 systems

To guarantee high performance and reliability, storage systems require better devices and data redundancy schemes, e.g., SSD-based RAID5. However, failures in the large-scale storage systems are common. In order to serve requests on a failed node, the SSD-...
MTDB: an LSM-tree-based key-value store using a multi-tree structure to improve read performance
Abstract
Traditional LSM-tree-based key-value storage systems face significant read amplification issues due to the multi-level structure of LSM-tree, the unordered SSTable files in Level 0, and the lack of an in-memory index structure for key-value pairs. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage

ACM Transactions on Storage Volume 4, Issue 4

January 2009

116 pages

ISSN:1553-3077

EISSN:1553-3093

DOI:10.1145/1480439

Issue’s Table of Contents

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 February 2009

Accepted: 01 May 2008

Revised: 01 May 2008

Received: 01 February 2008

Published in TOS Volume 4, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
881
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)0

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li JDeng YFan ZZhong ZMin G(2024)Towards Energy-Efficient and Thermal-Aware Data Placement for Storage ClustersIEEE Transactions on Sustainable Computing10.1109/TSUSC.2024.33516849:4(631-647)Online publication date: Jul-2024
https://doi.org/10.1109/TSUSC.2024.3351684
Wang ZLuo T(2024)A Scalable, Fault Resilient and Balanced Storage Architecture for Cyber-Physical Systems2024 IEEE 19th Conference on Industrial Electronics and Applications (ICIEA)10.1109/ICIEA61579.2024.10665062(1-6)Online publication date: 5-Aug-2024
https://doi.org/10.1109/ICIEA61579.2024.10665062
Li JDeng YZhou YWu ZPang SMin G(2023)TADRP: Toward Thermal-Aware Data Replica Placement in Data-Intensive Data CentersIEEE Transactions on Network and Service Management10.1109/TNSM.2023.326386420:4(4397-4415)Online publication date: 1-Dec-2023
https://dl.acm.org/doi/10.1109/TNSM.2023.3263864
Wei XWang Y(2023)Popularity-Based Data Placement With Load Balancing in Edge ComputingIEEE Transactions on Cloud Computing10.1109/TCC.2021.309646711:1(397-411)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TCC.2021.3096467
Pfandzelter TJapke NSchirmer THasenburg JBermbach D(2023)Managing data replication and distribution in the fog with FReDSoftware: Practice and Experience10.1002/spe.323753:10(1958-1981)Online publication date: 11-Jul-2023
https://doi.org/10.1002/spe.3237
Lu KZhao NWan JFei CZhao WDeng T(2022)RLRP: High-Efficient Data Placement with Reinforcement Learning for Modern Distributed Storage Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00064(595-605)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00064
Sun HWang QYue YZhao YFu S(2022)A storage computing architecture with multiple NDP devices for accelerating compaction performance in LSM-tree based KV storesJournal of Systems Architecture10.1016/j.sysarc.2022.102681(102681)Online publication date: Jul-2022
https://doi.org/10.1016/j.sysarc.2022.102681
Pfandzelter TBermbach D(2021)Towards Predictive Replica Placement for Distributed Data Stores in Fog Environments2021 IEEE International Conference on Cloud Engineering (IC2E)10.1109/IC2E52221.2021.00047(280-281)Online publication date: Oct-2021
https://doi.org/10.1109/IC2E52221.2021.00047
Li YChan HLee PXu Y(2019)Enabling Efficient Updates in KV Storage via HashingACM Transactions on Storage10.1145/334028715:3(1-29)Online publication date: 13-Aug-2019
https://dl.acm.org/doi/10.1145/3340287
Chan HLi YLee PXu YGunawi HReed B(2018)HashKVProceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference10.5555/3277355.3277451(1007-1019)Online publication date: 11-Jul-2018
https://dl.acm.org/doi/10.5555/3277355.3277451
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents