research-article

Leveraging data deduplication to improve the performance of primary storage systems in the cloud

Authors:

Bo Mao,

Hong Jiang,

Suzhen Wu,

Lei TianAuthors Info & Claims

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Article No.: 24, Pages 1 - 2

https://doi.org/10.1145/2523616.2525939

Published: 01 October 2013 Publication History

Get Access

Abstract

Recent studies have shown that moderate to high data redundancy exists in primary storage systems, such as VM-based, enterprise and HPC storage systems, which indicates that the data deduplication technology can be used to effectively reduce the write traffic and storage space in such environments. However, our experimental studies reveal that applying data deduplication to primary storage systems will cause space contention in main memory and data fragmentation on disks. This is in part because applying data deduplication introduces significant index memory overhead to the existing system and in part because a file or block is split into multiple small data chunks that are often located in non-sequential locations on disks after deduplication. This fragmentation of data can cause a subsequent read operation to invoke many disk I/O requests, thus leading to performance degradation.

The existing primary data deduplication schemes, such as iDedup[1], are to leverage spatial locality in that they only select the large requests to deduplicate and exclude the small requests (e.g., 4KB, 8KB or less) because the latter only account for a tiny fraction of the storage capacity requirement[2]. Moreover, these schemes tend to overlook the importance of cache management, leading them to manage the index cache and the read cache separately. However, previous workload studies on primary storage systems have revealed that small I/O requests dominate in the primary storage systems (more than 50%) and are at the root of the system performance bottleneck. Furthermore, the accesses in primary storage systems exhibit obvious I/O burstiness. The existing primary-storage data deduplication schemes fail to consider these workload characteristics in primary storage systems from the performance's perspective. We argue that, primary-storage data deduplication schemes should take the workload characteristics of primary storage into the design considerations.

To address the two problems and take the primary-storage workload characteristics into considerations, we propose a Performance-Oriented I/O Deduplication approach, POD, to improving the I/O performance of primary storage systems in the Cloud. POD takes a two-pronged deduplication approach to improve primary storage systems, a request-based I/O and data deduplication scheme, called Select-Dedupe, aimed at alleviating data fragmentation and an adaptive memory management scheme, called iCache, to ease the main memory contention. More specifically, the former takes the workload characteristics of small-I/O-request domination into the design considerations. It deduplicates all the write requests if their write data is already stored sequentially on disks, including the small write requests that would otherwise be excluded from by the capacity-oriented deduplication schemes. For other write requests, Select-Dedupe does not deduplicate their redundant write data to maintain the performance of the subsequent read requests to these data. iCache takes the I/O burstiness characteristics into the design considerations. It dynamically adjusts the cache space between the index cache and the read cache according to the workload characteristics, and swaps these data between memory and backend storage devices accordingly. During the write-intensive bursty periods, iCache enlarges the index cache size and shrinks the read cache size to detect much more redundant write requests, thus improving the write performance. The read cache size is enlarged to cache more hot read data to improve the read performance during the read-intensive bursty periods.

The prototype of the POD scheme is implemented as an embedded module at the block-device level with the fixed-size chunking method. Preliminary evaluations driven by the real traces conducted on our lightweight POD prototype implementation show that POD significantly outperforms iDedup in improving the performance of primary storage systems in the Cloud.

References

[1]

K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In FAST'12, Feb. 2012.

Digital Library

Google Scholar

[2]

D. Frey, A. Kermarrec, and K. Kloudas. Probabilistic Deduplication for Cluster-Based Storage Systems. InSOCC'12, Nov. 2012.

Digital Library

Google Scholar

Cited By

View all

Dong YChen BPan YZou XXia W(2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698507

Index Terms

Leveraging data deduplication to improve the performance of primary storage systems in the cloud

Recommendations

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, ...
Prefetch-aware fingerprint cache management for data deduplication systems

Data deduplication has been widely utilized in large-scale storage systems, particularly backup systems. Data deduplication systems typically divide data streams into chunks and identify redundant chunks by comparing chunk fingerprints. Maintaining all ...
Improving runtime performance of deduplication system with host-managed SMR storage drives
DAC '18: Proceedings of the 55th Annual Design Automation Conference

Due to the cost consideration for data storage, high-areal-density shingled-magnetic-recording (SMR) drives and data deduplication techniques are getting popular in many data storage services for the improvement of profit per storage unit. However, ...

Comments

Information & Contributors

Information

Published In

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

October 2013

427 pages

ISBN:9781450324281

DOI:10.1145/2523616

General Chair:
Guy Lohman

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

SOCC '13

Sponsor:

SOCC '13: ACM Symposium on Cloud Computing

October 1 - 3, 2013

California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
274
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Dong YChen BPan YZou XXia W(2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024
https://dl.acm.org/doi/10.1145/3698038.3698507

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Prefetch-aware fingerprint cache management for data deduplication systems

Improving runtime performance of deduplication system with host-managed SMR storage drives

Comments

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Cited By

Login options

Full Access

PDF

eReader

Abstract

References

Cited By

Index Terms

Recommendations

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Prefetch-aware fingerprint cache management for data deduplication systems

Improving runtime performance of deduplication system with host-managed SMR storage drives

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations