skip to main content
10.1145/2523616.2525939acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Leveraging data deduplication to improve the performance of primary storage systems in the cloud

Published: 01 October 2013 Publication History

Abstract

Recent studies have shown that moderate to high data redundancy exists in primary storage systems, such as VM-based, enterprise and HPC storage systems, which indicates that the data deduplication technology can be used to effectively reduce the write traffic and storage space in such environments. However, our experimental studies reveal that applying data deduplication to primary storage systems will cause space contention in main memory and data fragmentation on disks. This is in part because applying data deduplication introduces significant index memory overhead to the existing system and in part because a file or block is split into multiple small data chunks that are often located in non-sequential locations on disks after deduplication. This fragmentation of data can cause a subsequent read operation to invoke many disk I/O requests, thus leading to performance degradation.
The existing primary data deduplication schemes, such as iDedup[1], are to leverage spatial locality in that they only select the large requests to deduplicate and exclude the small requests (e.g., 4KB, 8KB or less) because the latter only account for a tiny fraction of the storage capacity requirement[2]. Moreover, these schemes tend to overlook the importance of cache management, leading them to manage the index cache and the read cache separately. However, previous workload studies on primary storage systems have revealed that small I/O requests dominate in the primary storage systems (more than 50%) and are at the root of the system performance bottleneck. Furthermore, the accesses in primary storage systems exhibit obvious I/O burstiness. The existing primary-storage data deduplication schemes fail to consider these workload characteristics in primary storage systems from the performance's perspective. We argue that, primary-storage data deduplication schemes should take the workload characteristics of primary storage into the design considerations.
To address the two problems and take the primary-storage workload characteristics into considerations, we propose a Performance-Oriented I/O Deduplication approach, POD, to improving the I/O performance of primary storage systems in the Cloud. POD takes a two-pronged deduplication approach to improve primary storage systems, a request-based I/O and data deduplication scheme, called Select-Dedupe, aimed at alleviating data fragmentation and an adaptive memory management scheme, called iCache, to ease the main memory contention. More specifically, the former takes the workload characteristics of small-I/O-request domination into the design considerations. It deduplicates all the write requests if their write data is already stored sequentially on disks, including the small write requests that would otherwise be excluded from by the capacity-oriented deduplication schemes. For other write requests, Select-Dedupe does not deduplicate their redundant write data to maintain the performance of the subsequent read requests to these data. iCache takes the I/O burstiness characteristics into the design considerations. It dynamically adjusts the cache space between the index cache and the read cache according to the workload characteristics, and swaps these data between memory and backend storage devices accordingly. During the write-intensive bursty periods, iCache enlarges the index cache size and shrinks the read cache size to detect much more redundant write requests, thus improving the write performance. The read cache size is enlarged to cache more hot read data to improve the read performance during the read-intensive bursty periods.
The prototype of the POD scheme is implemented as an embedded module at the block-device level with the fixed-size chunking method. Preliminary evaluations driven by the real traces conducted on our lightweight POD prototype implementation show that POD significantly outperforms iDedup in improving the performance of primary storage systems in the Cloud.

References

[1]
K. Srinivasan, T. Bisson, G. Goodson, and K. Voruganti. iDedup: Latency-aware, Inline Data Deduplication for Primary Storage. In FAST'12, Feb. 2012.
[2]
D. Frey, A. Kermarrec, and K. Kloudas. Probabilistic Deduplication for Cluster-Based Storage Systems. InSOCC'12, Nov. 2012.

Cited By

View all
  • (2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
October 2013
427 pages
ISBN:9781450324281
DOI:10.1145/2523616
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 October 2013

Check for updates

Author Tags

  1. cloud
  2. data deduplication
  3. performance

Qualifiers

  • Research-article

Funding Sources

Conference

SOCC '13
Sponsor:
SOCC '13: ACM Symposium on Cloud Computing
October 1 - 3, 2013
California, Santa Clara

Acceptance Rates

SOCC '13 Paper Acceptance Rate 23 of 114 submissions, 20%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)H2C-Dedup: Reducing I/O and GC Amplification for QLC SSDs from the Deduplication Metadata PerspectiveProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698507(704-719)Online publication date: 20-Nov-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media