skip to main content
research-article

Read-Performance Optimization for Deduplication-Based Storage Systems in the Cloud

Published: 01 March 2014 Publication History

Abstract

Data deduplication has been demonstrated to be an effective technique in reducing the total data transferred over the network and the storage space in cloud backup, archiving, and primary storage systems, such as VM (virtual machine) platforms. However, the performance of restore operations from a deduplicated backup can be significantly lower than that without deduplication. The main reason lies in the fact that a file or block is split into multiple small data chunks that are often located in different disks after deduplication, which can cause a subsequent read operation to invoke many disk IOs involving multiple disks and thus degrade the read performance significantly. While this problem has been by and large ignored in the literature thus far, we argue that the time is ripe for us to pay significant attention to it in light of the emerging cloud storage applications and the increasing popularity of the VM platform in the cloud. This is because, in a cloud storage or VM environment, a simple read request on the client side may translate into a restore operation if the data to be read or a VM suspended by the user was previously deduplicated when written to the cloud or the VM storage server, a likely scenario considering the network bandwidth and storage capacity concerns in such an environment.
To address this problem, in this article, we propose SAR, an SSD (solid-state drive)-Assisted Read scheme, that effectively exploits the high random-read performance properties of SSDs and the unique data-sharing characteristic of deduplication-based storage systems by storing in SSDs the unique data chunks with high reference count, small size, and nonsequential characteristics. In this way, many read requests to HDDs are replaced by read requests to SSDs, thus significantly improving the read performance of the deduplication-based storage systems in the cloud. The extensive trace-driven and VM restore evaluations on the prototype implementation of SAR show that SAR outperforms the traditional deduplication-based and flash-based cache schemes significantly, in terms of the average response times.

References

[1]
Andersen, D. G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L., and Vasudevan, V. 2009. FAWN: A fast array of wimpy nodes. In Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP’09).
[2]
Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., Lee, G., Patterson, D. A., Rabkin, A., Stoica, I., and Zaharia, M. 2009. Above the clouds: A Berkeley view of cloud computing. Tech. rep. USB/EECS-2009-28, University of California, Berkeley.
[3]
Bhagwat, D., Pollack, K., Long, D., Schwarz, T., Miller, E., and Pâris, J. 2006. Providing high reliability in a minimum redundancy archival storage system. In Proceedings of the 14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS’06).
[4]
Caulfield, A., Grupp, L., and Swanson, S. 2009. Gordon: Using flash memory to build fast power-efficient clusters for data-intensive applications. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).
[5]
Clements, A. T., Ahmad, I., Vilayannur, M., and Li, J. 2009. Decentralized deduplication in SAN cluster file systems. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’09).
[6]
Debnath, B., Sengupta, S., and Li, J. 2010. ChunkStash: Speeding up inline storage deduplication using flash memory. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’10).
[7]
Dong, W., Douglis, F., Li, K., Patterson, H., Reddy, S., and Shilane, P. 2011. Tradeoffs in scalable data routing for deduplication clusters. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11).
[8]
El-Shimi, A., Kalach, R., Kumar, A., Oltean, A., Li, J., and Sengupta, S. 2012. Primary data deduplication - Large scale study and system design. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’12).
[9]
ESG. 2008. Data protection survey. Enterprise Strategy Group. http://www.esg-global.com.
[10]
Guerra, J., Pucha, H., Glider, J., and Rangaswami, R. 2011. Cost effective storage using extent based dynamic tiering. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11).
[11]
Guo, F. and Efstathopoulos, P. 2011. Building a high-performance deduplication system. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11).
[12]
Gupta, D., Lee, S., Vrable, M., Savage, S., Snoeren, A. C., Varghese, G., Voelker, G. M., and Vahdat, A. 2008. Difference engine: Harnessing memory redundancy in virtual machines. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08).
[13]
Hansen, J. and Jul, E. 2010. Lithium: Virtual machine storage for the cloud. In Proceedings of the 1st ACM Symposium on Cloud Computing (SOCC’10).
[14]
Himelstein, M. 2011. Cloudy with a chance of data reduction: How data reduction technologies impact the cloud. In Proceedings of SNW Spring 2011.
[15]
Jin, K. and Miller, E. L. 2009. The effectiveness of deduplication on virtual machine disk images. In Proceedings of the Israeli Experimental Systems Conference (SYSTOR’09).
[16]
Jones, S. 2011. Online de-duplication in a log-structured file system for primary storage. Tech. rep. UCSC-SSRC-11-03, University of California, Santa Cruz.
[17]
Kim, Y., Gupta, A., and Urgaonkar, B. 2008. MixedStore: An enterprise-scale storage system combining solid-state and hard disk drives. Tech. rep. CSE-08-017, Department of Computer Science and Engineering, Pennsylvania State University.
[18]
Koller, R. and Rangaswami, R. 2010. I/O deduplication: Utilizing content similarity to improve I/O performance. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10).
[19]
Koltsidas, I. and Viglas, S. D. 2008. Flashing up the storage layer. Proc. VLDB Endow. 1, 1, 514--525.
[20]
Kruus, E., Ungureanu, C., and Dubnicki, C. 2010. Bimodal content defined chunking for backup streams. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10).
[21]
Lillibridge, M., Eshghi, K., Bhagwat, D., Deolalikar, V., Trezise, G., and Camble, P. 2009. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proceedings of the 7th Conference on File and Storage Technologies (FAST’09).
[22]
Lillibridge, M., Eshghi, K., and Bhagwat, D. 2013. Improving restore speed for backup systems that use inline chunk-based deduplication. In Proceedings of the 11th USENIX Conference on File and Storage Technologies (FAST’13).
[23]
Meister, D. and Brinkmann, A. 2010. dedupv1: Improving deduplication throughput using solid state drives (SSD). In Proceedings of the IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST’10).
[24]
Meister, D., Kaiser, J., Brinkmann, A., Cortes, T., Kuhn, M., and Kunkel, J. 2012. A study on data deduplication in HPC storage systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC’12).
[25]
Meyer, D. T. and Bolosky, W. J. 2011. A study of practical deduplication. In Proceedings of the 9th USENIX Conference on File and Storage Technologies (FAST’11).
[26]
Muthitacharoenand, A., Chen, B., and Mazières, D. 2001. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01).
[27]
Nath, P., Kozuch, M. A., O’Hallaron, D. R., Harkes, J., Satyanarayanan, M., Tolia, N., and Toups, M. 2006. Design tradeoffs in applying content addressable storage to enterprise-scale systems based on virtual machines. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’06).
[28]
Nath, P., Urgaonkar, B., and Sivasubramaniam, A. 2008. Evaluating the usefulness of content addressable storage for high-performance data intensive applications. In Proceedings of the 17th International Symposium on High Performance Distributed Computing (HPDC’08).
[29]
Polte, M., Simsa, J., and Gibson, G. 2008. Comparing performance of solid state devices and mechanical disks. In Proceedings of the 3rd Petascale Data Storage Workshop (PDSW’08).
[30]
Quinlan, S. and Dorward, S. 2002. Venti: A new approach to archival data storage. In Proceedings of the 1st USENIX Conference on File and Storage Technologies (FAST’02).
[31]
Ren, J. and Yang, Q. 2010. A new buffer cache design exploiting both temporal and content localities. In Proceedings of the 30th International Conference on Distributed Computing Systems (ICDCS’10).
[32]
Rhea, S., Cox, R., and Pesterev, A. 2008. Fast, inexpensive content-addressed storage in foundation. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’08).
[33]
Srinivasan, K., Bisson, T., Goodson, G., and Voruganti, K. 2012. iDedup: Latency-aware, inline data deduplication for primary storage. In Proceedings of the 10th USENIX Conference on File and Storage Technologies (FAST’12).
[34]
Tan, Y., Jiang, H., Feng, D., Tian, L., Yan, Z., and Zhou, G. 2011. CABdedupe: A causality-based de-duplication performance booster for cloud backup services. In Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS’’11).
[35]
Ungureanu, C., Atkin, B., Aranya, A., Gokhale, S., Rago, S., Całkowski, G., Dubnicki, C., and Bohra, A. 2010. HydraFS: A high-throughput file system for the HYDRAstor content-addressable storage system. In Proceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10).
[36]
Xia, W., Jiang, H., Feng, D., and Hua, Y. 2011. SiLo: A similarity-locality based near-exact deduplication scheme with low RAM overhead and high throughput. In Proceedings of the USENIX Annual Technical Conference (USENIX ATC’11).
[37]
Xiao, W. and Yang, Q. 2008. Can we really recover data if storage subsystem fails? In Proceedings of the 28th International Conference on Distributed Computing Systems (ICDCS’08).
[38]
Yang, T., Jiang, H., Feng, D., Niu, Z., Zhou, K., and Wan, Y. 2010. DEBAR: A scalable high-performance de-duplication storage system for backup and archiving. In Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS’’10).
[39]
Zhang, X., Huo, Z., Ma, J., and Meng, D. 2010. Exploiting data deduplication to accelerate live virtual machine migration. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’10).
[40]
Zhu, B., Li, K., and Patterson, H. 2008. Avoiding the disk bottleneck in the data domain deduplication file system. In Proceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08).
[41]
Zhu, Q., Chen, Z., Tan, L., Zhou, Y., Keeton, K., and Wilkes, J. 2005. Hibernator: Helping disk arrays sleep through the winter. In Proceedings of the ACM SIGOPS 20th Symposium on Operating Systems Principles (SOSP’05). ACM, New York, NY, 177--190.

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)I/O Causality Based In-Line Data Deduplication for Non-Volatile Memory Enabled Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336596173:5(1327-1340)Online publication date: May-2024
  • (2024)POFFO: A Perceptual Online File Fingerprint Offloading Strategy for Effective Data Deduplication at Cloud-Edge Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781345(1-8)Online publication date: 9-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Storage
ACM Transactions on Storage  Volume 10, Issue 2
March 2014
86 pages
ISSN:1553-3077
EISSN:1553-3093
DOI:10.1145/2600090
  • Editor:
  • Darrell Long
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 March 2014
Accepted: 01 July 2013
Revised: 01 June 2013
Received: 01 December 2012
Published in TOS Volume 10, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Storage systems
  2. data deduplication
  3. read performance
  4. solid-state drive
  5. virtual machine

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)24
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)From SSDs Back to HDDs: Optimizing VDO to Support Inline Deduplication and Compression for HDDs as Primary Storage MediaACM Transactions on Storage10.1145/367825020:4(1-28)Online publication date: 23-Jul-2024
  • (2024)I/O Causality Based In-Line Data Deduplication for Non-Volatile Memory Enabled Storage SystemsIEEE Transactions on Computers10.1109/TC.2024.336596173:5(1327-1340)Online publication date: May-2024
  • (2024)POFFO: A Perceptual Online File Fingerprint Offloading Strategy for Effective Data Deduplication at Cloud-Edge Systems2024 International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS63802.2024.10781345(1-8)Online publication date: 9-Nov-2024
  • (2024)Redundancy elimination in IoT oriented big data: a survey, schemes, open challenges and future applicationsCluster Computing10.1007/s10586-023-04209-127:1(1063-1087)Online publication date: 1-Feb-2024
  • (2023)Distributed storage optimization using multi-agent systems in HadoopE3S Web of Conferences10.1051/e3sconf/202341201091412(01091)Online publication date: 17-Aug-2023
  • (2023)Comparative Analysis of Image Augmentation and Data Deduplication TechniquesSmart Trends in Computing and Communications10.1007/978-981-99-0769-4_26(271-281)Online publication date: 15-Jun-2023
  • (2022)Dedup-for-speedProceedings of the 15th ACM International Conference on Systems and Storage10.1145/3534056.3534937(128-139)Online publication date: 6-Jun-2022
  • (2022)Design and Simulation of Content-Aware Hybrid DRAM-PCM Memory SystemIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.312353933:7(1666-1677)Online publication date: 1-Jul-2022
  • (2022)Data Deduplication using AEPAS Algorithm2022 11th International Conference on System Modeling & Advancement in Research Trends (SMART)10.1109/SMART55829.2022.10047719(123-129)Online publication date: 16-Dec-2022
  • (2022)Security Management in Decentralized Cloud Storage via Improved Bees Swarm Optimisation Data Slicers2022 International Conference on Edge Computing and Applications (ICECAA)10.1109/ICECAA55415.2022.9936491(551-556)Online publication date: 13-Oct-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media