Exploiting application-level similarity to improve SSD cache performance in Hadoop

Chen, Zhijian; Luo, Wenhai; Wu, Dan; Huang, Xiang; He, Jian; Zheng, Yuanhuan; Wu, Di

doi:10.1007/s11227-014-1230-x

Exploiting application-level similarity to improve SSD cache performance in Hadoop

Published: 19 June 2014

Volume 70, pages 1331–1344, (2014)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Zhijian Chen¹,
Wenhai Luo²,
Dan Wu³,
Xiang Huang¹,
Jian He²,
Yuanhuan Zheng² &
…
Di Wu²

424 Accesses
6 Citations
Explore all metrics

Abstract

To boost the performance of massive data processing, solid-state drives (SSDs) have been used as a kind of cache in the Hadoop system. However, most of existing SSD cache management algorithms are ignorant of the characteristics of upper-level applications. In this paper, we propose a novel SSD cache management algorithm called DSA, which can exploit the application-level data similarity to improve the SSD cache performance in Hadoop. Our algorithm takes both temporal similarity and user similarity in querying behaviors into account. We evaluate the effectiveness of our proposed DSA algorithm in a small-scale Hadoop cluster. Our experimental results show that our algorithm can achieve much better performance than other well-known algorithms (e.g., LRU, FIFO). We also clearly point out the underlying tradeoff between cache performance and SSD deployment cost, and identify a number of key factors that affect SSD cache performance. Our findings can provide useful guidelines on how to effectively integrate SSDs into Hadoop.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Adaptive hybrid storage systems leveraging SSDs and HDDs in HPC cloud environments

Article 01 July 2017

CATS: cache-aware task scheduling for Hadoop-based systems

Article 24 May 2017

Memory Management Approaches in Apache Spark: A Review

References

Borthakur D (2007) The Hadoop distributed file system: architecture and design. Hadoop Project Website, The Apache Software Foundation
Kim Y, Gupta A, Urgaonkar B, Berman P, Sivasubramaniam A (2011) HybridStore: a cost-efficient, high-performance storage system combining SSDs and HDDs. In Proceedings of MASCOTS
Narayanan D, Thereska E, Donnelly A, Elnikety S, Rowstron A (2009) Migrating server storage to SSDs: analysis of tradeoffs. In: EuroSys
Hua Y, Liu X, Feng D (2013) Data similarity-aware computation infrastructure for the cloud. IEEE Trans Computers
Hu Y, Jiang H, Feng D, Tian L, Luo H, Ren C (2013) Exploring and exploiting the multilevel parallelism inside SSDs for improved performance and endurance. IEEE Trans Computers
Dayan N, Svendsen M, Bjorling M, Bonnet P, Bouganim L (2013) EagleTree: exploring the design space of SSD-based algorithms. In: VLDB
Chen F, Koufaty D, Zhang X (2011) Hystor: making the best use of solid state drives in high performance storage systems. In: ICS
Ozmen O, Salem K, Schindler J, Daniel S (2010) Workload-aware storage layout for database systems. In: Proceedings of ACM SIGMOD Int’l Conference on management of data
Liu X, Salem K (2013) Hybrid storage management for database systems. In: VLDB
Lee S, Moon B, Park C, Kim J, Kim S (2008) A case for flash memory SSD in enterprise database applications. In: Proceedings ACM SIGMOD International Conference on management of data
Miller E, Brandt S, Long D (2001) HeRMES: high-performance reliable MRAM-enabled storage. In: Proceedings of IEEE Workshop on hot topics in operating systems
Hua Y, Liu X, Feng D (2012) MERCURY: a scalable and similarity-aware scheme in multi-level cache hierarchy. In: MASCOTS
Biswas S, Franklin D, Savage A, Dixon R, Sherwood T, Chong F (2009) Multi-execution: MultiCore caching for data similar executions. In: ISCA
Lee R, Ding X, Chen F, Lu Q, Zhang X (2009) MCC-DB: minimizing cache conflicts in multi-core processors for databases. In: VLDB
Pritchett T, Thottethodi M (2010) SieveStore: a highly-selective. Ensemble-level disk cache for cost-performance. In: ISCA
Kim S, Jung D, Kim J, Maeng S (2009) HeteroDrive: reshaping the storage access pattern of OLTP workload using SSD. In: Proceedings of IWSSPS
Oh Y, Choi J, Lee D, Noh S (2012) Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems. In: FAST

Download references

Acknowledgments

This work was supported in part by the NSFC under Grant 61272397, the Fundamental Research Funds for the Central Universities under Grant 12LGPY53, Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S20120011187, Program for New Century Excellent Talents in University under Grant NCET-11-0542, Guangzhou Pearl River Sci. and Tech. Rising Star Project under Grant No. 2011J2200086.

Author information

Authors and Affiliations

Network and Information Branch, Guangdong Electric Power Design Institute, Guangzhou, China
Zhijian Chen & Xiang Huang
Department of Computer Science, Sun Yat-sen University, Guangzhou, China
Wenhai Luo, Jian He, Yuanhuan Zheng & Di Wu
China Southern Power Grid Co., Ltd., Guangzhou, China
Dan Wu

Authors

Zhijian Chen
View author publications
You can also search for this author in PubMed Google Scholar
Wenhai Luo
View author publications
You can also search for this author in PubMed Google Scholar
Dan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Jian He
View author publications
You can also search for this author in PubMed Google Scholar
Yuanhuan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Di Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Di Wu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, Z., Luo, W., Wu, D. et al. Exploiting application-level similarity to improve SSD cache performance in Hadoop. J Supercomput 70, 1331–1344 (2014). https://doi.org/10.1007/s11227-014-1230-x

Download citation

Published: 19 June 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s11227-014-1230-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploiting application-level similarity to improve SSD cache performance in Hadoop

Abstract

Access this article

Similar content being viewed by others

Adaptive hybrid storage systems leveraging SSDs and HDDs in HPC cloud environments

CATS: cache-aware task scheduling for Hadoop-based systems

Memory Management Approaches in Apache Spark: A Review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Exploiting application-level similarity to improve SSD cache performance in Hadoop

Abstract

Access this article

Similar content being viewed by others

Adaptive hybrid storage systems leveraging SSDs and HDDs in HPC cloud environments

CATS: cache-aware task scheduling for Hadoop-based systems

Memory Management Approaches in Apache Spark: A Review

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation