Abstract
To boost the performance of massive data processing, solid-state drives (SSDs) have been used as a kind of cache in the Hadoop system. However, most of existing SSD cache management algorithms are ignorant of the characteristics of upper-level applications. In this paper, we propose a novel SSD cache management algorithm called DSA, which can exploit the application-level data similarity to improve the SSD cache performance in Hadoop. Our algorithm takes both temporal similarity and user similarity in querying behaviors into account. We evaluate the effectiveness of our proposed DSA algorithm in a small-scale Hadoop cluster. Our experimental results show that our algorithm can achieve much better performance than other well-known algorithms (e.g., LRU, FIFO). We also clearly point out the underlying tradeoff between cache performance and SSD deployment cost, and identify a number of key factors that affect SSD cache performance. Our findings can provide useful guidelines on how to effectively integrate SSDs into Hadoop.
Similar content being viewed by others
References
Borthakur D (2007) The Hadoop distributed file system: architecture and design. Hadoop Project Website, The Apache Software Foundation
Kim Y, Gupta A, Urgaonkar B, Berman P, Sivasubramaniam A (2011) HybridStore: a cost-efficient, high-performance storage system combining SSDs and HDDs. In Proceedings of MASCOTS
Narayanan D, Thereska E, Donnelly A, Elnikety S, Rowstron A (2009) Migrating server storage to SSDs: analysis of tradeoffs. In: EuroSys
Hua Y, Liu X, Feng D (2013) Data similarity-aware computation infrastructure for the cloud. IEEE Trans Computers
Hu Y, Jiang H, Feng D, Tian L, Luo H, Ren C (2013) Exploring and exploiting the multilevel parallelism inside SSDs for improved performance and endurance. IEEE Trans Computers
Dayan N, Svendsen M, Bjorling M, Bonnet P, Bouganim L (2013) EagleTree: exploring the design space of SSD-based algorithms. In: VLDB
Chen F, Koufaty D, Zhang X (2011) Hystor: making the best use of solid state drives in high performance storage systems. In: ICS
Ozmen O, Salem K, Schindler J, Daniel S (2010) Workload-aware storage layout for database systems. In: Proceedings of ACM SIGMOD Int’l Conference on management of data
Liu X, Salem K (2013) Hybrid storage management for database systems. In: VLDB
Lee S, Moon B, Park C, Kim J, Kim S (2008) A case for flash memory SSD in enterprise database applications. In: Proceedings ACM SIGMOD International Conference on management of data
Miller E, Brandt S, Long D (2001) HeRMES: high-performance reliable MRAM-enabled storage. In: Proceedings of IEEE Workshop on hot topics in operating systems
Hua Y, Liu X, Feng D (2012) MERCURY: a scalable and similarity-aware scheme in multi-level cache hierarchy. In: MASCOTS
Biswas S, Franklin D, Savage A, Dixon R, Sherwood T, Chong F (2009) Multi-execution: MultiCore caching for data similar executions. In: ISCA
Lee R, Ding X, Chen F, Lu Q, Zhang X (2009) MCC-DB: minimizing cache conflicts in multi-core processors for databases. In: VLDB
Pritchett T, Thottethodi M (2010) SieveStore: a highly-selective. Ensemble-level disk cache for cost-performance. In: ISCA
Kim S, Jung D, Kim J, Maeng S (2009) HeteroDrive: reshaping the storage access pattern of OLTP workload using SSD. In: Proceedings of IWSSPS
Oh Y, Choi J, Lee D, Noh S (2012) Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems. In: FAST
Acknowledgments
This work was supported in part by the NSFC under Grant 61272397, the Fundamental Research Funds for the Central Universities under Grant 12LGPY53, Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S20120011187, Program for New Century Excellent Talents in University under Grant NCET-11-0542, Guangzhou Pearl River Sci. and Tech. Rising Star Project under Grant No. 2011J2200086.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, Z., Luo, W., Wu, D. et al. Exploiting application-level similarity to improve SSD cache performance in Hadoop. J Supercomput 70, 1331–1344 (2014). https://doi.org/10.1007/s11227-014-1230-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-014-1230-x