Skip to main content
Log in

Exploiting application-level similarity to improve SSD cache performance in Hadoop

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

To boost the performance of massive data processing, solid-state drives (SSDs) have been used as a kind of cache in the Hadoop system. However, most of existing SSD cache management algorithms are ignorant of the characteristics of upper-level applications. In this paper, we propose a novel SSD cache management algorithm called DSA, which can exploit the application-level data similarity to improve the SSD cache performance in Hadoop. Our algorithm takes both temporal similarity and user similarity in querying behaviors into account. We evaluate the effectiveness of our proposed DSA algorithm in a small-scale Hadoop cluster. Our experimental results show that our algorithm can achieve much better performance than other well-known algorithms (e.g., LRU, FIFO). We also clearly point out the underlying tradeoff between cache performance and SSD deployment cost, and identify a number of key factors that affect SSD cache performance. Our findings can provide useful guidelines on how to effectively integrate SSDs into Hadoop.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Borthakur D (2007) The Hadoop distributed file system: architecture and design. Hadoop Project Website, The Apache Software Foundation

  2. Kim Y, Gupta A, Urgaonkar B, Berman P, Sivasubramaniam A (2011) HybridStore: a cost-efficient, high-performance storage system combining SSDs and HDDs. In Proceedings of MASCOTS

  3. Narayanan D, Thereska E, Donnelly A, Elnikety S, Rowstron A (2009) Migrating server storage to SSDs: analysis of tradeoffs. In: EuroSys

  4. Hua Y, Liu X, Feng D (2013) Data similarity-aware computation infrastructure for the cloud. IEEE Trans Computers

  5. Hu Y, Jiang H, Feng D, Tian L, Luo H, Ren C (2013) Exploring and exploiting the multilevel parallelism inside SSDs for improved performance and endurance. IEEE Trans Computers

  6. Dayan N, Svendsen M, Bjorling M, Bonnet P, Bouganim L (2013) EagleTree: exploring the design space of SSD-based algorithms. In: VLDB

  7. Chen F, Koufaty D, Zhang X (2011) Hystor: making the best use of solid state drives in high performance storage systems. In: ICS

  8. Ozmen O, Salem K, Schindler J, Daniel S (2010) Workload-aware storage layout for database systems. In: Proceedings of ACM SIGMOD Int’l Conference on management of data

  9. Liu X, Salem K (2013) Hybrid storage management for database systems. In: VLDB

  10. Lee S, Moon B, Park C, Kim J, Kim S (2008) A case for flash memory SSD in enterprise database applications. In: Proceedings ACM SIGMOD International Conference on management of data

  11. Miller E, Brandt S, Long D (2001) HeRMES: high-performance reliable MRAM-enabled storage. In: Proceedings of IEEE Workshop on hot topics in operating systems

  12. Hua Y, Liu X, Feng D (2012) MERCURY: a scalable and similarity-aware scheme in multi-level cache hierarchy. In: MASCOTS

  13. Biswas S, Franklin D, Savage A, Dixon R, Sherwood T, Chong F (2009) Multi-execution: MultiCore caching for data similar executions. In: ISCA

  14. Lee R, Ding X, Chen F, Lu Q, Zhang X (2009) MCC-DB: minimizing cache conflicts in multi-core processors for databases. In: VLDB

  15. Pritchett T, Thottethodi M (2010) SieveStore: a highly-selective. Ensemble-level disk cache for cost-performance. In: ISCA

  16. Kim S, Jung D, Kim J, Maeng S (2009) HeteroDrive: reshaping the storage access pattern of OLTP workload using SSD. In: Proceedings of IWSSPS

  17. Oh Y, Choi J, Lee D, Noh S (2012) Caching less for better performance: balancing cache size and update cost of flash memory cache in hybrid storage systems. In: FAST

Download references

Acknowledgments

This work was supported in part by the NSFC under Grant 61272397, the Fundamental Research Funds for the Central Universities under Grant 12LGPY53, Guangdong Natural Science Funds for Distinguished Young Scholar under Grant S20120011187, Program for New Century Excellent Talents in University under Grant NCET-11-0542, Guangzhou Pearl River Sci. and Tech. Rising Star Project under Grant No. 2011J2200086.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Di Wu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, Z., Luo, W., Wu, D. et al. Exploiting application-level similarity to improve SSD cache performance in Hadoop. J Supercomput 70, 1331–1344 (2014). https://doi.org/10.1007/s11227-014-1230-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-014-1230-x

Keywords

Navigation