Skip to main content
Log in

A workload independent energy reduction strategy for D-NUCA caches

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Wire delays and leakage energy consumption are both growing problems in the design of large on chip caches built in deep submicron technologies. D-NUCA caches (Dynamic-Nonuniform Cache Architecture) exploit an aggressive subbanking of the cache and a migration mechanism to speed up frequently accessed data access latency, to limit wire delays effects on performances. Way Adaptable D-NUCA is a leakage power reduction technique specifically suited for D-NUCA caches. It dynamically varies the portion of the powered-on cache area based on the running workload caching needs, but it relies on application dependent parameters that must be evaluated off-line. This limits the effectiveness of Way Adaptable D-NUCA in the general purpose, multiprogrammed environment. In this paper, we propose a new power reduction technique for D-NUCA caches, which still adapts the powered-on cache area to the needs of the running workload, but it does not rely on application-dependent parameters. Results show that our proposal saves around 49 % of total cache energy consumption in a single core environment and 44 % in CMP environment. By adding a timer, it performs similarly to previously proposed techniques to reduce leakage power consumptions, and outperforms them when they are applied in a workload independent manner.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

References

  1. Kim C, Burger D, Keckler SW (2002) An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In: Proc 10th ASPLOS, San Jose, CA, USA, Oct 2002, pp 211–222

    Google Scholar 

  2. Bardine A, Comparetti M, Foglia P, Gabrielli G, Prete CA (2010) Way-adaptable D-NUCA caches. Int J High Perform Syst Archit 2(3/4):215–228

    Article  Google Scholar 

  3. Standard Performance Evaluation Corporation (2000) Available: http://www.spec.org/cpu2000/

  4. Bailey DH, Barszcz E et al (1991) The NAS parallel benchmarks—summary and preliminary results. In: Proceedings of the 1991 ACM/IEEE conference on supercomputing. ACM, New York, pp 158–165. Available http://www.nas.nasa.gov/Resources/Software/npb.html

    Chapter  Google Scholar 

  5. Powell M, Yangh S, Falsafi B, Roy K, Vijaykumar TN (2000) Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cache memories. In: Proc int symp low power electronics and design, Rapallo, Italy, July 2000, pp 90–95

    Google Scholar 

  6. Desikan R et al (2001) Sim-Alpha: a validated execution-driven alpha2164 simulator. Tech Report TR-01-23, Dept of Computer Sciences, Univ Texas at Austin

  7. Muralimanohar N, Balasubramonian R, Jouppi N (2009) CACTI 6.0: a tool to model large caches. HP Tech Rep, HPL-2009-85, April 2009

  8. Snavely A, Tullsen DM (2000) Symbiotic jobscheduling for a simultaneous multithreading processor. In: Proc of the 9th ASPLOS, Cambridge, MA, Nov 2000, pp 234–244

    Google Scholar 

  9. Chisti Z, Powell MD, Vijaykumar TN (2003) Distance associativity for high-performance energy-efficient non-uniform cache architectures. In: Proc 36th int symp on microarchitecture, San Diego, CA, Dec 2003, pp 55–66

    Google Scholar 

  10. Foglia P, Mangano D, Prete CA (2005) A NUCA model for embedded systems cache design. In: IEEE 2005 workshop on embedded systems for real-time multimedia (ESTIMEDIA), New York Metropolitan Area, USA, September 2005, pp 41–46

    Chapter  Google Scholar 

  11. Huh J, Kim C, Shafi H, Zhang L, Bourger D, Keckler SW (2005) A NUCA substrate for flexible CMP cache sharing. In: Proc of the 19th ICS, Cambridge, MA, 20–22 June 2005

    Google Scholar 

  12. Beckmann BM, Wood DA (2003) Managing wire delay in large chip-multiprocessors caches. In: Proc of 37th int symp on microarchitecture, San Diego, CA, Dec 2003, pp 55–66

    Google Scholar 

  13. Annoni A et al (2012) A real-time configurable NURBS interpolator with bounded acceleration, Jerk and Chord error. Comput Aided Des 44(6):509–521. doi:10.1016/j.cad.2012.01.009

    Article  Google Scholar 

  14. Bardine A et al (2009) Impact of on-chip network parameters on NUCA cache performance. IET Comput Digit Tech 3(5):501–512. doi:10.1049/ietcdt.2008.0078

    Article  Google Scholar 

  15. Bardine A, Foglia P, Gabrielli G, Prete CA (2007) Analysis of static and dynamic energy consumption in NUCA caches: initial results. In: Proc of the MEDEA 2007 workshop, Brasov, Romania, Sep 2007, pp 105–112

    Google Scholar 

  16. Venkatachalam V, Franz M (2005) Power reduction techniques for microprocessor systems. ACM Comput Surv 37(3):195–237

    Article  Google Scholar 

  17. Albonesi DH (1999) Selective cache ways: on-demand cache resource allocation. In: Proc 32nd int symp on microarchitecture, Israel, Nov 1999, pp 248–259

    Google Scholar 

  18. Balasubramonian R et al (2000) Memory hierarchy reconfiguration for energy and performance in general purpose processor architectures. In: Proc 33rd int symp on microarchitecture, Monterey, CA, Dec 2000, pp 245–257

    Google Scholar 

  19. Bardine A et al (2013) Evaluation of leakage reduction alternatives for deep sub-micron D-NUCA caches. IEEE Trans Very Large Scale Integr (VLSI) Syst. doi:10.1109/TVLSI.2012.2231949, published on-line Feb 2013

    Google Scholar 

  20. Hanson H et al (2003) Static energy reduction techniques for microprocessor caches. IEEE Trans Very Large Scale Integr (VLSI) Syst 11(3):303–313

    Article  MathSciNet  Google Scholar 

  21. Flautner K, Kim NS, Blaauw SMD, Mudge T (2002) Drowsy caches: simple techniques for reducing leakage power. In: Proc 29th ISCA, Anchorage, AK, May 2002, pp 148–157

    Google Scholar 

  22. Mohyuddin N, Bhatti R, Dubois M (2005) Controlling leakage power with the replacement policy in slumberous cache. In: Proc 2nd conf on computing frontiers, Ischia, Italy, May 2005, pp 161–170

    Google Scholar 

  23. Hu Z, Kaxiras S, Martonosi M (2002) Let caches decay: reducing leakage energy via exploitation of cache generational behavior. ACM Trans Comput Syst 20(2):161–190

    Article  Google Scholar 

  24. Eyerman S, Eeckhout L (2008) System-level performance metrics for multiprogram workloads. IEEE MICRO 28(3):42–53

    Article  Google Scholar 

  25. Kumar R, Hinton G (2009) A family of 45 nm IA processors. In: Proceedings of the 56th international solid state circuits conference (ISSCC), February 2009

    Google Scholar 

  26. Kurd NA, Bhamidipati S, Mozak C et al (2010) A family of 32 nm IA processors. IEEE J Solid-State Circuits 46(1):119–130

    Article  Google Scholar 

  27. Agny R, DeLano E, Kumar M, Nachimutu M, Shiveley R (2010) The Intel Itanium processor 9300 series. Intel White Paper

  28. Horowitz M, Indermaur T, Gonzales R (1994) Low-power digital design. In: Proc IEEE symposium on low power electronics, pp 8–11

    Google Scholar 

  29. Foglia P, Panicucci F, Prete CA, Solinas M (2009) Analysis of performance dependencies in NUCA-based CMP systems. In: 21st int symp on computer architecture and high performance computing, Sao Paulo, Brazil, 28–31 October 2009, pp 49–56

    Google Scholar 

  30. Kotera I, Egawa R, Takizawa H, Kobayashi H (2008) Modeling of cache access behavior based on Zipf’s law. In: Proc of 9th MEDEA workshop, Toronto, Canada, October 2008, pp 9–15

    Google Scholar 

  31. Kobayashi H, Kotera I, Takizawa H (2004) Locality analysis to control dynamically way-adaptable caches. Comput Archit News 33(3):25–32

    Article  Google Scholar 

  32. S.I.A. Int. Technology Roadmap for Semiconductors (2005) http://public.itrs.net/Links/2005ITRS/Home2005.htm

  33. Kim NS et al (2003) Leakage current: Moore’s law meets static power. Computer 36(12):68–75

    Article  Google Scholar 

  34. Foglia P, Monni G, Prete CA, Solinas M (2010) Re-nuca: boosting CMP performances through block replication. In: Proc 13th EUROMICRO conference on digital system design, architectures, methods and tools, Lille, France, 1–3 September 2010, pp 199–206

    Google Scholar 

  35. Foglia P, Solinas M (2013) Exploiting replication to improve performances of NUCA-based CMP systems. ACM Trans Embed Comput Syst. Accepted September 2013, to appear

  36. Qureshi MK, Patt YN (2006) Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches. In: Proc of the 39th annual IEEE/ACM int symp on microarchitecture (MICRO 39)

    Google Scholar 

  37. Xie Y, Loh GH (2010) Scalable shared cache management by containing thrashing workloads. In: Proc of the int conf on high-performance embedded architectures and compilers (HiPEAC), Pisa, Italy, 25–27 January 2010, pp 262–276

    Chapter  Google Scholar 

  38. Kahng A, Li B, Peh L-S, Samadi K (2009) ORION 2.0: a fast and accurate NoC power and area model for early-stage design space exploration. In: Proc of design automation and test in Europe (DATE), Nice, France, April 2009

    Google Scholar 

  39. Agarwal V, Hrishikesh MS, Keckler S, Burger D (2000) Clock rate versus IPC: the end of the road for conventional microarchitectures. In: Proc of 27th ISCA, June 2000

    Google Scholar 

  40. Ho R, Mai KW, Horowitz MA (2001) The future of wires. Proc IEEE 89(4):490–504

    Article  Google Scholar 

  41. Mattson RL, Gecsei J, Slutz D, Traiger I (1970) Evaluation techniques for storage hierarchies. IBM Syst J. doi:10.1147/sj.92.0078

    Google Scholar 

  42. Cascaval C, DeRose L, Padua DA, Reed D (1999) Compile-time based performance prediction. In: 12th intl workshop on languages and compilers for parallel computing

    Google Scholar 

  43. Kotera I, Abe K, Egawa R, Takizawa H, Kobayashi H (2008) Power-aware dynamic cache partitionning for cmps. Trans HiPEAC 3(2):149–167

    Google Scholar 

  44. Tanenbaum AS (2007) Modern operating systems, 3rd edn. Prentice Hall Press, Englewood Cliffs

    Google Scholar 

  45. Fallin C, Nazario G, Yuy X, Chang K, Ausavarungnirun R, Mutlu O (2012) MinBD: minimally-buffered deflection routing for energy-efficient interconnect. In: NOCS

    Google Scholar 

  46. Lotfi-Kamran P, Grot B, Falsafi B (2012) NOC-out: microarchitecting a scale-out processor. In: Proc the 45th annual inter symp on microarchitecture, Vancouver, Canada, December 2012

    Google Scholar 

  47. Homayoun H, Sasan A, Veidenbaum AV, Yao H-C, Golshan S, Heydari P (2011) MZZ-HVS: multiple sleep modes zig-zag horizontal and vertical sleep transistor sharing to reduce leakage power in on-chip SRAM peripheral circuits. IEEE Trans Very Large Scale Integr (VLSI) Syst 19(12):2303–2316

    Article  Google Scholar 

  48. Chandra D, Guo F, Kim S, Solihin Y (2005) Predicting inter-thread cache contention on a chip multi-processor architecture. In: HPCA ’05: proceedings of the 11th international symposium on high-performance computer architecture, pp 340–351

    Google Scholar 

  49. Meng Y, Sherwood T, Kastner R (2005) Exploring the limits of leakage power reduction in caches. ACM Trans Archit Code Optim 2(3):221–246

    Article  Google Scholar 

  50. Zhao W, Cao Y (2006) New generation of predictive technology model for sub-45 nm design exploration. In: Proc 7th int symp quality electron design, Mar 2006, pp 590–596

    Google Scholar 

  51. Keating M, Flynn D, Aitken R, Gibbons A, Shi K (2007) Low power methodology manual. Springer, Berlin

    Google Scholar 

  52. Comparetti M, Foglia P et al (2009) A power-efficient migration mechanism for D-NUCA caches. In: Design, automation & test in Europe 2009 (Date 2009), Nice, France, 20–24 April 2009, pp 598–601

    Google Scholar 

  53. Bardine A, Foglia P, Panicucci F, Sahuquillo J, Solinas M (2011) Energy behaviour of NUCA caches in CMPs. In: 14th EUROMICRO conference on digital system design, architectures, methods and tools (DSD2011), OULU, Finland, 31 August–2 September 2011, pp 746–753

    Chapter  Google Scholar 

  54. Hardavellas N et al (2009) Reactive NUCA: near-optimal block placement and replication in distributed caches. In: 36th annual international symposium on computer architecture (ISCA ’09). ACM, New York, pp 184–195. doi:10.1145/1555815.1555779

    Google Scholar 

  55. Bartolini S et al (2010) Feedback driven restructuring of multi-threaded applications for NUCA cache performance in CMPs. In: 22nd int symp on computer architecture and high performance computing, Petropolis, Brazil, 27–30 October 2010, pp 87–94. doi:10.1109/SBAC-PAD.2010.20

    Google Scholar 

  56. Bardine A, Comparetti M, Foglia P, Gabrielli G, Prete CA, Stenstrom P (2008) Leveraging data promotion for low power D-NUCA caches. In: 11th EUROMICRO conference on digital system design, Parma, Italy, 3–5 September 2008, pp 307–316. doi:10.1109/DSD.2008.52

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierfrancesco Foglia.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Foglia, P., Comparetti, M. A workload independent energy reduction strategy for D-NUCA caches. J Supercomput 68, 157–182 (2014). https://doi.org/10.1007/s11227-013-1033-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-013-1033-5

Keywords

Navigation