skip to main content
10.1145/2830772.2830804acmconferencesArticle/Chapter ViewAbstractPublication PagesmicroConference Proceedingsconference-collections
research-article

Modeling the implications of DRAM failures and protection techniques on datacenter TCO

Published: 05 December 2015 Publication History

Abstract

Total Cost of Ownership (TCO) is a key optimization metric for the design of a datacenter. This paper proposes, for the first time, a framework for modeling the implications of DRAM failures and DRAM error protection techniques on the TCO of a datacenter. The framework captures the Effects and interactions of several key parameters including: the choice of DRAM protection technique (e.g. single vs dual channel Chipkill), device width (x4 or x8), memory size, power, FITs for various failure modes, the performance, power and temperature overheads of a protection technique for a given service and mixes of collocated services. The usefulness of the proposed framework is demonstrated through several case studies that identify the best DRAM protection technique in each case, in terms of TCO. Interestingly, our analysis reveals that among the three DRAM protection techniques considered, there is no one that is always superior to all the others. Moreover, each technique is better than the others for some cases. This underlines the importance and the need of the proposed framework for making optimal memory protection datacenter design decisions. As part of this work, we analyze and report the performance and power with single channel and dual channel Chipkill on real hardware when running a web search benchmark alone and collocated with benchmarks of varying memory intensity. This analysis reveals that the choice of memory protection can have serious performance and TCO ramifications depending on the memory characteristics of collocated services. Other analysis reveals that, for the datacenter and services assumed in this study, when using Chipkill protection it can be beneficial for TCO to use DRAM with 100x the failure rate of a baseline DRAM as long as the cost per DIMM is at least a dollar less compared to the baseline.

References

[1]
J. Hamilton, "Architecture for modular data centers," arXiv preprint cs/0612110, 2006.
[2]
L. A. Barroso, J. Clidaras, and U. Hölzle, "The datacenter as a computer: An introduction to the design of warehouse-scale machines," Synthesis lectures on Computer Architecture, pp. 1--154, 2013.
[3]
B. Schroeder and G. A. Gibson, "Understanding failures in petascale computers," Journal of Physics: Conference Series, 2007.
[4]
J. Daly, B. Harrod, T. Hoang, L. Nowell, B. Adolf, S. Borkar, N. DeBardeleben, M. Heroux, D. Rogers, R. R. ANL, et al., "Inter-Agency Workshop on HPC Resilience at Extreme Scale," 2012.
[5]
E. Brewer, "Lessons from giant-scale services," Internet Computing, pp. 46--55, 2001.
[6]
B. Schroeder and G. Gibson, "A large-scale study of failures in high-performance computing systems," Transactions on Dependable and Secure Computing, pp. 337--350, 2010.
[7]
V. Sridharan and D. Liberty, "A study of dram failures in the field," in International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 76:1--76:11, 2012.
[8]
B. Schroeder, E. Pinheiro, and D. Weber, "Dram errors in the wild: a large-scale field study," SIGMETRICS, pp. 193--204, June 2009.
[9]
L. Borucki, G. Schindlbeck, and C. Slayman, "Comparison of accelerated dram soft error rates measured at component and system level," in International Reliability Physics Symposium, pp. 482--487, April 2008.
[10]
K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. Wenisch, "Disaggregated memory for expansion and sharing in blade servers," in 36th Annual International Symposium on Computer Architecture, pp. 267--278, 2009.
[11]
G. Daniel Bowers, "Server trends," TR, 2012.
[12]
R. W. Hamming, "Error detecting and error correcting codes," Bell System Technical Journal, pp. 147--160, 1950.
[13]
T. J. Dell, "A white paper on the benefits of chipkill-correct ecc for pc server main memory," IBM Microelectronics Division, pp. 1--23, 1997.
[14]
S. Ankireddi and T. Chen, "Configuring and using DDR3 memory with HP ProLiant Gen8 Servers, Best Practice Guidelines for ProLiant servers with Intel Xeon processors," February 2014.
[15]
"BIOS and Kernel Developers Guide (BKDG) for AMD Family 15h," February 2014.
[16]
Y. Luo, S. Govindan, B. Sharma, M. Santaniello, J. Meza, A. Kansal, J. Liu, B. Khessib, K. Vaid, and O. Mutlu, "Characterizing application memory error vulnerability to optimize datacenter cost via heterogeneous-reliability memory," 44th Annual International Conference on Dependable Systems and Networks, June 2014.
[17]
C. Constantinescu, "Impact of deep submicron technology on dependability of vlsi circuits," in International Conference on Dependable Systems and Networks, pp. 205--209, 2002.
[18]
C. Weaver, J. Emer, S. Mukherjee, and S. Reinhardt, "Techniques to reduce the soft error rate of a high-performance microprocessor," in 31st Annual International Symposium on Computer Architecture, pp. 264--275, June 2004.
[19]
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt, and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," in 36th Annual International Symposium on Microarchitecture, 2003.
[20]
M. Y. Hsiao, "A class of optimal minimum odd-weight-column sec-ded codes," IBM J. Res. Dev., pp. 395--401, July 1970.
[21]
S. Ankireddi and T. Chen, "Challenges in thermal management of memory modules," Electronics Cooling, February 2008.
[22]
"BIOS and Kernel Developers Guide (BKDG) for AMD Family 10h," April 2010.
[23]
J. H. Ahn, N. P. Jouppi, C. Kozyrakis, J. Leverich, and R. S. Schreiber, "Future scaling of processor-memory interfaces," in Conference on High Performance Computing Networking, Storage and Analysis, pp. 42:1--42:12, 2009.
[24]
X. Jian, H. Duwe, J. Sartori, V. Sridharan, and R. Kumar, "Low-power, low-storage-overhead chipkill correct via multi-line error correction," in International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 24:1--24:12, 2013.
[25]
A. N. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. P. Jouppi, "Lot-ecc: Localized and tiered reliability mechanisms for commodity memory systems," in 39th Annual International Symposium on Computer Architecture, pp. 285--296, 2012.
[26]
"Micron, 2gb: x4, x8, x16 ddr3 sdram," Datasheed:https://www.micron.com/products/datasheets.
[27]
"Intel, Xeon Processor E7 Family:Reliability, Availability, and Serviceability, Advanced data integrity and resiliency support for mission-critical deployments," June 2006.
[28]
A. Kleen, "mcelog: memory error handling in user space, linux," TR, 2010.
[29]
D. Tang, P. Carruthers, Z. Totari, and M. Shapiro, "Assessment of the effect of memory page retirement on system ras against hardware faults," in International Conference on Dependable Systems and Networks, pp. 365--370, 2006.
[30]
D. Meisner, C. M. Sadler, L. A. Barroso, W.-D. Weber, and T. F. Wenisch, "Power management of online data-intensive services," in 38th Annual International Symposium on Computer Architecture, pp. 319--330, 2011.
[31]
L. A. Barroso, J. Dean, and U. Hölzle, "Web search for a planet: The google cluster architecture," IEEE Micro, pp. 22--28, Mar. 2003.
[32]
J. Mars, L. Tang, R. Hundt, K. Skadron, and M. L. Soffa, "Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations," in 44th Annual International Symposium on Microarchitecture, pp. 248--259, 2011.
[33]
H. Yang, A. Breslow, J. Mars, and L. Tang, "Bubble-flux: Precise online qos management for increased utilization in warehouse scale computers," in 40th Annual International Symposium on Computer Architecture, pp. 607--618, 2013.
[34]
D. Hardy, M. Kleanthous, I. Sideris, A. Saidi, E. Ozer, and Y. Sazeides, "An analytical framework for estimating tco and exploring data center design space," in International Symposium on Performance Analysis of Systems and Software, pp. 54--63, 2013.
[35]
C. Patel and A. Shah, "Cost model for planning, development and operation of a data center," HP TR, 2005.
[36]
J. Karidis, J. E. Moreira, and J. Moreno, "True value: assessing and optimizing the cost of computing at the data center level," in 6th ACM Conference on Computing Frontiers, pp. 185--192, 2009.
[37]
J. Moore, J. Chase, P. Ranganathan, and R. Sharma, "Making scheduling "cool": temperature-aware workload placement in data centers," in Annual Conference on USENIX, pp. 5--5, 2005.
[38]
J. Koomey, K. Brill, P. Turner, J. Stanley, and B. Taylor, "A simple model for determining true total cost of ownership for data centers," White Paper, Uptime Institute, 2007.
[39]
K. V. Vishwanath, A. Greenberg, and D. A. Reed, "Modular data centers: how to design them?," in 1st Workshop on Large-Scale System and Application Performance, pp. 3--10, 2009.
[40]
S. Li, K. Chen, M.-Y. Hsieh, N. Muralimanohar, C. D. Kersey, J. B. Brockman, A. F. Rodrigues, and N. P. Jouppi, "System implications of memory reliability in exascale computing," in International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 46:1--46:12, 2011.
[41]
D. H. Yoon and M. Erez, "Virtualized and flexible ecc for main memory," in 15th Edition of Architectural Support for Programming Languages and Operating Systems, pp. 397--408, 2010.
[42]
"Hp rom-based setup utility user guide," February HP TR, 2014.
[43]
"Memory technology evolution: an overview of system memory technologies," December HP, TR, 2010.
[44]
M. Guevara, B. Lubin, and B. C. Lee, "Market mechanisms for managing datacenters with heterogeneous microarchitectures," ACM Trans. Comput. Syst., pp. 3:1--3:31, Feb. 2014.
[45]
"HP iLO 3 User Guide," 2014. http://h20628.www2.hp.com/km-ext/kmcsdirect/emr_na-c02774507-6.pdf.
[46]
A. Sridhar, A. Vincenzi, M. Ruggiero, T. Brunschwiler, and D. Atienza, "3d-ice: Fast compact transient thermal modeling for 3d ics with inter-tier liquid cooling," in International Conference on Computer-Aided Design, pp. 463--470, Nov 2010.
[47]
"lm-sensors," http://www.lm-sensors.org/wiki/man/sensors-detect.
[48]
V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi, "Feng shui of supercomputer memory: Positional Effects in dram and sram faults," in International Conference on High Performance Computing, Networking, Storage and Analysis, pp. 22:1--22:11, 2013.
[49]
I. Cecil Ho, CST, "Innovative testing puts fallout dram back into systems," January 2003. Simmtester.com.
[50]
"Memory Test Background," 2000. http://tinyurl.com/m7c3wf7.
[51]
Z. Al-Ars, "Dram fault analysis and test generation," Ph.D. dissertation, Delft, 2005.
[52]
"Arrhenius equation," https://en.wikipedia.org/wiki/Arrhenius_equation.
[53]
"Intel Xeon Processor E5620-cost," http://www.cpubenchmark.net/cpu.php?cpu=Intel+Xeon+E5620+%40+2.40GHz.
[54]
"Desktop Drive 500GB-cost and power," http://www.ebuyer.com/394432-wd-500gb-black-desktop-drive-wd5003azex.
[55]
"Intel Server Motherboard cost," http://www.cpusolutions.com/store/pc/Intel-S1200V3RPS-Server-Motherboard-Intel-C222-Chipset-Socket/-H3-LGA-1150-p3673.htm#.U4OhLHZqOPM.
[56]
"Intel cpu configuration," http://www.rect.coreto-europe.com/rack-server/1u-intel-server/2428-short-1u-intel-single-cpu-rack-server.html.
[57]
"Server cace and power supply," http://www.newegg.com/Product/Product.aspx?Item=N82E16811108235.
[58]
"Kingston Technology ValueRAM 8GB-x4 1600MHz DDR3-cost," http://www.amazon.com/Kingston-Technology-PC3-12800-KVR16LR11S4-8HA/dp/B00BYO7CZM.
[59]
"Kingston ValueRam 8GB-x8 1600 MHz DDR3-cost," http://www.amazon.com/Kingston-Technology-Validated-KVR16LR11D8-8I/dp/B00JWFMBIS.
[60]
C. Delimitrou and C. Kozyrakis, "Quasar: Resource-efficient and qos-aware cluster management," in 19th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 127--144, 2014.
[61]
"The International Technology Roadmap for Semiconductors, ITRS, Tech. Rep.," 2013. http://www.itrs.ne.
[62]
"Hp advanced memory error detection technology," July TR, 2011.
[63]
"Hp proliant dl380 g7 server user guide, 2nd edition," February 2011.
[64]
"Cloudsuite web search site," http://parsa.epfl.ch/cloudsuite/search.html.
[65]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 37--48, 2012.
[66]
"Standard performance evaluation corporation. spec cpu 2006," 2006. http://www.spec.org/cpu2006/.
[67]
"Prime95," http://www.mersenne.org/download/.
[68]
J. Srinivasan, S. V. Adve, P. Bose, S. V. A. P. Bose, and J. A. Rivers, "The case for lifetime reliability-aware microprocessors," in 31st International Symposium on Computer Architecture, pp. 276--287, 2004.
[69]
K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, and J. Hiller, "Exascale computing study: Technology challenges in achieving exascale systems," TR, 2008.
[70]
"Intel Xeon Processor E3-power," http://www.servethehome.com/intel-xeon-e3-1220-v3-benchmark-review-haswell-xeon/.
[71]
J. Hamilton, "Overall data center costs." http://perspectives.mvdirona.com/2010/09/18/OverallDataCenterCosts.aspxn.
[72]
S. K. S. Hari, S. V. Adve, H. Naeimi, and P. Ramachandran, "Relyzer: Exploiting application-level fault equivalence to analyze application resiliency to transient faults," in Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 123--134, 2012.

Cited By

View all
  • (2024)Runtime Tests for Memory Error Handlers of In-Memory Key Value Stores Using MemFIIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7019E107.D:11(1408-1421)Online publication date: 1-Nov-2024
  • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
  • (2023)Redundant Array of Independent Memory DevicesIEEE Computer Architecture Letters10.1109/LCA.2023.333498922:2(181-184)Online publication date: Jul-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MICRO-48: Proceedings of the 48th International Symposium on Microarchitecture
December 2015
787 pages
ISBN:9781450340342
DOI:10.1145/2830772
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. DRAM
  2. co-running services
  3. datacenters
  4. online and offline services
  5. reliability
  6. total cost of ownership

Qualifiers

  • Research-article

Conference

MICRO-48
Sponsor:

Acceptance Rates

MICRO-48 Paper Acceptance Rate 61 of 283 submissions, 22%;
Overall Acceptance Rate 484 of 2,242 submissions, 22%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Runtime Tests for Memory Error Handlers of In-Memory Key Value Stores Using MemFIIEICE Transactions on Information and Systems10.1587/transinf.2024EDP7019E107.D:11(1408-1421)Online publication date: 1-Nov-2024
  • (2023)Structural Coding: A Low-Cost Scheme to Protect CNNs from Large-Granularity Memory FaultsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607084(1-17)Online publication date: 12-Nov-2023
  • (2023)Redundant Array of Independent Memory DevicesIEEE Computer Architecture Letters10.1109/LCA.2023.333498922:2(181-184)Online publication date: Jul-2023
  • (2023)Review of Memory RAS for Data CentersIEEE Access10.1109/ACCESS.2023.332998411(124782-124796)Online publication date: 2023
  • (2022)On the Evaluation of the Total-Cost-of-Ownership Trade-Offs in Edge vs Cloud Deployments: A Wireless-Denial-of-Service Case StudyIEEE Transactions on Sustainable Computing10.1109/TSUSC.2019.28940187:2(334-345)Online publication date: 1-Apr-2022
  • (2022)Graceful ECC-uncorrectable Error Handling in the Operating System Kernel2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE)10.1109/ISSRE55969.2022.00021(109-120)Online publication date: Oct-2022
  • (2022)Hardening In-memory Key-value Stores against ECC-uncorrectable Memory Errors2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN53405.2022.00057(509-521)Online publication date: Jun-2022
  • (2022)Total Cost of Ownership Perspective of Cloud vs Edge Deployments of IoT ApplicationsComputing at the EDGE10.1007/978-3-030-74536-3_6(141-161)Online publication date: 20-Sep-2022
  • (2021)Exploring the Tradeoff Between Reliability and Performance in HPC Systems2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622853(1-7)Online publication date: 20-Sep-2021
  • (2019)Cost Minimization Through Load Balancing and Effective Resource Utilization in Cloud-Based Web ServicesInternational Journal of Natural Computing Research10.4018/IJNCR.20190401038:2(51-74)Online publication date: Apr-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media