skip to main content
10.1145/1250662.1250720acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
Article

Configurable isolation: building high availability systems with commodity multi-core processors

Published: 09 June 2007 Publication History

Abstract

High availability is an increasingly important requirement for enterprise systems, often valued more than performance. Systems designed for high availability typically use redundant hardware for error detection and continued uptime in the event of a failure. Chip multiprocessors with an abundance of identical resources like cores, cache and interconnection networks would appear to be ideal building blocks for implementing high availability solutions on chip. However, doing so poses significant challenges with respect to error containment and faulty component replacement. Increasing silicon and transient fault rates with future technology scaling exacerbate the problem. This paper proposes a novel, cost-effective, architecture for high availability systems built from future multi-core processors. We propose a new chip multiprocessor architecture that provides configurable isolation for fault containment and component retirement, based upon cost-effective modifications to commodity designs. The design is evaluated for a state-of-the-art industrial fault model and the proposed architecture is shown to provide effective fault isolation and graceful degradation even when the failure rate is high.

References

[1]
Albonesi, D.H. Selective Cache Ways: On-Demand Cache Resource Allocation. Journal of Instruction-Level Parallelism, Vol. 2, 2000.
[2]
Austin, T. M. DIVA: A reliable substrate for deep submicron microarchitecture design. In Proc. of the 32nd Intl. Symposium on Microarchitecture, November 1999.
[3]
Bartlett, W. and Ball, B. Tandem's Approach to Fault Tolerance. Tandem Systems Rev., vol. 4, no. 1, Feb. 1998, pp. 84--95.
[4]
Bernick, D., Bruckert, B., Vigna, P. D., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. NonStop® Advanced Architecture. Conf. on Dependable Systems and Networks, 2005, 12--21.
[5]
Borkar, S. Challenges in Reliable System Design in the Presence of Transistor Variability and Degradation. IEEE Micro, vol. 25, no. 6, Nov.-Dec. 2005, pp. 10--16.
[6]
Bower, F. et al. Tolerating hard faults in microprocessor array structures. In proceedings of the 2004 International Conference on Dependable Systems and Networks, 2004.
[7]
Bressoud, T. C. and Schneider, F. B. Hypervisor-based fault tolerance. ACM Trans. Computer Systems 14, 1 (Feb. 1996), 80--107.
[8]
Constantinescu, C. Trends and challenges in VLSI circuit reliability. IEEE Micro, 23(4):14--19, 2003.
[9]
Dell, T.J. A White paper on the benefit of chipkill-correct ECC for PC Server Main Memory, IBM white paper, http://www-03.ibm.com/servers/eserver/pseries/campaigns/chipkill.pdf.
[10]
Eagle Rock Alliance Ltd. Online survey results: 2001 cost of downtime. http://contingencyplanningresearch.com/2001.Survey.pdf, Aug. 2001.
[11]
Fair, M.L., Conklin, C.R., Swaney, S. B., Meaney, P. J., Clarke, W. J., Alves, L. C., Modi, I. N., Freier, F., Fischer, W., and Weber, N. E. Reliability, Availability, and Serviceability (RAS) of the IBM eServer z990. IBM Journal of Research and Development, Nov, 2004.
[12]
Gold, B. T. et al. TRUSS: a reliable, scalable server architecture. IEEE Micro, Nov-Dec 2005.
[13]
Gold, B. T., Smolens, J. C., Falsafi, B. and Hoe, J. C. The Granularity of Soft-Error Containment in Shared Memory Multiprocessors, Proceedings of The Workshop on Silicon Errors in Logic-System Effects (SELSE), 2006.
[14]
Gomaa, M. et al. Transient-fault recovery for chip multiprocessors. In Proceedings of the 30th International Symposium on Computer Architecture, June 2003.
[15]
Hennessy, J. The Future of Systems Research. IEEE Computer, vol. 32, no. 8, Aug. 1999, pp. 27--33.
[16]
Joseph, R. Exploring Core Salvage Techniques for Multi-core Architectures. Workshop on High Performance Computing Reliability Issues, 2005.
[17]
Mukherjee, S. S. et al. Detailed design and evaluation of redundant multithreading alternatives. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002, 99--110.
[18]
Nakano, J. et al. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In HPCA, 2006.
[19]
Qureshi, M. K. et al. Microarchitecture-based introspection: A technique for transientfault tolerance in microprocessors. In Proc. of 32nd Intl. Symp. on Comp. Arch. (ISCA-32), June 2005.
[20]
Ranganathan, P., Adve, S., and Jouppi, N. P. Reconfigurable Cache and their Application to Media Processing, Proceedings of the 27th International Symposium on Computer Architecture (ISCA-27), June 2000.
[21]
Ray, J. et al. Dual use of superscalar datapath for transient-fault detection and recovery. In Proceedings of the 34th International Symposium on Microarchitecture, December 2001.
[22]
Reinhardt, S. K. and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. In Proceedings of the 27th International Symposium on Computer Architecture, June 2000.
[23]
Rotenberg, E. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors. In Proceedings of the 29th International Symposium on Fault-Tolerant Computing, June 1999.
[24]
Shivakumar, P. et al. Modeling the effect of technology trends on the soft error rate of combinational logic. In Proceedings of the International Conference on Dependable Systems and Networks, June 2002, 389--398.
[25]
Shivakumar, P. Keckler, S. W., Moore, C. R., and Burger, D. Exploiting Microarchitectural Redundancy for Defect Tolerance. The 21st International Conference on Computer Design (ICCD), October, 2003.
[26]
Slegel, T.J. et al. IBM's S/390 G5 Microprocessor Design. IEEE Micro, vol. 19, no. 2, Mar./Apr. 1999, pp. 12--23.
[27]
Smolens, J. C. et al. Efficient resource sharing in concurrent error detecting superscalar microarchitectures. In Proc. of 37th IEEE/ACM Intl. Symp. on Microarch. (MICRO 37), December 2004.
[28]
Sorin, D. J. et al. SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proc. of 29th Intl. Symp. on Comp. Arch. (ISCA-29), June 2002.
[29]
Srinivasan, J., Adve, S.V., Bose, P., Rivers, J.A. The Impact of Technology Scaling on Lifetime Reliability. Proceedings of International Conference on Dependable Systems and Networks (DSN '04) June 2004.
[30]
Srinivasan, J., Adve, S. V., Bose, P., and Rivers, J. A. The Case for Lifetime Reliability-Aware Microprocessors. Proceedings of 31st International Symposium on Computer Architecture (ISCA '04) June 2004.
[31]
Srinivasan, J., Adve, S. V., Bose, P., and Rivers, J. A. Exploiting Structural Duplication for Lifetime Reliability Enhancement. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA'05), June 2005.
[32]
Sundaramoorthy, K. et al. Slipstream processors: Improving both performance and fault tolerance. In ASPLOS, October 2000.
[33]
Vijaykumar, T. N. et al. Transient-fault recovery using simultaneous multithreading. In Proceedings of the 29th International Symposium on Computer Architecture, May 2002.
[34]
Wunderlich, R. E., Wenisch, T. F., Falsafi, B., and Hoe, J. C. 2003. SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th Annual international Symposium on Computer Architecture, June 2003.
[35]
SPEC Benchmark Suite. http://www.spec.org and http://www.spec.org/cpu/analysis/memory/
[36]
International Technology Roadmap for Semiconductors. http://www.itrs.net/
[37]
Falcon, A. Faraboschi, P., and Ortega, D. Combining Simulation and Virtualization through Dynamic Sampling. ISPASS-2007.
[38]
Foxton Technology, http://www.intel.com/technology/magazine/computing/foxton-technology-0905.htm
[39]
Barroso, L. A., Gharachorloo, K., McNamara, R., Nowatzyk, A., Qadeer, S., Sano, B., Smith, S., Stets, R., and Verghese, B. Piranha: A scalable architecture based on single-chip multiprocessing. In Proceedings of the 27th International Symposium on Computer Architecture, June 2000.
[40]
Kongetira, P., Aingaran, K., and Olukotun, K. Niagara: A 32-way multithreaded SPARC processor. IEEE Micro, 25(2):21--29, 2005.
[41]
Tendler, J. M., Dodson, J. S., Fields Jr., J. S., Le, H., and Sinharoy, B. IBM Power4 system microarchitecture. IBM Journal of Research and Development, 46(1):5--26, 2002.

Cited By

View all
  • (2024)Homogeneous and Heterogeneous Multicore SystemsInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAY458(141-149)Online publication date: 15-May-2024
  • (2021)ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00051(520-532)Online publication date: Feb-2021
  • (2021)Remaining useful life prediction in embedded systems using an online auto-updated machine learning based modelingMicroelectronics Reliability10.1016/j.microrel.2021.114071119(114071)Online publication date: Apr-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture
June 2007
542 pages
ISBN:9781595937063
DOI:10.1145/1250662
  • General Chair:
  • Dean Tullsen,
  • Program Chair:
  • Brad Calder
  • cover image ACM SIGARCH Computer Architecture News
    ACM SIGARCH Computer Architecture News  Volume 35, Issue 2
    May 2007
    527 pages
    ISSN:0163-5964
    DOI:10.1145/1273440
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. chip multiprocessors
  2. fault isolation
  3. high availability

Qualifiers

  • Article

Conference

SPAA07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)33
  • Downloads (Last 6 weeks)4
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Homogeneous and Heterogeneous Multicore SystemsInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAY458(141-149)Online publication date: 15-May-2024
  • (2021)ParaDox: Eliminating Voltage Margins via Heterogeneous Fault Tolerance2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00051(520-532)Online publication date: Feb-2021
  • (2021)Remaining useful life prediction in embedded systems using an online auto-updated machine learning based modelingMicroelectronics Reliability10.1016/j.microrel.2021.114071119(114071)Online publication date: Apr-2021
  • (2020)Incremental Modeling and Monitoring of Embedded CPU-GPU ChipsProcesses10.3390/pr80606788:6(678)Online publication date: 9-Jun-2020
  • (2019)Incorporating Core-to-Core Correlation to Improve Partially Good Yield ModelsIEEE Transactions on Semiconductor Manufacturing10.1109/TSM.2019.294083532:4(538-543)Online publication date: Nov-2019
  • (2019)PhantasyIEEE Transactions on Computers10.1109/TC.2018.286594368:2(225-238)Online publication date: 1-Feb-2019
  • (2019)Reliable flight control system architecture for agile airborne platforms: an asymmetric multiprocessing approachThe Aeronautical Journal10.1017/aer.2019.30(1-23)Online publication date: 3-Jun-2019
  • (2018)Error correlation prediction in lockstep processors for safety-critical systemsProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00065(737-748)Online publication date: 20-Oct-2018
  • (2018)Modelling processor reliability using LLVM compiler fault injection2018 IEEE Aerospace Conference10.1109/AERO.2018.8396489(1-10)Online publication date: Mar-2018
  • (2017)A resilient scheduler for dataflow execution2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT)10.1109/DFT.2017.8244460(1-4)Online publication date: Oct-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media