Skip to main content

Advertisement

Log in

A fault-tolerant architecture for parallel applications in tiled-CMPs

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Nowadays, hardware reliability is considered a first-class issue along with performance and energy efficiency. The increasing scaling technology and subsequent supply voltage reductions, together with temperature fluctuations, augment the susceptibility of architectures to errors.

With the development of CMPs, the interest for using parallel applications has increased. Previous proposals for providing fault detection and recovery have been mainly based on redundant execution over different cores. RMT (Redundant Multi-Threading) is a family of techniques based on SMT (Simultaneous Multi-Threading) processors in which two independent threads (master and slave), fed with the same inputs, redundantly execute the same instructions, in order to detect faults by checking their outputs. In this paper, we study the under-explored architectural support of RMT techniques to reliably execute shared-memory applications in tiled-CMPs.

Initially, we show how atomic operations induce serialization points between master and slave threads, degrading the execution time by 35% for several parallel scientific and multimedia benchmarks. To address this issue, we introduce REPAS (Reliable Execution of Parallel ApplicationS in tiled-CMPs), a novel RMT mechanism to provide reliable execution in shared-memory applications in environments prone to transient faults. REPAS architecture only needs few extra hardware since the redundant execution is performed within 2-way SMT cores in which the majority of hardware is shared. Experimental results show that REPAS is able to provide fault tolerance against soft errors with a lower execution time overhead (around 25% including the cost of redundancy) in comparison to a non-redundant system than previous proposals while using less hardware resources. Additionally, we show that REPAS supports huge fault ratios with negligible impact on performance (less than 2% for a fault ratio of 100 faults per million cycles).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Bartlett J, Gray J, Horst B (1987) Fault tolerance in tandem computer systems. In: The evolution of fault-tolerant systems. doi:10.1.59.6080

    Google Scholar 

  2. Blundell C, Martin MM, Wenisch TF (2009) Invisifence: performance-transparent memory ordering in conventional multiprocessors. In: Proc of the 36th annual international symposium on computer architecture (ISCA ’09), Austin, TX, USA, pp 233–244

    Chapter  Google Scholar 

  3. Carretero J, Vera X, Chaparro P, Abella J (2008) On-line failure detection in memory order buffers. In: IEEE international test conference, ITC 2008, pp 1–10

    Chapter  Google Scholar 

  4. Francisco J, Villa MEA, Garcýa JM (2016) Toward energy-efficient high-performance organizations of the memory hierarchy in chip-multiprocessors architectures. J Comput Sci Technol 6:1–7

    Google Scholar 

  5. Gniady C, Falsafi B (2002) Speculative sequential consistency with little custom storage. In: Proc of the 2002 international conference on parallel architectures and compilation techniques (PACT ’02), pp 179–188

    Chapter  Google Scholar 

  6. Gomaa M, Scarbrough C, Vijaykumar TN, Pomeranz I (2003) Transient-fault recovery for chip multiprocessors. In: Proc of the 30th annual int’ symp on computer architecture (ISCA’03), San Diego, California

    Google Scholar 

  7. González A, Mahlke S, Mukherjee S, Sendag R, Chiou D, Yi JJ (2007) Reliability: fallacy or reality? IEEE MICRO 27(6). doi:10.1109/MM.2007.107

  8. International VS, Weaver DL, Germond T (1992) The sparc architecture manual. doi:10.1.1.106.2805

  9. Kumar S, Aggarwal A (2008) Speculative instruction validation for performance-reliability trade-off. In: Proc of the IEEE 14th int’ symp on high performance computer architecture (HPCA’08), Salt Lake City

    Google Scholar 

  10. Kumar R, Zyuban V, Tullsen DM (2005) Interconnections in multi-core architectures: understanding mechanisms, overheads and scaling. In: Proc of the 32th int’l symp on computer architecture (ISCA’05), Madison, Wisconsin

    Google Scholar 

  11. LaFrieda C, Ipek E, Martinez JF, Manohar R (2007) Utilizing dynamically coupled cores to form a resilient chip multiprocessor. In: Proc of the 37th annual IEEE/IFIP int’ conference on dependable systems and networks (DSN’07), Edinburgh, UK. doi:10.1109/DSN.2007.100

    Google Scholar 

  12. Li ML, Sasanka R, Adve SV, Chen KY, Debes E (2005) The alpbench benchmark suite for complex multimedia applications. In: Proc of the IEEE int symp on workload characterization, pp 34–45

    Google Scholar 

  13. Li ML, Ramachandran P, Sahoo S, Adve S, Adve V, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. In: Proc of the 13th int’ conference on architectural support for programming languages and operating systems (ASPLOS’08), Seattle, WA

    Google Scholar 

  14. Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt A, Werner B, Werner B (2002) Simics: a full system simulation platform. Computer 35(2). doi:10.1109/2.982916

  15. Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33(4). doi:10.1.1.109.5362

  16. Martínez JF, Renau J, Huang MC, Prvulovic M, Torrellas J (2002) Cherry: checkpointed early resource recycling in out-of-order microprocessors. In: Proc of the int’ symp on microarchitecture (MICRO’02), Istanbul, Turkey. citeseer.ist.psu.edu/martinez02cherry.html

    Google Scholar 

  17. Mastipuram R, Wee EC (2004) Soft error’s impact on system reliability. Electronics Design, Strategy, News (EDN) pp 69–74. URL http://www.edn.com/article/CA454636.html

  18. Mukherjee S (2008) Architecture design for soft errors. Morgan Kauffman, San Mateo

    Google Scholar 

  19. Mukherjee S, Kontz M, Reinhardt SK (2002) Detailed design and evaluation of redundant multithreading alternatives. In: Proc of the 29th annual int’ symp on computer architecture (ISCA’02), Anchorage, Alaska

    Google Scholar 

  20. Olukotun K, Nayfeh BA, Hammond L, Wilson K, Chang K (1996) The case for a single-chip multiprocessor. In: Proceedings of the 7th international conference on architectural support for programming languages and operating systems. ACM Press, New York, pp 2–11. doi:10.1145/237090.237140. http://doi.acm.org/10.1145/237090.237140

    Google Scholar 

  21. Rashid M, Huang M (2008) Supporting highly-decoupled thread-level redundancy for parallel programs. In: Proc of the 14th int’ symp on high performance computer architecture (HPCA’08), Salt Lake City

    Google Scholar 

  22. Reinhardt SK, Mukherjee S (2000) Transient fault detection via simultaneous multithreading. In: Proc of the 27th annual int’ symp on computer architecture (ISCA’00), Vancouver, British Columbia, Canada

    Google Scholar 

  23. Ros A, Acacio ME, García JM (2010) A scalable organization for distributed directories. J Syst Archit 56(2–3):77–87

    Article  Google Scholar 

  24. Rotenberg E (1999) Ar-smt: A microarchitectural approach to fault tolerance in microprocessors. In: Proc of the 29th annual int’ symp on fault-tolerant computing (FTCS’99), Madison, Wisconsin

    Google Scholar 

  25. Sánchez D, Aragón JL, García JM (2008) Evaluating dynamic core coupling in a scalable tiled-cmp architecture. In: Proc of the 7th int workshop on duplicating, deconstructing, and debunking (WDDD’08). In conjunction with ISCA’08, Beijing, China

    Google Scholar 

  26. Sánchez D, Aragón JL, García JM (2009) Repas: reliable execution for parallel applications in tiled-cmps. In: Proc of the 15th int European conference on parallel and distributed computing (Euro-Par 2009), Delft, Netherlands, pp 321–333

    Google Scholar 

  27. Selse (2006) Selse ii final remarks. In: The 2nd workshop on system effects of logic soft errors

    Google Scholar 

  28. Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG (2004) Fingerprinting: Bounding soft-error-detection latency and bandwidth. IEEE MICRO 24(6). doi:10.1109/MM.2004.72

  29. Smolens JC, Gold BT, Falsafi B, Hoe JC (2006) Reunion: Complexity-effective multicore redundancy. In: Proc of the 39th annual IEEE/ACM int’ symp on microarchitecture (MICRO 39), Orlando, Florida, p 42. doi:10.1109/MICRO.2006.42

    Google Scholar 

  30. Taylor MB, Kim J, Miller J, Wentzlaff D, Ghodrat F, Greenwald B, Hoffman H, Johnson P, Lee JW, Lee W, Ma A, Saraf A, Seneski M, Shnidman N, Strumpen V, Frank M, Amarasinghe S, Agarwal A (2002) The raw microprocessor: a computational fabric for software circuits and general-purpose programs. IEEE MICRO 22(2):25–35

    Article  Google Scholar 

  31. Vijaykumar T, Pomeranz I, Cheng K (2002) Transient fault recovery using simultaneous multithreading. In: Proc of the 29th annual int’ symp on computer architecture (ISCA’02), Anchorage, Alaska

    Google Scholar 

  32. Wang NJ, Patel SJ (2006) Restore: Symptom-based soft error detection in microprocessors. IEEE Trans Depend Secure Comput 3(3). doi:10.1109/TDSC.2006.40

  33. Wenisch TF, Ailamaki A, Falsafi B, Moshovos A (2007) Mechanisms for store-wait-free multiprocessors, pp 266–277

  34. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: characterization and methodological considerations. In: Proc of the 22th int’ symp on computer architecture (ISCA’95), Santa Margherita Ligure, Italy

    Google Scholar 

  35. Ziegler J, Lanford WA (1981) The effect of sea level cosmic rays on electronic devices. J Appl Phys 52:4305–4312

    Article  Google Scholar 

  36. Zielger JF, Puchner H (2004) SER-History, Trends and Challenges. Cypress Semiconductor Corporation

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Sánchez.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez, D., Aragón, J.L. & García, J.M. A fault-tolerant architecture for parallel applications in tiled-CMPs. J Supercomput 61, 997–1023 (2012). https://doi.org/10.1007/s11227-011-0670-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-011-0670-9

Keywords

Navigation