Skip to main content

Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance

  • Conference paper
Architecture of Computing Systems – ARCS 2012 (ARCS 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7179))

Included in the following conference series:

Abstract

Obtaining fault tolerant applications and systems is one of today’s most important topics of research. Fault tolerance is becoming more and more essential in shared memory parallel programs and in multi/many core architectures due to the decreasing size of transistors and growing number of failures. Very few research works and techniques for fault tolerant OpenMP programs were studied. These few works are based on checkpoint and recovery, and on static thread level redundancy techniques. However, these approaches may illustrate scalability issues when the number of cores increases or when an unbalanced workload exists. To overcome these issues, we present in this paper a dynamic task level redundancy technique for fault tolerant OpenMP applications. Our method is based on dynamically applying a Triple Modular Redundancy for OpenMP tasks through a dedicated runtime and on applying a majority voting to guarantee correct results. Our flexible fault tolerant OpenMP approach has been evaluated for performance and fault coverage and it showed small overhead with good error detection and recovery rate.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. ORACLE SUN, ”Tasks vs Nested Parallel Regions”, http://wikis.sun.com/display/openmp/Tasks+vs+Nested+Parallel+Regions

  2. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The design of openmp tasks. IEEE Trans. Parallel Distrib. Syst. 20, 404–418 (2009)

    Article  Google Scholar 

  3. Balart, J., Duran, A., Gonzàlez, M., Martorell, X., Ayguadé, E., Labarta, J.: Nanos mercurium: a research compiler for openmp. In: European Workshop on OpenMP (EWOMP 2004), pp. 103–109 (2004)

    Google Scholar 

  4. Bronevetsky, G., Pingali, K., Stodghill, P.: Experimental evaluation of application-level checkpointing for openmp programs. In: Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006, pp. 2–13. ACM, New York (2006)

    Google Scholar 

  5. Cha, H., Rudnick, E.M., Choi, G.S., Patel, J.H., Iyer, R.K.: A fast and accurate gate-level transient fault simulation environment. In: Proceedings 23rd Symp. on Fault-Tolerant Computing Systems (FTCS-23), pp. 310–319 (1993)

    Google Scholar 

  6. Chan, C.Y., Bu, F., Shladover, S.: Experimental vehicle platform for pedestrian detection. California PATH research report. California PATH Program, Institute of Transportation Studies, University of California at Berkeley (2006)

    Google Scholar 

  7. Duran, A., Teruel, X., Ferrer, R., Martorell, X., Ayguade, E.: Barcelona openmp tasks suite: A set of benchmarks targeting the exploitation of task parallelism in openmp. In: Proceedings of the 2009 International Conference on Parallel Processing, ICPP 2009, pp. 124–131. IEEE Computer Society, Washington, DC (2009)

    Chapter  Google Scholar 

  8. Gizopoulos, D., Psarakis, M., Adve, S.V., Ramachandran, P., Hari, S.K.S., Sorin, D., Meixner, A., Biswas, A., Vera, X.: Architectures for online error detection and recovery in multicore processors. In: Design, Automation & Test in Europe, DATE 2011 (2011)

    Google Scholar 

  9. Hongyi, F., Yan, D.: Using redundant threads for fault tolerance of openmp programs. In: Proceedings of the 2010 International Conference on Information Science and Applications, ICISA 2010 (2010)

    Google Scholar 

  10. Prvulovic, M., Zhang, Z., Torrellas, J.: Revive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In: Proceedings of the 29th Annual International Symposium on Computer architecture, ISCA 2002, pp. 111–122. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  11. Saha, G.K.: Software based fault tolerance: a survey. Ubiquity 1, 1:1 (2006)

    Google Scholar 

  12. Sorin, D.J., Martin, M.M.K., Hill, M.D., Wood, D.A.: Safetynet: improving the availability of shared memory multiprocessors with global checkpoint/recovery. In: Proceedings of the 29th Annual International Symposium on Computer Architecture, ISCA 2002, pp. 123–134. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  13. Teruel, X., Martorell, X., Duran, A., Ferrer, R., Ayguadé, E.: Support for openmp tasks in nanos v4. In: Proceedings of the 2007 Conference of the Center for Advanced Studies on Collaborative Research, CASCON 2007, pp. 256–259. ACM, New York (2007)

    Chapter  Google Scholar 

  14. Wang, N.J., Patel, S.J.: Restore: Symptom-based soft error detection in microprocessors. IEEE Trans. Dependable Secur. Comput. 3 (2006)

    Google Scholar 

  15. Weaver, C., Emer, J., Mukherjee, S.S., Reinhardt, S.K.: Techniques to reduce the soft error rate of a high-performance microprocessor. In: Proceedings of the 31st Annual International Symposium on Computer Architecture, ISCA 2004, pp. 264–275. IEEE Computer Society, Washington, DC (2004)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Andreas Herkersdorf Kay Römer Uwe Brinkschulte

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tahan, O., Shawky, M. (2012). Using Dynamic Task Level Redundancy for OpenMP Fault Tolerance. In: Herkersdorf, A., Römer, K., Brinkschulte, U. (eds) Architecture of Computing Systems – ARCS 2012. ARCS 2012. Lecture Notes in Computer Science, vol 7179. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28293-5_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28293-5_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28292-8

  • Online ISBN: 978-3-642-28293-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics