Skip to main content

Analysis of Task Offloading for Accelerators

  • Conference paper
High Performance Embedded Architectures and Compilers (HiPEAC 2010)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5952))

Abstract

As an answer to the forthcoming heterogeneous multicore and accelerator–based architectures, we have proposed some syntactic extensions to C in the form of C pragmas, based on OpenMP, that make easier for programmers to offload parts of their applications to the auxiliary processors. Offloaded tasks can be made more profitable using a simple blocking strategy. And the runtime system is used to better support computation and communication overlap, while moving data to and from accelerators.

In order to prove the feasibility and usefulness of our proposal, we have considered the IBM Cell architecture. The performance of the whole system has been evaluated using HPCC STREAM Triad and several NAS benchmarks. We present their evaluation and a detailed performance breakdown at the level of parallel regions. We also classify the parallel regions according to their suitability to be exploited in accelerators. Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine Architecture and its first implementation. IBM Developer Works (November 2005)

    Google Scholar 

  2. NVIDIA corporation: NVIDIA CUDA Compute Unified Device Architecture Version 2.0 (2008)

    Google Scholar 

  3. NVIDIA corporation: NVIDIA Tesla GPU Computing Technical Brief (2008)

    Google Scholar 

  4. OpenMP Architecture Review Board: OpenMP Application Program Interface. Version 3.0 (May 2008), http://www.openmp.org

  5. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20(3), 404–418 (2009)

    Article  Google Scholar 

  6. Ayguadé, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Orti, E.: A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In: Fifth International Workshop on OpenMP, IWOMP (2009)

    Google Scholar 

  7. Jin, H., Frumkin, M., Yan, J.: The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011, NASA Ames Research Center (1999)

    Google Scholar 

  8. Kusano, K., Satoh, S., Sato, M.: Performance evaluation of the Omni OpenMP compiler. In: Third International Symposium on High Performance Computing, pp. 403–414 (2000)

    Google Scholar 

  9. Ferrer, R., Gonzalez, M., Silla, F., Martorell, X., Ayguadé, E.: Evaluation of Memory Performance on the Cell BE with the SARC Programming Model. In: Proceedings of the 9th Workshop on Memory Performance: Dealing with Applications, systems, and architecture (MEDEA 2008) (October 2008)

    Google Scholar 

  10. Intel Corporation: Intel Corporation’s Multicore Architecture Briefing (March 2008), http://www.intel.com/pressroom/archive/releases/20080317fact.htm

  11. AMD Corporation: AMD 2007 Technology Analyst Day, http://www2.amd.com/us-en/assets/content_type/DownloadableAssets/FinancialA-DayNewsSummary121307FINAL.pdf

  12. Stanford University: BrookGPU, http://graphics.stanford.edu/projects/brookgpu/

  13. Stanford University: Brook Language, http://merrimac.stanford.edu/brook/

  14. Group, K.O.W.: The OpenCL Specification (February 2009), http://www.khronos.org/registry/cl/

  15. Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Su, E., Unnikrishnan, P., Zhang, G.: A Proposal for Task Parallelism in OpenMP. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 1–12. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  16. Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development 51(5), 593–604 (2007)

    Article  Google Scholar 

  17. Duran, A., Pérez, J.M., Ayguadé, E., Badia, R.M., Labarta, J.: Extending the OpenMP Tasking Model to Allow Dependent Tasks. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 111–122. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  18. Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. In: Workshop on General Processing Using GPUs (2006)

    Google Scholar 

  19. IBM Corporation: XL C/C++ for Multicore Acceleration (January 2009), http://www-01.ibm.com/software/awdtools/xlcpp/multicore/

  20. O’Brien, K., O’Brien, K., Sura, Z., Chen, T., Zhang, T.: Supporting OpenMP on Cell. International Journal of Parallel Programming (2008)

    Google Scholar 

  21. Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Sura, Z., Chen, T., Zhang, T., O’Brien, K., O’Brien, K.: A Novel Asynchronous Software Cache Implementation for the CELL/BE Processor. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 125–140. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  22. Group, T.P.: PGI Fortran & C Accelerator Programming Model (December 2008), http://www.pgroup.com/lit/whitepapers/pgi_whitepaper_accpre.pdf

  23. Rafique, M.M., Butt, A.R., Nikolopoulos, D.S.: Dma-based prefetching for i/o-intensive workloads on the cell architecture. In: CF 2008: Proceedings of the 2008 conference on Computing frontiers, pp. 23–32. ACM, New York (2008)

    Chapter  Google Scholar 

  24. Chen, T., Zhang, T., Sura, Z., Gonzalez, M.: Prefetching irregular references for software cache on cell. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008)

    Chapter  Google Scholar 

  25. Ahmed, M.F., Ammar, R.A., Rajasekaran, S.: SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine. In: IFMT 2008: Proceedings of the 1st international forum on Next-generation multicore/manycore technologies, pp. 1–10. ACM, New York (2008)

    Chapter  Google Scholar 

  26. Beltran, V., Carrera, D., Torres, J., Ayguadé, E.: CellMT: A Cooperative Multithreading Library for the Cell/B.E. In: HiPC 2009: Proceedings of the 16th Annual IEEE International Conference on High Performance Computing. IEEE Computer Society, Los Alamitos (2009)

    Google Scholar 

  27. Weltzer, J., Silha, E., May, C., Frey, B., Furukawa, J., Frazier, G.: PowerPC Architecture Book V. 2.02. IBM Corporation (2005)

    Google Scholar 

  28. McCalpin, J.D.: STREAM: Sustainable Memory Bandwidth in High Performance Computers (2008), http://www.cs.virginia.edu/stream

  29. Corder, S., Sheumaker, K.: STREAM Benchmarking: Intel Xeon 5500 Nehalem vs AMD Opteron 2400 Istanbul (2009), http://www.advancedclustering.com/company-blog/stream-benchmarking.html

  30. Corporation, I.: Intel Xeon Processor 5000 Sequence (2009), http://www.intel.com/p/en_US/products/server/processor/xeon5000

  31. Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Labarta, J.: Runtime Address Space Computation for SDSM Systems. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 330–344. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  32. Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K.: Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ferrer, R., Beltran, V., Gonzàlez, M., Martorell, X., Ayguadé, E. (2010). Analysis of Task Offloading for Accelerators. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2010. Lecture Notes in Computer Science, vol 5952. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11515-8_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-11515-8_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-11514-1

  • Online ISBN: 978-3-642-11515-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics