Abstract
As an answer to the forthcoming heterogeneous multicore and accelerator–based architectures, we have proposed some syntactic extensions to C in the form of C pragmas, based on OpenMP, that make easier for programmers to offload parts of their applications to the auxiliary processors. Offloaded tasks can be made more profitable using a simple blocking strategy. And the runtime system is used to better support computation and communication overlap, while moving data to and from accelerators.
In order to prove the feasibility and usefulness of our proposal, we have considered the IBM Cell architecture. The performance of the whole system has been evaluated using HPCC STREAM Triad and several NAS benchmarks. We present their evaluation and a detailed performance breakdown at the level of parallel regions. We also classify the parallel regions according to their suitability to be exploited in accelerators. Overall, our performance is better compared to the results obtained from the IBM compiler for the Cell processor.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine Architecture and its first implementation. IBM Developer Works (November 2005)
NVIDIA corporation: NVIDIA CUDA Compute Unified Device Architecture Version 2.0 (2008)
NVIDIA corporation: NVIDIA Tesla GPU Computing Technical Brief (2008)
OpenMP Architecture Review Board: OpenMP Application Program Interface. Version 3.0 (May 2008), http://www.openmp.org
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Teruel, X., Unnikrishnan, P., Zhang, G.: The Design of OpenMP Tasks. IEEE Transactions on Parallel and Distributed Systems 20(3), 404–418 (2009)
Ayguadé, E., Badia, R.M., Cabrera, D., Duran, A., Gonzalez, M., Igual, F., Jimenez, D., Labarta, J., Martorell, X., Mayo, R., Perez, J.M., Quintana-Orti, E.: A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures. In: Fifth International Workshop on OpenMP, IWOMP (2009)
Jin, H., Frumkin, M., Yan, J.: The OpenMP Implementation of NAS Parallel Benchmarks and Its Performance. Technical Report NAS-99-011, NASA Ames Research Center (1999)
Kusano, K., Satoh, S., Sato, M.: Performance evaluation of the Omni OpenMP compiler. In: Third International Symposium on High Performance Computing, pp. 403–414 (2000)
Ferrer, R., Gonzalez, M., Silla, F., Martorell, X., Ayguadé, E.: Evaluation of Memory Performance on the Cell BE with the SARC Programming Model. In: Proceedings of the 9th Workshop on Memory Performance: Dealing with Applications, systems, and architecture (MEDEA 2008) (October 2008)
Intel Corporation: Intel Corporation’s Multicore Architecture Briefing (March 2008), http://www.intel.com/pressroom/archive/releases/20080317fact.htm
AMD Corporation: AMD 2007 Technology Analyst Day, http://www2.amd.com/us-en/assets/content_type/DownloadableAssets/FinancialA-DayNewsSummary121307FINAL.pdf
Stanford University: BrookGPU, http://graphics.stanford.edu/projects/brookgpu/
Stanford University: Brook Language, http://merrimac.stanford.edu/brook/
Group, K.O.W.: The OpenCL Specification (February 2009), http://www.khronos.org/registry/cl/
Ayguadé, E., Copty, N., Duran, A., Hoeflinger, J., Lin, Y., Massaioli, F., Su, E., Unnikrishnan, P., Zhang, G.: A Proposal for Task Parallelism in OpenMP. In: Chapman, B., Zheng, W., Gao, G.R., Sato, M., Ayguadé, E., Wang, D. (eds.) IWOMP 2007. LNCS, vol. 4935, pp. 1–12. Springer, Heidelberg (2008)
Perez, J.M., Bellens, P., Badia, R.M., Labarta, J.: CellSs: Making it easier to program the Cell Broadband Engine processor. IBM Journal of Research and Development 51(5), 593–604 (2007)
Duran, A., Pérez, J.M., Ayguadé, E., Badia, R.M., Labarta, J.: Extending the OpenMP Tasking Model to Allow Dependent Tasks. In: Eigenmann, R., de Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 111–122. Springer, Heidelberg (2008)
Dolbeau, R., Bihan, S., Bodin, F.: HMPP: A Hybrid Multi-core Parallel Programming Environment. In: Workshop on General Processing Using GPUs (2006)
IBM Corporation: XL C/C++ for Multicore Acceleration (January 2009), http://www-01.ibm.com/software/awdtools/xlcpp/multicore/
O’Brien, K., O’Brien, K., Sura, Z., Chen, T., Zhang, T.: Supporting OpenMP on Cell. International Journal of Parallel Programming (2008)
Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Sura, Z., Chen, T., Zhang, T., O’Brien, K., O’Brien, K.: A Novel Asynchronous Software Cache Implementation for the CELL/BE Processor. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC 2007. LNCS, vol. 5234, pp. 125–140. Springer, Heidelberg (2008)
Group, T.P.: PGI Fortran & C Accelerator Programming Model (December 2008), http://www.pgroup.com/lit/whitepapers/pgi_whitepaper_accpre.pdf
Rafique, M.M., Butt, A.R., Nikolopoulos, D.S.: Dma-based prefetching for i/o-intensive workloads on the cell architecture. In: CF 2008: Proceedings of the 2008 conference on Computing frontiers, pp. 23–32. ACM, New York (2008)
Chen, T., Zhang, T., Sura, Z., Gonzalez, M.: Prefetching irregular references for software cache on cell. In: CGO 2008: Proceedings of the sixth annual IEEE/ACM international symposium on Code generation and optimization, pp. 155–164. ACM, New York (2008)
Ahmed, M.F., Ammar, R.A., Rajasekaran, S.: SPENK: Adding Another Level of Parallelism on the Cell Broadband Engine. In: IFMT 2008: Proceedings of the 1st international forum on Next-generation multicore/manycore technologies, pp. 1–10. ACM, New York (2008)
Beltran, V., Carrera, D., Torres, J., Ayguadé, E.: CellMT: A Cooperative Multithreading Library for the Cell/B.E. In: HiPC 2009: Proceedings of the 16th Annual IEEE International Conference on High Performance Computing. IEEE Computer Society, Los Alamitos (2009)
Weltzer, J., Silha, E., May, C., Frey, B., Furukawa, J., Frazier, G.: PowerPC Architecture Book V. 2.02. IBM Corporation (2005)
McCalpin, J.D.: STREAM: Sustainable Memory Bandwidth in High Performance Computers (2008), http://www.cs.virginia.edu/stream
Corder, S., Sheumaker, K.: STREAM Benchmarking: Intel Xeon 5500 Nehalem vs AMD Opteron 2400 Istanbul (2009), http://www.advancedclustering.com/company-blog/stream-benchmarking.html
Corporation, I.: Intel Xeon Processor 5000 Sequence (2009), http://www.intel.com/p/en_US/products/server/processor/xeon5000
Balart, J., Gonzalez, M., Martorell, X., Ayguadé, E., Labarta, J.: Runtime Address Space Computation for SDSM Systems. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 330–344. Springer, Heidelberg (2007)
Chen, T., Sura, Z., O’Brien, K., O’Brien, J.K.: Optimizing the Use of Static Buffers for DMA on a CELL Chip. In: Almási, G.S., Caşcaval, C., Wu, P. (eds.) LCPC 2006. LNCS, vol. 4382, pp. 314–329. Springer, Heidelberg (2007)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ferrer, R., Beltran, V., Gonzàlez, M., Martorell, X., Ayguadé, E. (2010). Analysis of Task Offloading for Accelerators. In: Patt, Y.N., Foglia, P., Duesterwald, E., Faraboschi, P., Martorell, X. (eds) High Performance Embedded Architectures and Compilers. HiPEAC 2010. Lecture Notes in Computer Science, vol 5952. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-11515-8_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-11515-8_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-11514-1
Online ISBN: 978-3-642-11515-8
eBook Packages: Computer ScienceComputer Science (R0)