Abstract
Orchestrating data transfers between CPU and a coprocessor manually is cumbersome, particularly for multi-dimensional arrays and other data structures with multi-level pointers common in scientific computations. This paper describes a system that includes both compile-time and runtime solutions for this problem, with the overarching goal of improving programmer productivity while maintaining performance.
We find that the standard linearization method performs poorly for non-uniform dimensions on the coprocessor due to redundant data transfers and suppression of important compiler optimizations such as vectorization. The key contribution of this paper is a novel approach for heap linearization that avoids modifying memory accesses to enable vectorization, referred to as partial linearization with pointer reset.
We implement partial linearization with pointer reset as the compile time solution, whereas runtime solution is implemented as an enhancement to MYO library. We evaluate our approach with respect to multiple C benchmarks. Experimental results demonstrate that our best compile-time solution can perform 2.5x-5x faster than original runtime solution, and the CPU-MIC code with it can achieve 1.5x-2.5x speedup over the 16-thread CPU version.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Intel C++ Compiler. http://www.intel.com/Compilers.
- 2.
Due to the page limitation, we omit some details of the runtime optimization and the source-to-source transformation to integrate two approaches, and all of our code examples in this version. Please refer to our LCPC’15 conference version for more details: http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.
- 3.
OpenACC: Directives for Accelerators. http://www.openacc-standard.org/.
- 4.
As described in http://www.csc2.ncsu.edu/workshops/lcpc2015/lcpc15proc.pdf.
References
Baskaran, M.M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: PPoPP (2008)
Bennett, J.K., Carter, J.B., Zwaenepoel, W.: Munin: distributed shared memory based on type-specific memory coherence. In: PPoPP (1990)
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H., Zhou, Y.: Cilk: an efficient multithreaded runtime system. In: PPoPP (1995)
Bueno, J., Martorell, X., Badia, R.M., Ayguadé, E., Labarta, J.: Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: ICS, pp. 359–368. ACM (2013)
Chamberlain, B., Callahan, D., Zima, H.: Parallel programmability and the chapel language. Int. J. High Perf. Comput. Appl. 21, 291–312 (2007)
Chatterjee, S., Jain, V.V., Lebeck, A.R., Mundhra, S., Thottethodi, M.: Nonlinear array layouts for hierarchical memory systems. In: ICS (1999)
Dagum, L., Menon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5, 46–55 (1998)
Dubach, C., Cheng, P., Rabbah, R., Bacon, D.F., Fink, S.J.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: PLDI. ACM (2012)
El-Ghazawi, T., Smith, L.: UPC: unified parallel C. In: SC. ACM (2006)
Gelado, I., Stone, J.E., Cabezas, J., Patel, S., Navarro, N., Hwu, W.-M.W.: An asymmetric distributed shared memory model for heterogeneous parallel systems. In: ASPLOS (2010)
Grasso, I., Pellegrini, S., Cosenza, B., Fahringer, T.: libWater: heterogeneous distributed computing made easy. In: ICS, pp. 161–172. ACM (2013)
Gropp, W.D., Lusk, E.L., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. The MIT Press, Cambridge (1999)
T.P. Group. PGI Accelerator Compilers OpenACC Getting Started Guide (2013)
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: CGO (2012)
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: PLDI (2011)
Jeffers, J., Reinders, J.: Intel Xeon Phi Coprocessor High Performance Programming. Newnes (2013)
Ju, Y.-L., Dietz, H.G.: Reduction of cache coherence overhead by compiler data layout and loop transformation. In: LCPC (1992)
Ladelsky, R.: Matrix flattening and transposing in GCC. In: GCC Summit Proceedings, vol. 2007 (2006)
Lattner, C., Adve, V.S.: Automatic pool allocation: improving performance by controlling data structure layout in the heap. In: PLDI. ACM (2005)
Lee, S., Eigenmann, R.: OpenMPC: extended OpenMP programming and tuning for GPUs. In: SC (2010)
Li, K., Hudak, P.: Memory coherence in shared virtual memory systems. ACM Trans. Comput. Syst. 7(4), 321–359 (1989)
Maleki, S., Gao, Y., Garzaran, M.J., Wong, T., Padua, D.A.: An evaluation of vectorizing compilers. In: PACT. IEEE (2011)
Margiolas, C., O’Boyle, M.F.: Portable and transparent host-device communication optimization for GPGPU environments. In: CGO. ACM (2014)
Newburn, C.J., Deodhar, R., Dmitriev, S., Murty, R., Narayanaswamy, R., Wiegert, J., Chinchilla, F., McGuire, R.: Offload Compiler Runtime for the Intel\(^{\textregistered }\) Xeon PhiTM Coprocessor. In: Supercomputing. Springer (2013)
Numrich, R.W., Reid, J.: Co-array fortran for parallel programming. In: ACM Sigplan Fortran Forum. ACM (1998)
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: PACT (2012)
Ravi, N., Yang, Y., Bao, T., Chakradhar, S.: Apricot: an optimizing compiler and productivity tool for x86-Compatible many-core coprocessors. In: ICS (2012)
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2010)
Ren, B., Agrawal, G.: Compiling dynamic data structures in python to enable the use of multi-core and many-core libraries. In: PACT. IEEE (2011)
Saha, B., Zhou, X., Chen, H., Gao, Y., Yan, S., Rajagopalan, M., Fang, J., Zhang, P., Ronen, R., Mendelson, A.: Programming model for a heterogeneous x86 platform. In: PLDI (2009)
Saraswat, V.A., Sarkar, V., von Praun, C.: X10: concurrent programming for modern architectures. In: PPoPP (2007)
Sidelnik, A., Maleki, S., Chamberlain, B.L., Garzarán, M.J., Padua, D.: Performance portability with the chapel language. In: IPDPS. IEEE (2012)
Yan, Y., Grossman, M., Sarkar, V.: JCUDA: a programmer-friendly interface for accelerating java programs with CUDA. In: Sips, H., Epema, D., Lin, H.-X. (eds.) Euro-Par 2009. LNCS, vol. 5704, pp. 887–899. Springer, Heidelberg (2009)
Zhang, Y., Ding, W., Liu, J., Kandemir, M.: Optimizing data layouts for parallel computation on multicores. In: PACT (2011)
Acknowledgements
We would like to thank Ravindra Ganapathi from Intel for guiding us through the MYO library.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ren, B., Ravi, N., Yang, Y., Feng, M., Agrawal, G., Chakradhar, S. (2016). Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors. In: Shen, X., Mueller, F., Tuck, J. (eds) Languages and Compilers for Parallel Computing. LCPC 2015. Lecture Notes in Computer Science(), vol 9519. Springer, Cham. https://doi.org/10.1007/978-3-319-29778-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-319-29778-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29777-4
Online ISBN: 978-3-319-29778-1
eBook Packages: Computer ScienceComputer Science (R0)