Abstract
The fast development of acceleration architectures and applications has made heterogeneous computing the norm for high-performance computing. The cost of high volume data movement to the accelerators is an important bottleneck both in terms of application performance and developer productivity. Memory management is still a manual task performed tediously by expert programmers. In this paper, we develop a compiler analysis to automate memory management for heterogeneous computing. We propose an optimization framework that casts the problem of detection and removal of redundant data movements into a partial redundancy elimination (PRE) problem and applies the lazy code motion technique to optimize these data movements. We chose OpenMP as the underlying parallel programming model and implemented our optimization framework in the LLVM toolchain. We evaluated it with ten benchmarks and obtained a geometric speedup of 2.3\(\times \), and reduced on average 50% of the total bytes transferred between the host and GPU.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amarasinghe, S.P., Lam, M.S.: Communication optimization and code generation for distributed memory machines. In: Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, PLDI 1993, pp. 126–138. Association for Computing Machinery, New York (1993). https://doi.org/10.1145/155090.155102
Asai, R., Okita, M., Ino, F., Hagihara, K.: Transparent avoidance of redundant data transfer on GPU-enabled apache spark. In: Kaeli, D.R., Cavazos, J. (eds.) 11th Workshop on General Purpose Processing using GPUs, GPGPU@PPoPP 2018, Vosendorf (Vienna), Austria, 25 February 2018, pp. 22–30. ACM (2018). https://doi.org/10.1145/3180270.3180276
Ashcraft, M.B., Lemon, A., Penry, D.A., Snell, Q.: Compiler optimization of accelerator data transfers. Int. J. Parallel Program. 47(1), 39–58 (2017). https://doi.org/10.1007/s10766-017-0549-3
Barua, P., Shirako, J., Tsang, W., Paudel, J., Chen, W., Sarkar, V.: OMPSan: static verification of openmp’s data mapping constructs. In: IWOMP (2019). https://doi.org/10.1007/978-3-030-28596-8_1
BodÃk, R., Gupta, R., Soffa, M.L.: Load-reuse analysis: design and evaluation. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI 1999, pp. 64–76, Association for Computing Machinery, New York (1999). https://doi.org/10.1145/301618.301643
ChavarrÃa-Miranda, D., Mellor-Crummey, J.: Effective communication coalescing for data-parallel applications. In: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2005, pp. 14–25. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1065944.1065948
Dathathri, R., Reddy, C., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT 2013, pp. 375–386. IEEE Press (2013)
Drechsler, K.H., Stadel, M.P.: A variation of knoop, rüthing, and steffen’s lazy code motion. SIGPLAN Not. 28(5), 29–38 (1993). https://doi.org/10.1145/152819.152823
Fink, S., Knobe, K., Sarkar, V.: Unified analysis of array and object references in strongly typed languages. In: Palsberg, J. (ed.) SAS 2000. LNCS, vol. 1824, pp. 155–174. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-540-45099-3_9
Hoeflinger, J.P., de Supinski, B.R.: The OpenMP memory model. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP -2005. LNCS, vol. 4315, pp. 167–177. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68555-5_14
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 165–174. CGO 2012, Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2259016.2259038
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: Hall, M.W., Padua, D.A. (eds.) Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, 4–8 June 2011, pp. 142–151. ACM (2011). https://doi.org/10.1145/1993498.1993516
Kim, J., Lee, Y., Park, J., Lee, J.: Translating openMP device constructs to openCL using unnecessary data transfer elimination. In: West, J., Pancake, C.M. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, 13–18 November 2016, pp. 597–608. IEEE Computer Society (2016). https://doi.org/10.1109/SC.2016.50
Knobe, K., Sarkar, V.: Array SSA form and its use in parallelization. In: Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1998, pp. 107–120. Association for Computing Machinery, New York (1998). https://doi.org/10.1145/268946.268956
Knoop, J., Rüthing, O., Steffen, B.: Lazy code motion. In: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI 1992, pp. 224–234. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/143095.143136
Mendonca, G.S.D., Guimarães, B.C.F., Alves, P., Pereira, M.M., Araujo, G., Pereira, F.M.Q.: DawnCC: automatic annotation for data parallelism and offloading. TACO 14(2), 13:1–13:25 (2017). https://doi.org/10.1145/3084540
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. PACT 2012, pp. 33–42. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2370816.2370824
Pechtchanski, I., Sarkar, V.: Immutability specification and its applications. Concurr. Comput.: Pract. Experience 17(5–6), 639–662 (2005)
Ramashekar, T., Bondhugula, U.: Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Archit. Code Optim. 10(4), 1–26 (2013). https://doi.org/10.1145/2544100
Torczon, L., Cooper, K.: Engineering A Compiler, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Zhao, J., Burke, M.G., Sarkar, V.: Parallel sparse flow-sensitive points-to analysis. In: Proceedings of the 27th International Conference on Compiler Construction, CC 2018, Vienna, Austria, 24–25 February 2018, pp. 59–70. ACM (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Barua, P., Zhao, J., Sarkar, V. (2020). OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-57675-2_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)