OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing

Barua, Prithayan; Zhao, Jisheng; Sarkar, Vivek

doi:10.1007/978-3-030-57675-2_13

Prithayan Barua¹⁰,
Jisheng Zhao¹⁰ &
Vivek Sarkar¹⁰

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12247))

Included in the following conference series:

European Conference on Parallel Processing

1405 Accesses
2 Citations

Abstract

The fast development of acceleration architectures and applications has made heterogeneous computing the norm for high-performance computing. The cost of high volume data movement to the accelerators is an important bottleneck both in terms of application performance and developer productivity. Memory management is still a manual task performed tediously by expert programmers. In this paper, we develop a compiler analysis to automate memory management for heterogeneous computing. We propose an optimization framework that casts the problem of detection and removal of redundant data movements into a partial redundancy elimination (PRE) problem and applies the lazy code motion technique to optimize these data movements. We chose OpenMP as the underlying parallel programming model and implemented our optimization framework in the LLVM toolchain. We evaluated it with ten benchmarks and obtained a geometric speedup of 2.3\(\times \), and reduced on average 50% of the total bytes transferred between the host and GPU.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Amarasinghe, S.P., Lam, M.S.: Communication optimization and code generation for distributed memory machines. In: Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation, PLDI 1993, pp. 126–138. Association for Computing Machinery, New York (1993). https://doi.org/10.1145/155090.155102
Asai, R., Okita, M., Ino, F., Hagihara, K.: Transparent avoidance of redundant data transfer on GPU-enabled apache spark. In: Kaeli, D.R., Cavazos, J. (eds.) 11th Workshop on General Purpose Processing using GPUs, GPGPU@PPoPP 2018, Vosendorf (Vienna), Austria, 25 February 2018, pp. 22–30. ACM (2018). https://doi.org/10.1145/3180270.3180276
Ashcraft, M.B., Lemon, A., Penry, D.A., Snell, Q.: Compiler optimization of accelerator data transfers. Int. J. Parallel Program. 47(1), 39–58 (2017). https://doi.org/10.1007/s10766-017-0549-3
Article Google Scholar
Barua, P., Shirako, J., Tsang, W., Paudel, J., Chen, W., Sarkar, V.: OMPSan: static verification of openmp’s data mapping constructs. In: IWOMP (2019). https://doi.org/10.1007/978-3-030-28596-8_1
Bodík, R., Gupta, R., Soffa, M.L.: Load-reuse analysis: design and evaluation. In: Proceedings of the ACM SIGPLAN 1999 Conference on Programming Language Design and Implementation, PLDI 1999, pp. 64–76, Association for Computing Machinery, New York (1999). https://doi.org/10.1145/301618.301643
Chavarría-Miranda, D., Mellor-Crummey, J.: Effective communication coalescing for data-parallel applications. In: Proceedings of the Tenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP 2005, pp. 14–25. Association for Computing Machinery, New York (2005). https://doi.org/10.1145/1065944.1065948
Dathathri, R., Reddy, C., Ramashekar, T., Bondhugula, U.: Generating efficient data movement code for heterogeneous architectures with distributed-memory. In: Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, PACT 2013, pp. 375–386. IEEE Press (2013)
Google Scholar
Drechsler, K.H., Stadel, M.P.: A variation of knoop, rüthing, and steffen’s lazy code motion. SIGPLAN Not. 28(5), 29–38 (1993). https://doi.org/10.1145/152819.152823
Article Google Scholar
Fink, S., Knobe, K., Sarkar, V.: Unified analysis of array and object references in strongly typed languages. In: Palsberg, J. (ed.) SAS 2000. LNCS, vol. 1824, pp. 155–174. Springer, Heidelberg (2000). https://doi.org/10.1007/978-3-540-45099-3_9
Chapter MATH Google Scholar
Hoeflinger, J.P., de Supinski, B.R.: The OpenMP memory model. In: Mueller, M.S., Chapman, B.M., de Supinski, B.R., Malony, A.D., Voss, M. (eds.) IWOMP -2005. LNCS, vol. 4315, pp. 167–177. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68555-5_14
Chapter Google Scholar
Jablin, T.B., Jablin, J.A., Prabhu, P., Liu, F., August, D.I.: Dynamically managed data for CPU-GPU architectures. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 165–174. CGO 2012, Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2259016.2259038
Jablin, T.B., Prabhu, P., Jablin, J.A., Johnson, N.P., Beard, S.R., August, D.I.: Automatic CPU-GPU communication management and optimization. In: Hall, M.W., Padua, D.A. (eds.) Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, San Jose, CA, USA, 4–8 June 2011, pp. 142–151. ACM (2011). https://doi.org/10.1145/1993498.1993516
Kim, J., Lee, Y., Park, J., Lee, J.: Translating openMP device constructs to openCL using unnecessary data transfer elimination. In: West, J., Pancake, C.M. (eds.) Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2016, Salt Lake City, UT, USA, 13–18 November 2016, pp. 597–608. IEEE Computer Society (2016). https://doi.org/10.1109/SC.2016.50
Knobe, K., Sarkar, V.: Array SSA form and its use in parallelization. In: Proceedings of the 25th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 1998, pp. 107–120. Association for Computing Machinery, New York (1998). https://doi.org/10.1145/268946.268956
Knoop, J., Rüthing, O., Steffen, B.: Lazy code motion. In: Proceedings of the ACM SIGPLAN 1992 Conference on Programming Language Design and Implementation, PLDI 1992, pp. 224–234. Association for Computing Machinery, New York (1992). https://doi.org/10.1145/143095.143136
Mendonca, G.S.D., Guimarães, B.C.F., Alves, P., Pereira, M.M., Araujo, G., Pereira, F.M.Q.: DawnCC: automatic annotation for data parallelism and offloading. TACO 14(2), 13:1–13:25 (2017). https://doi.org/10.1145/3084540
Article Google Scholar
Pai, S., Govindarajan, R., Thazhuthaveetil, M.J.: Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. In: Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques. PACT 2012, pp. 33–42. Association for Computing Machinery, New York (2012). https://doi.org/10.1145/2370816.2370824
Pechtchanski, I., Sarkar, V.: Immutability specification and its applications. Concurr. Comput.: Pract. Experience 17(5–6), 639–662 (2005)
Article Google Scholar
Ramashekar, T., Bondhugula, U.: Automatic data allocation and buffer management for multi-GPU machines. ACM Trans. Archit. Code Optim. 10(4), 1–26 (2013). https://doi.org/10.1145/2544100
Article Google Scholar
Torczon, L., Cooper, K.: Engineering A Compiler, 2nd edn. Morgan Kaufmann Publishers Inc., San Francisco (2007)
MATH Google Scholar
Zhao, J., Burke, M.G., Sarkar, V.: Parallel sparse flow-sensitive points-to analysis. In: Proceedings of the 27th International Conference on Compiler Construction, CC 2018, Vienna, Austria, 24–25 February 2018, pp. 59–70. ACM (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Prithayan Barua, Jisheng Zhao & Vivek Sarkar

Authors

Prithayan Barua
View author publications
You can also search for this author in PubMed Google Scholar
Jisheng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Vivek Sarkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Prithayan Barua .

Editor information

Editors and Affiliations

AGH University of Science and Technology, Krakow, Poland
Maciej Malawski
University of Warsaw, Warsaw, Poland
Krzysztof Rzadca

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Barua, P., Zhao, J., Sarkar, V. (2020). OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing. In: Malawski, M., Rzadca, K. (eds) Euro-Par 2020: Parallel Processing. Euro-Par 2020. Lecture Notes in Computer Science(), vol 12247. Springer, Cham. https://doi.org/10.1007/978-3-030-57675-2_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-57675-2_13
Published: 18 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-57674-5
Online ISBN: 978-3-030-57675-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics