(When) Do Multiple Passes Save Energy?

Narmour, Louis; Yuki, Tomofumi; Rajopadhye, Sanjay

doi:10.1007/978-3-031-04580-6_30

Louis Narmour¹¹,
Tomofumi Yuki¹² &
Sanjay Rajopadhye¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13227))

Included in the following conference series:

International Conference on Embedded Computer Systems

933 Accesses

Abstract

Energy cost continues to be a significant barrier on all modern computing platforms. The common wisdom has been to focus on speed alone through heuristics like “race-to-sleep,” a strategy based on the observation that the time-dependent components of total energy tend to dominate. Among different speed-optimal implementations or transformations of a program, however, there is a range of choices to (further) reduce energy. One of them is to execute a program with “multiple passes,” which reduces data accesses while retaining speed optimality, and was shown to be effective for stencil computations on CPUs. We try to extend this strategy for a suite of computational kernels on both CPU and GPU platforms based on prior success. We find that the approach does not appear to generalize well due to practical limitations in the hardware present on the systems we studied. Despite this negative result, we illustrate what it would take to be profitable and use it to understand why it appears to be out of reach on current systems today.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Library, I.M.K.: Reference Manual. Intel Corporation, Santa Clara, USA (2019)
Google Scholar
Nvidia Management Library (NVML), April 2020. https://developer.nvidia.com/nvidia-management-library-nvml
Profiler user’s guide (2020). https://docs.nvidia.com/cuda/profiler-users-guide/index.html
Amd uprof , April 2021. https://developer.amd.com/amd-uprof/
Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-efficient sorting using solid state disks. In: International Conference on Green Computing, pp. 191–202 (2010). https://doi.org/10.1109/GREENCOMP.2010.5598309
Cho, S., Melhem, R.G.: On the interplay of parallelization, program performance, and energy consumption. IEEE Trans. Parallel Distrib. Syst. 21(3), 342–353 (2010). https://doi.org/10.1109/TPDS.2009.41
Article Google Scholar
Chowdhury, R., et al.: Autogen: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP 2016. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2851141.2851167
Chowdhury, R., et al.: Autogen: automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Trans. Parallel Comput. 4(1), 1–30 (2017). https://doi.org/10.1145/3125632
Article MathSciNet Google Scholar
Hsu, C.-H., Feng, W.-C.: A power-aware run-time system for high-performance computing. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 1. SC 2005 (2005). https://doi.org/10.1109/SC.2005.3
David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R., Le, C.: RAPL: memory power estimation and capping. In: Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 189–194. ISLPED 2010 (2010). https://doi.org/10.1145/1840845.1840883
Dawson-Haggerty, S., Krioukov, A., Culler, D.: Power optimization-a reality check (2009)
Google Scholar
Eranian, S.: Perfmon2: a flexible performance monitoring interface for Linux (2010)
Google Scholar
Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS: IEEE Symposium on Foundations of Computer Science, pp. 285–297, October 1999
Google Scholar
Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: International Conference on Supercomputing, pp. 361–366. ICS 2005, June 2005
Google Scholar
Frigo, M., Strumpen, V.: The cache complexity of multithreaded cache oblivious algorithms. In: Proceedings of Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 271–280 (2006)
Google Scholar
Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 66–75. CGO 2014 (2014). https://doi.org/10.1145/2544137.2544160
Han, D., Nam, Y.M., Lee, J., Park, K., Kim, H., Kim, M.S.: DistME: a fast and elastic distributed matrix computation engine using gpus. In: Proceedings of the 2019 International Conference on Management of Data, pp. 759–774. SIGMOD 2019 (2019). https://doi.org/10.1145/3299869.3319865
Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing. pp. 311–320. ICS 2012 (2012). https://doi.org/10.1145/2304576.2304619
Hsu, C.H., Kremer, U.: The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction. In: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. PLDI 2003, vol. 38, pp. 38–48, May 2003. https://doi.org/10.1145/780822.781137
Itzhaky, S., et al.: Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In: Proceedings of ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 145–164. OOPSLA 2016 (2016)
Google Scholar
Jimborean, A., Koukos, K., Spiliopoulos, V., Black-Schaffer, D., Kaxiras, S.: Fix the code. don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 262–272. CGO 2014 (2014). https://doi.org/10.1145/2581122.2544161
Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 235–244. PLDI 2007 (2007). https://doi.org/10.1145/1250734.1250761
Matsumura, K., Zohouri, H.R., Wahib, M., Endo, T., Matsuoka, S.: AN5D: automated stencil framework for high-degree temporal blocking on GPUs. ACM (2020). https://doi.org/10.1145/3368826.3377904
Article Google Scholar
McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality (1999)
Google Scholar
Park, S., Kim, Y., Urgaonkar, B., Lee, J., Seo, E.: A comprehensive study of energy efficiency and performance of flash-based SSD. J. Syst. Archit. 57(4), 354–365 (2011). https://doi.org/10.1016/j.sysarc.2011.01.005
Article Google Scholar
Leiserson, C.E.: Cache-Oblivious algorithms. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, p. 5. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44849-7_5
Chapter MATH Google Scholar
Rawat, P.S., et al.: Domain-specific optimization and generation of high-performance GPU code for stencil computations. Proc. IEEE 106(11), 1902–1920 (2018). https://doi.org/10.1109/JPROC.2018.2862896
Article Google Scholar
Strzodka, R., Shaheen, M., Pajak, D., Seidel, H.: Cache accurate time skewing in iterative stencil computations. In: Proceedings of the 2011 International Conference on Parallel Processing, pp. 571–581 (2011). https://doi.org/10.1109/ICPP.2011.47
Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures. SPAA 2011 (2011)
Google Scholar
Tithi, J.J., Ganapathi, P., Talati, A., Aggarwal, S., Chowdhury, R.: High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In: IEEE International Parallel and Distributed Processing Symposium, pp. 303–312, May 2015
Google Scholar
Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991). https://doi.org/10.1109/71.97902
Article Google Scholar
Wonnacott, D.: Achieving scalable locality with time skewing. Int. J. Parallel Prog. 30, 181–221 (1999)
Article Google Scholar
Yuki, T., Rajopadhye, S.: Folklore confirmed: compiling for Speed \(=\) compiling for energy. In: Caşcaval, C., Montesinos, P. (eds.) LCPC 2013. LNCS, vol. 8664, pp. 169–184. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09967-5_10
Chapter Google Scholar
Van Zee, F.G., van de Geijn, R.A.: BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015). https://doi.org/10.1145/2764454
Zou, Y., Rajopadhye, S.: A code generator for energy-efficient wavefront parallelization of uniform dependence computations. IEEE Trans. Parallel Distrib. Syst. 29, 1923–1936 (2018). https://doi.org/10.1109/TPDS.2017.2709748
Article Google Scholar

Download references

Author information

Authors and Affiliations

Colorado State University, Fort Collins, USA
Louis Narmour & Sanjay Rajopadhye
Univ Rennes, Inria, Rennes, France
Tomofumi Yuki

Authors

Louis Narmour
View author publications
You can also search for this author in PubMed Google Scholar
Tomofumi Yuki
View author publications
You can also search for this author in PubMed Google Scholar
Sanjay Rajopadhye
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Louis Narmour .

Editor information

Editors and Affiliations

University of California, San Diego, La Jolla, CA, USA
Alex Orailoglu
Fraunhofer IESE, Kaiserslautern, Germany
Matthias Jung
Brandenburg University of Technology, Cottbus, Germany
Marc Reichenbach

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Narmour, L., Yuki, T., Rajopadhye, S. (2022). (When) Do Multiple Passes Save Energy?. In: Orailoglu, A., Jung, M., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2021. Lecture Notes in Computer Science, vol 13227. Springer, Cham. https://doi.org/10.1007/978-3-031-04580-6_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-04580-6_30
Published: 27 April 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-04579-0
Online ISBN: 978-3-031-04580-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics