Skip to main content

(When) Do Multiple Passes Save Energy?

  • Conference paper
  • First Online:
Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13227))

Included in the following conference series:

  • 933 Accesses

Abstract

Energy cost continues to be a significant barrier on all modern computing platforms. The common wisdom has been to focus on speed alone through heuristics like “race-to-sleep,” a strategy based on the observation that the time-dependent components of total energy tend to dominate. Among different speed-optimal implementations or transformations of a program, however, there is a range of choices to (further) reduce energy. One of them is to execute a program with “multiple passes,” which reduces data accesses while retaining speed optimality, and was shown to be effective for stencil computations on CPUs. We try to extend this strategy for a suite of computational kernels on both CPU and GPU platforms based on prior success. We find that the approach does not appear to generalize well due to practical limitations in the hardware present on the systems we studied. Despite this negative result, we illustrate what it would take to be profitable and use it to understand why it appears to be out of reach on current systems today.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.top500.org/lists/top500/2019/11/.

  2. 2.

    https://www.eia.gov/electricity/.

References

  1. Library, I.M.K.: Reference Manual. Intel Corporation, Santa Clara, USA (2019)

    Google Scholar 

  2. Nvidia Management Library (NVML), April 2020. https://developer.nvidia.com/nvidia-management-library-nvml

  3. Profiler user’s guide (2020). https://docs.nvidia.com/cuda/profiler-users-guide/index.html

  4. Amd uprof , April 2021. https://developer.amd.com/amd-uprof/

  5. Beckmann, A., Meyer, U., Sanders, P., Singler, J.: Energy-efficient sorting using solid state disks. In: International Conference on Green Computing, pp. 191–202 (2010). https://doi.org/10.1109/GREENCOMP.2010.5598309

  6. Cho, S., Melhem, R.G.: On the interplay of parallelization, program performance, and energy consumption. IEEE Trans. Parallel Distrib. Syst. 21(3), 342–353 (2010). https://doi.org/10.1109/TPDS.2009.41

    Article  Google Scholar 

  7. Chowdhury, R., et al.: Autogen: automatic discovery of cache-oblivious parallel recursive algorithms for solving dynamic programs. In: Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP 2016. Association for Computing Machinery, New York (2016). https://doi.org/10.1145/2851141.2851167

  8. Chowdhury, R., et al.: Autogen: automatic discovery of efficient recursive divide-8-conquer algorithms for solving dynamic programming problems. ACM Trans. Parallel Comput. 4(1), 1–30 (2017). https://doi.org/10.1145/3125632

    Article  MathSciNet  Google Scholar 

  9. Hsu, C.-H., Feng, W.-C.: A power-aware run-time system for high-performance computing. In: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, pp. 1. SC 2005 (2005). https://doi.org/10.1109/SC.2005.3

  10. David, H., Gorbatov, E., Hanebutte, U.R., Khanna, R., Le, C.: RAPL: memory power estimation and capping. In: Proceedings of the 16th ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 189–194. ISLPED 2010 (2010). https://doi.org/10.1145/1840845.1840883

  11. Dawson-Haggerty, S., Krioukov, A., Culler, D.: Power optimization-a reality check (2009)

    Google Scholar 

  12. Eranian, S.: Perfmon2: a flexible performance monitoring interface for Linux (2010)

    Google Scholar 

  13. Frigo, M., Leiserson, C.E., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. In: FOCS: IEEE Symposium on Foundations of Computer Science, pp. 285–297, October 1999

    Google Scholar 

  14. Frigo, M., Strumpen, V.: Cache oblivious stencil computations. In: International Conference on Supercomputing, pp. 361–366. ICS 2005, June 2005

    Google Scholar 

  15. Frigo, M., Strumpen, V.: The cache complexity of multithreaded cache oblivious algorithms. In: Proceedings of Eighteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp. 271–280 (2006)

    Google Scholar 

  16. Grosser, T., Cohen, A., Holewinski, J., Sadayappan, P., Verdoolaege, S.: Hybrid hexagonal/classical tiling for GPUs. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 66–75. CGO 2014 (2014). https://doi.org/10.1145/2544137.2544160

  17. Han, D., Nam, Y.M., Lee, J., Park, K., Kim, H., Kim, M.S.: DistME: a fast and elastic distributed matrix computation engine using gpus. In: Proceedings of the 2019 International Conference on Management of Data, pp. 759–774. SIGMOD 2019 (2019). https://doi.org/10.1145/3299869.3319865

  18. Holewinski, J., Pouchet, L.N., Sadayappan, P.: High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing. pp. 311–320. ICS 2012 (2012). https://doi.org/10.1145/2304576.2304619

  19. Hsu, C.H., Kremer, U.: The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction. In: Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. PLDI 2003, vol. 38, pp. 38–48, May 2003. https://doi.org/10.1145/780822.781137

  20. Itzhaky, S., et al.: Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations. In: Proceedings of ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, pp. 145–164. OOPSLA 2016 (2016)

    Google Scholar 

  21. Jimborean, A., Koukos, K., Spiliopoulos, V., Black-Schaffer, D., Kaxiras, S.: Fix the code. don’t tweak the hardware: a new compiler approach to voltage-frequency scaling. In: Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization, pp. 262–272. CGO 2014 (2014). https://doi.org/10.1145/2581122.2544161

  22. Krishnamoorthy, S., Baskaran, M., Bondhugula, U., Ramanujam, J., Rountev, A., Sadayappan, P.: Effective automatic parallelization of stencil computations. In: Proceedings of the 28th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 235–244. PLDI 2007 (2007). https://doi.org/10.1145/1250734.1250761

  23. Matsumura, K., Zohouri, H.R., Wahib, M., Endo, T., Matsuoka, S.: AN5D: automated stencil framework for high-degree temporal blocking on GPUs. ACM (2020). https://doi.org/10.1145/3368826.3377904

    Article  Google Scholar 

  24. McCalpin, J., Wonnacott, D.: Time skewing: a value-based approach to optimizing for memory locality (1999)

    Google Scholar 

  25. Park, S., Kim, Y., Urgaonkar, B., Lee, J., Seo, E.: A comprehensive study of energy efficiency and performance of flash-based SSD. J. Syst. Archit. 57(4), 354–365 (2011). https://doi.org/10.1016/j.sysarc.2011.01.005

    Article  Google Scholar 

  26. Leiserson, C.E.: Cache-Oblivious algorithms. In: Petreschi, R., Persiano, G., Silvestri, R. (eds.) CIAC 2003. LNCS, vol. 2653, p. 5. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-44849-7_5

    Chapter  MATH  Google Scholar 

  27. Rawat, P.S., et al.: Domain-specific optimization and generation of high-performance GPU code for stencil computations. Proc. IEEE 106(11), 1902–1920 (2018). https://doi.org/10.1109/JPROC.2018.2862896

    Article  Google Scholar 

  28. Strzodka, R., Shaheen, M., Pajak, D., Seidel, H.: Cache accurate time skewing in iterative stencil computations. In: Proceedings of the 2011 International Conference on Parallel Processing, pp. 571–581 (2011). https://doi.org/10.1109/ICPP.2011.47

  29. Tang, Y., Chowdhury, R.A., Kuszmaul, B.C., Luk, C.K., Leiserson, C.E.: The pochoir stencil compiler. In: Proceedings of ACM Symposium on Parallelism in Algorithms and Architectures. SPAA 2011 (2011)

    Google Scholar 

  30. Tithi, J.J., Ganapathi, P., Talati, A., Aggarwal, S., Chowdhury, R.: High-performance energy-efficient recursive dynamic programming with matrix-multiplication-like flexible kernels. In: IEEE International Parallel and Distributed Processing Symposium, pp. 303–312, May 2015

    Google Scholar 

  31. Wolf, M.E., Lam, M.S.: A loop transformation theory and an algorithm to maximize parallelism. IEEE Trans. Parallel Distrib. Syst. 2(4), 452–471 (1991). https://doi.org/10.1109/71.97902

    Article  Google Scholar 

  32. Wonnacott, D.: Achieving scalable locality with time skewing. Int. J. Parallel Prog. 30, 181–221 (1999)

    Article  Google Scholar 

  33. Yuki, T., Rajopadhye, S.: Folklore confirmed: compiling for Speed \(=\) compiling for energy. In: Caşcaval, C., Montesinos, P. (eds.) LCPC 2013. LNCS, vol. 8664, pp. 169–184. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09967-5_10

    Chapter  Google Scholar 

  34. Van Zee, F.G., van de Geijn, R.A.: BLIS: A framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015). https://doi.org/10.1145/2764454

  35. Zou, Y., Rajopadhye, S.: A code generator for energy-efficient wavefront parallelization of uniform dependence computations. IEEE Trans. Parallel Distrib. Syst. 29, 1923–1936 (2018). https://doi.org/10.1109/TPDS.2017.2709748

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Louis Narmour .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Narmour, L., Yuki, T., Rajopadhye, S. (2022). (When) Do Multiple Passes Save Energy?. In: Orailoglu, A., Jung, M., Reichenbach, M. (eds) Embedded Computer Systems: Architectures, Modeling, and Simulation. SAMOS 2021. Lecture Notes in Computer Science, vol 13227. Springer, Cham. https://doi.org/10.1007/978-3-031-04580-6_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-04580-6_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-04579-0

  • Online ISBN: 978-3-031-04580-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics