Skip to main content

Compile Time Modeling of Off-Chip Memory Bandwidth for Parallel Loops

  • Conference paper
  • First Online:
Languages and Compilers for Parallel Computing (LCPC 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8664))

  • 665 Accesses

Abstract

In this paper, we present a statistical model to predict the off-chip memory bandwidth required by a parallel loop during its execution. It is a compile-time modeling technique that derives the correlations between memory bandwidth requirement and data access patterns of multithreaded applications. This model could be used by the compiler and performance tools to predict when the sustainable memory bandwidth of the system will be reached by the application during execution, and to determine an optimal number of threads that should be configured to execute a specific parallel loop according to its memory reference patterns. Awareness of the performance impact of oversubscribed memory bandwidth can also help programmers to take into account the additional latency caused by the contention, and to minimize the overhead by tuning the memory access behavior of applications. We evaluated this model in terms of both technical accuracy and prediction accuracy by comparing the modeling results with the measured results. The evaluation demonstrates its accuracy in both system bandwidth modeling and application bandwidth modeling.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. The Open64 compiler. http://open64.sourceforge.net

  2. Agarwal, D., Liu, W., Yeung, D.: Exploiting application-level information to reduce memory bandwidth consumption. In: Proceedings of 4th Workshop on Complexity-Effective Design (2003)

    Google Scholar 

  3. Bailey, D.H., Barszcz, E., Barton, J.T., Browning, D.S., Carter, R.L., Dagum, D., Fatoohi, R.A., Frederickson, P.O., Lasinski, T.A., Schreiber, R.S., Simon, H.D., Venkatakrishnan, V., Weeratunga, S.K.: The NAS parallel benchmarks. Int. J. Supercomput. Appl. 5(3), 63–73 (1991)

    Article  Google Scholar 

  4. Black, N., Moore, S., Weisstein, E.W.: Jacobi method. http://mathworld.wolfram.com/JacobiMethod.html

  5. Coope, I.D.: Circle fitting by linear and nonlinear least squares. J. Optim. Theor. Appl. 76(2), 381–388 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  6. Ding, C., Kennedy, K.: Improving cache performance in dynamic applications through data and computation reorganization at run time. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 229–241 (1999)

    Google Scholar 

  7. Ding, C. Kennedy, K.: The memory bandwidth bottleneck and its amelioration by a compiler. In: Proceedings of the 14th International Symposium on Parallel and Distributed Processing (2000)

    Google Scholar 

  8. Jeremiassen, T., Eggers, S.J.: Reducing false sharing on shared memory multiprocessors through compile time data transformations. In: Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 179–188 (1994)

    Google Scholar 

  9. Lee, H.-J., Cho, W.-C., Chung, E.-Y.: Analytical memory bandwidth model for many-core processor based systems. IEICE Electron. Express 9(18), 1461–1466 (2012)

    Article  Google Scholar 

  10. Liu, F., Jiang, X., Solihin, Y.: Understanding how off-chip memory bandwidth partitioning in chip multiprocessors affects system performance. In: Proceedings of High Performance Computer Architecture (HPCA), pp. 1–12 (2010)

    Google Scholar 

  11. Majo, Z., Gross, T.R.: Memory system performance in a numa multicore multiprocessor. In Proceedings of the 4th Annual International Conference on Systems and Storage (SYSTOR), pp. 12:1–12:10 (2011)

    Google Scholar 

  12. Mandal, A., Fowler, R., Porterfield. Modeling memory concurrency for multi-socket multi-core systems. In: ISPASS, pp. 66–75 (2010)

    Google Scholar 

  13. Marchal, P., Gómez, J.I., Catthoor, F.: Optimizing the memory bandwidth with loop fusion. In: Proceedings of the 2nd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pp. 188–193 (2004)

    Google Scholar 

  14. McCalpin, J.D.: Memory bandwidth and machine balance in current high performance computers. In: IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25 (1995)

    Google Scholar 

  15. McKinley, K.S., Carr, S., Tseng, C.-W.: Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18(4), 424–453 (1996)

    Article  Google Scholar 

  16. Mohideen, R.M., Sankaranarayanan, V.: An analytical model for optimum off-chip memory bandwidth partitioning in multicore architectures. In: Proceedings of the 2nd International Conference on Computer Science and Information Technology (ICCSIT) (2012)

    Google Scholar 

  17. Sandberg, A., Eklov, D., Hagersten, E.: Reducing cache pollution through detection and elimination of non-temporal memory accesses. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–11 (2010)

    Google Scholar 

  18. Schuff, D., Parsons, B., Pai, V.: Multicore-aware reuse distance analysis. In: IPDPS Workshop on Performance Modeling, Evaluation, and Optimization of Ubiquitous Computing and Networked Systems (2010)

    Google Scholar 

  19. Tolubaeva, M., Yan, Y., Chapman, B.: Compile-time detection of false sharing via loop cost modeling. In: Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 557–566 (2012)

    Google Scholar 

  20. Wang, R., Chen, L., Pinkston, T.M.: An analytical performance model for partitioning off-chip memory bandwidth. In: Proceedings of the IPDPS (2013)

    Google Scholar 

  21. Wu, X., Taylor, V.E.: Performance modeling of hybrid mpi/openmp scientific applications on large-scale multicore cluster systems. In: CSE, pp. 181–190 (2011)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the National Science Foundations Computer Systems Research program under Award No. CCF-0833201 and Department of Energy under Award Agreement No. DE-FC02-12ER26099. The evaluation platform used for this work was supported by the National Science Foundation’s Computer Systems Research program under Award No. CNS-0833201 and CRI-0958464.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Munara Tolubaeva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tolubaeva, M., Yan, Y., Chapman, B. (2014). Compile Time Modeling of Off-Chip Memory Bandwidth for Parallel Loops. In: Cașcaval, C., Montesinos, P. (eds) Languages and Compilers for Parallel Computing. LCPC 2013. Lecture Notes in Computer Science(), vol 8664. Springer, Cham. https://doi.org/10.1007/978-3-319-09967-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09967-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09966-8

  • Online ISBN: 978-3-319-09967-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics