Skip to main content

Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling

  • Conference paper
Parallel Processing and Applied Mathematics (PPAM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7203))

Abstract

We present a modeling framework to accurately predict time to run dense linear algebra calculation. We report the framework’s accuracy in a number of varied computational environments such as shared memory multicore systems, clusters, and large supercomputing installations with tens of thousands of cores. We also test the accuracy for various algorithms, each of which having a different scaling properties and tolerance to low-bandwidth/high-latency interconnects. The predictive accuracy is very good and on the order of measurement accuracy which makes the method suitable for both dedicated and non-dedicated environments. We also present a practical application of our model to reduce the time required to tune and optimize large parallel runs whose time is dominated by linear algebra computations. We show practical examples of how to apply the methodology to avoid common pitfalls and reduce the influence of measurement errors and the inherent performance variability.

This research was supported by DARPA through ORNL subcontract 4000075916 as well as NSF through award number 1038814. We would like to also thank Patrick Worley from ORNL for facilitating the large scale runs on Jaguar’s Cray XT4 partition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anderson, E., Bai, Z., Bischof, C., Blackford, S.L., Demmel, J.W., Dongarra, J.J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.C.: LAPACK User’s Guide, 3rd edn. Society for Industrial and Applied Mathematics, Philadelphia (1999)

    Book  MATH  Google Scholar 

  2. Barrett, R.F., Chan, T.H.F., D’Azevedo, E.F., Jaeger, E.F., Wong, K., Wong, R.Y.: Complex version of high performance computing LINPACK benchmark (HPL). Concurrency and Computation: Practice and Experience 22(5), 573–587 (2010)

    Google Scholar 

  3. Björk, Å.: Numerical methods for Least Squares Problems. SIAM (1996) ISBN 0-89871-360-9

    Google Scholar 

  4. Suzan Blackford, L., Choi, J., Cleary, A., D’Azevedo, E.F., Demmel, J.W., Dhillon, I.S., Dongarra, J.J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D.W., Clint Whaley, R.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997)

    Book  Google Scholar 

  5. Chen, Z., Dongarra, J., Luszczek, P., Roche, K.: Self-adapting software for numerical linear algebra and LAPACK for Clusters. Parallel Computing 29(11-12), 1723–1743 (2003)

    Article  Google Scholar 

  6. Choi, J., Dongarra, J.J., Ostrouchov, S., Petitet, A., Walker, D.W., Clint Whaley, R.: The design and implementation of the ScaLAPACK LU, QR, and Cholesky factorization routines. Scientific Programming 5, 173–184 (1996)

    Google Scholar 

  7. Dongarra, J., Du Croz, J., Duff, I., Hammarling, S.: Algorithm 679: A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16(1), 18–28 (1990)

    Article  MATH  Google Scholar 

  8. Dongarra, J., Du Croz, J., Duff, I., Hammarling, S.: A set of Level 3 Basic Linear Algebra Subprograms. ACM Trans. Math. Soft. 16(1), 1–17 (1990)

    Article  MATH  Google Scholar 

  9. Dongarra, J., Jeannot, E., Langou, J.: Modeling the LU factorization for SMP clusters. In: Proceeedings of Parallel Matrix Algorithms and Applications (PMAA 2006), September 7-9. IRISA, Rennes, France (2006)

    Google Scholar 

  10. Dongarra, J., Luszczek, P.: Reducing the time to tune parallel dense linear algebra routines with partial execution and performance modelling. In: Poster Session of SC 2010, New Orleans, Louisianna, USA, November 13-19 (2010), Also: Technical Report UT-CS-10-661, University of Tennessee, Computer Science Department

    Google Scholar 

  11. Dongarra, J.J., Duff, I.S., Sorensen, D.C., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. Society for Industrial and Applied Mathematics, Philadelphia (1998)

    Book  MATH  Google Scholar 

  12. Dongarra, J.J., Gustavson, F.G., Karp, A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine. SIAM Review 26(1), 91–112 (1984)

    Article  MathSciNet  MATH  Google Scholar 

  13. Dongarra, J.J., Luszczek, P., Petitet, A.: The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience 15, 1–18 (2003)

    Article  Google Scholar 

  14. Edelman, A.: Large dense numerical linear algebra in 1993: the parallel computing influence. International Journal of High Performance Computing Applications 7(2), 113–128 (1993)

    Article  Google Scholar 

  15. García, L.-P., Cuenca, J., Giménez, D.: Using Experimental Data to Improve the Performance Modelling of Parallel Linear Algebra Routines. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Wasniewski, J. (eds.) PPAM 2007. LNCS, vol. 4967, pp. 1150–1159. Springer, Heidelberg (2008) ISSN 0302-9743 (Print) 1611-3349 (Online), doi:10.1007/978-3-540-68111-3

    Chapter  Google Scholar 

  16. Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore and London (1996)

    MATH  Google Scholar 

  17. Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: TOP500 Supercomputer Sites, 34th edn. (November 2009), http://www.netlib.org/benchmark/top500.html and http://www.top500.org/

  18. Meuer, H., Strohmaier, E., Dongarra, J., Simon, H.: TOP500 Supercomputer Sites, Hambug, Germany, 37th edn. (June 2011), http://www.netlib.org/benchmark/top500.html and http://www.top500.org/

  19. Harrington, R.: Origin and development of the method of moments for field computation. IEEE Antennas and Propagation Magazine (June 1990)

    Google Scholar 

  20. Hess, J.L.: Panel methods in computational fluid dynamics. Annual Reviews of Fluid Mechanics 22, 255–274 (1990)

    Article  Google Scholar 

  21. Hess, L., Smith, M.O.: Calculation of potential flows about arbitrary bodies. In: Kuchemann, D. (ed.) Progress in Aeronautical Sciences, vol. 8. Pergamon Press (1967)

    Google Scholar 

  22. Kerbyson, D.J., Hoisie, A., Wasserman, H.J.: Verifying Large-Scale System Performance During Installation using Modeling. In: High Performance Scientific and Engineering Computing, Hardware/Software Support. Kluwer (October 2003)

    Google Scholar 

  23. Luszczek, P., Dongarra, J., Kepner, J.: Design and implementation of the HPCC benchmark suite. CT Watch Quarterly 2(4A) (November 2006)

    Google Scholar 

  24. Numerich, R.W.: Computational forces in the Linpack benchmark. Concurrency Practice and Experience (2007)

    Google Scholar 

  25. Oram, A., Wilson, G. (eds.): Beautiful Code. O’Reilly (2007), Chapter 14: How Elegant Code Evolves with Hardware: The Case of Gaussian Elimination

    Google Scholar 

  26. Roche, K.J., Dongarra, J.J.: Deploying parallel numerical library routines to cluster computing in a self adapting fashion. In: Parallel Computing: Advances and Current Issues. Imperial College Press, London (2002)

    Google Scholar 

  27. Rodgers, J.L., Nicewander, W.A.: Thirteen ways to look at the correlation coefficient. The American Statistician 42, 59–66 (1988)

    Article  Google Scholar 

  28. Smith, W., Foster, I., Taylor, V.: Predicting application runt times with historical information. In: Proceedings of IPPS Workshop on Job Scheduling Strtegies for Parallel Processing. Elsevier Inc. (1998), doi:10.1016/j.jpdc.2004.06.2008

    Google Scholar 

  29. Wang, J.J.H.: Generalized Moment Methods in Electromagnetics. John Wiley & Sons, New York (1991)

    Google Scholar 

  30. Weinberg, J., McCracken, M.O., Strohmaier, E., Snavely, A.: Quantifying locality in the memory access patterns of HPC applications. In: Proceedings of SC 2005, Seattle, Washington. IEEE Computer Society Washington, DC (2005)

    Google Scholar 

  31. Wilkinson, J.H.: Rounding Errors in Algebraic Processes. Prentice Hall, Englewood Cliffs (1963)

    MATH  Google Scholar 

  32. Wilkinson, J.H.: The Algebraic Eigenvalue Problem. Oxford University Press, Oxford (1965)

    MATH  Google Scholar 

  33. Yang, L.T., Ma, X., Mueller, F.: Cross-platform performance prediction of parallel applications using partial execution. In: Proceedings of the ACM/IEEE SC 2005 Conference (SC 2005). IEEE (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luszczek, P., Dongarra, J. (2012). Reducing the Time to Tune Parallel Dense Linear Algebra Routines with Partial Execution and Performance Modeling. In: Wyrzykowski, R., Dongarra, J., Karczewski, K., Waśniewski, J. (eds) Parallel Processing and Applied Mathematics. PPAM 2011. Lecture Notes in Computer Science, vol 7203. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-31464-3_74

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-31464-3_74

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-31463-6

  • Online ISBN: 978-3-642-31464-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics