research-article

Effective performance measurement and analysis of multithreaded applications

Authors:
Nathan R. Tallent

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
John M. Mellor-Crummey

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programmingFebruary 2009Pages 229–240https://doi.org/10.1145/1504176.1504210

Published:14 February 2009Publication History

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 229–240

ABSTRACT

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

References

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.Google Scholar
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In SIGPLAN Conference on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM Press. Google ScholarDigital Library
T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115--125, 1990. Google ScholarDigital Library
Apple Computer. Shark. http://developer.apple.com/tools/sharkoptimize.html.Google Scholar
W. Binder. Portable and accurate sampling profiling for Java. Softw. Pract. Exper., 36(6):615--650, 2006. Google ScholarDigital Library
C. P. Breshears. Using Intel Thread Profiler for Win32 threads: Philosophy and theory. http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-.. threads-philosophy-and-theory, August 2007.Google Scholar
D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google ScholarDigital Library
M. E. Crovella and T. J. LeBlanc. Parallel performance using lost cycles analysis. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 600--609, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarDigital Library
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. SIGPLAN Not., 41(11):175--184, 2006. Google ScholarDigital Library
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998. Google ScholarDigital Library
N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
R. J. Hall. Call path profiling. In ICSE '92: Proceedings of the 14th international conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press. Google ScholarDigital Library
Intel Corporation. Intel performance tuning utility. Linked from http://whatif.intel.com/.Google Scholar
Intel Corporation. Intel thread profiler. http://www.intel.com/software/products/tpwin.Google Scholar
Intel Corporation. Intel VTune performance analyzers. http://www.intel.com/software/products/vtune/.Google Scholar
M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime API for profiling. http://www.compunity.org/futures/omp-api.html.Google Scholar
D. Levinthal. Execution-based cycle accounting on Intel Core 2 Duo processors. http://www.devx.com/go-parallel/Link/33315.Google Scholar
J. Levon al. OProfile. http://oprofile.sourceforge.net/.Google Scholar
M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal design-space exploration for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 19(5):666--681, May 2008. Google ScholarDigital Library
D. Mosberger-Tang. libunwind. http://www.nongnu.org/libunwind/.Google Scholar
T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF '07: Proceedings of the 4th international conference on Computing frontiers, pages 143--152, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google Scholar
J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007. Google ScholarDigital Library
Rice University. HPCToolkit performance tools. http://hpctoolkit.org.Google Scholar
T. Yasue, T. Suganuma, H. Komatsu, and T. Nakatani. An efficient online path profiling framework for Java just-in-time compilers. In PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 148, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 263--271, New York, NY, USA, 2006. ACM. Google ScholarDigital Library

Index Terms

Effective performance measurement and analysis of multithreaded applications
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Effective performance measurement and analysis of multithreaded applications
PPoPP '09

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three ...
Read More
Analyzing lock contention in multithreaded applications
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
Read More
Analyzing lock contention in multithreaded applications
PPoPP '10

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2009
322 pages
ISBN:9781605583976
DOI:10.1145/1504176
General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA
ACM SIGPLAN Notices Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 February 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
call path profiling
hpctoolkit
multithreaded programming models
performance analysis
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate230of1,014submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 101
  Total Citations
  View Citations
- 1,207
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective performance measurement and analysis of multithreaded applications

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Effective performance measurement and analysis of multithreaded applications

Analyzing lock contention in multithreaded applications

Analyzing lock contention in multithreaded applications