skip to main content
10.1145/1504176.1504210acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

Effective performance measurement and analysis of multithreaded applications

Published:14 February 2009Publication History

ABSTRACT

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

References

  1. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.Google ScholarGoogle Scholar
  2. G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In SIGPLAN Conference on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115--125, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apple Computer. Shark. http://developer.apple.com/tools/sharkoptimize.html.Google ScholarGoogle Scholar
  5. W. Binder. Portable and accurate sampling profiling for Java. Softw. Pract. Exper., 36(6):615--650, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. P. Breshears. Using Intel Thread Profiler for Win32 threads: Philosophy and theory. http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-.. threads-philosophy-and-theory, August 2007.Google ScholarGoogle Scholar
  7. D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. E. Crovella and T. J. LeBlanc. Parallel performance using lost cycles analysis. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 600--609, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. SIGPLAN Not., 41(11):175--184, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. J. Hall. Call path profiling. In ICSE '92: Proceedings of the 14th international conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel Corporation. Intel performance tuning utility. Linked from http://whatif.intel.com/.Google ScholarGoogle Scholar
  14. Intel Corporation. Intel thread profiler. http://www.intel.com/software/products/tpwin.Google ScholarGoogle Scholar
  15. Intel Corporation. Intel VTune performance analyzers. http://www.intel.com/software/products/vtune/.Google ScholarGoogle Scholar
  16. M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime API for profiling. http://www.compunity.org/futures/omp-api.html.Google ScholarGoogle Scholar
  17. D. Levinthal. Execution-based cycle accounting on Intel Core 2 Duo processors. http://www.devx.com/go-parallel/Link/33315.Google ScholarGoogle Scholar
  18. J. Levon al. OProfile. http://oprofile.sourceforge.net/.Google ScholarGoogle Scholar
  19. M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal design-space exploration for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 19(5):666--681, May 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Mosberger-Tang. libunwind. http://www.nongnu.org/libunwind/.Google ScholarGoogle Scholar
  21. T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF '07: Proceedings of the 4th international conference on Computing frontiers, pages 143--152, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google ScholarGoogle Scholar
  23. J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Rice University. HPCToolkit performance tools. http://hpctoolkit.org.Google ScholarGoogle Scholar
  25. T. Yasue, T. Suganuma, H. Komatsu, and T. Nakatani. An efficient online path profiling framework for Java just-in-time compilers. In PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 148, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 263--271, New York, NY, USA, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Effective performance measurement and analysis of multithreaded applications

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
          February 2009
          322 pages
          ISBN:9781605583976
          DOI:10.1145/1504176
          • cover image ACM SIGPLAN Notices
            ACM SIGPLAN Notices  Volume 44, Issue 4
            PPoPP '09
            April 2009
            294 pages
            ISSN:0362-1340
            EISSN:1558-1160
            DOI:10.1145/1594835
            Issue’s Table of Contents

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 14 February 2009

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate230of1,014submissions,23%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader