skip to main content
10.1145/2503210.2503297acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

A data-centric profiler for parallel programs

Published:17 November 2013Publication History

ABSTRACT

It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.

References

  1. Accelerated Strategic Computing Initiative. The ASCI Sweep3D Benchmark Code. http://wwwc3.lanl.gov/pal/software/sweep3d, 2009.Google ScholarGoogle Scholar
  2. L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarGoogle ScholarCross RefCross Ref
  3. Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://amddevcentral.com/tools/hc/CodeAnalyst/Pages/default.aspx.Google ScholarGoogle Scholar
  4. J. M. Anderson et al. Continuous profiling: where have all the cycles gone? ACM TOCS., 15(4):357--390, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. Proc. of the 2nd Intl. Conf. on High Performance Computing and Communications (HPCC), 4208:220--229, Sept. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Beyls and E. H. D'Hollander. Refactoring for data locality. Computer, 42(2):62--71, Feb. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the Intel ltanium 2 processor. In SC '04: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing, page 58, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization (IISWC), pages 44--54, Washington, DC, USA, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean et al. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proc. of the 30th annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007.Google ScholarGoogle Scholar
  11. N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of ICS'05, pages 81--90, New York, NY, USA, 2005. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. L. Graham, P. B. Kessler, and M. K. McKusick. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symposium on Compiler Construction, pages 120--126, New York, NY, USA, 1982. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe, April 2013.Google ScholarGoogle Scholar
  14. Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032US. http://www.intel.com/Assets/PDF/manual/253669.pdf, June 2010.Google ScholarGoogle Scholar
  15. Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization. http://www.intel.com/Assets/PDF/manual/323602.pdf, March 2010.Google ScholarGoogle Scholar
  16. R. B. Irvin and B. P. Miller. Mapping performance data for high-level and data views of parallel program performance. In Proc. of ICS'96, pages 69--77, New York, NY, USA, 1996. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proc. of the 2003 ACM/IEEE Conf. on Supercomputing, page 17, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005.Google ScholarGoogle Scholar
  19. A. Kleen. nuamctl -- Linux man page. http://linux.die.net/man/8/numactl, 2005.Google ScholarGoogle Scholar
  20. R. Lachaize, B. Lepers, and V. Quéma. Memprof: a memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php.Google ScholarGoogle Scholar
  22. Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks.Google ScholarGoogle Scholar
  23. A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. Computer, 27(10):15--26, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Levon et al. OProfile. http://oprofile.sourceforge.net.Google ScholarGoogle Scholar
  25. X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of CGO'11, pages 171--180, Washington, DC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of ISPASS 2013, Austin, TX, USA, April 21--23, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Martonosi, A. Gupta, and T. Anderson. Memspy: analyzing memory system bottlenecks in programs. SIGMETRICS Perform. Eval. Rev., 20(1):1--12, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE Intl. Symp. on Performance Analysis of Systems Software (ISPASS), pages 87--96, Mar. 2010.Google ScholarGoogle ScholarCross RefCross Ref
  29. Message Passing Interface Forum. MPI: A message passing interface standard. http://www.mcs.anl.gov/research/projects/mpi, 2013.Google ScholarGoogle Scholar
  30. B. P. Miller et al. The Paradyn parallel performance measurement tool. IEEE Computer, 28(11):37--46, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google ScholarGoogle Scholar
  32. A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the 12th Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?EntryId=1492, August 2012.Google ScholarGoogle Scholar
  34. D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. of PACT'10, PACT '10, pages 53--64, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--4:19, May-June 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of Intl. Conf. for High-Performance Computing, Networking, Storage and Analysis, New York, NY, USA, November 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. R. Tallent and D. Kerbyson. Data-centric performance analysis of PGAS applications. In Proc. of the Second Intl. Workshop on High-performance Infrastructure for Scalable Tools (WHIST), San Servolo Island, Venice, Italy, 2012.Google ScholarGoogle Scholar
  38. N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM PLDI, pages 441--452, NY, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proc. of the 7th Intl. Symposium on Memory Management, ISMM '08, pages 91--100, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A data-centric profiler for parallel programs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
          November 2013
          1123 pages
          ISBN:9781450323789
          DOI:10.1145/2503210
          • General Chair:
          • William Gropp,
          • Program Chair:
          • Satoshi Matsuoka

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 November 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader