ABSTRACT
It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.
- Accelerated Strategic Computing Initiative. The ASCI Sweep3D Benchmark Code. http://wwwc3.lanl.gov/pal/software/sweep3d, 2009.Google Scholar
- L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarCross Ref
- Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://amddevcentral.com/tools/hc/CodeAnalyst/Pages/default.aspx.Google Scholar
- J. M. Anderson et al. Continuous profiling: where have all the cycles gone? ACM TOCS., 15(4):357--390, 1997. Google ScholarDigital Library
- K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. Proc. of the 2nd Intl. Conf. on High Performance Computing and Communications (HPCC), 4208:220--229, Sept. 2006. Google ScholarDigital Library
- K. Beyls and E. H. D'Hollander. Refactoring for data locality. Computer, 42(2):62--71, Feb. 2009. Google ScholarDigital Library
- B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the Intel ltanium 2 processor. In SC '04: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing, page 58, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization (IISWC), pages 44--54, Washington, DC, USA, 2009. Google ScholarDigital Library
- J. Dean et al. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proc. of the 30th annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. Google ScholarDigital Library
- P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007.Google Scholar
- N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of ICS'05, pages 81--90, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
- S. L. Graham, P. B. Kessler, and M. K. McKusick. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symposium on Compiler Construction, pages 120--126, New York, NY, USA, 1982. ACM Press. Google ScholarDigital Library
- Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe, April 2013.Google Scholar
- Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032US. http://www.intel.com/Assets/PDF/manual/253669.pdf, June 2010.Google Scholar
- Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization. http://www.intel.com/Assets/PDF/manual/323602.pdf, March 2010.Google Scholar
- R. B. Irvin and B. P. Miller. Mapping performance data for high-level and data views of parallel program performance. In Proc. of ICS'96, pages 69--77, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
- M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proc. of the 2003 ACM/IEEE Conf. on Supercomputing, page 17, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
- A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005.Google Scholar
- A. Kleen. nuamctl -- Linux man page. http://linux.die.net/man/8/numactl, 2005.Google Scholar
- R. Lachaize, B. Lepers, and V. Quéma. Memprof: a memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarDigital Library
- Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php.Google Scholar
- Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks.Google Scholar
- A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. Computer, 27(10):15--26, 1994. Google ScholarDigital Library
- J. Levon et al. OProfile. http://oprofile.sourceforge.net.Google Scholar
- X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of CGO'11, pages 171--180, Washington, DC, 2011. Google ScholarDigital Library
- X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of ISPASS 2013, Austin, TX, USA, April 21--23, 2013.Google ScholarCross Ref
- M. Martonosi, A. Gupta, and T. Anderson. Memspy: analyzing memory system bottlenecks in programs. SIGMETRICS Perform. Eval. Rev., 20(1):1--12, 1992. Google ScholarDigital Library
- C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE Intl. Symp. on Performance Analysis of Systems Software (ISPASS), pages 87--96, Mar. 2010.Google ScholarCross Ref
- Message Passing Interface Forum. MPI: A message passing interface standard. http://www.mcs.anl.gov/research/projects/mpi, 2013.Google Scholar
- B. P. Miller et al. The Paradyn parallel performance measurement tool. IEEE Computer, 28(11):37--46, 1995. Google ScholarDigital Library
- OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google Scholar
- A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the 12th Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
- Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?EntryId=1492, August 2012.Google Scholar
- D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. of PACT'10, PACT '10, pages 53--64, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--4:19, May-June 2011.Google ScholarCross Ref
- N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of Intl. Conf. for High-Performance Computing, Networking, Storage and Analysis, New York, NY, USA, November 2010. ACM. Google ScholarDigital Library
- N. R. Tallent and D. Kerbyson. Data-centric performance analysis of PGAS applications. In Proc. of the Second Intl. Workshop on High-performance Infrastructure for Scalable Tools (WHIST), San Servolo Island, Venice, Italy, 2012.Google Scholar
- N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM PLDI, pages 441--452, NY, NY, USA, 2009. ACM. Google ScholarDigital Library
- Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proc. of the 7th Intl. Symposium on Memory Management, ISMM '08, pages 91--100, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
Index Terms
- A data-centric profiler for parallel programs
Recommendations
Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniquesOptimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Data-centric execution of speculative parallel programs
MICRO-49: The 49th Annual IEEE/ACM International Symposium on MicroarchitectureMulticore systems must exploit locality to scale, scheduling tasks to minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which ...
Behavior Aware Data Locality for Caches
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed SystemsOptimizing cache performance through improving data locality has been receiving a lot of attention. However, none of the existing approaches can combine each task's behavior to optimize data locality for caches. We present a behavior aware data locality ...
Comments