research-article

A data-centric profiler for parallel programs

Authors:
Xu Liu

Rice University, Houston, TX

Rice University, Houston, TX
View Profile

,
John Mellor-Crummey

Rice University, Houston, TX

Rice University, Houston, TX
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 28Pages 1–12https://doi.org/10.1145/2503210.2503297

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

It is difficult to manually identify opportunities for enhancing data locality. To address this problem, we extended the HPCToolkit performance tools to support data-centric profiling of scalable parallel programs. Our tool uses hardware counters to directly measure memory access latency and attributes latency metrics to both variables and instructions. Different hardware counters provide insight into different aspects of data locality (or lack thereof). Unlike prior tools for data-centric analysis, our tool employs scalable measurement, analysis, and presentation methods that enable it to analyze the memory access behavior of scalable parallel programs with low runtime and space overhead. We demonstrate the utility of HPCToolkit's new data-centric analysis capabilities with case studies of five well-known benchmarks. In each benchmark, we identify performance bottlenecks caused by poor data locality and demonstrate non-trivial performance optimizations enabled by this guidance.

References

Accelerated Strategic Computing Initiative. The ASCI Sweep3D Benchmark Code. http://wwwc3.lanl.gov/pal/software/sweep3d, 2009.Google Scholar
L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarCross Ref
Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://amddevcentral.com/tools/hc/CodeAnalyst/Pages/default.aspx.Google Scholar
J. M. Anderson et al. Continuous profiling: where have all the cycles gone? ACM TOCS., 15(4):357--390, 1997. Google ScholarDigital Library
K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. Proc. of the 2^nd Intl. Conf. on High Performance Computing and Communications (HPCC), 4208:220--229, Sept. 2006. Google ScholarDigital Library
K. Beyls and E. H. D'Hollander. Refactoring for data locality. Computer, 42(2):62--71, Feb. 2009. Google ScholarDigital Library
B. R. Buck and J. K. Hollingsworth. Data centric cache measurement on the Intel ltanium 2 processor. In SC '04: Proc. of the 2004 ACM/IEEE Conf. on Supercomputing, page 58, Washington, DC, USA, 2004. IEEE Computer Society. Google ScholarDigital Library
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization (IISWC), pages 44--54, Washington, DC, USA, 2009. Google ScholarDigital Library
J. Dean et al. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proc. of the 30^th annual ACM/IEEE Intl. Symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. Google ScholarDigital Library
P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007.Google Scholar
N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In Proc. of ICS'05, pages 81--90, New York, NY, USA, 2005. ACM Press. Google ScholarDigital Library
S. L. Graham, P. B. Kessler, and M. K. McKusick. Gprof: A call graph execution profiler. In Proc. of the 1982 SIGPLAN Symposium on Compiler Construction, pages 120--126, New York, NY, USA, 1982. ACM Press. Google ScholarDigital Library
Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe, April 2013.Google Scholar
Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032US. http://www.intel.com/Assets/PDF/manual/253669.pdf, June 2010.Google Scholar
Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization. http://www.intel.com/Assets/PDF/manual/323602.pdf, March 2010.Google Scholar
R. B. Irvin and B. P. Miller. Mapping performance data for high-level and data views of parallel program performance. In Proc. of ICS'96, pages 69--77, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In Proc. of the 2003 ACM/IEEE Conf. on Supercomputing, page 17, Washington, DC, USA, 2003. IEEE Computer Society. Google ScholarDigital Library
A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005.Google Scholar
A. Kleen. nuamctl -- Linux man page. http://linux.die.net/man/8/numactl, 2005.Google Scholar
R. Lachaize, B. Lepers, and V. Quéma. Memprof: a memory profiler for NUMA multicore systems. In Proceedings of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarDigital Library
Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php.Google Scholar
Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks.Google Scholar
A. R. Lebeck and D. A. Wood. Cache profiling and the SPEC benchmarks: A case study. Computer, 27(10):15--26, 1994. Google ScholarDigital Library
J. Levon et al. OProfile. http://oprofile.sourceforge.net.Google Scholar
X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of CGO'11, pages 171--180, Washington, DC, 2011. Google ScholarDigital Library
X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of ISPASS 2013, Austin, TX, USA, April 21--23, 2013.Google ScholarCross Ref
M. Martonosi, A. Gupta, and T. Anderson. Memspy: analyzing memory system bottlenecks in programs. SIGMETRICS Perform. Eval. Rev., 20(1):1--12, 1992. Google ScholarDigital Library
C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In IEEE Intl. Symp. on Performance Analysis of Systems Software (ISPASS), pages 87--96, Mar. 2010.Google ScholarCross Ref
Message Passing Interface Forum. MPI: A message passing interface standard. http://www.mcs.anl.gov/research/projects/mpi, 2013.Google Scholar
B. P. Miller et al. The Paradyn parallel performance measurement tool. IEEE Computer, 28(11):37--46, 1995. Google ScholarDigital Library
OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.Google Scholar
A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the 12th Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. IEEE Computer Society. Google ScholarDigital Library
Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?EntryId=1492, August 2012.Google Scholar
D. L. Schuff, M. Kulkarni, and V. S. Pai. Accelerating multicore reuse distance analysis with sampling and parallelization. In Proc. of PACT'10, PACT '10, pages 53--64, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--4:19, May-June 2011.Google ScholarCross Ref
N. R. Tallent, L. Adhianto, and J. M. Mellor-Crummey. Scalable identification of load imbalance in parallel executions using call path profiles. In Proc. of Intl. Conf. for High-Performance Computing, Networking, Storage and Analysis, New York, NY, USA, November 2010. ACM. Google ScholarDigital Library
N. R. Tallent and D. Kerbyson. Data-centric performance analysis of PGAS applications. In Proc. of the Second Intl. Workshop on High-performance Infrastructure for Scalable Tools (WHIST), San Servolo Island, Venice, Italy, 2012.Google Scholar
N. R. Tallent, J. Mellor-Crummey, and M. W. Fagan. Binary analysis for measurement and attribution of program performance. In Proc. of the 2009 ACM PLDI, pages 441--452, NY, NY, USA, 2009. ACM. Google ScholarDigital Library
Y. Zhong and W. Chang. Sampling-based program locality approximation. In Proc. of the 7th Intl. Symposium on Memory Management, ISMM '08, pages 91--100, New York, NY, USA, 2008. ACM. Google ScholarDigital Library

Index Terms

A data-centric profiler for parallel programs
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems
PACT '13: Proceedings of the 22nd international conference on Parallel architectures and compilation techniques

Optimizing cache locality has always been important since the emergence of caches, and numerous cache locality optimization schemes have been published in compiler literature. However, in modern architectures, cache locality is not the only factor that ...
Read More
Data-centric execution of speculative parallel programs
MICRO-49: The 49th Annual IEEE/ACM International Symposium on Microarchitecture

Multicore systems must exploit locality to scale, scheduling tasks to minimize data movement. While locality-aware parallelism is well studied in non-speculative systems, it has received little attention in speculative systems (e.g., HTM or TLS), which ...
Read More
Behavior Aware Data Locality for Caches
ICPADS '12: Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems

Optimizing cache performance through improving data locality has been receiving a lot of attention. However, none of the existing approaches can combine each task's behavior to optimize data locality for caches. We present a behavior aware data locality ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data locality
data-centric profiling
scalable profiler
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 566
  Total Downloads
- Downloads (Last 12 months)33
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A data-centric profiler for parallel programs

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Reshaping cache misses to improve row-buffer locality in multicore systems

Data-centric execution of speculative parallel programs

Behavior Aware Data Locality for Caches