ABSTRACT
Program performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips where performance is highly memory bound. Therefore diagnoses and selection of optimizations based only on measurements of the execution behavior of code segments are incomplete because they do not incorporate knowledge of memory access patterns and behaviors. This paper presents a low-overhead tool (MACPO) that captures memory traces and computes metrics for the memory access behavior of source-level (C, C++, Fortran) data structures. It also presents a complete process for integrating code segment-based and memory access pattern measurements and analyses for performance optimization specifically targeting multicore chips and multichip nodes of clusters. MACPO explicitly targets the measurement and metrics important to performance optimization for multicore chips. MACPO uses more realistic cache models for computation of latency metrics than those used by previous tools. Evaluation of the effectiveness of adding memory access behavior characteristics of data structures to performance optimization was done on subsets of the ASCI, NAS and Rodina parallel benchmarks and one application program from a domain not represented in these benchmarks. Adding memory behavior characteristics enabled easier diagnoses of bottlenecks and more accurate selection of appropriate optimizations than with only code centric behavior measurements. The performance gains ranged from a few percent to 38 percent.
- AMD Barcelona Processor Cache Architecture. http://developer.amd.com/documentation/articles/pages/8142007173.aspx.Google Scholar
- GCC 4.6.2 manual. http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/.Google Scholar
- Intel C Compiler Manual. http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/common_options/option_fp_lcase.htm.Google Scholar
- Linux support for NUMA hardware. http://lse.sourceforge.net/numa/faq/.Google Scholar
- Longhorn User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.Google Scholar
- PerfExpert. http://www.tacc.utexas.edu/perfexpert.Google Scholar
- Ranger User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.Google Scholar
- The ASCI Sweep3D Benchmark Code. DOE Accelerated Strategic Computing Initiative. http://www.c3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html.Google Scholar
- ThreadSpotter. http://www.roguewave.com/.Google Scholar
- D. H. Bailey, E. Barszcz, L. Dagum, and H. D. Simon. NAS Parallel Benchmark Results 3-94. Proceedings of the Scalable High Performance Computing Conference, pages 386--393, 1992. Google ScholarDigital Library
- K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. High Performance Computing and Communications, pages 220--229, 2006. Google ScholarDigital Library
- K. Beyls and E. H. D'Hollander. Refactoring for Data Locality. Computer, 42(2):62--71, 2009. Google ScholarDigital Library
- M. Burtscher, B. D. Kim, J. Diamond, J. Mccalpin, L. Koesterke, and J. Browne. PerfExpert : An Easy-to-Use Performance Diagnosis Tool for HPC Applications. In Computer, pages 1--11. IEEE, 2010. Google ScholarDigital Library
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization IISWC, 2009(c):44--54, 2009. Google ScholarDigital Library
- J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. Proceedings of 30th Annual International Symposium on Microarchitecture, pages 292--302, 1997. Google ScholarDigital Library
- C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2003. Google ScholarDigital Library
- Intel. Intel Processor Identification and the CPUID Instruction. Journal On The Theory Of Ordered Sets And Its Applications, (August), 2009.Google Scholar
- M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In In Supercomputing Conference (SC, pages 17--30, 2003. Google ScholarDigital Library
- Y. Jiang, E. Zhang, K. Tian, and X. Shen. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? In Compiler Construction, pages 264--282. Springer, 2010. Google ScholarDigital Library
- C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization 2004 CGO 2004, (c):75--86, 2004. Google ScholarDigital Library
- K. Lawton. Bochs IA-32 Emulator Project, 2004.Google Scholar
- X. Liu and J. Mellor-Crummey. Pinpointing Data Locality Problems Using Data-centric Analysis. CGO, pages 171--180, 2011. Google ScholarDigital Library
- G. Marin. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In In Proceedings of the Symposium of the Las Alamos Computer Science Institute, Sante Fe, 2005.Google Scholar
- V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Computer Architecture, 2004. Google ScholarDigital Library
- D. L. Schuff, M. Kulkarni, V. S. Pai, and W. Lafayette. Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization. Measurement, pages 53--63, 2010.Google Scholar
- D. L. Schuff, B. S. Parsons, and V. S. Pai. Multicore-aware reuse distance analysis. Measurement, page 8 pp., 2010.Google Scholar
- O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne. AutoSCOPE : Automatic Suggestions for Code Optimizations using PerfExpert. International Conference on Parallel and Distributed Processing Techniques and Applications, 2011.Google Scholar
- S. Vlaovic and E. S. Davidson. TAXI: Trace Analysis for X86 Interpretation. PhD thesis, University of Michigan, 2002.Google Scholar
- J. Weinberg and A. Snavely. Chameleon: A framework for observing, understanding, and imitating the memory behavior of applications. In PARA08: Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim, Norway. Citeseer, 2008.Google Scholar
Index Terms
- Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics
Recommendations
Enhancing Performance Optimization of Multicore/Multichip Nodes with Data Structure Metrics
Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular PapersProgram performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips ...
Performance Optimization of Data Structures Using Memory Access Characterization
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster ComputingProgram performance optimization is generally based on measurements of execution behavior of code segments. However, an equally important task for performance optimizations is understanding memory access behaviors and thus, data structure access ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed SystemsGeneral-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...
Comments