research-article

Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

Authors:

James BrowneAuthors Info & Claims

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 147 - 156

https://doi.org/10.1145/2370816.2370838

Published: 19 September 2012 Publication History

Abstract

Program performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips where performance is highly memory bound. Therefore diagnoses and selection of optimizations based only on measurements of the execution behavior of code segments are incomplete because they do not incorporate knowledge of memory access patterns and behaviors. This paper presents a low-overhead tool (MACPO) that captures memory traces and computes metrics for the memory access behavior of source-level (C, C++, Fortran) data structures. It also presents a complete process for integrating code segment-based and memory access pattern measurements and analyses for performance optimization specifically targeting multicore chips and multichip nodes of clusters. MACPO explicitly targets the measurement and metrics important to performance optimization for multicore chips. MACPO uses more realistic cache models for computation of latency metrics than those used by previous tools. Evaluation of the effectiveness of adding memory access behavior characteristics of data structures to performance optimization was done on subsets of the ASCI, NAS and Rodina parallel benchmarks and one application program from a domain not represented in these benchmarks. Adding memory behavior characteristics enabled easier diagnoses of bottlenecks and more accurate selection of appropriate optimizations than with only code centric behavior measurements. The performance gains ranged from a few percent to 38 percent.

References

[1]

AMD Barcelona Processor Cache Architecture. http://developer.amd.com/documentation/articles/pages/8142007173.aspx.

[2]

GCC 4.6.2 manual. http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/.

[3]

Intel C Compiler Manual. http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/common_options/option_fp_lcase.htm.

[4]

Linux support for NUMA hardware. http://lse.sourceforge.net/numa/faq/.

[5]

Longhorn User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.

[6]

PerfExpert. http://www.tacc.utexas.edu/perfexpert.

[7]

Ranger User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.

[8]

The ASCI Sweep3D Benchmark Code. DOE Accelerated Strategic Computing Initiative. http://www.c3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html.

[9]

ThreadSpotter. http://www.roguewave.com/.

[10]

D. H. Bailey, E. Barszcz, L. Dagum, and H. D. Simon. NAS Parallel Benchmark Results 3-94. Proceedings of the Scalable High Performance Computing Conference, pages 386--393, 1992.

Digital Library

[11]

K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. High Performance Computing and Communications, pages 220--229, 2006.

Digital Library

[12]

K. Beyls and E. H. D'Hollander. Refactoring for Data Locality. Computer, 42(2):62--71, 2009.

Digital Library

[13]

M. Burtscher, B. D. Kim, J. Diamond, J. Mccalpin, L. Koesterke, and J. Browne. PerfExpert : An Easy-to-Use Performance Diagnosis Tool for HPC Applications. In Computer, pages 1--11. IEEE, 2010.

Digital Library

[14]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization IISWC, 2009(c):44--54, 2009.

Digital Library

[15]

J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. Proceedings of 30th Annual International Symposium on Microarchitecture, pages 292--302, 1997.

Digital Library

[16]

C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2003.

Digital Library

[17]

Intel. Intel Processor Identification and the CPUID Instruction. Journal On The Theory Of Ordered Sets And Its Applications, (August), 2009.

[18]

M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In In Supercomputing Conference (SC, pages 17--30, 2003.

Digital Library

[19]

Y. Jiang, E. Zhang, K. Tian, and X. Shen. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? In Compiler Construction, pages 264--282. Springer, 2010.

Digital Library

[20]

C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization 2004 CGO 2004, (c):75--86, 2004.

Digital Library

[21]

K. Lawton. Bochs IA-32 Emulator Project, 2004.

[22]

X. Liu and J. Mellor-Crummey. Pinpointing Data Locality Problems Using Data-centric Analysis. CGO, pages 171--180, 2011.

Digital Library

[23]

G. Marin. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In In Proceedings of the Symposium of the Las Alamos Computer Science Institute, Sante Fe, 2005.

[24]

V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Computer Architecture, 2004.

Digital Library

[25]

D. L. Schuff, M. Kulkarni, V. S. Pai, and W. Lafayette. Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization. Measurement, pages 53--63, 2010.

[26]

D. L. Schuff, B. S. Parsons, and V. S. Pai. Multicore-aware reuse distance analysis. Measurement, page 8 pp., 2010.

[27]

O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne. AutoSCOPE : Automatic Suggestions for Code Optimizations using PerfExpert. International Conference on Parallel and Distributed Processing Techniques and Applications, 2011.

[28]

S. Vlaovic and E. S. Davidson. TAXI: Trace Analysis for X86 Interpretation. PhD thesis, University of Michigan, 2002.

[29]

J. Weinberg and A. Snavely. Chameleon: A framework for observing, understanding, and imitating the memory behavior of applications. In PARA08: Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim, Norway. Citeseer, 2008.

Cited By

Ning ZGu NSu JQi D(2022)STAFF: A Model for Structure Layout Optimization2022 7th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS55155.2022.9846314(115-122)Online publication date: 22-Apr-2022
https://doi.org/10.1109/ICCCS55155.2022.9846314
Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Zhao QLiu XChabbi M(2020)DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00034(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00034
Show More Cited By

Index Terms

Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

Recommendations

Enhancing Performance Optimization of Multicore/Multichip Nodes with Data Structure Metrics
Inaugural Issue and Special Section on Top Papers from PACT-21, and Regular Papers

Program performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips ...
Performance Optimization of Data Structures Using Memory Access Characterization
CLUSTER '11: Proceedings of the 2011 IEEE International Conference on Cluster Computing

Program performance optimization is generally based on measurements of execution behavior of code segments. However, an equally important task for performance optimizations is understanding memory access behaviors and thus, data structure access ...
On the Programmability and Performance of Heterogeneous Platforms
ICPADS '13: Proceedings of the 2013 International Conference on Parallel and Distributed Systems

General-purpose computing on an ever-broadening array of parallel devices has led to an increasingly complex and multi-dimensional landscape with respect to programmability and performance optimization. The growing diversity of parallel architectures ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

September 2012

512 pages

ISBN:9781450311823

DOI:10.1145/2370816

General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IFIP WG 10.3: IFIP WG 10.3
SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE CS TCPP: IEEE Computer Society Technical Committee on Parallel Processing
IEEE CS TCAA: IEEE CS technical committee on architectural acoustics

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 September 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '12

Sponsor:

IFIP WG 10.3
SIGARCH
IEEE CS TCPP
IEEE CS TCAA

PACT '12: International Conference on Parallel Architectures and Compilation Techniques

September 19 - 23, 2012

Minnesota, Minneapolis, USA

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
447
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Ning ZGu NSu JQi D(2022)STAFF: A Model for Structure Layout Optimization2022 7th International Conference on Computer and Communication Systems (ICCCS)10.1109/ICCCS55155.2022.9846314(115-122)Online publication date: 22-Apr-2022
https://doi.org/10.1109/ICCCS55155.2022.9846314
Zhao XZhou JGuan HWang WLiu XLiu TZhou HMoreira JMueller FEtsion Y(2021)NumaPerfProceedings of the 35th ACM International Conference on Supercomputing10.1145/3447818.3460361(52-62)Online publication date: 3-Jun-2021
https://dl.acm.org/doi/10.1145/3447818.3460361
Zhao QLiu XChabbi M(2020)DRCCTPROF: A Fine-Grained Call Path Profiler for ARM-Based ClustersSC20: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41405.2020.00034(1-16)Online publication date: Nov-2020
https://doi.org/10.1109/SC41405.2020.00034
Alam MGottschlich JTatbul NTurek JMattson TMuzahid AWallach HLarochelle HBeygelzimer Ad'Alché-Buc FFox E(2019)A zero-positive learning approach for diagnosing software performance regressionsProceedings of the 33rd International Conference on Neural Information Processing Systems10.5555/3454287.3455330(11627-11639)Online publication date: 8-Dec-2019
https://dl.acm.org/doi/10.5555/3454287.3455330
Wang QLiu XChabbi M(2019)Featherlight Reuse-Distance Measurement2019 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2019.00056(440-453)Online publication date: Feb-2019
https://doi.org/10.1109/HPCA.2019.00056
Roy PSong SKrishnamoorthy SLiu X(2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168819(200-213)Online publication date: 2018
https://doi.org/10.1145/3179541.3168819
Shen DChabbi MLiu X(2018)An Evaluation of Vectorization and Cache Reuse Tradeoffs on Modern CPUsProceedings of the 9th International Workshop on Programming Models and Applications for Multicores and Manycores10.1145/3178442.3178445(21-30)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3178442.3178445
Roy PSong SKrishnamoorthy SLiu XKnoop JSchordan MJohnson TO'Boyle M(2018)Lightweight detection of cache conflictsProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168819(200-213)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168819
Yu CRoy PBai YYang HLiu X(2018)LWPTool: A Lightweight Profiler to Guide Data Layout OptimizationIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.284099229:11(2489-2502)Online publication date: 1-Nov-2018
https://doi.org/10.1109/TPDS.2018.2840992
Servat HPena ALlort GMercadal EHoppe HLabarta J(2017)Automating the Application Data Placement in Hybrid Memory Systems2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.50(126-136)Online publication date: Sep-2017
https://doi.org/10.1109/CLUSTER.2017.50
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten