skip to main content
10.1145/2370816.2370838acmconferencesArticle/Chapter ViewAbstractPublication PagespactConference Proceedingsconference-collections
research-article

Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

Published:19 September 2012Publication History

ABSTRACT

Program performance optimization is usually based solely on measurements of execution behavior of code segments using hardware performance counters. However, memory access patterns are critical performance limiting factors for today's multicore chips where performance is highly memory bound. Therefore diagnoses and selection of optimizations based only on measurements of the execution behavior of code segments are incomplete because they do not incorporate knowledge of memory access patterns and behaviors. This paper presents a low-overhead tool (MACPO) that captures memory traces and computes metrics for the memory access behavior of source-level (C, C++, Fortran) data structures. It also presents a complete process for integrating code segment-based and memory access pattern measurements and analyses for performance optimization specifically targeting multicore chips and multichip nodes of clusters. MACPO explicitly targets the measurement and metrics important to performance optimization for multicore chips. MACPO uses more realistic cache models for computation of latency metrics than those used by previous tools. Evaluation of the effectiveness of adding memory access behavior characteristics of data structures to performance optimization was done on subsets of the ASCI, NAS and Rodina parallel benchmarks and one application program from a domain not represented in these benchmarks. Adding memory behavior characteristics enabled easier diagnoses of bottlenecks and more accurate selection of appropriate optimizations than with only code centric behavior measurements. The performance gains ranged from a few percent to 38 percent.

References

  1. AMD Barcelona Processor Cache Architecture. http://developer.amd.com/documentation/articles/pages/8142007173.aspx.Google ScholarGoogle Scholar
  2. GCC 4.6.2 manual. http://gcc.gnu.org/onlinedocs/gcc-4.6.2/gcc/.Google ScholarGoogle Scholar
  3. Intel C Compiler Manual. http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/common_options/option_fp_lcase.htm.Google ScholarGoogle Scholar
  4. Linux support for NUMA hardware. http://lse.sourceforge.net/numa/faq/.Google ScholarGoogle Scholar
  5. Longhorn User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.Google ScholarGoogle Scholar
  6. PerfExpert. http://www.tacc.utexas.edu/perfexpert.Google ScholarGoogle Scholar
  7. Ranger User Guide. http://www.tacc.utexas.edu/user-services/user-guides/.Google ScholarGoogle Scholar
  8. The ASCI Sweep3D Benchmark Code. DOE Accelerated Strategic Computing Initiative. http://www.c3.lanl.gov/pal/software/sweep3d/sweep3d_readme.html.Google ScholarGoogle Scholar
  9. ThreadSpotter. http://www.roguewave.com/.Google ScholarGoogle Scholar
  10. D. H. Bailey, E. Barszcz, L. Dagum, and H. D. Simon. NAS Parallel Benchmark Results 3-94. Proceedings of the Scalable High Performance Computing Conference, pages 386--393, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. Beyls and E. D'Hollander. Discovery of locality-improving refactorings by reuse path analysis. High Performance Computing and Communications, pages 220--229, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. K. Beyls and E. H. D'Hollander. Refactoring for Data Locality. Computer, 42(2):62--71, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Burtscher, B. D. Kim, J. Diamond, J. Mccalpin, L. Koesterke, and J. Browne. PerfExpert : An Easy-to-Use Performance Diagnosis Tool for HPC Applications. In Computer, pages 1--11. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A benchmark suite for heterogeneous computing. 2009 IEEE International Symposium on Workload Characterization IISWC, 2009(c):44--54, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. Proceedings of 30th Annual International Symposium on Microarchitecture, pages 292--302, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Ding and Y. Zhong. Predicting whole-program locality through reuse distance analysis. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Intel. Intel Processor Identification and the CPUID Instruction. Journal On The Theory Of Ordered Sets And Its Applications, (August), 2009.Google ScholarGoogle Scholar
  18. M. Itzkowitz, B. J. N. Wylie, C. Aoki, and N. Kosche. Memory profiling using hardware counters. In In Supercomputing Conference (SC, pages 17--30, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Jiang, E. Zhang, K. Tian, and X. Shen. Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors? In Compiler Construction, pages 264--282. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. International Symposium on Code Generation and Optimization 2004 CGO 2004, (c):75--86, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Lawton. Bochs IA-32 Emulator Project, 2004.Google ScholarGoogle Scholar
  22. X. Liu and J. Mellor-Crummey. Pinpointing Data Locality Problems Using Data-centric Analysis. CGO, pages 171--180, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Marin. Scalable cross-architecture predictions of memory hierarchy response for scientific applications. In In Proceedings of the Symposium of the Las Alamos Computer Science Institute, Sante Fe, 2005.Google ScholarGoogle Scholar
  24. V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: a binary instrumentation tool for computer architecture research and education. Computer Architecture, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. L. Schuff, M. Kulkarni, V. S. Pai, and W. Lafayette. Accelerating Multicore Reuse Distance Analysis with Sampling and Parallelization. Measurement, pages 53--63, 2010.Google ScholarGoogle Scholar
  26. D. L. Schuff, B. S. Parsons, and V. S. Pai. Multicore-aware reuse distance analysis. Measurement, page 8 pp., 2010.Google ScholarGoogle Scholar
  27. O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne. AutoSCOPE : Automatic Suggestions for Code Optimizations using PerfExpert. International Conference on Parallel and Distributed Processing Techniques and Applications, 2011.Google ScholarGoogle Scholar
  28. S. Vlaovic and E. S. Davidson. TAXI: Trace Analysis for X86 Interpretation. PhD thesis, University of Michigan, 2002.Google ScholarGoogle Scholar
  29. J. Weinberg and A. Snavely. Chameleon: A framework for observing, understanding, and imitating the memory behavior of applications. In PARA08: Workshop on State-of-the-Art in Scientific and Parallel Computing, Trondheim, Norway. Citeseer, 2008.Google ScholarGoogle Scholar

Index Terms

  1. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
          September 2012
          512 pages
          ISBN:9781450311823
          DOI:10.1145/2370816

          Copyright © 2012 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 September 2012

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate121of471submissions,26%

          Upcoming Conference

          PACT '24
          International Conference on Parallel Architectures and Compilation Techniques
          October 14 - 16, 2024
          Southern California , CA , USA

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader