skip to main content
10.1145/2749246.2749256acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
short-paper

Cache Line Aware Optimizations for ccNUMA Systems

Published: 15 June 2015 Publication History

Abstract

Current shared memory systems utilize complex memory hierarchies to maintain scalability when increasing the number of processing units. Although hardware designers aim to hide this complexity from the programmer, ignoring the detailed architectural characteristics can harm performance significantly. We propose to expose the block-based design of caches in parallel computers to middleware designers to allow semi-automatic performance tuning with the systematic translation from algorithms to an analytic performance model. For this, we design a simple interface for cache line aware (CLa) optimization, a translation methodology, and a full performance model for cache line transfers in ccNUMA systems. Algorithms developed using CLa design perform up to 14x better than vendor and open-source libraries, and 2x better than existing ccNUMA optimizations.

References

[1]
Intel® 64 and IA-32 Architectures Optimization Ref. Manual, 2014.
[2]
A. Agarwal et al. An Analytical Cache Model. ACM Trans. on Computer Systems, 7(2):184--215, 1989.
[3]
A. Alexandrov et al. LogGP: Incorporating Long Messages into the LogP Model - One Step Closer towards a Realistic Model for Parallel Computation. In Proc. 7th ACM SPAA'95, pages 95--105, S. Barbara, CA, USA, 1995.
[4]
D. Andrade et al. Accurate Prediction of the Behavior of Multithreaded Applications in Shared Caches. Parallel Computing, 39(1):36--57, 2013.
[5]
K. W. Cameron et al. lognP and log3P: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems. IEEE Trans. on Computers, 53(3):314--327, 2007.
[6]
K. W. Cameron and X. H. Sun. Quantifying Locality Effect in Data Access Delay: Memory logP. In Proc. 17th IEEE IPDPS'03, (8 pages),Nice, France, 2003.
[7]
D. Culler et al. LogP: towards a Realistic Model of Parallel Computation. SIGPLAN Not., 28(7):1--12, 1993.
[8]
T. David et al. Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask. In Proc. 24th ACM Symp. SOSP'13, pages 33--48, Farmington, PA, USA, 2013.
[9]
R. M. Karp et al. Optimal Broadcast and Summation in the LogP Model. In Proc. 5th ACM SPAA'93, pages 142--153, Velen, Germany, 1993.
[10]
G. Li and F. Ruskey. Advantages of Forward Thinking in Generating Rooted and Free Trees. In Proc. 10th ACM-SIAM SODA'99, pages 939--940, Baltimore, MD,USA, 1999.
[11]
L. Li et al. mPlogP: A Parallel Computation Model for Heterogeneous Multi-core Computer. In Proc. 10th IEEE/ACM Intl. CCGRID'10, pages 679--684, Melbourne, Australia, 2010.
[12]
S. Li et al. NUMA-aware Shared-memory Collective Communication for MPI. In In Proc. 22nd Intl. Symp. HPDC'13, pages 85--96, New York, NY, USA, 2013.
[13]
D. Molka et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. PACT'09, pages 261--270, Raleigh, NC, USA, 2009.
[14]
S. Ramos and T. Hoefler. Benchmark Suite for Modeling Intel Xeon Phi. http://gac.des.udc.es/~sramos/xeon_phi_bench/xeon_phi_bench.html.
[15]
S. Ramos and T. Hoefler. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In Proc. of the 22nd Intl. HPDC'13, pages 97--108, New York, New York, USA, 2013.
[16]
Z. Wang and M. F. O'Boyle. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach. In Proc. 14th ACM SIGPLAN Symp. PPoPP'09, pages 75--84, Raleigh, NC, USA, 2009.

Cited By

View all
  • (2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
  • (2020)Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00014(32-41)Online publication date: May-2020
  • (2016)Machine-aware atomic broadcast trees for multicoresProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026881(33-48)Online publication date: 2-Nov-2016

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing
June 2015
296 pages
ISBN:9781450335508
DOI:10.1145/2749246
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cache coherence
  2. multi-cores
  3. performance modeling

Qualifiers

  • Short-paper

Funding Sources

  • FEDER funds of the EU
  • Ministry of Economy and Competitiveness of Spain

Conference

HPDC'15
Sponsor:

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
  • (2020)Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00014(32-41)Online publication date: May-2020
  • (2016)Machine-aware atomic broadcast trees for multicoresProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026881(33-48)Online publication date: 2-Nov-2016

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media