short-paper

Cache Line Aware Optimizations for ccNUMA Systems

Authors:

Torsten HoeflerAuthors Info & Claims

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Pages 85 - 88

https://doi.org/10.1145/2749246.2749256

Published: 15 June 2015 Publication History

Abstract

Current shared memory systems utilize complex memory hierarchies to maintain scalability when increasing the number of processing units. Although hardware designers aim to hide this complexity from the programmer, ignoring the detailed architectural characteristics can harm performance significantly. We propose to expose the block-based design of caches in parallel computers to middleware designers to allow semi-automatic performance tuning with the systematic translation from algorithms to an analytic performance model. For this, we design a simple interface for cache line aware (CLa) optimization, a translation methodology, and a full performance model for cache line transfers in ccNUMA systems. Algorithms developed using CLa design perform up to 14x better than vendor and open-source libraries, and 2x better than existing ccNUMA optimizations.

References

[1]

Intel® 64 and IA-32 Architectures Optimization Ref. Manual, 2014.

[2]

A. Agarwal et al. An Analytical Cache Model. ACM Trans. on Computer Systems, 7(2):184--215, 1989.

Digital Library

[3]

A. Alexandrov et al. LogGP: Incorporating Long Messages into the LogP Model - One Step Closer towards a Realistic Model for Parallel Computation. In Proc. 7th ACM SPAA'95, pages 95--105, S. Barbara, CA, USA, 1995.

Digital Library

[4]

D. Andrade et al. Accurate Prediction of the Behavior of Multithreaded Applications in Shared Caches. Parallel Computing, 39(1):36--57, 2013.

Digital Library

[5]

K. W. Cameron et al. lognP and log3P: Accurate Analytical Models of Point-to-Point Communication in Distributed Systems. IEEE Trans. on Computers, 53(3):314--327, 2007.

Digital Library

[6]

K. W. Cameron and X. H. Sun. Quantifying Locality Effect in Data Access Delay: Memory logP. In Proc. 17th IEEE IPDPS'03, (8 pages),Nice, France, 2003.

Digital Library

[7]

D. Culler et al. LogP: towards a Realistic Model of Parallel Computation. SIGPLAN Not., 28(7):1--12, 1993.

Digital Library

[8]

T. David et al. Everything You Always Wanted to Know About Synchronization but Were Afraid to Ask. In Proc. 24th ACM Symp. SOSP'13, pages 33--48, Farmington, PA, USA, 2013.

Digital Library

[9]

R. M. Karp et al. Optimal Broadcast and Summation in the LogP Model. In Proc. 5th ACM SPAA'93, pages 142--153, Velen, Germany, 1993.

Digital Library

[10]

G. Li and F. Ruskey. Advantages of Forward Thinking in Generating Rooted and Free Trees. In Proc. 10th ACM-SIAM SODA'99, pages 939--940, Baltimore, MD,USA, 1999.

Digital Library

[11]

L. Li et al. mPlogP: A Parallel Computation Model for Heterogeneous Multi-core Computer. In Proc. 10th IEEE/ACM Intl. CCGRID'10, pages 679--684, Melbourne, Australia, 2010.

Digital Library

[12]

S. Li et al. NUMA-aware Shared-memory Collective Communication for MPI. In In Proc. 22nd Intl. Symp. HPDC'13, pages 85--96, New York, NY, USA, 2013.

Digital Library

[13]

D. Molka et al. Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System. In Proc. 18th Intl. Conf. PACT'09, pages 261--270, Raleigh, NC, USA, 2009.

Digital Library

[14]

S. Ramos and T. Hoefler. Benchmark Suite for Modeling Intel Xeon Phi. http://gac.des.udc.es/~sramos/xeon_phi_bench/xeon_phi_bench.html.

[15]

S. Ramos and T. Hoefler. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi. In Proc. of the 22nd Intl. HPDC'13, pages 97--108, New York, New York, USA, 2013.

Digital Library

[16]

Z. Wang and M. F. O'Boyle. Mapping Parallelism to Multi-cores: A Machine Learning Based Approach. In Proc. 14th ACM SIGPLAN Symp. PPoPP'09, pages 75--84, Raleigh, NC, USA, 2009.

Digital Library

Cited By

Loughlin KSaroiu SWolman AManerkar YKasikci BSalapura VZahran MChong FTang L(2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527427
Hashmi JXu SRamesh BBayatpour MSubramoni HPanda D(2020)Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00014(32-41)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00014
Kaestle SAchermann RHaecki RHoffmann MRamos SRoscoe TKeeton KRoscoe T(2016)Machine-aware atomic broadcast trees for multicoresProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026881(33-48)Online publication date: 2-Nov-2016
https://dl.acm.org/doi/10.5555/3026877.3026881

Index Terms

Cache Line Aware Optimizations for ccNUMA Systems
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

The locality-aware adaptive cache coherence protocol
ICSA '13

Next generation multicore applications will process massive amounts of data with significant sharing. Data movement and management impacts memory access latency and consumes power. Therefore, harnessing data locality is of fundamental importance in ...
Dynamic directory table with victim cache: on-demand allocation of directory entries for active shared cache blocks

In this paper, we present a novel directory architecture that can dynamically allocate a directory entry for a cache block on demand at runtime only when the block is shared by more than a single core. Thus, we do not maintain coherence for private ...
An efficient cache design for scalable glueless shared-memory multiprocessors
CF '06: Proceedings of the 3rd conference on Computing frontiers

Traditionally, cache coherence in large-scale shared-memory multiprocessors has been ensured by means of a distributed directory structure stored in main memory. In this way, the access to main memory to recover the sharing status of the block is ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '15: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

June 2015

296 pages

ISBN:9781450335508

DOI:10.1145/2749246

General Chair:
Thilo Kielmann
VU University Amsterdam, The Netherlands
,
Program Chairs:
Dean Hildebrand
IBM Research Almaden
,
Michela Taufer
University of Delaware

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

FEDER funds of the EU
Ministry of Economy and Competitiveness of Spain

Conference

HPDC'15

Sponsor:

University of Arizona
SIGARCH

HPDC'15: The 24th International Symposium on High-Performance Parallel and Distributed Computing

June 15 - 19, 2015

Oregon, Portland, USA

Acceptance Rates

HPDC '15 Paper Acceptance Rate 19 of 116 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
188
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Loughlin KSaroiu SWolman AManerkar YKasikci BSalapura VZahran MChong FTang L(2022)MOESI-primeProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527427(670-684)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527427
Hashmi JXu SRamesh BBayatpour MSubramoni HPanda D(2020)Machine-agnostic and Communication-aware Designs for MPI on Emerging Architectures2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS47924.2020.00014(32-41)Online publication date: May-2020
https://doi.org/10.1109/IPDPS47924.2020.00014
Kaestle SAchermann RHaecki RHoffmann MRamos SRoscoe TKeeton KRoscoe T(2016)Machine-aware atomic broadcast trees for multicoresProceedings of the 12th USENIX conference on Operating Systems Design and Implementation10.5555/3026877.3026881(33-48)Online publication date: 2-Nov-2016
https://dl.acm.org/doi/10.5555/3026877.3026881

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten