research-article

Analyzing memory access intensity in parallel programs on multicore

Authors:

Ahmed H. SamehAuthors Info & Claims

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

Pages 359 - 367

https://doi.org/10.1145/1375527.1375579

Published: 07 June 2008 Publication History

Abstract

As the shared memory bus becomes a major performance bottleneck for many numerical applications on multicore chips, understanding how the increased parallelism on chip strains the memory bandwidth and hence affects the efficiency of parallel codes becomes a critical issue. This paper introduces the notion of memory access intensity to facilitate quantitative analysis of program's memory behavior on multicores which employ state-of-the-art prefetching hardware. Three numerical solvers for large scale sparse linear systems are used to demonstrate the estimation of memory access intensity and its effect on program performance.

References

[1]

Monica S. Lam, et al. The cache performance and optimizations of blocked algorithms. In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV). ACM, April 1991

Digital Library

[2]

Michael E. Wolf and Monica S. Lam. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 30--44, June 1991.

Digital Library

[3]

Zhiyuan Li and Yonghong Song. Automatic Tiling of Iterative Stencil Loops. ACM Trans. on Programming Languages and Systems 26(6), pp. 975--1028, November, 2004.

Digital Library

[4]

Alan Jay Smith, Cache Memories. Computing Surveys, 14(3):473-530, September, 1982

Digital Library

[5]

Dean M. Tullsen and Susan J.Eggers. Limitations of Cache prefetching on a bus-based Multiprocessor. In Proceedings of the 20th annual international symposium on Computer architecture, 1993.

Digital Library

[6]

Santhosh Srinath and et al. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In proceedings of the 13th International Symposium on High-Performance Computer Architecture, 2006.

Digital Library

[7]

S. Carr and K. Kennedy, "Improving the Ratio of Memory Operations to Floating-Point Operations in Loops," ACM Transactions on Programming Languages and Systems, vol. 16, pp. 1768--1810, November 1994.M. E.

Digital Library

[8]

K. Asanovic and et al. "The Landscape of Parallel Computing Research: A View from Berkeley," EECS Department University of California, Berkeley Technical Report No. UCB/EECS-2006-183 December 18, 2006.

[9]

L.S Blackford, et al. ScaLAPACK User's Guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997.

Digital Library

[10]

Andy Cleary, Jack Dongarra. Implementation in ScaLAPACK of divide-and-conquer algorithms for banded and tridiagonal linear systems. University of Tennessee Computer Science Technical Report, 1997.

Digital Library

[11]

Intel® Math Kernel Library, http://www.intel.com/software/products/mkl/.

[12]

E. Polizzi, Ahmed H. Sameh. The SPIKE algorithm: a parallel hybrid banded system solver. Parallel Computing, 2006.

Digital Library

[13]

Qi Zhang, et al. Parallelization and Performance Analysis of Video Feature Extractions on Multi-Core Based Systems. In proceedings of International Conference on Parallel Processing (ICPP), 2007

Digital Library

[14]

Sadaf R. Alam, et al. Characterization of Scientific Workloads on Systems with Multi-Core Processors. In International Symposium on Workload Characterization, 2006.

[15]

Figure 14 Spike NEW: performance for wide banded system

[16]

Lei Chai, et al. Understanding the Impact of Multi-Core Architecture in Cluster Computing: A Case Study with Intel Dual-Core System, In Cluster Computing and the Grid, 2007

Digital Library

[17]

John L. Hennessy, David A. Patterson. Computer Architecture: A Quantitative Approach, Fourth Edition, 2007

Digital Library

Cited By

Manguoğlu MPolizzi ESameh A(2020)Parallel Hybrid Sparse Linear System SolversParallel Algorithms in Computational Science and Engineering10.1007/978-3-030-43736-7_4(95-120)Online publication date: 7-Jul-2020
https://doi.org/10.1007/978-3-030-43736-7_4
Li LMayer AFanucci LTeich J(2016)Trace-based analysis methodology of program flash contention in embedded multicore systemsProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971852(199-204)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2971852
Kumar RMuknahallipatna SMcInroy J(2016)An Approach to Parallelization of SIFT Algorithm on GPUs for Real-Time ApplicationsJournal of Computer and Communications10.4236/jcc.2016.41700204:17(18-50)Online publication date: 2016
https://doi.org/10.4236/jcc.2016.417002
Show More Cited By

Index Terms

Analyzing memory access intensity in parallel programs on multicore
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

An Enhanced Memory Address Mapping Scheme for Improved Memory Access Performance of 2-D DWT Processing Systems
Abstract
The implementation of the memory for storing image and transform coefficients in 2-D DWT processing systems using the more cost-effective external memory module such as DDR DRAM is shown to suffer from effective memory bandwidth which is ...
On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency
HPCAsia '20: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Diagnosing if an application suffers from DRAM contention can be a challenging task. One method is to compare the hardware memory bandwidth limit with the measured memory bandwidth of an application. Another method is based on memory access latency. The ...
Evaluating the feasibility of storage class memory as main memory
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Storage class memory offers the prospect of large capacity persistent memory with DRAM-like access latency. In this work, we evaluate the performance of a small set of benchmarks using SCM as main memory. We use an FPGA emulator to model a range of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

June 2008

390 pages

ISBN:9781605581583

DOI:10.1145/1375527

General Chairs:
Theo Papatheodorou
University of Patras, Greece
,
Utpal Banerjee
Intel (retired), USA
,
Program Chairs:
Avi Mendelson
Intel, Israel
,
Kyle Gallivan
Florida State University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 June 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ICS08

Sponsor:

ICS08: International Conference on Supercomputing

June 7 - 12, 2008

Island of Kos, Greece

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

24
Total Citations
View Citations
851
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 25 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Manguoğlu MPolizzi ESameh A(2020)Parallel Hybrid Sparse Linear System SolversParallel Algorithms in Computational Science and Engineering10.1007/978-3-030-43736-7_4(95-120)Online publication date: 7-Jul-2020
https://doi.org/10.1007/978-3-030-43736-7_4
Li LMayer AFanucci LTeich J(2016)Trace-based analysis methodology of program flash contention in embedded multicore systemsProceedings of the 2016 Conference on Design, Automation & Test in Europe10.5555/2971808.2971852(199-204)Online publication date: 14-Mar-2016
https://dl.acm.org/doi/10.5555/2971808.2971852
Kumar RMuknahallipatna SMcInroy J(2016)An Approach to Parallelization of SIFT Algorithm on GPUs for Real-Time ApplicationsJournal of Computer and Communications10.4236/jcc.2016.41700204:17(18-50)Online publication date: 2016
https://doi.org/10.4236/jcc.2016.417002
Li LFussenegger MCichon G(2016)A Data Locality and Memory Contention Analysis Method in Embedded NUMA Multi-core Systems2016 IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC)10.1109/MCSoC.2016.15(85-92)Online publication date: Sep-2016
https://doi.org/10.1109/MCSoC.2016.15
Otoom MPaul J(2015)Multiprocessor Capacity Metric and AnalysisIEEE Transactions on Computers10.1109/TC.2015.238983164:11(3181-3196)Online publication date: 1-Nov-2015
https://dl.acm.org/doi/10.1109/TC.2015.2389831
Balkir AOktay HFoster I(2015)Estimating graph distance and centrality on shared nothing architecturesConcurrency and Computation: Practice & Experience10.1002/cpe.335427:14(3587-3613)Online publication date: 25-Sep-2015
https://dl.acm.org/doi/10.1002/cpe.3354
Wang YJia ZChen RWang MLiu DShao Z(2014)Loop scheduling with memory access reduction subject to register constraints for DSP applicationsSoftware—Practice & Experience10.1002/spe.218644:8(999-1026)Online publication date: 1-Aug-2014
https://dl.acm.org/doi/10.1002/spe.2186
Chen CWu YZuckerman SGao G(2013)Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution ModelsProceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum10.1109/IPDPSW.2013.47(1607-1617)Online publication date: 20-May-2013
https://dl.acm.org/doi/10.1109/IPDPSW.2013.47
Kuck D(2012)Computational Capacity-Based Codesign of Computer SystemsHigh-Performance Scientific Computing10.1007/978-1-4471-2437-5_2(45-73)Online publication date: 2012
https://doi.org/10.1007/978-1-4471-2437-5_2
Bosson MGrudinin SRedon S(2012)Block‐adaptive quantum mechanics: An adaptive divide‐and‐conquer approach to interactive quantum chemistryJournal of Computational Chemistry10.1002/jcc.2315734:6(492-504)Online publication date: 29-Oct-2012
https://doi.org/10.1002/jcc.23157
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten