research-article

Parallelism and data movement characterization of contemporary application classes

Authors:

Victoria Caparrós Cabezas,

Phillip Stanley-MarbellAuthors Info & Claims

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

Pages 95 - 104

https://doi.org/10.1145/1989493.1989506

Published: 04 June 2011 Publication History

Abstract

This paper presents a framework for characterizing the distribution of fine-grained parallelism, data movement, and communication-minimizing code partitions. Understanding the spectrum of parallelism available in applications, and how much data movement might result if such parallelism is exploited, is essential in the hardware design process because these properties will be the limiters to performance scaling of future computing systems. The framework is applied to characterizing 26 applications and kernels, classified according to their dominant components in the Berkeley dwarf/ computational motif classification.

The distributions of ILP and TLP over execution time are studied, and it is shown that, though mean ILP is high, available ILP is significantly smaller for most of the execution. The results from this framework are complemented by hardware performance counter data on two RISC platforms (IBM Power7 and Freescale P2020) and one CISC platform (IntelAtom D510), spanning a broad range of real machine characteristics. Employing a combination of these new techniques, and building upon previous proposals, it is demonstrated that the similarity in available ideal-case parallelism and data movement within and across the dwarf classes, is limited.

References

[1]

G. M. Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, AFIPS '67 (Spring), pages 483--485. ACM, 1967.

Digital Library

[2]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: a view from berkeley. Technical report, University of California at Berkeley, December 2006.

[3]

T. M. Austin and G. S. Sohi. Dynamic dependency analysis of ordinary programs. SIGARCH Comput. Archit. News, 20(2):342--351, 1992.

Digital Library

[4]

S. E. Breach, T. N. Vijaykumar, and G. S. Sohi. Multiscalar processors. In ISCA '95: Proceedings of the 22nd Annual International Symposium on Computer Architecture, pages 414--425, Los Alamitos, CA, USA, 1995.

Digital Library

[5]

M. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. August. Revisiting the sequential programming model for multi-core. In MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 69--84, Washington, DC, USA, 2007.

Digital Library

[6]

J. A. Brown and D. M. Tullsen. The shared-thread multiprocessor. In ICS '08: Proceedings of the 22nd annual international conference on Supercomputing, pages 73--82. ACM, 2008.

Digital Library

[7]

D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. SIGARCH Comput. Archit. News, 25(3):13--25, 1997.

Digital Library

[8]

D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidth limitations of future microprocessors. In Proceedings of the 23rd annual international symposium on Computer architecture, ISCA '96, pages 78--89. ACM, 1996.

Digital Library

[9]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to algorithms, third edition, 2009.

Digital Library

[10]

J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008.

Digital Library

[11]

M. J. Flynn. Toward more efficient computer organizations. In AFIPS '72 (Spring): Proceedings of the May 16-18, 1972, spring joint computer conference, pages 1211--1217. ACM, 1972.

Digital Library

[12]

M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. Mibench: A free, commercially representative embedded benchmark suite. In WWC '01: Proceedings of the IEEE International Workshop on Workload Characterization, pages 3--14, Washington, DC, USA, 2001. IEEE Computer Society.

Digital Library

[13]

Y. He, C. E. Leiserson, and W. M. Leiserson. The cilkview scalability analyzer. In Proceedings of the 22nd ACM symposium on Parallelism in algorithms and architectures, SPAA '10, pages 145--156. ACM, 2010.

Digital Library

[14]

J. L. Hennessy and D. A. Patterson. Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., 2002.

Digital Library

[15]

M. Iyer, C. Ashok, J. Stone, N. Vachharajani, D. A. Connors, and M. Vachharajani. Finding parallelism for future epic machines. In Proceedings of the 4th Workshop on Explicitly Parallel Instruction Computing Techniques, 2005.

[16]

R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM's next-generation server processor. IEEE Micro, 30:7--15, 2010.

Digital Library

[17]

K. Kennedy and U. Kremer. Automatic data layout for distributed-memory machines. ACM Trans. Program. Lang. Syst., 20(4):869--916, 1998.

Digital Library

[18]

M. S. Lam and R. P. Wilson. Limits of control flow on parallelism. SIGARCH Comput. Archit. News, 20(2):46--57, 1992.

Digital Library

[19]

A. Nakajima, R. Kobayashi, H. Ando, and T. Shimada. Limits of thread-level parallelism in non-numerical programs. In IPSJ Transactions on Advanced Computing Systems, pages 12--20, 2006.

[20]

G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic thread extraction with decoupled software pipelining. In MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 105--118, Washington, DC, USA, 2005.

Digital Library

[21]

M. A. Postiff, D. A. Greene, G. S. Tyson, and T. N. Mudge. The limits of instruction level parallelism in SPEC95 applications. SIGARCH Comput. Archit. News, 27(1):31--34, 1999.

Digital Library

[22]

C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA '07: Proceedings of the 13th International Symposium on High-Performance Computer Architecture, pages 13--24, 2007.

Digital Library

[23]

E. Riseman and C. Foster. The inhibition of potential parallelism by conditional jumps. IEEE Transactions on Computers, 21:1405--1411, 1972.

Digital Library

[24]

K. Scott and J. Davidson. Exploring the limits of sub-word level parallelism. In PACT '00: Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques, page 81, Washington, DC, USA, 2000. IEEE Computer Society.

Digital Library

[25]

R. Simar and R. Tatge. How TI adopted VLIW in digital signal processors. Solid-State Circuits Magazine, IEEE, 1(3):10--14, summer 2009.

[26]

K. B. Theobald, G. R. Gao, and L. J. Hendren. On the limits of program parallelism and its smoothability. In Proceedings of the 25th annual international symposium on Microarchitecture, MICRO 25, pages 10--19, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press.

Digital Library

[27]

G. S. Tjaden and M. J. Flynn. Detection and parallel execution of independent instructions. IEEE Trans. Comput., 19(10):889--895, 1970.

Digital Library

[28]

J. S. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. J. Parallel Distrib. Comput., 63(9):853--865, 2003.

Digital Library

[29]

D. Wall. Limits of instruction-level parallelism. In ASPLOS-IV: Proceedings of the fourth international conference on Architectural support for programming languages and operating systems, pages 176--188. ACM, 1991.

Digital Library

Cited By

Domke JVatai EGerofi BKodama YWahib MPodobas AMittal SPericàs MZhang LChen PDrozd AMatsuoka S(2023)At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC WorkloadsACM Transactions on Architecture and Code Optimization10.1145/362952020:4(1-26)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629520
Tye NMeech JBilgin BStanley-Marbell P(2020)A System for Generating Non-Uniform Random Variates using Graphene Field-Effect Transistors2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP49362.2020.00026(101-108)Online publication date: Jul-2020
https://doi.org/10.1109/ASAP49362.2020.00026
Johnston BMilthorpe J(2018)AIWC: OpenCL-Based Architecture-Independent Workload Characterization2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)10.1109/LLVM-HPC.2018.8639381(81-91)Online publication date: Nov-2018
https://doi.org/10.1109/LLVM-HPC.2018.8639381
Show More Cited By

Index Terms

Parallelism and data movement characterization of contemporary application classes
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a ...
An evaluation of speculative instruction execution on simultaneous multithreaded processors

Modern superscalar processors rely heavily on speculative execution for performance. For example, our measurements show that on a 6-issue superscalar, 93% of committed instructions for SPECINT95 are speculative. Without speculation, processor resources ...
Efficient Exploitation of Instruction-Level Parallelism for Superscalar Processors by the Conjugate Register File Scheme

This paper introduces a novel superscalar micro-architecture, called IAS-S, and its related software techniques. We treat two basic problems in superscalar machines. First, we seek a feasible hardware platform which allows the compiler to perform more ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SPAA '11: Proceedings of the twenty-third annual ACM symposium on Parallelism in algorithms and architectures

June 2011

404 pages

ISBN:9781450307437

DOI:10.1145/1989493

Co-chairs:
Friedhelm Meyer auf der Heide
University of Paderborn, Germany
,
Rajmohan Rajaraman
Northeastern University, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

EATCS: European Association for Theoretical Computer Science

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SPAA '11

Sponsor:

SPAA '11: 23rd ACM Symposium on Parallelism in Algorithms and Architectures

June 4 - 6, 2011

California, San Jose, USA

Acceptance Rates

Overall Acceptance Rate 447 of 1,461 submissions, 31%

Upcoming Conference

SPAA '25

Sponsor:
sigact
sigact

37th ACM Symposium on Parallelism in Algorithms and Architectures

July 28 - August 1, 2025

Portland , OR , USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
682
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Domke JVatai EGerofi BKodama YWahib MPodobas AMittal SPericàs MZhang LChen PDrozd AMatsuoka S(2023)At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC WorkloadsACM Transactions on Architecture and Code Optimization10.1145/362952020:4(1-26)Online publication date: 25-Oct-2023
https://dl.acm.org/doi/10.1145/3629520
Tye NMeech JBilgin BStanley-Marbell P(2020)A System for Generating Non-Uniform Random Variates using Graphene Field-Effect Transistors2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)10.1109/ASAP49362.2020.00026(101-108)Online publication date: Jul-2020
https://doi.org/10.1109/ASAP49362.2020.00026
Johnston BMilthorpe J(2018)AIWC: OpenCL-Based Architecture-Independent Workload Characterization2018 IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC)10.1109/LLVM-HPC.2018.8639381(81-91)Online publication date: Nov-2018
https://doi.org/10.1109/LLVM-HPC.2018.8639381
Singh GChelini LCorda SJaved Awan AStuijk SJordans RCorporaal HBoonstra A(2018)A Review of Near-Memory Computing Architectures: Opportunities and Challenges2018 21st Euromicro Conference on Digital System Design (DSD)10.1109/DSD.2018.00106(608-617)Online publication date: Aug-2018
https://doi.org/10.1109/DSD.2018.00106
H. M. Cruz EDiener MO. A. Navaux PH. M. Cruz EDiener MO. A. Navaux P(2018)IntroductionThread and Data Mapping for Multicore Systems10.1007/978-3-319-91074-1_1(1-8)Online publication date: 5-Jul-2018
https://doi.org/10.1007/978-3-319-91074-1_1
Theis TWong H(2017)The End of Moore's LawComputing in Science and Engineering10.1109/MCSE.2017.2919:2(41-50)Online publication date: 1-Mar-2017
https://dl.acm.org/doi/10.1109/MCSE.2017.29
Malazgirt GYurdakul ANiar S(2014)MIPT: Rapid exploration and evaluation for migrating sequential algorithms to multiprocessing systems with multi-port memories2014 International Conference on High Performance Computing & Simulation (HPCS)10.1109/HPCSim.2014.6903767(776-783)Online publication date: Jul-2014
https://doi.org/10.1109/HPCSim.2014.6903767
Broekema PBoonstra ACabezas VEngbersen THolties HJelitto JLuijten RMaat Pvan Nieuwpoort RNijboer RRomein JOffrein BVarbanescu Avan Nieuwpoort RZwart SVarbanescu A(2012)DOMEProceedings of the 2012 workshop on High-Performance Computing for Astronomy Date10.1145/2286976.2286978(1-4)Online publication date: 18-Jun-2012
https://dl.acm.org/doi/10.1145/2286976.2286978
Cabezas VStanley-Marbell PCascaval CTrancoso PPrasanna V(2011)Quantitative analysis of parallelism and data movement properties across the Berkeley computational motifsProceedings of the 8th ACM International Conference on Computing Frontiers10.1145/2016604.2016625(1-2)Online publication date: 3-May-2011
https://dl.acm.org/doi/10.1145/2016604.2016625

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten