research-article

Reusing Data Reorganization for Efficient SIMD Parallelization of Adaptive Irregular Applications

Authors:

Gagan AgrawalAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 16, Pages 1 - 10

https://doi.org/10.1145/2925426.2926285

Published: 01 June 2016 Publication History

Abstract

Applying SIMD parallelization to irregular applications with non-continuous and data-dependent memory accesses is challenging. While an application involving a static pattern of indirect accesses (across iterations) can be accelerated by data transformations, such techniques are no longer feasible if the indirect access patterns change over time. In this paper, we propose an indexing method that facilitates the reuse of data reorganization for efficient SIMD parallelization of dynamic irregular applications. This indexing approach is first applied on a class of vertex-centric graph algorithms where the set of active vertices varies over the execution -- the indexing method helps maintain the set of active edges. Next, we focus on unstructured particle interaction applications in which the edges change adaptively, and present an incremental indexing method. In our experimental evaluation, the speedups achieved by utilizing SIMD on graph applications range from 3.04× to 7.19×, and between 2.54× to 4.43× for molecular dynamics.

References

[1]

http://www.top500.org/lists/2015/11/.

[2]

G. Agrawal and J. Saltz. Interprocedural compilation of irregular applications for distributed memory machines. SC '95, 1995.

Digital Library

[3]

L. Chen, P. Jiang, and G. Agrawal. Exploiting recent simd architectural advances for irregular applications. CGO '16, 2016.

Digital Library

[4]

C. Ding and K. Kennedy. Improving cache performance in dynamic applications through data and computation reorganization at run time. SIGPLAN Not., may 1999.

Digital Library

[5]

E. Gutiérrez, O. Plata, and E. L. Zapata. Balanced, locality-based parallel irregular reductions. In Languages and Compilers for Parallel Computing. 2003.

Digital Library

[6]

H. Han and C.-W. Tseng. Improving compiler and run-time support for irregular reductions using local writes. In Languages and Compilers for Parallel Computing. 1999.

Digital Library

[7]

H. Han and C.-W. Tseng. Efficient compiler and run-time support for parallel irregular reductions. Parallel Computing, 2000.

Digital Library

[8]

P. Harish and P. J. Narayanan. Accelerating large graph algorithms on the gpu using cuda. HiPC'07, 2007.

Digital Library

[9]

S. Hong, S. K. Kim, T. Oguntebi, and K. Olukotun. Accelerating cuda graph algorithms at maximum warp. SIGPLAN Not., 46(8), Feb. 2011.

Digital Library

[10]

Y.-S. Hwang, B. Moon, S. D. Sharma, R. Ponnusamy, R. Das, and J. H. Saltz. Runtime and language support for compiling adaptive irregular programs. 25(6):597--621, June 1995.

Digital Library

[11]

F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. Cusha: Vertex-centric graph processing on gpus. HPDC '14.

Digital Library

[12]

A. Kyrola, G. Blelloch, and C. Guestrin. Graphchi: Large-scale graph computation on just a pc. OSDI'12, 2012.

Digital Library

[13]

J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection, June 2014.

[14]

G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. SIGMOD '10, 2010.

Digital Library

[15]

J. Mellor-Crummey, D. Whalley, and K. Kennedy. Improving memory hierarchy performance for irregular applications using data and computation reorderings. Int. J. Parallel Program., June 2001.

Digital Library

[16]

D. Merrill, M. Garland, and A. Grimshaw. Scalable gpu graph traversal. SIGPLAN Not., Feb. 2012.

Digital Library

[17]

S. J. Pennycook, C. J. Hughes, M. Smelyanskiy, and S. A. Jarvis. Exploring simd for molecular dynamics, using intel® xeon® processors and intel® xeon phi coprocessors. IPDPS '13.

Digital Library

[18]

S. Salihoglu and J. Widom. Gps: A graph processing system. 2013.

[19]

L. Thébault, E. Petit, and Q. Dinh. Scalable and efficient implementation of 3d unstructured meshes computation: A case study on matrix assembly. PPoPP 2015, 2015.

[20]

B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu. PPoPP '13, 2013.

Digital Library

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3631709
Jiang PXia YAgrawal G(2020)Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative SpeculationACM Transactions on Parallel Computing10.1145/33997147:3(1-26)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3399714
Vandierendonck HAyguadé EHwu WBadia RHofstee H(2020)GraptorProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392753(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392753
Show More Cited By

Recommendations

Efficient SIMD code generation for irregular kernels
PPoPP '12: Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to ...
Efficient SIMD code generation for irregular kernels
PPOPP '12

Array indirection causes several challenges for compilers to utilize single instruction, multiple data (SIMD) instructions. Disjoint memory references, arbitrarily misaligned memory references, and dependence cycles in loops are main challenges to ...
Exploiting mixed SIMD parallelism by reducing data reorganization overhead
CGO '16: Proceedings of the 2016 International Symposium on Code Generation and Optimization

Existing loop vectorization techniques can exploit either intra- or inter-iteration SIMD parallelism alone in a code region if one part of the region vectorized for one type of parallelism has data dependences (called mixed-parallelism-inhibiting ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

14
Total Citations
View Citations
283
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Tayeb HPaillat LBramas B(2023)Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph TransformationsACM Transactions on Architecture and Code Optimization10.1145/363170921:1(1-25)Online publication date: 9-Nov-2023
https://dl.acm.org/doi/10.1145/3631709
Jiang PXia YAgrawal G(2020)Combining SIMD and Many/Multi-core Parallelism for Finite-state Machines with Enumerative SpeculationACM Transactions on Parallel Computing10.1145/33997147:3(1-26)Online publication date: 21-Jun-2020
https://dl.acm.org/doi/10.1145/3399714
Vandierendonck HAyguadé EHwu WBadia RHofstee H(2020)GraptorProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392753(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392753
Jin RPeng ZWu WDragan FAgrawal GRen BAyguadé EHwu WBadia RHofstee H(2020)Parallelizing pruned landmark labelingProceedings of the 34th ACM International Conference on Supercomputing10.1145/3392717.3392745(1-13)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3392717.3392745
Jiang PHong CAgrawal GGupta RShen X(2020)A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUsProceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3332466.3374546(376-388)Online publication date: 19-Feb-2020
https://dl.acm.org/doi/10.1145/3332466.3374546
Peng ZPowell AWu BBicer TRen BEvripidou SStenström PO'Boyle M(2018)GraphphiProceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques10.1145/3243176.3243205(1-14)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1145/3243176.3243205
Jiang PAgrawal G(2018)Conflict-free vectorization of associative irregular applications with recent SIMD architectural advancesProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168827(175-187)Online publication date: 2018
https://doi.org/10.1145/3179541.3168827
Jiang PAgrawal GKnoop JSchordan MJohnson TO'Boyle M(2018)Conflict-free vectorization of associative irregular applications with recent SIMD architectural advancesProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168827(175-187)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168827
Shi RGan YWang Y(2018)Evaluating Scalability Bottlenecks by Workload Extrapolation2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS.2018.00039(333-347)Online publication date: Sep-2018
https://doi.org/10.1109/MASCOTS.2018.00039
Jiang PAgrawal G(2017)Combining SIMD and Many/Multi-core Parallelism for Finite State Machines with Enumerative SpeculationACM SIGPLAN Notices10.1145/3155284.301876052:8(179-191)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3155284.3018760
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten