research-article

Graph-Morphing: Exploiting Hidden Parallelism of Non-Stencil Computation in High-Level Synthesis

Authors:
Yu Zou

University of Central Florida

University of Central Florida
View Profile

,
Mingjie Lin

University of Central Florida

University of Central Florida
View Profile

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019June 2019Article No.: 124Pages 1–6https://doi.org/10.1145/3316781.3317834

Published:02 June 2019Publication History

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

Pages 1–6

ABSTRACT

Non-stencil kernels with irregular memory access patterns pose unique challenges to achieving high computing performance and hardware efficiency in FPGA high-level synthesis. We present a highly versatile and systematic approach, termed as Graph-Morphing, to constructing a reconfigurable computing engine specifically optimized to perform non-stencil kernel computing. Graph-Morphing achieves significant performance improvement by fragmenting operations across loop iterations and subsequently rescheduling computation and data to maximize overall performance. In experiments, Graph-Morphing achieves 2-13 times performance improvement albeit with significantly more hardware usage. For accelerating non-stencil kernel computing, Graph-Morphing proposes a new research direction.

References

Uday Kumar Bondhugula. 2008. Effective automatic parallelization and locality optimization using the polyhedral model. Ph.D. Dissertation. The Ohio State University.Google Scholar
Alessandro Cilardo and Luca Gallo. 2015. Improving Multibank Memory Access Parallelism with Lattice-Based Partitioning. ACM Trans. Archit. Code Optim. 11 (2015), 45:1--45:25. Google ScholarDigital Library
Jason Cong, Peng Zhang, and Yi Zou. 2011. Combined loop transformation and hierarchy allocation for data reuse optimization. In Proceedings of the International Conference on Computer-Aided Design. IEEE Press, 185--192. Google ScholarDigital Library
Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. Fpgp: Graph processing framework on fpga a case study of breadth-first search. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, 105--110. Google ScholarDigital Library
Alain Darte and Frédéric Vivien. 1997. Optimal fine and medium grain parallelism detection in polyhedral reduced dependence graphs. International Journal of Parallel Programming 25, 6 (1997), 447--496. Google ScholarDigital Library
Juan Escobedo and Mingjie Lin. 2017. Tessellating Memory Space for Parallel Access. In ASP-DAC.Google Scholar
Juan Escobedo and Mingjie Lin. 2018. Extracting Data Parallelism in Non-stencil Kernel Computing by Optimally Coloring Folded Memory Conflict Graph. In Proceedings of the 55th Annual Design Automation Conference (DAC '18). ACM, New York, NY, USA, Article 156, 6 pages. Google ScholarDigital Library
Juan Escobedo and Mingjie Lin. 2018. Graph-Theoretically Optimal Memory Banking for Stencil-Based Computing Kernels. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '18). ACM, New York, NY, USA, 199--208. Google ScholarDigital Library
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. International journal of parallel programming 21, 5 (1992), 313--347. Google ScholarDigital Library
Matthew Jacobsen, Dustin Richmond, Matthew Hogains, and Ryan Kastner. 2015. RIFFA 2.1: A reusable integration framework for FPGA accelerators. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 8, 4 (2015), 22. Google ScholarDigital Library
Jialin Ju and Vipin Chaudhary. 1997. Unique sets oriented parallelization of loops with non-uniform dependences. Comput. J. 40, 6 (1997), 322--339.Google ScholarCross Ref
Ken Kennedy and John R Allen. 2001. Optimizing compilers for modern architectures: a dependence-based approach. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
Junyi Liu, John Wickerson, and George A Constantinides. 2016. Loop splitting for efficient pipelining in high-level synthesis. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 72--79.Google ScholarCross Ref
Akihiro Suda, Hideki Takase, Kazuyoshi Takagi, and Naofumi Takagi. 2013. A Buffering Method for Parallelized Loop with Non-Uniform Dependencies in High-Level Synthesis. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 390--401. Google ScholarDigital Library
Yuxin Wang, Peng Li, and Jason Cong. 2014. Theory and Algorithm for Generalized Memory Partitioning in High-level Synthesis. In Proceedings of the 2014 ACM/SIGDA International Symposium on Field-programmable Gate Arrays (FPGA '14). ACM, New York, NY, USA, 199--208. Google ScholarDigital Library
Yuxin Wang, Peng Li, Peng Zhang, Chen Zhang, and Jason Cong. 2013. Memory partitioning for multidimensional arrays in high-level synthesis. In Proceedings of the 50th Annual Design Automation Conference. ACM, 12. Google ScholarDigital Library
Xiaowei Zhu, Wentao Han, and Wenguang Chen. 2015. GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning.. In USENIX Annual Technical Conference. 375--386. Google ScholarDigital Library

Recommendations

Accelerating CUDA graph algorithms at maximum warp
PPoPP '11

Graphs are powerful data representations favored in many computational domains. Modern GPUs have recently shown promising results in accelerating computationally challenging graph problems but their performance suffered heavily when the graph structure ...
Read More
Compiler Support for Scalable and Efficient Memory Systems

Technological trends require that future scalable microprocessors be decentralized. Applying these trends toward memory systems shows that the size of the cache accessible in a single cycle will decrease in a future generation of chips. Thus, a bank-...
Read More
Accelerating Graph Structural Clustering Algorithms on Heterogeneous Processors
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019
June 2019
1378 pages
ISBN:9781450367257
DOI:10.1145/3316781

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Edge-Centric Graph
Loopwith Non-Uniform Dependencies
Memory Parallelism
Non-Stencil Kernel
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,770of5,499submissions,32%
Upcoming Conference
DAC '24

Sponsor:

sigda

61st ACM/IEEE Design Automation Conference

June 23 - 27, 2024

San Francisco , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 167
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Graph-Morphing: Exploiting Hidden Parallelism of Non-Stencil Computation in High-Level Synthesis

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Accelerating CUDA graph algorithms at maximum warp

Compiler Support for Scalable and Efficient Memory Systems

Accelerating Graph Structural Clustering Algorithms on Heterogeneous Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Graph-Morphing: Exploiting Hidden Parallelism of Non-Stencil Computation in High-Level Synthesis

DAC '19: Proceedings of the 56th Annual Design Automation Conference 2019

ABSTRACT

References

Cited By

Recommendations

Accelerating CUDA graph algorithms at maximum warp

Compiler Support for Scalable and Efficient Memory Systems

Accelerating Graph Structural Clustering Algorithms on Heterogeneous Processors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media