research-article

Toward a Microarchitecture for Efficient Execution of Irregular Applications

Authors:

John D. Leidel,

Brody Williams,

Yong ChenAuthors Info & Claims

ACM Transactions on Parallel Computing (TOPC), Volume 7, Issue 4

Article No.: 26, Pages 1 - 24

https://doi.org/10.1145/3418082

Published: 27 September 2020 Publication History

Abstract

Given the increasing importance of efficient data-intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns often found in these algorithms. Applications and algorithms that do not exhibit spatial and temporal memory request locality induce high latency and low memory bandwidth due to the high cache miss rate. In response to the performance penalties inherently present in applications with irregular memory accesses, we introduce a GoblinCore-64 (GC64) architecture and a unique memory hierarchy that are explicitly designed to exploit memory performance from irregular memory access patterns. GC64 provides a pressure-driven hardware-managed concurrency control to minimize pipeline stalls and lower the latency of context switches. A novel memory coalescing model is also introduced to enhance the performance of memory systems via request aggregations. We have evaluated the performance benefits of our approach using a series of 24 benchmarks and the results show nearly 50% memory request reductions and a performance acceleration of up to 14.6×.

References

[1]

R. Hornung, J. A. Keasler, and M. B. Gokhale. 2011. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Lawrence Livermore National Laboratory, CA, United States, Tech. Rep. LLNL-TR-490254.

[2]

OpenMP Application Program Interface Version 4.0. 2013. Technical Report. OpenMP Architecture Review Board. Retrieved from http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.

[3]

Hybrid Memory Cube Specification 2.0. 2015. Technical Report. Hybrid Memory Cube Consortium. Retrieved from http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf.

[4]

Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, and Stephen W. Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’16).

[5]

George Almási, Cǎlin Caşcaval, José G. Castaños, Monty Denneau, Derek Lieber, José E. Moreira, and Henry S. Warren, Jr. 2003. Dissecting cyclops: A detailed analysis of a multithreaded architecture. SIGARCH Comput. Archit. News 31, 1 (March 2003), 26--38.

Digital Library

[6]

Gail Alverson, Preston Briggs, Susan Coatney, Simon Kahan, and Richard Korry. 1997. Tera hardware-software cooperation. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’97). ACM, New York, NY, 1--16.

Digital Library

[7]

Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The tera computer system. SIGARCH Comput. Archit. News 18, 3b (June 1990), 1--6.

Digital Library

[8]

Krste Asanovic and David A. Patterson. 2014. Instruction Sets should be Free: The Case for RISC-V. Technical Report UCB/EECS-2014-146. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.html.

[9]

D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS parallel benchmark results. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’92). IEEE Computer Society Press, Los Alamitos, CA, 386--393. Retrieved from http://dl.acm.org/citation.cfm?id=147877.148032.

[10]

Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619 (2015). http://arxiv.org/abs/1508.03619

[11]

David Gordon Bradlee. 1991. Retargetable Instruction Scheduling for Pipelined Processors. Ph.D. Dissertation. Seattle, WA.

[12]

Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.

Digital Library

[13]

Keith D. Cooper and Anshuman Dasgupta. 2006. Tailoring graph-coloring register allocation for runtime compilation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, Washington, DC, 39--49.

[14]

Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. 2006. Toward a software infrastructure for the cyclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS’06). IEEE Computer Society, Washington, DC, 9--.

Digital Library

[15]

Digilent. 2016. ZYBO FPGA Board Reference Manual: Doc 502-279. Technical Report. Digilent.

[16]

Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report. University of Tennessee, Sandia National Laboratories.

[17]

Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, Washington, DC, 124--131.

Digital Library

[18]

Ge Gan, Xu Wang, Joseph Manzano, and Guang R. Gao. 2009. Tile percolation: An OpenMP tile aware parallelization technique for the cyclops-64 multicore processor. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing (Euro-Par’09). Springer-Verlag, Berlin, 839--850.

[19]

Maya Gokhale, Scott Lloyd, and Chris Macaraeg. 2015. Hybrid memory cube performance characterization on data-centric workloads. In Proceedings of the 5th Workshop on Irregular Applications (IA3’15).

Digital Library

[20]

Antonio González, Carlos Aliagas, and Mateo Valero. 2014. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 217--226.

Digital Library

[21]

Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. 2018. Performance implications of NoCs on 3D-stacked memories: Insights from the hybrid memory cube. (2018), 99--108.

[22]

Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the Distance for TLB Prefetching: An Application-driven Study. Vol. 30. IEEE Computer Society.

[23]

John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke. 2015. WarpPool: Sharing requests with inter-warp coalescing for throughput processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 433--444.

Digital Library

[24]

Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2 (Mar. 2005), 21--29.

Digital Library

[25]

David Kroft. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th International Symposium on Computer Architecture (ISCA’81). IEEE Computer Society Press, 81--87.

Digital Library

[26]

Chris Lattner and Vikram Adve. 2002. The LLVM instruction set and compilation strategy. Computer Science Department, University of Illinois at Urbana-Champaign, Technical Report UIUCDCS.

[27]

Yunsup Lee, Andrew Waterman, Henry Cook, Brian Zimmer, Ben Keller, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtic, Stevo Bailey, Milovan Blagojevic, et al. 2016. An agile approach to building risc-v microprocessors. IEEE Micro 36, 2 (2016), 8--20.

Digital Library

[28]

John D. Leidel. 2017. GoblinCore-64: A Scalable, Open Architecture for Data Intensive High Performance Computing. Ph.D. Dissertation.

[29]

John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A framework and instruction set for latency tolerant, massively multithreaded processors. In Proceedings of the Conference on High Performance Computing, Networking, Storage and Analysis (SCC’12). IEEE, 232--239.

Digital Library

[30]

John D. Leidel, Xi Wang, and Yong Chen. [n.d.]. Dynamic Memory Coalescing. Retrieved from http://gc64.org/?page_id=182.

[31]

John D. Leidel, Xi Wang, and Yong Chen. 2017. Pressure-driven hardware-managed thread concurrency for irregular applications. In Proceedings of the 7th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.

Digital Library

[32]

Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, and Christian Schulte. 2012. Constraint-based register allocation and instruction scheduling. In Principles and Practice of Constraint Programming. Springer, 750--766.

[33]

John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. In IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.

[34]

David Mizell and Kristyn Maschhoff. 2009. Early experiences with large-scale Cray XMT systems. In Proceedings of the IEEE International Symposium on Parallel8Distributed Processing (IPDPS’09). IEEE Computer Society, Washington, DC, 1--9.

Digital Library

[35]

John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (Mar. 2008), 40--53.

Digital Library

[36]

Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. 1996. The case for a single-chip multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 2--11.

Digital Library

[37]

Todd Alan Proebsting. 1992. Code Generation Techniques. Ph.D. Dissertation. Madison, WI. UMI Order No. GAX92-31217.

[38]

Paul Rosenfeld. 2014. Performance Exploration of the Hybrid Memory Cube. Ph.D. Dissertation.

[39]

Philip John Schielke. 2000. Stochastic Instruction Scheduling. Ph.D. Dissertation. Houston, TX. Advisor(s) Cooper, Keith D. AAI9969315.

[40]

Srdan Stipić, Vasileios Karakostas, Vesna Smiljković, Vladimir Gajinov, Osman Unsal, Adrián Cristal, and Mateo Valero. 2014. Dynamic transaction coalescing. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14). ACM, New York, NY, Article 18, 10 pages.

Digital Library

[41]

Geoffrey Taylor. 1950. The formation of a blast wave by a very intense explosion. II. The atomic explosion of 1945. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 201, 1065 (1950), 175--186. arXiv:http://rspa.royalsocietypublishing.org/content/201/1065/175.full.pdf.

[42]

K. R. Wadleigh and I. L. Crawford. 2000. Software Optimization for High-performance Computing. Prentice Hall PTR. 00709512 Retrieved from http://books.google.com/books?id=IRN0IEXJzKEC.

[43]

Xi Wang, John D. Leidel, and Yong Chen. 2018. Memory coalescing for hybrid memory cube. In Proceedings of the International Conference on Parallel Processing (ICPP’18). ACM.

Digital Library

[44]

Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, and Yong Chen. 2019. MAC: Memory access coalescer for 3D-stacked memory. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.

Digital Library

[45]

Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanovic. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html.

[46]

Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html.

[47]

Kyle Wheeler, Richard Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the Workshop on Multithreaded Architectures and Applications. Miami, FL.

[48]

Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 57--68.

Digital Library

[49]

Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. SIGPLAN Not. 48, 8 (Feb. 2013), 57--68.

Digital Library

[50]

Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 442--453.

Digital Library

[51]

Xilinx. 2015. Vivado Design Suite User Guide v2015.4: UG910. Technical Report. Xilinx.

[52]

Xilinx. 2016. Xilinx Zynq-7000 All Programmable SoC Overview: DS190. Technical Report. Xilinx.

[53]

Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not. 45, 6 (June 2010), 86--97.

Digital Library

[54]

Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97.

Digital Library

[55]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. SIGPLAN Not. 46, 3 (Mar. 2011), 369--380.

Digital Library

[56]

Eddy Z. Zhang, Han Li, and Xipeng Shen. 2012. A study towards optimal data layout for GPU computing. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC’12). ACM, New York, NY, 72--73.

Digital Library

Cited By

Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Hansen ZWilliams BLeidel JWang XChen Y(2021)DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622817(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622817

Index Terms

Toward a Microarchitecture for Efficient Execution of Irregular Applications
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Processors and memory architectures
2. Hardware
  1. Emerging technologies
    1. Analysis and design of emerging devices and systems
      1. Emerging architectures

Recommendations

PAC: Paged Adaptive Coalescer for 3D-Stacked Memory
HPDC '20: Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing

Many contemporary data-intensive applications exhibit irregular and highly concurrent memory access patterns and thus challenge the performance of conventional memory systems. Driven by an expanding need for high-bandwidth memory featuring low access ...
Pressure-Driven Hardware Managed Thread Concurrency for Irregular Applications
IA3'17: Proceedings of the Seventh Workshop on Irregular Applications: Architectures and Algorithms

Given the increasing importance of efficient data intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns found in these algorithms. This research focuses on mapping the compiler's ...
Effects of Multithreading on Cache Performance
Special issue on cache memory and related problems

As the performance gap between processor and memory grows, memory latency becomes a major bottleneck in achieving high processor utilization. Multithreading has emerged as one of the most promising and exciting techniques used to tolerate memory latency ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing

ACM Transactions on Parallel Computing Volume 7, Issue 4

Special Issue on Innovations in Systems for Irregular Applications, Part 2

December 2020

179 pages

ISSN:2329-4949

EISSN:2329-4957

DOI:10.1145/3426879

Editor:
David A. Bader
New Jersey Institute of Technology, USA

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2020

Accepted: 01 April 2020

Revised: 01 March 2020

Received: 01 November 2018

Published in TOPC Volume 7, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
185
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Khojasteh HTabatabaei H(2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
https://doi.org/10.1109/CSCE60160.2023.00342
Hansen ZWilliams BLeidel JWang XChen Y(2021)DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622817(1-8)Online publication date: 20-Sep-2021
https://doi.org/10.1109/HPEC49654.2021.9622817

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents