skip to main content
research-article

Toward a Microarchitecture for Efficient Execution of Irregular Applications

Published: 27 September 2020 Publication History

Abstract

Given the increasing importance of efficient data-intensive computing, we find that modern processor designs are not well suited to the irregular memory access patterns often found in these algorithms. Applications and algorithms that do not exhibit spatial and temporal memory request locality induce high latency and low memory bandwidth due to the high cache miss rate. In response to the performance penalties inherently present in applications with irregular memory accesses, we introduce a GoblinCore-64 (GC64) architecture and a unique memory hierarchy that are explicitly designed to exploit memory performance from irregular memory access patterns. GC64 provides a pressure-driven hardware-managed concurrency control to minimize pipeline stalls and lower the latency of context switches. A novel memory coalescing model is also introduced to enhance the performance of memory systems via request aggregations. We have evaluated the performance benefits of our approach using a series of 24 benchmarks and the results show nearly 50% memory request reductions and a performance acceleration of up to 14.6×.

References

[1]
R. Hornung, J. A. Keasler, and M. B. Gokhale. 2011. Hydrodynamics Challenge Problem, Lawrence Livermore National Laboratory. Lawrence Livermore National Laboratory, CA, United States, Tech. Rep. LLNL-TR-490254.
[2]
OpenMP Application Program Interface Version 4.0. 2013. Technical Report. OpenMP Architecture Review Board. Retrieved from http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.
[3]
Hybrid Memory Cube Specification 2.0. 2015. Technical Report. Hybrid Memory Cube Consortium. Retrieved from http://www.hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.0_Public.pdf.
[4]
Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F. Wenisch, John Danskin, and Stephen W. Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA’16).
[5]
George Almási, Cǎlin Caşcaval, José G. Castaños, Monty Denneau, Derek Lieber, José E. Moreira, and Henry S. Warren, Jr. 2003. Dissecting cyclops: A detailed analysis of a multithreaded architecture. SIGARCH Comput. Archit. News 31, 1 (March 2003), 26--38.
[6]
Gail Alverson, Preston Briggs, Susan Coatney, Simon Kahan, and Richard Korry. 1997. Tera hardware-software cooperation. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’97). ACM, New York, NY, 1--16.
[7]
Robert Alverson, David Callahan, Daniel Cummings, Brian Koblenz, Allan Porterfield, and Burton Smith. 1990. The tera computer system. SIGARCH Comput. Archit. News 18, 3b (June 1990), 1--6.
[8]
Krste Asanovic and David A. Patterson. 2014. Instruction Sets should be Free: The Case for RISC-V. Technical Report UCB/EECS-2014-146. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-146.html.
[9]
D. H. Bailey, L. Dagum, E. Barszcz, and H. D. Simon. 1992. NAS parallel benchmark results. In Proceedings of the ACM/IEEE Conference on Supercomputing (Supercomputing’92). IEEE Computer Society Press, Los Alamitos, CA, 386--393. Retrieved from http://dl.acm.org/citation.cfm?id=147877.148032.
[10]
Scott Beamer, Krste Asanovic, and David A. Patterson. 2015. The GAP benchmark suite. CoRR abs/1508.03619 (2015). http://arxiv.org/abs/1508.03619
[11]
David Gordon Bradlee. 1991. Retargetable Instruction Scheduling for Pipelined Processors. Ph.D. Dissertation. Seattle, WA.
[12]
Preston Briggs, Keith D. Cooper, and Linda Torczon. 1994. Improvements to graph coloring register allocation. ACM Trans. Program. Lang. Syst. 16, 3 (May 1994), 428--455.
[13]
Keith D. Cooper and Anshuman Dasgupta. 2006. Tailoring graph-coloring register allocation for runtime compilation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society, Washington, DC, 39--49.
[14]
Juan del Cuvillo, Weirong Zhu, Ziang Hu, and Guang R. Gao. 2006. Toward a software infrastructure for the cyclops-64 cellular architecture. In Proceedings of the 20th International Symposium on High-Performance Computing in an Advanced Collaborative Environment (HPCS’06). IEEE Computer Society, Washington, DC, 9--.
[15]
Digilent. 2016. ZYBO FPGA Board Reference Manual: Doc 502-279. Technical Report. Digilent.
[16]
Jack Dongarra, Michael A. Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: A New Metric for Ranking High Performance Computing Systems. Technical Report. University of Tennessee, Sandia National Laboratories.
[17]
Alejandro Duran, Xavier Teruel, Roger Ferrer, Xavier Martorell, and Eduard Ayguade. 2009. Barcelona OpenMP tasks suite: A set of benchmarks targeting the exploitation of task parallelism in OpenMP. In Proceedings of the International Conference on Parallel Processing (ICPP’09). IEEE Computer Society, Washington, DC, 124--131.
[18]
Ge Gan, Xu Wang, Joseph Manzano, and Guang R. Gao. 2009. Tile percolation: An OpenMP tile aware parallelization technique for the cyclops-64 multicore processor. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing (Euro-Par’09). Springer-Verlag, Berlin, 839--850.
[19]
Maya Gokhale, Scott Lloyd, and Chris Macaraeg. 2015. Hybrid memory cube performance characterization on data-centric workloads. In Proceedings of the 5th Workshop on Irregular Applications (IA3’15).
[20]
Antonio González, Carlos Aliagas, and Mateo Valero. 2014. A data cache with multiple caching strategies tuned to different types of locality. In Proceedings of the ACM International Conference on Supercomputing 25th Anniversary Volume. ACM, 217--226.
[21]
Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, and Hyesoon Kim. 2018. Performance implications of NoCs on 3D-stacked memories: Insights from the hybrid memory cube. (2018), 99--108.
[22]
Gokul B. Kandiraju and Anand Sivasubramaniam. 2002. Going the Distance for TLB Prefetching: An Application-driven Study. Vol. 30. IEEE Computer Society.
[23]
John Kloosterman, Jonathan Beaumont, Mick Wollman, Ankit Sethia, Ron Dreslinski, Trevor Mudge, and Scott Mahlke. 2015. WarpPool: Sharing requests with inter-warp coalescing for throughput processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO’15). ACM, New York, NY, 433--444.
[24]
Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. 2005. Niagara: A 32-way multithreaded sparc processor. IEEE Micro 25, 2 (Mar. 2005), 21--29.
[25]
David Kroft. 1981. Lockup-free instruction fetch/prefetch cache organization. In Proceedings of the 8th International Symposium on Computer Architecture (ISCA’81). IEEE Computer Society Press, 81--87.
[26]
Chris Lattner and Vikram Adve. 2002. The LLVM instruction set and compilation strategy. Computer Science Department, University of Illinois at Urbana-Champaign, Technical Report UIUCDCS.
[27]
Yunsup Lee, Andrew Waterman, Henry Cook, Brian Zimmer, Ben Keller, Alberto Puggelli, Jaehwa Kwak, Ruzica Jevtic, Stevo Bailey, Milovan Blagojevic, et al. 2016. An agile approach to building risc-v microprocessors. IEEE Micro 36, 2 (2016), 8--20.
[28]
John D. Leidel. 2017. GoblinCore-64: A Scalable, Open Architecture for Data Intensive High Performance Computing. Ph.D. Dissertation.
[29]
John D. Leidel, Kevin Wadleigh, Joe Bolding, Tony Brewer, and Dean Walker. 2012. CHOMP: A framework and instruction set for latency tolerant, massively multithreaded processors. In Proceedings of the Conference on High Performance Computing, Networking, Storage and Analysis (SCC’12). IEEE, 232--239.
[30]
John D. Leidel, Xi Wang, and Yong Chen. [n.d.]. Dynamic Memory Coalescing. Retrieved from http://gc64.org/?page_id=182.
[31]
John D. Leidel, Xi Wang, and Yong Chen. 2017. Pressure-driven hardware-managed thread concurrency for irregular applications. In Proceedings of the 7th Workshop on Irregular Applications: Architectures and Algorithms. ACM, 7.
[32]
Roberto Castañeda Lozano, Mats Carlsson, Frej Drejhammar, and Christian Schulte. 2012. Constraint-based register allocation and instruction scheduling. In Principles and Practice of Constraint Programming. Springer, 750--766.
[33]
John D. McCalpin. 1995. Memory bandwidth and machine balance in current high performance computers. In IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec. 1995), 19--25.
[34]
David Mizell and Kristyn Maschhoff. 2009. Early experiences with large-scale Cray XMT systems. In Proceedings of the IEEE International Symposium on Parallel8Distributed Processing (IPDPS’09). IEEE Computer Society, Washington, DC, 1--9.
[35]
John Nickolls, Ian Buck, Michael Garland, and Kevin Skadron. 2008. Scalable parallel programming with CUDA. Queue 6, 2 (Mar. 2008), 40--53.
[36]
Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung Chang. 1996. The case for a single-chip multiprocessor. SIGOPS Oper. Syst. Rev. 30, 5 (Sept. 1996), 2--11.
[37]
Todd Alan Proebsting. 1992. Code Generation Techniques. Ph.D. Dissertation. Madison, WI. UMI Order No. GAX92-31217.
[38]
Paul Rosenfeld. 2014. Performance Exploration of the Hybrid Memory Cube. Ph.D. Dissertation.
[39]
Philip John Schielke. 2000. Stochastic Instruction Scheduling. Ph.D. Dissertation. Houston, TX. Advisor(s) Cooper, Keith D. AAI9969315.
[40]
Srdan Stipić, Vasileios Karakostas, Vesna Smiljković, Vladimir Gajinov, Osman Unsal, Adrián Cristal, and Mateo Valero. 2014. Dynamic transaction coalescing. In Proceedings of the 11th ACM Conference on Computing Frontiers (CF’14). ACM, New York, NY, Article 18, 10 pages.
[41]
Geoffrey Taylor. 1950. The formation of a blast wave by a very intense explosion. II. The atomic explosion of 1945. Proceedings of the Royal Society of London A: Mathematical, Physical and Engineering Sciences 201, 1065 (1950), 175--186. arXiv:http://rspa.royalsocietypublishing.org/content/201/1065/175.full.pdf.
[42]
K. R. Wadleigh and I. L. Crawford. 2000. Software Optimization for High-performance Computing. Prentice Hall PTR. 00709512 Retrieved from http://books.google.com/books?id=IRN0IEXJzKEC.
[43]
Xi Wang, John D. Leidel, and Yong Chen. 2018. Memory coalescing for hybrid memory cube. In Proceedings of the International Conference on Parallel Processing (ICPP’18). ACM.
[44]
Xi Wang, Antonino Tumeo, John D. Leidel, Jie Li, and Yong Chen. 2019. MAC: Memory access coalescer for 3D-stacked memory. In Proceedings of the 48th International Conference on Parallel Processing. 1--10.
[45]
Andrew Waterman, Yunsup Lee, Rimas Avizienis, David A. Patterson, and Krste Asanovic. 2015. The RISC-V Instruction Set Manual Volume II: Privileged Architecture Version 1.7. Technical Report UCB/EECS-2015-49. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-49.html.
[46]
Andrew Waterman, Yunsup Lee, David A. Patterson, and Krste Asanovic. 2014. The RISC-V Instruction Set Manual, Volume I: User-Level ISA, Version 2.0. Technical Report UCB/EECS-2014-54. EECS Department, University of California, Berkeley. Retrieved from http://www.eecs.berkeley.edu/Pubs/TechRpts/2014/EECS-2014-54.html.
[47]
Kyle Wheeler, Richard Murphy, and Douglas Thain. 2008. Qthreads: An API for programming with millions of lightweight threads. In Proceedings of the Workshop on Multithreaded Architectures and Applications. Miami, FL.
[48]
Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. In Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’13). ACM, New York, NY, 57--68.
[49]
Bo Wu, Zhijia Zhao, Eddy Zheng Zhang, Yunlian Jiang, and Xipeng Shen. 2013. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU. SIGPLAN Not. 48, 8 (Feb. 2013), 57--68.
[50]
Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely Jr., and Joel Emer. 2011. PACMan: Prefetch-aware cache management for high performance caching. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. ACM, 442--453.
[51]
Xilinx. 2015. Vivado Design Suite User Guide v2015.4: UG910. Technical Report. Xilinx.
[52]
Xilinx. 2016. Xilinx Zynq-7000 All Programmable SoC Overview: DS190. Technical Report. Xilinx.
[53]
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. SIGPLAN Not. 45, 6 (June 2010), 86--97.
[54]
Yi Yang, Ping Xiang, Jingfei Kong, and Huiyang Zhou. 2010. A GPGPU compiler for memory optimization and parallelism management. In Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’10). ACM, New York, NY, 86--97.
[55]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly elimination of dynamic irregularities for GPU computing. SIGPLAN Not. 46, 3 (Mar. 2011), 369--380.
[56]
Eddy Z. Zhang, Han Li, and Xipeng Shen. 2012. A study towards optimal data layout for GPU computing. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC’12). ACM, New York, NY, 72--73.

Cited By

View all
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • (2021)DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622817(1-8)Online publication date: 20-Sep-2021

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Parallel Computing
ACM Transactions on Parallel Computing  Volume 7, Issue 4
Special Issue on Innovations in Systems for Irregular Applications, Part 2
December 2020
179 pages
ISSN:2329-4949
EISSN:2329-4957
DOI:10.1145/3426879
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 September 2020
Accepted: 01 April 2020
Revised: 01 March 2020
Received: 01 November 2018
Published in TOPC Volume 7, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data-intensive computing
  2. context switching
  3. irregular algorithms
  4. thread concurrency

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • National Science Foundation

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)22
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Survey on the Proposed Architectures for Efficient Execution of Irregular Applications Using Pipeline Parallelism2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00342(2080-2087)Online publication date: 24-Jul-2023
  • (2021)DMM-GAPBS: Adapting the GAP Benchmark Suite to a Distributed Memory Model2021 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC49654.2021.9622817(1-8)Online publication date: 20-Sep-2021

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media