skip to main content
10.1145/3357526.3357550acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmemsysConference Proceedingsconference-collections
research-article

PIMS: a lightweight processing-in-memory accelerator for stencil computations

Published: 30 September 2019 Publication History

Abstract

Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound.
In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D-stacked memory, exploits the high bandwidth provided by through-silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

References

[1]
Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.
[2]
Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 336--348.
[3]
Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2015. High performance AXI-4.0 based interconnect for extensible smart memory cubes. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 1317--1322.
[4]
Hybrid Memory Cube Consortium. 2015. The HMC Specification 2.1. Retrieved May, 2019 from http://hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf
[5]
Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review 51, 1 (2009), 129--159.
[6]
Hikmet Dursun, Ken-ichi Nomura, Weiqiang Wang, Manaschai Kunaseth, Liu Peng, Richard Seymour, Rajiv K Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. In-Core Optimization of High-Order Stencil Computations. In PDPTA. 533--538.
[7]
Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 283--295.
[8]
Basilio B Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas. 2003. Programming the FlexRAM parallel intelligent memory system. In ACM Sigplan Notices, Vol. 38. ACM, 49--60.
[9]
Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In ICS, Vol. 5. Citeseer, 361--366.
[10]
Michael A Frumkin and Rob F Van der Wijngaart. 2002. Tight bounds on cache use for stencil operations on rectangular grids. Journal of the ACM (JACM) 49, 3 (2002), 434--453.
[11]
Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 113--124.
[12]
Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. 2017. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube. In 2017 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 66--75.
[13]
Georg Hager, Jan Treibig, Johannes Habich, and Gerhard Wellein. 2016. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience 28, 2 (2016), 189--210.
[14]
Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 13--24.
[15]
Justin Holewinski, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 311--320.
[16]
Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 204--216.
[17]
Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 symposium on VLSI technology (VLSIT). IEEE, 87--88.
[18]
JEDEC. 2018. HIGH BANDWIDTH MEMORY (HBM) DRAM. Retrieved May, 2019 from https://www.jedec.org/document_search?search_api_views_fulltext=jesd235B
[19]
Peter M Kogge, Steven C Bass, Jay B Brockman, Danny Z Chen, and Edwin Sha. 1996. Pursuing a petaflop: Point designs for 100 TF computers using PIM technologies. In Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers' 96). IEEE, 88--97.
[20]
Alexandros Labrinidis and Hosagrahar V Jagadish. 2012. Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5, 12 (2012), 2032--2033.
[21]
John D Leidel and Yong Chen. 2014. Hmc-sim: A simulation framework for hybrid memory cube devices. Parallel Processing Letters 24, 04 (2014), 1442002.
[22]
Gabriel H Loh, Nuwan Jayasena, M Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP).
[23]
Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 11.
[24]
John McCalpin and David Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Technical Report. Technical Report DCS-TR-379, Department of Computer Science, Rugers University.
[25]
Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, 79--84.
[26]
Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 457--468.
[27]
Ravi Nair, Samuel F Antao, Carlo Bertolli, Pradip Bose, Jose R Brunheroto, Tong Chen, C-Y Cher, Carlos HA Costa, Jun Doi, Constantinos Evangelinos, et al. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (2015), 17--1.
[28]
Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13.
[29]
Mark Oskin, Frederic T Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No. 98CB36235). IEEE, 192--203.
[30]
Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.
[31]
Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R de Supinski, Sally A McKee, Petar Radojkovi'c, and Eduard Ayguad'e. 2015. Another trip to the wall: How much will stacked dram benefit hpc?. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 31--36.
[32]
Paul Rosenfeld. 2014. Performance exploration of the hybrid memory cube. Ph.D. Dissertation.
[33]
Andreas Schäfer and Dietmar Fey. 2011. High performance stencil code algorithms for GPGPUs. Procedia Computer Science 4 (2011), 2027--2036.
[34]
Juri Schmidt, Holger Fröning, and Ulrich Brüning. 2016. Exploring time and energy for complex accesses to a hybrid memory cube. In Proceedings of the Second International Symposium on Memory Systems. ACM, 142--150.
[35]
Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 207--216.
[36]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and West Pomeranian. 2011. Impact of system and cache bandwidth on stencil computations across multiple processor generations. In Proceedings of the Workshop on Applications for Multi-and Many-Core Processors (A4MMC) at ISCA, Vol. 3. 2.
[37]
Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In 2011 International Conference on Parallel Processing. IEEE, 571--581.
[38]
Erik Vermij, Christoph Hagleitner, Leandro Fiorin, Rik Jongerius, Jan van Lunteren, and Koen Bertels. 2016. An architecture for near-data processing systems. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, 357--360.
[39]
Borui Wang, Martin Torres, Dong Li, Jishen Zhao, and Florin Rusu. 2016. Performance implications of processing-in-memory designs on data-intensive applications. In 2016 45th International Conference on Parallel Processing Workshops (ICPPW). IEEE, 115--122.
[40]
Xi Wang, Antonino Tumeo, John D Leidel, Jie Li, and Yong Chen. 2019. MAC: Memory Access Coalescer for 3D-Stacked Memory. In Proceedings of the 48th International Conference on Parallel Processing. ACM, 2.
[41]
Wikipedia. 2019. SIMD. Retrieved May, 2019 from https://en.wikipedia.org/wiki/SIMD
[42]
Michael Wolfe. 1989. More iteration space tiling. In Supercomputing'89: Proceedings of the 1989 ACM/IEEE conference on Supercomputing. IEEE, 655--664.
[43]
Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20--24.
[44]
Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, 85--98.
[45]
Jiyuan Zhang, Tze Meng Low, Qi Guo, and Franz Franchetti. [n. d.]. A 3D-Stacked Memory Manycore Stencil Accelerator System. ([n. d.]).

Cited By

View all
  • (2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
  • (2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
MEMSYS '19: Proceedings of the International Symposium on Memory Systems
September 2019
517 pages
ISBN:9781450372060
DOI:10.1145/3357526
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. high performance computing
  2. hybrid memory cube
  3. processing-in-memory
  4. stencil computation

Qualifiers

  • Research-article

Conference

MEMSYS '19
MEMSYS '19: The International Symposium on Memory Systems
September 30 - October 3, 2019
District of Columbia, Washington, USA

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 14 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
  • (2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
  • (2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
  • (2022)HybriDSProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538591(321-332)Online publication date: 11-Jul-2022
  • (2021)TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in MemoryMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480080(268-281)Online publication date: 18-Oct-2021
  • (2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media