research-article

PIMS: a lightweight processing-in-memory accelerator for stencil computations

Authors:

Antonino Tumeo,

Brody Williams,

John D. Leidel,

Yong ChenAuthors Info & Claims

MEMSYS '19: Proceedings of the International Symposium on Memory Systems

Pages 41 - 52

https://doi.org/10.1145/3357526.3357550

Published: 30 September 2019 Publication History

Abstract

Stencil computation is a classic computational kernel present in many high-performance scientific applications, like image processing and partial differential equation solvers (PDE). A stencil computation sweeps over a multi-dimensional grid and repeatedly updates values associated with points using the values from neighboring points. Stencil computations often employ large datasets that exceed cache capacity, leading to excessive accesses to the memory subsystem. As such, 3D stencil computations on large grid sizes are memory-bound.

In this paper we present PIMS, an in-memory accelerator for stencil computations. PIMS, implemented in the logic layer of a 3D-stacked memory, exploits the high bandwidth provided by through-silicon vias to reduce redundant memory traffic. Our comprehensive evaluation using three different grid sizes with six categories of orders indicate that the proposed architecture reduces 48.25% of data movement on average and obtains up to 65.55% of bank conflict reduction.

References

[1]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2016. A scalable processing-in-memory accelerator for parallel graph processing. ACM SIGARCH Computer Architecture News 43, 3 (2016), 105--117.

Digital Library

[2]

Junwhan Ahn, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA). IEEE, 336--348.

Digital Library

[3]

Erfan Azarkhish, Davide Rossi, Igor Loi, and Luca Benini. 2015. High performance AXI-4.0 based interconnect for extensible smart memory cubes. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 1317--1322.

[4]

Hybrid Memory Cube Consortium. 2015. The HMC Specification 2.1. Retrieved May, 2019 from http://hybridmemorycube.org/files/SiteDownloads/HMC-30G-VSR_HMCC_Specification_Rev2.1_20151105.pdf

[5]

Kaushik Datta, Shoaib Kamil, Samuel Williams, Leonid Oliker, John Shalf, and Katherine Yelick. 2009. Optimization and performance modeling of stencil computations on modern microprocessors. SIAM review 51, 1 (2009), 129--159.

[6]

Hikmet Dursun, Ken-ichi Nomura, Weiqiang Wang, Manaschai Kunaseth, Liu Peng, Richard Seymour, Rajiv K Kalia, Aiichiro Nakano, and Priya Vashishta. 2009. In-Core Optimization of High-Order Stencil Computations. In PDPTA. 533--538.

[7]

Amin Farmahini-Farahani, Jung Ho Ahn, Katherine Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). IEEE, 283--295.

[8]

Basilio B Fraguela, Jose Renau, Paul Feautrier, David Padua, and Josep Torrellas. 2003. Programming the FlexRAM parallel intelligent memory system. In ACM Sigplan Notices, Vol. 38. ACM, 49--60.

Digital Library

[9]

Matteo Frigo and Volker Strumpen. 2005. Cache oblivious stencil computations. In ICS, Vol. 5. Citeseer, 361--366.

Digital Library

[10]

Michael A Frumkin and Rob F Van der Wijngaart. 2002. Tight bounds on cache use for stencil operations on rectangular grids. Journal of the ACM (JACM) 49, 3 (2002), 434--453.

Digital Library

[11]

Mingyu Gao, Grant Ayers, and Christos Kozyrakis. 2015. Practical near-data processing for in-memory analytics frameworks. In 2015 International Conference on Parallel Architecture and Compilation (PACT). IEEE, 113--124.

Digital Library

[12]

Ramyad Hadidi, Bahar Asgari, Burhan Ahmad Mudassar, Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim. 2017. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube. In 2017 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 66--75.

[13]

Georg Hager, Jan Treibig, Johannes Habich, and Gerhard Wellein. 2016. Exploring performance and power properties of modern multi-core chips via simple machine models. Concurrency and Computation: Practice and Experience 28, 2 (2016), 189--210.

Digital Library

[14]

Tom Henretty, Richard Veras, Franz Franchetti, Louis-Noël Pouchet, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2013. A stencil compiler for short-vector SIMD architectures. In Proceedings of the 27th international ACM conference on International conference on supercomputing. ACM, 13--24.

Digital Library

[15]

Justin Holewinski, Louis-Noël Pouchet, and Ponnuswamy Sadayappan. 2012. High-performance code generation for stencil computations on GPU architectures. In Proceedings of the 26th ACM international conference on Supercomputing. ACM, 311--320.

Digital Library

[16]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent offloading and mapping (TOM): Enabling programmer-transparent near-data processing in GPU systems. In ACM SIGARCH Computer Architecture News, Vol. 44. IEEE Press, 204--216.

[17]

Joe Jeddeloh and Brent Keeth. 2012. Hybrid memory cube new DRAM architecture increases density and performance. In 2012 symposium on VLSI technology (VLSIT). IEEE, 87--88.

[18]

JEDEC. 2018. HIGH BANDWIDTH MEMORY (HBM) DRAM. Retrieved May, 2019 from https://www.jedec.org/document_search?search_api_views_fulltext=jesd235B

[19]

Peter M Kogge, Steven C Bass, Jay B Brockman, Danny Z Chen, and Edwin Sha. 1996. Pursuing a petaflop: Point designs for 100 TF computers using PIM technologies. In Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers' 96). IEEE, 88--97.

[20]

Alexandros Labrinidis and Hosagrahar V Jagadish. 2012. Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5, 12 (2012), 2032--2033.

Digital Library

[21]

John D Leidel and Yong Chen. 2014. Hmc-sim: A simulation framework for hybrid memory cube devices. Parallel Processing Letters 24, 04 (2014), 1442002.

[22]

Gabriel H Loh, Nuwan Jayasena, M Oskin, Mark Nutter, David Roberts, Mitesh Meswani, Dong Ping Zhang, and Mike Ignatowski. 2013. A processing in memory taxonomy and a case for studying fixed-function pim. In Workshop on Near-Data Processing (WoNDP).

[23]

Naoya Maruyama, Tatsuo Nomura, Kento Sato, and Satoshi Matsuoka. 2011. Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 11.

Digital Library

[24]

John McCalpin and David Wonnacott. 1999. Time skewing: A value-based approach to optimizing for memory locality. Technical Report. Technical Report DCS-TR-379, Department of Computer Science, Rugers University.

[25]

Paulius Micikevicius. 2009. 3D finite difference computation on GPUs using CUDA. In Proceedings of 2nd workshop on general purpose processing on graphics processing units. ACM, 79--84.

Digital Library

[26]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE International symposium on high performance computer architecture (HPCA). IEEE, 457--468.

[27]

Ravi Nair, Samuel F Antao, Carlo Bertolli, Pradip Bose, Jose R Brunheroto, Tong Chen, C-Y Cher, Carlos HA Costa, Jun Doi, Constantinos Evangelinos, et al. 2015. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development 59, 2/3 (2015), 17--1.

Digital Library

[28]

Anthony Nguyen, Nadathur Satish, Jatin Chhugani, Changkyu Kim, and Pradeep Dubey. 2010. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 1--13.

Digital Library

[29]

Mark Oskin, Frederic T Chong, and Timothy Sherwood. 1998. Active pages: A computation model for intelligent memory. In Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No. 98CB36235). IEEE, 192--203.

[30]

Seth H Pugsley, Jeffrey Jestes, Huihui Zhang, Rajeev Balasubramonian, Vijayalakshmi Srinivasan, Alper Buyuktosunoglu, Al Davis, and Feifei Li. 2014. NDC: Analyzing the impact of 3D-stacked memory+ logic devices on MapReduce workloads. In 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 190--200.

[31]

Milan Radulovic, Darko Zivanovic, Daniel Ruiz, Bronis R de Supinski, Sally A McKee, Petar Radojkovi'c, and Eduard Ayguad'e. 2015. Another trip to the wall: How much will stacked dram benefit hpc?. In Proceedings of the 2015 International Symposium on Memory Systems. ACM, 31--36.

Digital Library

[32]

Paul Rosenfeld. 2014. Performance exploration of the hybrid memory cube. Ph.D. Dissertation.

[33]

Andreas Schäfer and Dietmar Fey. 2011. High performance stencil code algorithms for GPGPUs. Procedia Computer Science 4 (2011), 2027--2036.

[34]

Juri Schmidt, Holger Fröning, and Ulrich Brüning. 2016. Exploring time and energy for complex accesses to a hybrid memory cube. In Proceedings of the Second International Symposium on Memory Systems. ACM, 142--150.

Digital Library

[35]

Holger Stengel, Jan Treibig, Georg Hager, and Gerhard Wellein. 2015. Quantifying performance bottlenecks of stencil computations using the execution-cache-memory model. In Proceedings of the 29th ACM on International Conference on Supercomputing. ACM, 207--216.

Digital Library

[36]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and West Pomeranian. 2011. Impact of system and cache bandwidth on stencil computations across multiple processor generations. In Proceedings of the Workshop on Applications for Multi-and Many-Core Processors (A4MMC) at ISCA, Vol. 3. 2.

[37]

Robert Strzodka, Mohammed Shaheen, Dawid Pajak, and Hans-Peter Seidel. 2011. Cache accurate time skewing in iterative stencil computations. In 2011 International Conference on Parallel Processing. IEEE, 571--581.

Digital Library

[38]

Erik Vermij, Christoph Hagleitner, Leandro Fiorin, Rik Jongerius, Jan van Lunteren, and Koen Bertels. 2016. An architecture for near-data processing systems. In Proceedings of the ACM International Conference on Computing Frontiers. ACM, 357--360.

Digital Library

[39]

Borui Wang, Martin Torres, Dong Li, Jishen Zhao, and Florin Rusu. 2016. Performance implications of processing-in-memory designs on data-intensive applications. In 2016 45th International Conference on Parallel Processing Workshops (ICPPW). IEEE, 115--122.

[40]

Xi Wang, Antonino Tumeo, John D Leidel, Jie Li, and Yong Chen. 2019. MAC: Memory Access Coalescer for 3D-Stacked Memory. In Proceedings of the 48th International Conference on Parallel Processing. ACM, 2.

Digital Library

[41]

Wikipedia. 2019. SIMD. Retrieved May, 2019 from https://en.wikipedia.org/wiki/SIMD

[42]

Michael Wolfe. 1989. More iteration space tiling. In Supercomputing'89: Proceedings of the 1989 ACM/IEEE conference on Supercomputing. IEEE, 655--664.

Digital Library

[43]

Wm A Wulf and Sally A McKee. 1995. Hitting the memory wall: implications of the obvious. ACM SIGARCH computer architecture news 23, 1 (1995), 20--24.

[44]

Dongping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joseph L Greathouse, Lifan Xu, and Michael Ignatowski. 2014. TOP-PIM: throughput-oriented programmable processing in memory. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. ACM, 85--98.

Digital Library

[45]

Jiyuan Zhang, Tze Meng Low, Qi Guo, and Franz Franchetti. [n. d.]. A 3D-Stacked Memory Manycore Stencil Accelerator System. ([n. d.]).

Cited By

Singh GKhodamoradi ADenolf KLo JGomez-Luna JMelber JBisca ACorporaal HMutlu OGallivan KNikolopoulos DBeivide RGallopoulos E(2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593719
Mambu KCharles HKooli M(2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
https://doi.org/10.1109/TCAD.2023.3258346
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Show More Cited By

Index Terms

PIMS: a lightweight processing-in-memory accelerator for stencil computations
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Special purpose systems
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Processing data in or near memory (PIM), as opposed to in conventional computational units in a processor, can greatly alleviate the performance and energy penalties of data transfers from/to main memory. Graphics Processing Unit (GPU) architectures and ...
IMC-Sort: In-Memory Parallel Sorting Architecture using Hybrid Memory Cube
GLSVLSI '20: Proceedings of the 2020 on Great Lakes Symposium on VLSI

Processing-in-memory (PIM) architectures have gained significant importance as an alternative paradigm to the von-Neumann architectures to alleviate the memory wall and technology scaling problems. PIM architectures have achieved significant latency and ...
Memory Coalescing for Hybrid Memory Cube
ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Arguably, many data-intensive applications pose significant challenges to conventional architectures and memory systems, especially when applications exhibit non-contiguous, irregular, and small memory access patterns. The long memory access latency can ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '19: Proceedings of the International Symposium on Memory Systems

September 2019

517 pages

ISBN:9781450372060

DOI:10.1145/3357526

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 September 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MEMSYS '19

MEMSYS '19: The International Symposium on Memory Systems

September 30 - October 3, 2019

District of Columbia, Washington, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
239
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh GKhodamoradi ADenolf KLo JGomez-Luna JMelber JBisca ACorporaal HMutlu OGallivan KNikolopoulos DBeivide RGallopoulos E(2023)SPARTA: Spatial Acceleration for Efficient and Scalable Horizontal Diffusion Weather Stencil ComputationProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593719(463-476)Online publication date: 21-Jun-2023
https://dl.acm.org/doi/10.1145/3577193.3593719
Mambu KCharles HKooli M(2023)Dedicated Instruction Set for Pattern-Based Data Transfers: An Experimental Validation on Systems Containing In-Memory Computing UnitsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2023.325834642:11(3757-3767)Online publication date: Nov-2023
https://doi.org/10.1109/TCAD.2023.3258346
Denzler AOliveira GHajinazar NBera RSingh GGómez-Luna JMutlu O(2023)Casper: Accelerating Stencil Computations Using Near-Cache ProcessingIEEE Access10.1109/ACCESS.2023.325200211(22136-22154)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3252002
Choe JCrotty AMoreshet THerlihy MBahar RAgrawal KLee I(2022)HybriDSProceedings of the 34th ACM Symposium on Parallelism in Algorithms and Architectures10.1145/3490148.3538591(321-332)Online publication date: 11-Jul-2022
https://dl.acm.org/doi/10.1145/3490148.3538591
Park JKim BYun SLee ERhu MAhn J(2021)TRiM: Enhancing Processor-Memory Interfaces with Scalable Tensor Reduction in MemoryMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480080(268-281)Online publication date: 18-Oct-2021
https://dl.acm.org/doi/10.1145/3466752.3480080
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten