research-article

In-DRAM near-data approximate acceleration for GPUs

Authors:

Amir Yazdanbakhsh,

Pejman Lotfi-Kamran,

Hadi Esmaeilzadeh,

Nam Sung KimAuthors Info & Claims

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

Article No.: 34, Pages 1 - 14

https://doi.org/10.1145/3243176.3243188

Published: 01 November 2018 Publication History

Abstract

GPUs are bottlenecked by the off-chip communication bandwidth and its energy cost; hence near-data acceleration is particularly attractive for GPUs. Integrating the accelerators within DRAM can mitigate these bottlenecks and additionally expose them to the higher internal bandwidth of DRAM. However, such an integration is challenging, as it requires low-overhead accelerators while supporting a diverse set of applications. To enable the integration, this work leverages the approximability of GPU applications and utilizes the neural transformation, which converts diverse regions of code mainly to Multiply-Accumulate (MAC). Furthermore, to preserve the SIMT execution model of GPUs, we also propose a novel approximate MAC unit with a significantly smaller area overhead. As such, this work introduces AxRam---a novel DRAM architecture---that integrates several approximate MAC units. AxRam offers this integration without increasing the memory column pitch or modifying the internal architecture of the DRAM banks. Our results with 10 GPGPU benchmarks show that, on average, AxRam provides 2.6× speedup and 13.3× energy reduction over a baseline GPU with no acceleration. These benefits are achieved while reducing the overall DRAM system power by 26% with an area cost of merely 2.1%.

References

[1]

2015. NanGate FreePDK45 Open Cell Library. http://www.nangate.com. (2015). http://www.nangate.com/?page_id=2325

[2]

2015. NVIDIA Corporation. CUDA Programming Guide. http://docs.nvidia.com/cuda/cuda-c-programming-guide. (2015). http://docs.nvidia.com/cuda/cuda-c-programming-guide

[3]

Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi. 2015. A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing. In ISCA.

Digital Library

[4]

Berkin Akin, Franz Franchetti, and James C Hoe. 2015. Data Reorganization in Memory using 3D-stacked DRAM. In ISCA.

Digital Library

[5]

Renée St Amant, Amir Yazdanbakhsh, Jongse Park, Bradley Thwaites, Hadi Esmaeilzadeh, Arjang Hassibi, Luis Ceze, and Doug Burger. 2014. General-Purpose Code Acceleration with Limited-Precision Analog Computation. In ISCA.

Digital Library

[6]

Hadi Asghari-Moghaddam, Young Hoon Son, Jung Ho Ahn, and Nam Sung Kim. 2016. Chameleon: Versatile and Practical Near-DRAM Acceleration Architecture for Large Memory Systems. In MICRO.

Digital Library

[7]

A Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt. 2009. Analyzing CUDA Workloads using a Detailed GPU Simulator. In ISPASS.

[8]

Bilel Belhadj, Antoine Joubert, Zheng Li, Rodolphe Héliot, and Olivier Temam. 2013. Continuous Real-World Inputs Can Open Up Alternative Accelerator Designs. In ISCA.

Digital Library

[9]

Tarun Beri, Sorav Bansal, and Subodh Kumar. 2015. A Scheduling and Runtime Framework for a Cluster of Heterogeneous Machines with Multiple Accelerators. In IPDPS.

Digital Library

[10]

Pierre Boudier and Graham Sellers. 2011. Memory System on Fusion APUs. AMD Fusion developer summit (2011).

[11]

Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying Quantitative Reliability for Programs that Execute on Unreliable Hardware. In OOPSLA.

Digital Library

[12]

K. K. Chang, P. J. Nair, D. Lee, S. Ghose, M. K. Qureshi, and O. Mutlu. 2016. Low-Cost Inter-Linked Subarrays (LISA): Enabling fast inter-subarray data movement in DRAM. In HPCA.

[13]

Niladrish Chatterjee, Mike O'Connor, Gabriel H. Loh, Nuwan Jayasena, and Rajeev Balasubramonian. 2014. Managing DRAM Latency Divergence in Irregular GPGPU Applications. In SC.

Digital Library

[14]

Xuhao Chen, Li-Wen Chang, Christopher I Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In MICRO.

Digital Library

[15]

Yu-Hsin Chen, Joel Emer, and Vivienne Sze. 2016. Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks. In ISCA.

Digital Library

[16]

Radoslav Danilak. 2009. System and Method for Hardware-based GPU Pagingto System Memory. US7623134 B1.

[17]

Michael F Deering, Stephen A Schlapp, and Michael G Lavelle. 1994. FBRAM: A New Form of Memory Optimized for 3D Graphics. In SIGGRAPH.

Digital Library

[18]

Jeff Draper, Jacqueline Chame, Mary Hall, Craig Steele, Tim Barrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, et al. 2002. The Architecture of the DIVA Processing-in-Memory Chip. In Supercomputing.

Digital Library

[19]

Zidong Du, Avinash Lingamneni, Yunji Chen, Krishna Palem, Olivier Temam, and Chengyong Wu. 2014. Leveraging the Error Resilience of Machine-Learning Applications for Designing Highly Energy Efficient Accelerators. In ASP-DAC.

[20]

Schuyler Eldridge, Amos Waterland, Margo Seltzer, Jonathan Appavoo, and Ajay Joshi. 2015. Towards General-Purpose Neural Network Computing. In PACT.

Digital Library

[21]

Duncan G Elliott, W Martin Snelgrove, and Michael Stumm. 1992. Computational RAM: A Memory-SIMD Hybrid and its Application to DSP. In Custom Integrated Circuits Conference, Vol. 30.

[22]

Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012. Neural Acceleration for General-Purpose Approximate Programs. In MICRO.

Digital Library

[23]

A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. 2015. DRAMA: An Architecture for Accelerated Processing Near Memory. CAL 14, 1 (2015).

[24]

A. Farmahini-Farahani, Jung Ho Ahn, K. Morrow, and Nam Sung Kim. 2015. NDA: Near-DRAM Acceleration Architecture Leveraging Commodity DRAM Devices and Standard Memory Modules. In HPCA.

[25]

Yusuke Fujii, Takuya Azumi, Nobuhiko Nishio, Shinpei Kato, and Masato Edahiro. 2013. Data Transfer Matters for GPU Computing. In ICPADS.

Digital Library

[26]

M. Gao and Ch. Kozyrakis. 2016. HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing. In HPCA.

[27]

Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. 2017. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory. (2017).

Digital Library

[28]

V. Govindaraju, C. H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim. 2012. DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing. IEEE Micro 32, 5 (2012), 38--51.

Digital Library

[29]

Beayna Grigorian, Nazanin Farahpour, and Glenn Reinman. 2015. BRAINIAC: Bringing Reliable Accuracy Into Neurally-Implemented Approximate Computing. In HPCA.

[30]

Beayna Grigorian and Glenn Reinman. 2014. Accelerating Divergent Applications on SIMD Architectures using Neural Networks. In ICCD.

[31]

Q. Guo, N. Alachiotis, B. Akin, F. Sadi, G. Xu, T-M. Low, L. Pileggi, J. Hoe, and F. Franchetti. 2014. 3D-Stacked Memory-Side Acceleration: Accelerator and System Design. In WoNDP.

[32]

Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, and Eby G. Friedman. 2013. AC-DIMM: Associative Computing with STT-MRAM. In ISCA.

Digital Library

[33]

Song Han, Huizi Mao, and William J. Dally. 2016. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization, and Huffman Coding. In ICLR.

[34]

Mark Harris. 2016. Inside Pascal: Nvidia's Newest Computing Platform. https://devblogs.nvidia.com/parallelforall/inside-pascal/. (2016). https://devblogs.nvidia.com/parallelforall/inside-pascal/

[35]

Syed Minhaj Hassan, Sudhakar Yalamanchili, and Saibal Mukhopadhyay. 2015. Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore. In MEMSYS.

Digital Library

[36]

Mark Horowitz. {n. d.}. Energy Table for 45nm Process. ({n. d.}).

[37]

Rui Hou, Lixin Zhang, Michael C Huang, Kun Wang, Hubertus Franke, Yi Ge, and Xiaotao Chang. 2011. Efficient Data Streaming with On-chip Accelerators: Opportunities and Challenges. In HPCA.

Digital Library

[38]

Kevin Hsieh, Eiman Ebrahimi, Gwangsun Kim, Niladrish Chatterjee, Mike O'Connor, Nandita Vijaykumar, Onur Mutlu, and Stephen W Keckler. 2016. Transparent Offloading and Mapping (TOM): Enabling Programmer-Transparent Near-Data Processing in GPU Systems. In ISCA.

Digital Library

[39]

Hynix. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0. 2015. (2015).

[40]

Krzysztof Iniewski. 2010. CMOS Processors and Memories. Springer Science & Business Media.

[41]

Jayesh Iyer, Corinne L Hall, Jerry Shi, and Yuchen Huang. 2006. System Memory Power and Thermal Management in Platforms Build on Intel Centrino Duo Technology. Intel Technology Journal 10,2 (2006).

[42]

JEDEC. October 2013. High Bandwidth Memory DRAM. http://www.jedec.org/standards-documents/docs/jesd235. (October 2013).

[43]

Hadi Jooybar, Wilson W.L. Fung, Mike O'Connor, Joseph Devietti, and Tor M. Aamodt. 2013. GPUDet: A Deterministic GPU Architecture. In ASPLOS.

Digital Library

[44]

Yi Kang, Wei Huang, Seung-Moon Yoo, Diana Keen, Zhenzhou Ge, Vinh Lam, Pratap Pattnaik, and Josep Torrellas. 2012. FlexRAM: Toward an Advanced Intelligent Memory System. In ICCD.

[45]

Shinpei Kato, Michael McThrow, Carlos Maltzahn, and Scott Brandt. 2012. Gdev: First-Class GPU Resource Management in the Operating System. In USENIX.

Digital Library

[46]

S. W. Keckler, W. J. Dally, B. Khailany, M. Garland, and D. Glasco. 2011. GPUs and the Future of Parallel Computing. IEEE Micro 31, 5 (2011), 7--17.

Digital Library

[47]

D. Kim, J. Kung, S. Chai, S. Yalamanchili, and S. Mukhopadhyay. 2016. NeuroCube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory. In ISCA.

Digital Library

[48]

K.W. Kim. 2004. Apparatus for Pipe Latch Control Circuit in Synchronous Memory Device. (2004). https://www.google.com/patents/US6724684US6724684B2.

[49]

K. Koo, S. Ok, Y. Kang, S. Kim, C. Song, H. Lee, H. Kim, Y. Kim, J. Lee, S. Oak, Y. Lee, J. Lee, J. Lee, H. Lee, J. Jang, J. Jung, B. Choi, Y. Kim, Y. Hur, Y. Kim, B. Chung, and Y. Kim. 2012. A 1.2V 38nm 2.4Gb/s/pin 2Gb DDR4 SDRAM with Bank Group and x4 Half-Page Architecture. In ISSCC. 40--41.

[50]

D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, J. H. Cho, K. H. Kwon, M. J. Kim, J. Lee, K. W. Park, B. Chung, and S. Hong. 2014. 25.2 A 1.2V 8Gb 8-Channel 128GB/s High-Bandwidth Memory (HBM) Stacked DRAM with Effective Microbump I/O Test Methods using 29nm Process and TSV. In ISSCC.

[51]

Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA.

Digital Library

[52]

Sheng Li, Jung Ho Ahn, Richard D. Strong, Jay B. Brockman, Dean M. Tullsen, and Norman P. Jouppi. 2009. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In MICRO.

Digital Library

[53]

Jiang Lin, Hongzhong Zheng, Zhichun Zhu, Howard David, and Zhao Zhang. 2007. Thermal Modeling and Management of DRAM Memory Systems. In ISCA.

Digital Library

[54]

Jiang Lin, Hongzhong Zheng, Zhichun Zhu, Eugene Gorbatov, Howard David, and Zhao Zhang. 2008. Software Thermal Management of DRAM Memory for Multicore Systems. SIGMETRICS 36, 1 (2008), 337--348.

Digital Library

[55]

J. Lin, H. Zheng, Z. Zhu, and Z. Zhang. 2013. Thermal Modeling and Management of DRAM Systems. IEEE Trans. Comput. 62, 10 (2013), 2069--2082.

Digital Library

[56]

Song Liu, Brian Leung, Alexander Neckar, Seda Ogrenci Memik, Gokhan Memik, and Nikos Hardavellas. 2011. Hardware/Software Techniques for DRAM Thermal Management. In HPCA.

Digital Library

[57]

Ken Mai, Tim Paaske, Nuwan Jayasena, Ron Ho, William J Dally, and Mark Horowitz. 2000. Smart Memories: A Modular Reconfigurable Architecture. In ISCA.

Digital Library

[58]

K Man. {n. d.}. Bensley FB-DIMM Performance/Thermal Management. In Intel Developer Forum.

[59]

Lawrence McAfee and Kunle Olukotun. 2015. EMEURO: A Framework for Generating Multi-Purpose Accelerators via Deep Learning. In CGO.

Digital Library

[60]

Xinxin Mei and Xiaowen Chu. 2016. Dissecting GPU Memory Hierarchy through Microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 99 (2016).

Digital Library

[61]

Thierry Moreau, Mark Wyse, Jacob Nelson, Adrian Sampson, Hadi Esmaeilzadeh, Luis Ceze, and Mark Oskin. 2015. SNNAP: Approximate Computing on Programmable SoCs via Neural Acceleration. In HPCA.

[62]

Janani Mukundan, Hillery Hunter, Kyu-hyoun Kim, Jeffrey Stuecheli, and José F. Martínez. 2013. Understanding and Mitigating Refresh Overheads in High-density DDR4 DRAM Systems. In ISCA.

Digital Library

[63]

Naveen Muralimanohar, Rajeev Balasubramonian, and Norm Jouppi. 2007. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0. In MICRO.

Digital Library

[64]

Lifeng Nai, Ramyad Hadidi, Jaewoong Sim, Hyojong Kim, Pranith Kumar, and Hyesoon Kim. 2017. GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks. In HPCA.

[65]

R. Nair, S.F. Antao, C. Bertolli, P. Bose, J.R. Brunheroto, T. Chen, C. Cher, C.H.A. Costa, J. Doi, C. Evangelinos, B.M. Fleischer, T.W. Fox, D.S. Gallo, L. Grinberg, J.A. Gunnels, A.C. Jacob, P. Jacob, H.M. Jacobson, T. Karkhanis, C. Kim, J.H. Moreno, J.K. O'Brien, M. Ohmacht, Y. Park, D.A. Prener, B.S. Rosenburg, K.D. Ryu, O. Sallenave, M.J. Serrano, P.D.M. Siegl, K. Sugavanam, and Z. Sura. 2015. Active Memory Cube: A Processing-in-Memory Architecture for Exascale Systems. IBM Journal of Research and Development 59, 2/3 (2015).

Digital Library

[66]

T. Y. Oh, Y. S. Sohn, S. J. Bae, M. S. Park, J. H. Lim, Y. K. Cho, D. H. Kim, D. M. Kim, H. R. Kim, H. J. Kim, J. H. Kim, J. K. Kim, Y. S. Kim, B. C. Kim, S. H. Kwak, J. H. Lee, J. Y. Lee, C. H. Shin, Y. Yang, B. S. Cho, S. Y. Bang, H. J. Yang, Y. R. Choi, G. S. Moon, C. G. Park, S. W. Hwang, J. D. Lim, K. I. Park, J. S. Choi, and Y. H. Jun. 2011. A 7 Gb/s/pin 1 Gbit GDDR5 SDRAM With 2.5 ns Bank to Bank Active Time and No Bank Group Restriction. JSSC 46, 1 (2011), 107--118.

[67]

T. Y. Oh, Y. S. Sohn, S. J. Bae, M. S. Park, J. H. Lim, Y. K. Cho, D. H. Kim, D. M. Kim, H. R. Kim, H. J. Kim, J. H. Kim, J. K. Kim, Y. S. Kim, B. C. Kim, S. H. Kwak, J. H. Lee, J. Y. Lee, C. H. Shin, Y. S. Yang, B. S. Cho, S. Y. Bang, H. J. Yang, Y. R. Choi, G. S. Moon, C. G. Park, S. W. Hwang, J. D. Lim, K. I. Park, J. S. Choi, and Y. H. Jun. {n. d.}. A 7Gb/s/pin GDDR5 SDRAM with 2.5ns Bank-to-Bank Active Time and no Bank-group Restriction. In ISSCC'10.

[68]

M. Oskin, F.T. Chong, and T. Sherwood. 1998. Active Pages: a Computation Model for Intelligent Memory. In ISCA.

Digital Library

[69]

Jongse Park, Hadi Esmaeilzadeh, Xin Zhang, Mayur Naik, and William Harris. 2015. FlexJava: Language Support for Safe and Modular Approximate Programming. In FSE.

Digital Library

[70]

David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick. 1997. A Case for Intelligent RAM. Micro, IEEE 17, 2 (1997).

Digital Library

[71]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In ACM SIGARCH Computer Architecture News, Vol. 42. 743--758.

Digital Library

[72]

Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In ASPLOS.

Digital Library

[73]

Jason Power, Mark D Hill, and David A Wood. 2014. Supporting x86-64 Address Translation for 100s of GPU Lanes. In HPCA.

[74]

S.H. Pugsley, J. Jestes, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and Feifei Li. 2014. Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads. Micro, IEEE 34, 4 (2014).

[75]

Brian M. Rogers, Anil Krishna, Gordon B. Bell, Ken Vu, Xiaowei Jiang, and Yan Solihin. 2009. Scaling the Bandwidth Wall: Challenges in and Avenues for CMP Scaling. In ISCA.

Digital Library

[76]

Timothy G. Rogers, Mike O'Connor, and Tor M. Aamodt. 2012. Cache-Conscious Wavefront Scheduling. In MICRO.

Digital Library

[77]

A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and D. Grossman. 2011. EnerJ: Approximate Data Types for Safe and General Low-Power Computation. In PLDI.

Digital Library

[78]

Richard Sampson, Ming Yang, Siyuan Wei, Chaitali Chakrabarti, and Thomas F Wenisch. 2013. Sonic Millip3De: A Massively Parallel 3D-stacked Accelerator for 3D Ultrasound. In HPCA.

Digital Library

[79]

C. Shelor, K. Kavi, and Adavally S. 2015. Dataflow based Near Data Processing using Coarse Grain Reconfigurable Logic. In WoNDP.

[80]

Inderjit Singh, Arrvindh Shriraman, Wilson WL Fung, Mike O'Connor, and Tor M Aamodt. 2013. Cache Coherence for GPU Architectures. In HPCA.

Digital Library

[81]

Young Hoon Son, O. Seongil, Yuhwan Ro, Jae W. Lee, and Jung Ho Ahn. 2013. Reducing Memory Access Latency with Asymmetric DRAM Bank Organizations. In ISCA.

Digital Library

[82]

Yingying Tian, Sooraj Puthoor, Joseph L Greathouse, Bradford M Beckmann, and Daniel A Jiménez. 2015. Adaptive GPU Cache Bypassing. In Proceedings of the 8th Workshop on General Purpose Processing using GPUs.

Digital Library

[83]

Peter C Tong, Sonny S Yeoh, Kevin J Kranzusch, Gary D Lorensen, Kaymann L Woo, Ashish Kishen Kaul, Colyn S Case, Stefan A Gottschalk, and Dennis K Ma. 2008. Dedicated Mechanism for Page Mapping in a GPU. US20080028181 A1.

[84]

Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Onur Mutlu, Chita Das, Mahmut Kandemir, and Todd C. Mowry. 2015. A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Efficient Data Compression. In ISCA.

Digital Library

[85]

Nicholas Wilt. 2013. The CUDA Handbook: A Comprehensive Guide to GPU Programming. Pearson Education.

[86]

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying GPU Microarchitecture through Microbenchmarking. In ISPASS.

[87]

Amir Yazdanbakhsh, Divya Mahajan, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2016. AxBench: A Multi-Platform Benchmark Suite for Approximate Computing: Acceleration for GPU Throughput Processors. IEEE Design and Test (2016).

[88]

Amir Yazdanbakhsh, Divya Mahajan, Bradley Thwaites, Jongse Park, Anandhavel Nagendrakumar, Sindhuja Sethuraman, Kartik Ramkrishnan, Nishanthi Ravindran, Rudra Jariwala, Abbas Rahimi, Hadi Esmaeilzadeh, and Kia Bazargan. 2015. Axilog: Language Support for Approximate Hardware Design. In DATE.

Digital Library

[89]

Amir Yazdanbakhsh, Jongse Park, Hardik Sharma, Pejman Lotfi-Kamran, and Hadi Esmaeilzadeh. 2015. Neural Acceleration for GPU Throughput Processors. In MICRO.

Digital Library

[90]

Amir Yazdanbakhsh, Gennady Pekhimenko, Bradley Thwaites, Hadi Esmaeilzadeh, Onur Mutlu, and Todd C. Mowry. 2015. RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads. In TACO.

Digital Library

[91]

D.P. Zhang, N. Jayasena, A. Lyashevsky, J.L. Greathouse, L.F. Xu, and M. Ignatowski. 2014. TOP-PIM: Throughput-Oriented Programmable Processing in Memory. In HPDC.

Digital Library

[92]

Qiuling Zhu, T. Graf, H.E. Sumbul, L. Pileggi, and F. Franchetti. 2013. Accelerating Sparse Matrix-Matrix Multiplication with 3D-Stacked Logic-in-Memory Hardware. In HPEC.

Cited By

Sutradhar PBavikadi SDinakarrao SIndovina MGanguly A(2024)3DL-PIM: A Look-Up Table Oriented Programmable Processing in Memory Architecture Based on the 3-D Stacked Memory for Data-Intensive ApplicationsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.329314012:1(60-72)Online publication date: Jan-2024
https://doi.org/10.1109/TETC.2023.3293140
Heo JYoo S(2024)NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344371243:11(3900-3912)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443712
Yang WYang YJi SJiang JJing NWang QMao ZSheng W(2024)RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338611743:10(2854-2867)Online publication date: Oct-2024
https://doi.org/10.1109/TCAD.2024.3386117
Show More Cited By

Recommendations

Main-Memory Near-Data Acceleration with Concurrent Host Access
Near-Data Processing in Memory Expander for DNN Acceleration on GPUs
We propose a near-data processing (NDP) architecture that exploits a memory expander with byte-addressable memory-semantic interconnect to accelerate memory-bound operations in deep neural networks (DNNs). Our architecture can execute NDP operations on ...
A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Modern Graphics Processing Units (GPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '18: Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques

November 2018

494 pages

ISBN:9781450359863

DOI:10.1145/3243176

General Chair:
Skevos Evripidou
University of Cyprus, Cyprus
,
Program Chairs:
Per Stenström
Chalmers University of Technology, Sweden
,
Michael O'Boyle
University of Edinburgh, UK

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

In-Cooperation

IFIP WG 10.3: IFIP WG 10.3
IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 November 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

PACT '18

Sponsor:

SIGARCH

PACT '18: International conference on Parallel Architectures and Compilation Techniques

November 1 - 4, 2018

Limassol, Cyprus

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
562
Total Downloads

Downloads (Last 12 months)87
Downloads (Last 6 weeks)18

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sutradhar PBavikadi SDinakarrao SIndovina MGanguly A(2024)3DL-PIM: A Look-Up Table Oriented Programmable Processing in Memory Architecture Based on the 3-D Stacked Memory for Data-Intensive ApplicationsIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.329314012:1(60-72)Online publication date: Jan-2024
https://doi.org/10.1109/TETC.2023.3293140
Heo JYoo S(2024)NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering NetworksIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.344371243:11(3900-3912)Online publication date: Nov-2024
https://doi.org/10.1109/TCAD.2024.3443712
Yang WYang YJi SJiang JJing NWang QMao ZSheng W(2024)RecPIM: Efficient In-Memory Processing for Personalized Recommendation Inference Using Near-Bank ArchitectureIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2024.338611743:10(2854-2867)Online publication date: Oct-2024
https://doi.org/10.1109/TCAD.2024.3386117
Xie XGu PDing YNiu DZheng HXie Y(2023)MPU: Memory-centric SIMT Processor via In-DRAM Near-bank ComputingACM Transactions on Architecture and Code Optimization10.1145/360311320:3(1-26)Online publication date: 29-May-2023
https://dl.acm.org/doi/10.1145/3603113
Zhang CSun HLi SWang YChen HLiu H(2023)A Survey of Memory-Centric Energy Efficient Computer ArchitectureIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2023.329759534:10(2657-2670)Online publication date: Oct-2023
https://doi.org/10.1109/TPDS.2023.3297595
Kandemir MAkbulut GChoi WKarakoy M(2023)Architecture-Aware Currying2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT58117.2023.00029(250-264)Online publication date: 21-Oct-2023
https://doi.org/10.1109/PACT58117.2023.00029
Fakhrzadehgan ARamrakhyani PQureshi MErez M(2023)SecDDR: Enabling Low-Cost Secure Memories by Protecting the DDR Interface2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58367.2023.00016(14-27)Online publication date: Jun-2023
https://doi.org/10.1109/DSN58367.2023.00016
Bitalebi HGeraeinejad VEbrahimi MSun YWong DNaghibijouybari H(2022)Near LLC versus near main memory processingProceedings of the 14th Workshop on General Purpose Processing Using GPU10.1145/3530390.3532726(1-6)Online publication date: 3-Apr-2022
https://dl.acm.org/doi/10.1145/3530390.3532726
Lin JLiang LQu ZAhmad ILiu LTu FGupta TDing YXie YSalapura VZahran MChong FTang L(2022)INSPIREProceedings of the 49th Annual International Symposium on Computer Architecture10.1145/3470496.3527433(102-115)Online publication date: 18-Jun-2022
https://dl.acm.org/doi/10.1145/3470496.3527433
Sharma AKondguli SHuang M(2022)Irrelevant Data Traffic in Modern Low Power GPU Architectures2022 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS55553.2022.9925321(1-7)Online publication date: Oct-2022
https://doi.org/10.1109/NAS55553.2022.9925321
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten