Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions

Santos, Paulo C.; de Lima, João P. C.; de Moura, Rafael F.; Alves, Marco A. Z.; Beck, Antonio C. S.; Carro, Luigi

doi:10.1007/s10766-020-00674-y

Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions

Published: 28 January 2021

Volume 49, pages 237–252, (2021)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Paulo C. Santos ORCID: orcid.org/0000-0001-8555-2637¹,
João P. C. de Lima¹,
Rafael F. de Moura¹,
Marco A. Z. Alves²,
Antonio C. S. Beck¹ &
…
Luigi Carro¹

250 Accesses
4 Citations
Explore all metrics

Abstract

Processing-in-Memory (PIM) or Near-Data Accelerator (NDA) has been recently revisited to mitigate the issues of memory and power wall, mainly supported by the maturity of 3D-staking manufacturing technology, and the increasing demand for bandwidth and parallel data access in emerging processing-hungry applications. However, as these designs are naturally decoupled from main processors, at least three open issues must be tackled to allow the adoption of PIM: how to offload instructions from the host to NDAs, since many can be placed along memory; how to keep cache coherence between host and NDAs, and how to deal with the internal communication between different NDA units considering that NDAs can communicate to each other to better exploit their adoptions. In this work, we present an efficient design to solve these challenges. Based on the hybrid Host-Accelerator code, to provide fine-grain control, our design allows transparent offloading of NDA instructions directly from a host processor. Moreover, our design proposes a data coherence protocol, which includes an inclusion-policy agnostic cache coherence mechanism to share data between the host processor and the NDA units, transparently, and a protocol to allow communication between different NDA units. The proposed mechanism allows full exploitation of the experimented state-of-the-art design, achieving a speedup of up to 14.6× compared to a AVX architecture on PolyBench Suite, using, on average, 82% of the total time for processing and only 18% for the cache coherence and communication protocols.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Widening the Memory Bottleneck by Automatically-Compiled Application-Specific Speculation Mechanisms

Exploiting Heterogeneity in PIM Architectures for Data-Intensive Applications

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

References

Aga, S., Jeloka, S., Subramaniyan, A., Narayanasamy, S., Blaauw, D., Das, R.: Compute caches. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 481–492 (2017)
Ahmed, H., Santos, P.C., de Lima, J.P.C., de Moura, R.F., Alves, M.A., Beck, A., Carro, L.: A compiler for automatic selection of suitable processing-in-memory instructions. In: Design, Automation & Test in Europe Conference & Exhibition (DATE) (2019)
Ahn, J., Hong, S., Yoo, S., Mutlu, O., Choi, K.: A scalable processing-in-memory accelerator for parallel graph processing. In: International Symposium on Computer Architecture (2015)
Ahn, J., Yoo, S., Mutlu, O., Choi, K.: Pim-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In: International Symposium on Computer Architecture (ISCA), pp. 336–348. IEEE (2015)
Akin, B., Franchetti, F., Hoe, J.C.: Data reorganization in memory using 3d-stacked dram. In: ACM SIGARCH Computer Architecture News, vol. 43, pp. 131–143. ACM (2015)
Alves, M.A., Diener, M., Santos, P.C., Carro, L.: Large vector extensions inside the hmc. In: Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1249–1254. IEEE (2016)
Binkert, N., Beckmann, B., Black, G., Reinhardt, S.K., Saidi, A., Basu, A., Hestness, J., Hower, D.R., Krishna, T., Sardashti, S., et al.: The gem5 simulator. ACM SIGARCH Computer Architecture News 39, (2011)
Boroumand, A., Ghose, S., Lucia, B., Hsieh, K., Malladi, K., Zheng, H., Mutlu, O.: LazyPIM: an efficient cache coherence mechanism for processing-in-memory. IEEE Comput. Architect. Lett. 16(1), 46–50 (2016)
Article Google Scholar
Drumond, M., Daglis, A., Mirzadeh, N., Ustiugov, D., Picorel, J., Falsafi, B., Grot, B., Pnevmatikatos, D.: The mondrian data engine. In: International Symposium on Computer Architecture. ACM (2017)
Eckert, Y., Jayasena, N., Loh, G.H.: Thermal feasibility of die-stacked processing in memory. In: 2nd Workshop on Near-Data Processing (WoNDP) (2014)
Gao, M., Kozyrakis, C.: HRL: efficient and flexible reconfigurable logic for near-data processing. In: International Symposium High Performance Computer Architecture (HPCA) (2016)
Hsieh, K., Ebrahimi, E., Kim, G., Chatterjee, N., O’Connor, M., Vijaykumar, N., Mutlu, O., Keckler, S.W.: Transparent offloading and mapping (tom): Enabling programmer-transparent near-data processing in gpu systems. ACM SIGARCH Comput. Architect. News 44(3), 204–216 (2016)
Article Google Scholar
Hsieh, K., Khan, S., Vijaykumar, N., et al.: Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In: International Conference on Computer Design (ICCD) (2016)
Hybrid Memory Cube Consortium: Hybrid memory cube specification rev. 2.0 (2013). https://www.micron.com/-/media/client/global/documents/products/data-sheet/hmc/gen2/hmc_gen2.pdf
Lee, J.H., Sim, J., Kim, H.: Bssync: Processing near memory for machine learning workloads with bounded staleness consistency models. In: International Conference on Parallel Architecture and Compilation (PACT), pp. 241–252. IEEE (2015)
de Lima, J.P.C., Santos, P.C., Alves, M.A., Beck, A., Carro, L.: Design space exploration for PIM architectures in 3d-stacked memories. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, pp. 113–120. ACM (2018)
Nai, L., Hadidi, R., Sim, J., Kim, H., Kumar, P., Kim, H.: Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In: International Symposium High Performance Computer Architecture (HPCA), pp. 457–468. IEEE (2017)
Nair, R., Antao, S.F., Bertolli, C., Bose, P., et al.: Active memory cube: a processing-in-memory architecture for exascale systems. IBM J. Res. Develop. 59, 2–3 (2015)
Article Google Scholar
Oliveira, G.F., Santos, P.C., Alves, M.A., Carro, L.: Nim: An hmc-based machine for neuron computation. In: International Symposium on Applied Reconfigurable Computing (2017)
Papamarcos, M.S., Patel, J.H.: A low-overhead coherence solution for multiprocessors with private cache memories. SIGARCH Comput. Archit. News 12(3), 348–354 (1984). https://doi.org/10.1145/773453.808204
Article Google Scholar
Pawlowski, J.T.: Hybrid memory cube (HMC). In: Hot Chips 23 Symposium (HCS). IEEE (2011)
Pouchet, L.N.: Polybench: The polyhedral benchmark suite. http://www.cs.ucla.edu/pouchet/software/polybench (2012)
Pugsley, S., Jestes, J., Balasubramonian, R., et al.: Comparing Implementations of Near-Data Computing with In-Memory MapReduce Workloads. IEEE Micro 34(4), 44–52 (2014)
Article Google Scholar
Santos, P.C., de Lima, J.P.C., de Moura, R.F., Ahmed, H., Alves, M.A., Beck, A., Carro, L.: Exploring IoT platform with technologically agnostic processing-in-memory framework. In: Proceedings of the Workshop on INTelligent Embedded Systems Architectures and Applications, pp. 1–6. ACM (2018)
Santos, P.C., Oliveira, G.F., Tomé, D.G., Alves, M.A., Almeida, E.C., Carro, L.: Operand size reconfiguration for big data processing in memory. In: Design, Automation & Test in Europe Conference & Exhibition (DATE) (2017)
Scrbak, M., Islam, M., Kavi, K.M., Ignatowski, M., Jayasena, N.: Exploring the processing-in-memory design space. J. Syst. Architect. 75, 59–67 (2017)
Article Google Scholar
Singh, G., Chelini, L., Corda, S., Awan, A.J., Stuijk, S., Jordans, R., Corporaal, H., Boonstra, A.J.: A review of near-memory computing architectures: Opportunities and challenges. In: Euromicro Conference on Digital System Design (DSD) (2018)
Standard JEDEC: High Bandwidth Memory (HBM) DRAM. JESD235 (2013)
Zhang, D., Jayasena, N., Lyashevsky, A., et al.: TOP-PIM: throughput-oriented programmable processing in memory. In: International Symposium on High-performance Parallel and Distributed Computing (2014)

Download references

Author information

Authors and Affiliations

Federal University of Rio Grande do Sul, Porto Alegre, Brazil
Paulo C. Santos, João P. C. de Lima, Rafael F. de Moura, Antonio C. S. Beck & Luigi Carro
Federal University of Paraná, Curitiba, Brazil
Marco A. Z. Alves

Authors

Paulo C. Santos
View author publications
You can also search for this author in PubMed Google Scholar
João P. C. de Lima
View author publications
You can also search for this author in PubMed Google Scholar
Rafael F. de Moura
View author publications
You can also search for this author in PubMed Google Scholar
Marco A. Z. Alves
View author publications
You can also search for this author in PubMed Google Scholar
Antonio C. S. Beck
View author publications
You can also search for this author in PubMed Google Scholar
Luigi Carro
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Paulo C. Santos.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This study was financed in part by the CAPES (Finance Code 001), CNPq, FAPERGS, and Serrapilheira (Serra-1709-16621).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Santos, P.C., de Lima, J.P.C., de Moura, R.F. et al. Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions. Int J Parallel Prog 49, 237–252 (2021). https://doi.org/10.1007/s10766-020-00674-y

Download citation

Received: 29 October 2019
Accepted: 15 July 2020
Published: 28 January 2021
Issue Date: April 2021
DOI: https://doi.org/10.1007/s10766-020-00674-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions

Abstract

Access this article

Similar content being viewed by others

Widening the Memory Bottleneck by Automatically-Compiled Application-Specific Speculation Mechanisms

Exploiting Heterogeneity in PIM Architectures for Data-Intensive Applications

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Enabling Near-Data Accelerators Adoption by Through Investigation of Datapath Solutions

Abstract

Access this article

Similar content being viewed by others

Widening the Memory Bottleneck by Automatically-Compiled Application-Specific Speculation Mechanisms

Exploiting Heterogeneity in PIM Architectures for Data-Intensive Applications

A Dynamic Cache Architecture for Efficient Memory Resource Allocation in Many-Core Systems

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation