MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System

Asgari, Bahar; Mukhopadhyay, Saibal; Yalamanchili, Sudhakar

doi:10.1007/s11265-019-01505-1

MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System

Published: 28 February 2020

Volume 93, pages 659–675, (2021)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Bahar Asgari ORCID: orcid.org/0000-0003-2305-9892¹,
Saibal Mukhopadhyay¹ &
Sudhakar Yalamanchili¹

451 Accesses
1 Citation
Explore all metrics

Abstract

As computations in machine-learning applications are increasing simultaneously along the size of datasets, the energy and performance costs of data movement dominate that of compute. This issue is more pronounced in embedded systems with limited resources and energy. Although near-data-processing (NDP) is pursued as an architectural solution, comparatively less attention has been focused on how to scale NDP for larger-scale embedded machine learning applications (e.g., speech and motion processing). We propose machine-learning hardware acceleration using a software-defined intelligent memory system (Mahasim). Mahasim is a scalable NDP-based memory system, in which application performance scales with the size of data. The building blocks of Mahasim are the programable memory slices, supported by data partitioning, compute-aware memory allocation, and an independent in-memory execution model. For recurrent neural networks, Mahasim shows up to 537.95 GFLOPS/W energy efficiency and 3.9x speedup, when the size of the system increases from 2 to 256 memory slices, which indicates that Mahasim favors larger problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

Automated machine learning: past, present and future

Article Open access 18 April 2024

A comprehensive review of Binary Neural Network

Article 30 March 2023

Notes

Mahasim is a binary star in the constellation of Auriga.

References

Asgari, B., Hadidi, R., Kim, H., Yalamanchili, S. (2019). Eridanus: Efficiently running inference of dnns using systolic arrays. IEEE Micro.
Asgari, B., Hadidi, R., Kim, H., Yalamanchili, S. (2019). Lodestar: Creating locally-dense cnns for efficient inference on systolic arrays. In Proceedings of the 56th Annual Design Automation Conference 2019 (pp. 233). ACM.
Asghari-Moghaddam, H., Son, Y. H., Ahn, J. H., Kim, N. S. (2016). Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems. In 2016 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO) (pp 1–13). IEEE.
Azarkhish, E., Rossi, D., Loi, I., Benini, L. (2017). Neurostream: Scalable and energy efficient deep learning with smart memory cubes. arXiv:1701.06420.
Bae, Y., Hadidi, R., Asgari, B., Cao, J., Kim, H. (2019). Capella: Customizing perception for edge devices by efficiently allocating fpgas to dnns. In Proceedings of The International Conference on Field-Programmable Logic and Applications 2019.
Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.
Balfour, J., & Dally, W. J. (2014). Design tradeoffs for tiled cmp on-chip networks. In ACM International conference on supercomputing 25th anniversary volume (pp. 390–401). ACM.
Borkar, S. (2013). Role of interconnects in the future of computing. Journal of Lightwave Technology, 31(24), 3927–3933.
Article Google Scholar
Canziani, A., Paszke, A., Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv:1605.07678.
Cao, J., Hadidi, R., Arulraj, J. S., Kim, H. (2019). Understanding the power consumption of executing deep neural networks on a distributed robot system. In Proceedings of the CODES+ISSS: International Conference on Hardware/Software Codesign and System Synthesis 2019.
Chang, A. X. M., & Culurciello, E. (2017). Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International symposium on circuits and systems (ISCAS) (pp. 1–4). IEEE.
Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., et al. (2014). Dadiannao: a machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609–622). IEEE Computer Society.
Chen, Y. H., Krishna, T., Emer, J. S., Sze, V. (2017). Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1), 127–138.
Article Google Scholar
Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259.
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078.
Consortium, H.M.C., & et al. (2013). Hybrid memory cube specification 1.0 Last Revision Jan.
Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J. H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., et al. (2003). Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing (pp. 35). ACM.
Dean, J. (2017). Machine learning for systems and systems for machine learning.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In 2009. CVPR 2009. IEEE conference on Computer vision and pattern recognition (pp. 248–255). IEEE.
Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. (2015). Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer architecture news (vol. 43, pp. 92–104). ACM.
Farmahini-Farahani, A., Ahn, J. H., Morrow, K., Kim, N. S. (2015). Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In 2015 IEEE 21st international symposium on High performance computer architecture (HPCA) (pp. 283–295). IEEE.
Gao, C., Neil, D., Ceolini, E., Liu, S. C., Delbruck, T. (2018). Deltarnn: a power-efficient recurrent neural network accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 21–30). ACM.
Gao, M., Ayers, G., Kozyrakis, C. (2015). Practical near-data processing for in-memory analytics frameworks. In 2015 international conference on Parallel architecture and compilation (PACT) (pp. 113–124). IEEE.
Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C. (2017). Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 751–764). ACM.
Gentleman, W. M., & Kung, H. (1982). Matrix triangularization by systolic arrays. In Real-time signal processing IV (vol. 298, pp. 19–27). International Society for Optics and Photonics.
Gokhale, M., Holmes, B., Iobst, K. (1995). Processing in memory: The terasys massively parallel pim array. Computer, 28(4), 23–31.
Article Google Scholar
Grot, B., Hestness, J., Keckler, S. W., Mutlu, O. (2009). Express cube topologies for on-chip interconnects. In 2009. HPCA 2009. IEEE 15th international symposium on High performance computer architecture (pp. 163–174). IEEE.
Hadidi, R., Asgari, B., Mudassar, B. A., Mukhopadhyay, S., Yalamanchili, S., Kim, H. (2017). Demystifying the characteristics of 3d-stacked memories: a case study for hybrid memory cube. In 2017 IEEE international symposium on Workload characterization (IISWC) (pp. 66–75). IEEE.
Hadidi, R., Asgari, B., Young, J., Mudassar, B. A., Garg, K., Krishna, T., Kim, H. (2018). Performance implications of nocs on 3d-stacked memories: Insights from the hybrid memory cube. In 2018 IEEE international symposium on Performance analysis of systems and software (ISPASS) (pp. 99–108). IEEE.
Hadidi, R., Cao, J., Ryoo, M. S., Kim, H. (2019). Robustly executing dnns in iot systems using coded distributed computing. In Proceedings of the 56th Annual Design Automation Conference 2019 (pp. 234). ACM.
Hadidi, R., Cao, J., Woodward, M., Ryoo, M. S., Kim, H. (2018). Distributed perception by collaborative robots. IEEE Robotics and Automation Letters, 3(4), 3709–3716.
Article Google Scholar
Hadidi, R., Cao, J., Woodward, M., Ryoo, M. S., Kim, H. (2018). Real-time image recognition using collaborative iot devices. In Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning (pp. 4). ACM.
Hadidi, R., Cao, J., Xie, Y., Asgari, B., Krishna, T., Kim, H. (2019). Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the International Symposium on Workload Characterization 2019. IEEE.
Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al. (2017). Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 75–84). ACM.
Hassan, S. M., Yalamanchili, S., Mukhopadhyay, S. (2015). Near data processing: Impact and optimization of 3d memory system architecture on the uncore. In Proceedings of the 2015 International Symposium on Memory Systems (pp. 11–21). ACM.
He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
Article Google Scholar
Jeddeloh, J., & Keeth, B. (2012). Hybrid memory cube new dram architecture increases density and performance. In 2012 symposium on VLSI Technology (VLSIT) (pp. 87–88). IEEE.
Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (pp. 1–12). ACM.
Kanttaiah, G. (1978). Bit-slice microprocessors. IETE Journal of Research, 24(3-4), 124–131.
Article Google Scholar
Keckler, S. W., Dally, W. J., Khailany, B., Garland, M., Glasco, D. (2011). Gpus and the future of parallel computing. IEEE Micro, 31(5), 7–17.
Article Google Scholar
Kim, D., Kung, J., Chai, S., Yalamanchili, S., Mukhopadhyay, S. (2016). Neurocube: a programmable digital neuromorphic architecture with high-density 3d memory. In 2016 ACM/IEEE 43rd annual international symposium on Computer architecture (ISCA) (pp. 380–392). IEEE.
Kim, G., Kim, J., Ahn, J. H., Kim, J. (2013). Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (pp. 145–156). IEEE Press.
Kim, J., Balfour, J., Dally, W. (2007). Flattened butterfly topology for on-chip networks. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 172–182). IEEE Computer Society.
Kim, W., Gajski, D., Kuck, D. J. (1984). A parallel pipelined relational query processor. ACM Transactions on Database Systems (TODS), 9(2), 214–235.
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).
Kung, H., McDanel, B., Zhang, S. Q. (2019). Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 821–834). ACM.
Kung, H. T. (1982). Why systolic architectures? IEEE Computer, 15(1), 37–46.
Article Google Scholar
Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., Jouppi, N. P. (2009). Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009. MICRO-42. 42nd annual IEEE/ACM international symposium on Microarchitecture (pp. 469–480). IEEE.
LiKamWa, R., Hou, Y., Gao, J., Polansky, M., Zhong, L. (2016). Redeye: Analog convnet image sensor architecture for continuous mobile vision. In ISCA’16 (pp. 255–266). ACM.
Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W. J., Horowitz, M. (2000). Smart memories: a modular reconfigurable architecture. ACM SIGARCH Computer Architecture News, 28(2), 161–171.
Article Google Scholar
Nai, L., Hadidi, R., Sim, J., Kim, H., Kumar, P., Kim, H. (2017). Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE international symposium on High performance computer architecture (HPCA) (pp. 457–468). IEEE.
O’Connor, M. (2014). Highlights of the high-bandwidth memory (hbm) standard. In Memory forum workshop.
O’Connor, M., Chatterjee, N., Lee, D., Wilson, J., Agrawal, A., Keckler, S. W., Dally, W. J. (2017). Fine grained dram: energy efficient dram for extreme bandwidth systems. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 41–54). ACM.
Pomerleau, D. A., Gusciora, G. L., Touretzky, D. S., Kung, H. (1988). Neural network simulation at warp speed: How we got 17 million connections per second. In Proceedings of 1988 IEEE International Conference on Neural Networks (pp. 143–150).
Quinton, P. (1984). Automatic synthesis of systolic arrays from uniform recurrent equations. In ACM SIGARCH Computer architecture news (vol. 12, pp. 208–214). ACM.
Rote, G. (1985). A systolic array algorithm for the algebraic path problem (shortest paths; matrix inversion). Computing, 34(3), 191–219.
Article MathSciNet Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.
Song, W. J., Mukhopadhyay, S., Yalamanchili, S. (2015). Kitfox: Multiphysics libraries for integrated power, thermal, and reliability simulations of multicore microarchitecture. IEEE Transactions on Components. Packaging and Manufacturing Technology, 5(11), 1590–1601.
Article Google Scholar
Standard, J. (2013). High bandwidth memory (hbm) dram JESD235.
Sutskever, I. (2013). Training recurrent neural networks. University of Toronto, Toronto.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).
TensorFlow: Performance benchmarks (2017). https://www.tensorflow.org/performance/benchmarks.
TensorFlow: Tensorflow (2018). https://www.tensorflow.org/.
Toomarian, N., & Barhen, J. (1991). Fast temporal neural learning using teacher forcing. In 1991., IJCNN-91-seattle international joint conference on Neural networks (vol. 1, pp. 817–822). IEEE.
TRANSLATION, T.W.O.S.M.: Wmt’15 (2017). http://www.statmt.org/wmt15/translation-task.html.
Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., et al. (2017). Scaledeep: a scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (pp. 13–26). ACM.
Wang, S., Zhou, D., Han, X., Yoshimura, T. (2017). Chain-nn: an energy-efficient 1d chain architecture for accelerating deep convolutional neural networks. In Design, automation & test in europe conference & exhibition (DATE), 2017 (pp. 1032–1037). IEEE.
Wang, Y., Zhang, M., Yang, J. (2017). Towards memory-efficient processing-in-memory architecture for convolutional neural networks. In ACM SIGPLAN Notices (vol. 52, pp. 81–90). ACM.
Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550– 1560.
Article Google Scholar
Williams, S., Waterman, A., Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), 65–76.
Article Google Scholar
Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, Georgia Institute of Technology, 266 Ferst Dr NW, KACB 2316, Atlanta, GA, 30332, USA
Bahar Asgari, Saibal Mukhopadhyay & Sudhakar Yalamanchili

Authors

Bahar Asgari
View author publications
You can also search for this author in PubMed Google Scholar
Saibal Mukhopadhyay
View author publications
You can also search for this author in PubMed Google Scholar
Sudhakar Yalamanchili
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bahar Asgari.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asgari, B., Mukhopadhyay, S. & Yalamanchili, S. MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System. J Sign Process Syst 93, 659–675 (2021). https://doi.org/10.1007/s11265-019-01505-1

Download citation

Received: 04 February 2019
Revised: 29 August 2019
Accepted: 20 November 2019
Published: 28 February 2020
Issue Date: June 2021
DOI: https://doi.org/10.1007/s11265-019-01505-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Automated machine learning: past, present and future

A comprehensive review of Binary Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Automated machine learning: past, present and future

A comprehensive review of Binary Neural Network

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation