Skip to main content

Advertisement

Log in

MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

As computations in machine-learning applications are increasing simultaneously along the size of datasets, the energy and performance costs of data movement dominate that of compute. This issue is more pronounced in embedded systems with limited resources and energy. Although near-data-processing (NDP) is pursued as an architectural solution, comparatively less attention has been focused on how to scale NDP for larger-scale embedded machine learning applications (e.g., speech and motion processing). We propose machine-learning hardware acceleration using a software-defined intelligent memory system (Mahasim). Mahasim is a scalable NDP-based memory system, in which application performance scales with the size of data. The building blocks of Mahasim are the programable memory slices, supported by data partitioning, compute-aware memory allocation, and an independent in-memory execution model. For recurrent neural networks, Mahasim shows up to 537.95 GFLOPS/W energy efficiency and 3.9x speedup, when the size of the system increases from 2 to 256 memory slices, which indicates that Mahasim favors larger problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14

Similar content being viewed by others

Notes

  1. Mahasim is a binary star in the constellation of Auriga.

References

  1. Asgari, B., Hadidi, R., Kim, H., Yalamanchili, S. (2019). Eridanus: Efficiently running inference of dnns using systolic arrays. IEEE Micro.

  2. Asgari, B., Hadidi, R., Kim, H., Yalamanchili, S. (2019). Lodestar: Creating locally-dense cnns for efficient inference on systolic arrays. In Proceedings of the 56th Annual Design Automation Conference 2019 (pp. 233). ACM.

  3. Asghari-Moghaddam, H., Son, Y. H., Ahn, J. H., Kim, N. S. (2016). Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems. In 2016 49th annual IEEE/ACM international symposium on Microarchitecture (MICRO) (pp 1–13). IEEE.

  4. Azarkhish, E., Rossi, D., Loi, I., Benini, L. (2017). Neurostream: Scalable and energy efficient deep learning with smart memory cubes. arXiv:1701.06420.

  5. Bae, Y., Hadidi, R., Asgari, B., Cao, J., Kim, H. (2019). Capella: Customizing perception for edge devices by efficiently allocating fpgas to dnns. In Proceedings of The International Conference on Field-Programmable Logic and Applications 2019.

  6. Bahdanau, D., Cho, K., Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv:1409.0473.

  7. Balfour, J., & Dally, W. J. (2014). Design tradeoffs for tiled cmp on-chip networks. In ACM International conference on supercomputing 25th anniversary volume (pp. 390–401). ACM.

  8. Borkar, S. (2013). Role of interconnects in the future of computing. Journal of Lightwave Technology, 31(24), 3927–3933.

    Article  Google Scholar 

  9. Canziani, A., Paszke, A., Culurciello, E. (2016). An analysis of deep neural network models for practical applications. arXiv:1605.07678.

  10. Cao, J., Hadidi, R., Arulraj, J. S., Kim, H. (2019). Understanding the power consumption of executing deep neural networks on a distributed robot system. In Proceedings of the CODES+ISSS: International Conference on Hardware/Software Codesign and System Synthesis 2019.

  11. Chang, A. X. M., & Culurciello, E. (2017). Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International symposium on circuits and systems (ISCAS) (pp. 1–4). IEEE.

  12. Chen, Y., Luo, T., Liu, S., Zhang, S., He, L., Wang, J., Li, L., Chen, T., Xu, Z., Sun, N., et al. (2014). Dadiannao: a machine-learning supercomputer. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 609–622). IEEE Computer Society.

  13. Chen, Y. H., Krishna, T., Emer, J. S., Sze, V. (2017). Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 52(1), 127–138.

    Article  Google Scholar 

  14. Cho, K., Van Merriënboer, B., Bahdanau, D., Bengio, Y. (2014). On the properties of neural machine translation: Encoder-decoder approaches. arXiv:1409.1259.

  15. Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078.

  16. Consortium, H.M.C., & et al. (2013). Hybrid memory cube specification 1.0 Last Revision Jan.

  17. Dally, W. J., Labonte, F., Das, A., Hanrahan, P., Ahn, J. H., Gummaraju, J., Erez, M., Jayasena, N., Buck, I., Knight, T. J., et al. (2003). Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE conference on Supercomputing (pp. 35). ACM.

  18. Dean, J. (2017). Machine learning for systems and systems for machine learning.

  19. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. In 2009. CVPR 2009. IEEE conference on Computer vision and pattern recognition (pp. 248–255). IEEE.

  20. Du, Z., Fasthuber, R., Chen, T., Ienne, P., Li, L., Luo, T., Feng, X., Chen, Y., Temam, O. (2015). Shidiannao: Shifting vision processing closer to the sensor. In ACM SIGARCH Computer architecture news (vol. 43, pp. 92–104). ACM.

  21. Farmahini-Farahani, A., Ahn, J. H., Morrow, K., Kim, N. S. (2015). Nda: Near-dram acceleration architecture leveraging commodity dram devices and standard memory modules. In 2015 IEEE 21st international symposium on High performance computer architecture (HPCA) (pp. 283–295). IEEE.

  22. Gao, C., Neil, D., Ceolini, E., Liu, S. C., Delbruck, T. (2018). Deltarnn: a power-efficient recurrent neural network accelerator. In Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 21–30). ACM.

  23. Gao, M., Ayers, G., Kozyrakis, C. (2015). Practical near-data processing for in-memory analytics frameworks. In 2015 international conference on Parallel architecture and compilation (PACT) (pp. 113–124). IEEE.

  24. Gao, M., Pu, J., Yang, X., Horowitz, M., Kozyrakis, C. (2017). Tetris: Scalable and efficient neural network acceleration with 3d memory. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 751–764). ACM.

  25. Gentleman, W. M., & Kung, H. (1982). Matrix triangularization by systolic arrays. In Real-time signal processing IV (vol. 298, pp. 19–27). International Society for Optics and Photonics.

  26. Gokhale, M., Holmes, B., Iobst, K. (1995). Processing in memory: The terasys massively parallel pim array. Computer, 28(4), 23–31.

    Article  Google Scholar 

  27. Grot, B., Hestness, J., Keckler, S. W., Mutlu, O. (2009). Express cube topologies for on-chip interconnects. In 2009. HPCA 2009. IEEE 15th international symposium on High performance computer architecture (pp. 163–174). IEEE.

  28. Hadidi, R., Asgari, B., Mudassar, B. A., Mukhopadhyay, S., Yalamanchili, S., Kim, H. (2017). Demystifying the characteristics of 3d-stacked memories: a case study for hybrid memory cube. In 2017 IEEE international symposium on Workload characterization (IISWC) (pp. 66–75). IEEE.

  29. Hadidi, R., Asgari, B., Young, J., Mudassar, B. A., Garg, K., Krishna, T., Kim, H. (2018). Performance implications of nocs on 3d-stacked memories: Insights from the hybrid memory cube. In 2018 IEEE international symposium on Performance analysis of systems and software (ISPASS) (pp. 99–108). IEEE.

  30. Hadidi, R., Cao, J., Ryoo, M. S., Kim, H. (2019). Robustly executing dnns in iot systems using coded distributed computing. In Proceedings of the 56th Annual Design Automation Conference 2019 (pp. 234). ACM.

  31. Hadidi, R., Cao, J., Woodward, M., Ryoo, M. S., Kim, H. (2018). Distributed perception by collaborative robots. IEEE Robotics and Automation Letters, 3(4), 3709–3716.

    Article  Google Scholar 

  32. Hadidi, R., Cao, J., Woodward, M., Ryoo, M. S., Kim, H. (2018). Real-time image recognition using collaborative iot devices. In Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning (pp. 4). ACM.

  33. Hadidi, R., Cao, J., Xie, Y., Asgari, B., Krishna, T., Kim, H. (2019). Characterizing the deployment of deep neural networks on commercial edge devices. In Proceedings of the International Symposium on Workload Characterization 2019. IEEE.

  34. Han, S., Kang, J., Mao, H., Hu, Y., Li, X., Li, Y., Xie, D., Luo, H., Yao, S., Wang, Y., et al. (2017). Ese: Efficient speech recognition engine with sparse lstm on fpga. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (pp. 75–84). ACM.

  35. Hassan, S. M., Yalamanchili, S., Mukhopadhyay, S. (2015). Near data processing: Impact and optimization of 3d memory system architecture on the uncore. In Proceedings of the 2015 International Symposium on Memory Systems (pp. 11–21). ACM.

  36. He, K., Zhang, X., Ren, S., Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

  37. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

    Article  Google Scholar 

  38. Jeddeloh, J., & Keeth, B. (2012). Hybrid memory cube new dram architecture increases density and performance. In 2012 symposium on VLSI Technology (VLSIT) (pp. 87–88). IEEE.

  39. Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., Bates, S., Bhatia, S., Boden, N., Borchers, A., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (pp. 1–12). ACM.

  40. Kanttaiah, G. (1978). Bit-slice microprocessors. IETE Journal of Research, 24(3-4), 124–131.

    Article  Google Scholar 

  41. Keckler, S. W., Dally, W. J., Khailany, B., Garland, M., Glasco, D. (2011). Gpus and the future of parallel computing. IEEE Micro, 31(5), 7–17.

    Article  Google Scholar 

  42. Kim, D., Kung, J., Chai, S., Yalamanchili, S., Mukhopadhyay, S. (2016). Neurocube: a programmable digital neuromorphic architecture with high-density 3d memory. In 2016 ACM/IEEE 43rd annual international symposium on Computer architecture (ISCA) (pp. 380–392). IEEE.

  43. Kim, G., Kim, J., Ahn, J. H., Kim, J. (2013). Memory-centric system interconnect design with hybrid memory cubes. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques (pp. 145–156). IEEE Press.

  44. Kim, J., Balfour, J., Dally, W. (2007). Flattened butterfly topology for on-chip networks. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 172–182). IEEE Computer Society.

  45. Kim, W., Gajski, D., Kuck, D. J. (1984). A parallel pipelined relational query processor. ACM Transactions on Database Systems (TODS), 9(2), 214–235.

    Article  Google Scholar 

  46. Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097–1105).

  47. Kung, H., McDanel, B., Zhang, S. Q. (2019). Packing sparse convolutional neural networks for efficient systolic array implementations: Column combining under joint optimization. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems (pp. 821–834). ACM.

  48. Kung, H. T. (1982). Why systolic architectures? IEEE Computer, 15(1), 37–46.

    Article  Google Scholar 

  49. Li, S., Ahn, J. H., Strong, R. D., Brockman, J. B., Tullsen, D. M., Jouppi, N. P. (2009). Mcpat: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In 2009. MICRO-42. 42nd annual IEEE/ACM international symposium on Microarchitecture (pp. 469–480). IEEE.

  50. LiKamWa, R., Hou, Y., Gao, J., Polansky, M., Zhong, L. (2016). Redeye: Analog convnet image sensor architecture for continuous mobile vision. In ISCA’16 (pp. 255–266). ACM.

  51. Mai, K., Paaske, T., Jayasena, N., Ho, R., Dally, W. J., Horowitz, M. (2000). Smart memories: a modular reconfigurable architecture. ACM SIGARCH Computer Architecture News, 28(2), 161–171.

    Article  Google Scholar 

  52. Nai, L., Hadidi, R., Sim, J., Kim, H., Kumar, P., Kim, H. (2017). Graphpim: Enabling instruction-level pim offloading in graph computing frameworks. In 2017 IEEE international symposium on High performance computer architecture (HPCA) (pp. 457–468). IEEE.

  53. O’Connor, M. (2014). Highlights of the high-bandwidth memory (hbm) standard. In Memory forum workshop.

  54. O’Connor, M., Chatterjee, N., Lee, D., Wilson, J., Agrawal, A., Keckler, S. W., Dally, W. J. (2017). Fine grained dram: energy efficient dram for extreme bandwidth systems. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 41–54). ACM.

  55. Pomerleau, D. A., Gusciora, G. L., Touretzky, D. S., Kung, H. (1988). Neural network simulation at warp speed: How we got 17 million connections per second. In Proceedings of 1988 IEEE International Conference on Neural Networks (pp. 143–150).

  56. Quinton, P. (1984). Automatic synthesis of systolic arrays from uniform recurrent equations. In ACM SIGARCH Computer architecture news (vol. 12, pp. 208–214). ACM.

  57. Rote, G. (1985). A systolic array algorithm for the algebraic path problem (shortest paths; matrix inversion). Computing, 34(3), 191–219.

    Article  MathSciNet  Google Scholar 

  58. Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  59. Song, W. J., Mukhopadhyay, S., Yalamanchili, S. (2015). Kitfox: Multiphysics libraries for integrated power, thermal, and reliability simulations of multicore microarchitecture. IEEE Transactions on Components. Packaging and Manufacturing Technology, 5(11), 1590–1601.

    Article  Google Scholar 

  60. Standard, J. (2013). High bandwidth memory (hbm) dram JESD235.

  61. Sutskever, I. (2013). Training recurrent neural networks. University of Toronto, Toronto.

  62. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).

  63. TensorFlow: Performance benchmarks (2017). https://www.tensorflow.org/performance/benchmarks.

  64. TensorFlow: Tensorflow (2018). https://www.tensorflow.org/.

  65. Toomarian, N., & Barhen, J. (1991). Fast temporal neural learning using teacher forcing. In 1991., IJCNN-91-seattle international joint conference on Neural networks (vol. 1, pp. 817–822). IEEE.

  66. TRANSLATION, T.W.O.S.M.: Wmt’15 (2017). http://www.statmt.org/wmt15/translation-task.html.

  67. Venkataramani, S., Ranjan, A., Banerjee, S., Das, D., Avancha, S., Jagannathan, A., Durg, A., Nagaraj, D., Kaul, B., Dubey, P., et al. (2017). Scaledeep: a scalable compute architecture for learning and evaluating deep networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture (pp. 13–26). ACM.

  68. Wang, S., Zhou, D., Han, X., Yoshimura, T. (2017). Chain-nn: an energy-efficient 1d chain architecture for accelerating deep convolutional neural networks. In Design, automation & test in europe conference & exhibition (DATE), 2017 (pp. 1032–1037). IEEE.

  69. Wang, Y., Zhang, M., Yang, J. (2017). Towards memory-efficient processing-in-memory architecture for convolutional neural networks. In ACM SIGPLAN Notices (vol. 52, pp. 81–90). ACM.

  70. Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550– 1560.

    Article  Google Scholar 

  71. Williams, S., Waterman, A., Patterson, D. (2009). Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 52(4), 65–76.

    Article  Google Scholar 

  72. Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv:1609.08144.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bahar Asgari.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asgari, B., Mukhopadhyay, S. & Yalamanchili, S. MAHASIM: Machine-Learning Hardware Acceleration Using a Software-Defined Intelligent Memory System. J Sign Process Syst 93, 659–675 (2021). https://doi.org/10.1007/s11265-019-01505-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-019-01505-1

Keywords

Navigation