skip to main content
10.1145/3620666.3651328acmconferencesArticle/Chapter ViewAbstractPublication PagesasplosConference Proceedingsconference-collections
research-article
Open Access

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays

Published:27 April 2024Publication History

ABSTRACT

The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML's characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 μm2/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.

References

  1. Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, and Danilo Pau. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.Google ScholarGoogle Scholar
  2. Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. Proceedings of machine learning and systems, 3:517--532, 2021.Google ScholarGoogle Scholar
  3. Orest Bula, Rebecca D. Mih, Eric Jasinski, Dennis Hoyniak, Andrew Lu, Jay Harrington, and Anne E. McGuire. Pushing SRAM densities beyond 0.13-um technology in the year 2000. In 20th Annual BACUS Symposium on Photomask Technology, volume 4186, pages 601--611. SPIE, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  4. Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, and Luca Benini. DORY: Lightweight memory hierarchy management for deep NN inference on IoT endnodes: work-in-progress. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, pages 1--2, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vincent Camus, Linyan Mei, Christian Enz, and Marian Verhelst. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(4):697--711, December 2019.Google ScholarGoogle ScholarCross RefCross Ref
  6. Xing Chen, Ming Li, Hao Zhong, Yun Ma, and Ching-Hsien Hsu. DNNOff: offloading DNN-based intelligent IoT applications in mobile edge computing. IEEE transactions on industrial informatics, 18(4):2820--2829, 2021.Google ScholarGoogle Scholar
  7. Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52(1):127--138, January 2017.Google ScholarGoogle ScholarCross RefCross Ref
  8. Yu-Der Chih, Po-Hao Lee, Hidehiro Fujiwara, Yi-Chun Shih, Chia-Fu Lee, Rawan Naous, Yu-Lin Chen, Chieh-Pu Lo, Cheng-Han Lu, Haruki Mori, Wei-Chang Zhao, Dar Sun, Mahmut E. Sinangil, Yen-Huei Chen, Tan-Li Chou, Kerem Akarvardar, Hung-Jen Liao, Yih Wang, Meng-Fan Chang, and Tsung-Yung Jonathan Chang. 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), volume 64, pages 252--254, February 2021.Google ScholarGoogle ScholarCross RefCross Ref
  9. Hyungmin Cho. RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping. ACM Transactions on Embedded Computing Systems, 20(5s):53:1--53:20, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Francesco Conti, Davide Rossi, Gianna Paulin, Anaelo Garofalo, Alfio Di Mauro, Georg Rutishauer, Gian marco Ottavi, Manuel Eggimann, Hayate Okuhara, and Vincent Huard. 22.1 A 12.4 TOPS/W@ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 21--23. IEEE, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  11. Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, and Aviral Shrivastava. Dmazerunner: Executing perfectly nested loops on dataflow accelerators. ACM Transactions on Embedded Computing Systems (TECS), 18(5s):1--27, 2019.Google ScholarGoogle Scholar
  12. Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182--197, 2002.Google ScholarGoogle Scholar
  13. Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Manui Avodhyawasi, Harsh Rawat, Hitesh Chawla, VS Abhijith, Paolo Zambotti, Akhilesh Sharma, Carmine Cappetta, Michele Rossi, Antonio De Vita, and Francesca Girardi. 16.7 a 40-310tops/w sram-based all-digital up to 4b in-memory computing multi-tiled nn accelerator in fd-soi 18nm for deep-learning edge applications. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 260--262, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  14. Lachit Dutta and Swapna Bharali. Tinyml meets iot: A comprehensive survey. Internet of Things, 16:100461, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  15. Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, Saman Adham, Tan-Li Chou, Mahmut Ersin Sinangil, Yih Wang, Yu-Der Chih, Yen-Huei Chen, Hung-Jen Liao, and Tsung-Yung Jonathan Chang. A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1--3, February 2022.Google ScholarGoogle Scholar
  16. Massimo Giordano, Kartik Prabhu, Kalhan Koul, Robert M. Radway, Albert Gural, Rohan Doshi, Zainab F. Khan, John W. Kustin, Timothy Liu, and Gregorio B. Lopes. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W edge AI accelerator with 2 MByte on-chip foundry resistive RAM for efficient training and inference. In 2021 Symposium on VLSI Circuits, pages 1--2. IEEE, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  17. Lien-Chih Hsu, Ching-Te Chiu, Kuan-Ting Lin, Hsing-Huan Chou, and Yen-Yu Pu. Essa: An energy-aware bit-serial streaming deep convolutional neural network accelerator. Journal of Systems Architecture, 111:101831, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  18. Wei-Hsing Huang, Tai-Hao Wen, Je-Min Hung, Win-San Khwa, Yun-Chen Lo, Chuan-Jia Jhang, Huna-Hsi Hsu, Yu-Hsiana Chin, Yu-Chiao Chen, Chuna-Chuan Lo, Ren-Shuo Liu, Kea-Tiong Tang, Chih-Cheng Hsieh, Yu-Der Chih, Tsung-Yung Chang, and Meng-Fan Chang. A nonvolatile al-edge processor with 4mb slc-mlc hybrid-mode reram compute-in-memory macro and 51.4-251tops/w. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 15--17, 2023.Google ScholarGoogle ScholarCross RefCross Ref
  19. Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark. Technical report, November 2019.Google ScholarGoogle Scholar
  20. Supreet Jeloka, Naveen Bharathwaj Akesh, Dennis Sylvester, and David Blaauw. A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling logic-in-memory. IEEE Journal of Solid-State Circuits, 51(4):1009--1021, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  21. Michael Jiang. TSMC chip contract price increases by up to 20%, smashing the market and reversing misgivings.Google ScholarGoogle Scholar
  22. Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding, 2020.Google ScholarGoogle Scholar
  23. Mahmut Burak Karadeniz and Mustafa Altun. Talipot: Energy-efficient dnn booster employing hybrid bit parallel-serial processing in msb-first fashion. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(8):2714--2727, 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725--1732, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338--5348. PMLR, 2020.Google ScholarGoogle Scholar
  26. Nohhyup Kwak, Saeng-Hwan Kim, Kyong Ha Lee, Chang-Ki Baek, Mun Seon Jang, Yongsuk Joo, Seung-Hun Lee, Woo Young Lee, Eunryeong Lee, and Donghee Han. 23.3 A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with sub-100A self-refresh current for IoT applications. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 392--393. IEEE, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  27. Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 754--768, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. Heterogeneous dataflow accelerators for multi-DNN workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71--83. IEEE, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  29. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609--616, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tom H. Luan, Longxiang Gao, Zhi Li, Yang Xiang, Guiyi Wei, and Limin Sun. Fog computing: Focusing on mobile users at the edge. arXiv preprint arXiv:1502.01815, 2015.Google ScholarGoogle Scholar
  31. Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, and Hanlin Tang. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8--16, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  32. Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework, August 2020.Google ScholarGoogle Scholar
  33. Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 246--247, February 2017.Google ScholarGoogle ScholarCross RefCross Ref
  34. Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.Google ScholarGoogle Scholar
  35. Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304--315, March 2019.Google ScholarGoogle ScholarCross RefCross Ref
  36. Bryan Pearson, Lan Luo, Yue Zhang, Rajib Dey, Zhen Ling, Mostafa Bassiouni, and Xinwen Fu. On Misconception of Hardware and Cost in IoT Security and Privacy. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC), pages 1--7, May 2019.Google ScholarGoogle Scholar
  37. Pijush Kanti Dutta Pramanik, Saurabh Pal, Aditya Brahmachari, and Prasenjit Choudhury. Processing IoT data: From cloud to fog---It's time to be down to earth. In Applications of Security, Mobile, Analytic, and Cloud (SMAC) Technologies for Effective Information Processing and Management, pages 124--148. IGI Global, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  38. Carlo R. Raquel and Prospero C. Naval Jr. An effective use of crowding distance in multiobjective particle swarm optimization. In Proceedings of the 7th Annual conference on Genetic and Evolutionary Computation, pages 257--264, 2005.Google ScholarGoogle Scholar
  39. Sagnik Saha and Manish Purohit. Np-completeness of the active time scheduling problem. arXiv preprint arXiv:2112.03255, 2021.Google ScholarGoogle Scholar
  40. Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018.Google ScholarGoogle Scholar
  41. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510--4520, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  42. Matti Siekkinen, Markus Hiienkari, Jukka K. Nurminen, and Johanna Nieminen. How low energy is bluetooth low energy? Comparative measurements with ZigBee/802.15.4. In 2012 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 232--237, Paris, France, April 2012. IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  43. Irina Sizova. GAP9 processor.Google ScholarGoogle Scholar
  44. Ramy N. Tadros, Arash Fayyazi, Massoud Pedram, and Peter A. Beerel. SystemVerilog modeling of SFQ and AQFP circuits. IEEE Transactions on Applied Superconductivity, 30(2):1--13, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  45. Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Controlled placement of standard cell memory arrays for high density and low power in 28nm FD-SOI. In The 20th Asia and South Pacific Design Automation Conference, pages 81--86. IEEE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  46. Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Power, area, and performance optimization of standard cell memory arrays through controlled placement. ACM Transactions on Design Automation of Electronic Systems (TODAES), 21(4):1--25, 2016.Google ScholarGoogle Scholar
  47. Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models, 2019.Google ScholarGoogle Scholar
  48. Rangharajan Venkatesan, Yakun Sophia Shao, Miaorong Wang, Jason Clemons, Steve Dai, Matthew Fojtik, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, and Priyanka Raina. Magnet: A modular accelerator generator for neural networks. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  49. Thomas Vogelsang. Understanding the energy consumption of dynamic random access memories. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 363--374. IEEE, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yu Emma Wang, Gu-Yeon Wei, and David Brooks. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning, October 2019.Google ScholarGoogle Scholar
  51. Po-Hsuan Wei and Boris Murmann. Analog and mixed-signal layout automation using digital place-and-route tools. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 29(11):1838--1849, 2021.Google ScholarGoogle ScholarCross RefCross Ref
  52. Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, and Bill Jia. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE international symposium on high performance computer architecture (HPCA), pages 331--344. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  53. Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  54. Sam Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, and David Brooks. SMAUG: End-to-end full-stack simulation infrastructure for deep learning workloads. ACM Transactions on Architecture and Code Optimization (TACO), 17(4):1--26, 2020.Google ScholarGoogle Scholar
  55. Jiaqi Yang, Hao Zheng, and Ahmed Louri. Adapt-Flow: A Flexible DNN Accelerator Architecture for Heterogeneous Dataflow Implementation. In Proceedings of the Great Lakes Symposium on VLSI 2022, pages 287--292, 2022.Google ScholarGoogle Scholar
  56. Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, and Priyanka Raina. Interstellar: Using halide's scheduling language to analyze dnn accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 369--383, 2020. Received 10 August 2023; revised 13 November 2023; accepted 27 February 2024Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Article Metrics

              • Downloads (Last 12 months)117
              • Downloads (Last 6 weeks)117

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader