ABSTRACT
The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML's characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 μm2/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.
- Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, and Danilo Pau. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.Google Scholar
- Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. Proceedings of machine learning and systems, 3:517--532, 2021.Google Scholar
- Orest Bula, Rebecca D. Mih, Eric Jasinski, Dennis Hoyniak, Andrew Lu, Jay Harrington, and Anne E. McGuire. Pushing SRAM densities beyond 0.13-um technology in the year 2000. In 20th Annual BACUS Symposium on Photomask Technology, volume 4186, pages 601--611. SPIE, 2001.Google ScholarCross Ref
- Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, and Luca Benini. DORY: Lightweight memory hierarchy management for deep NN inference on IoT endnodes: work-in-progress. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, pages 1--2, 2019.Google ScholarDigital Library
- Vincent Camus, Linyan Mei, Christian Enz, and Marian Verhelst. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(4):697--711, December 2019.Google ScholarCross Ref
- Xing Chen, Ming Li, Hao Zhong, Yun Ma, and Ching-Hsien Hsu. DNNOff: offloading DNN-based intelligent IoT applications in mobile edge computing. IEEE transactions on industrial informatics, 18(4):2820--2829, 2021.Google Scholar
- Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52(1):127--138, January 2017.Google ScholarCross Ref
- Yu-Der Chih, Po-Hao Lee, Hidehiro Fujiwara, Yi-Chun Shih, Chia-Fu Lee, Rawan Naous, Yu-Lin Chen, Chieh-Pu Lo, Cheng-Han Lu, Haruki Mori, Wei-Chang Zhao, Dar Sun, Mahmut E. Sinangil, Yen-Huei Chen, Tan-Li Chou, Kerem Akarvardar, Hung-Jen Liao, Yih Wang, Meng-Fan Chang, and Tsung-Yung Jonathan Chang. 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), volume 64, pages 252--254, February 2021.Google ScholarCross Ref
- Hyungmin Cho. RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping. ACM Transactions on Embedded Computing Systems, 20(5s):53:1--53:20, 2021.Google ScholarDigital Library
- Francesco Conti, Davide Rossi, Gianna Paulin, Anaelo Garofalo, Alfio Di Mauro, Georg Rutishauer, Gian marco Ottavi, Manuel Eggimann, Hayate Okuhara, and Vincent Huard. 22.1 A 12.4 TOPS/W@ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 21--23. IEEE, 2023.Google ScholarCross Ref
- Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, and Aviral Shrivastava. Dmazerunner: Executing perfectly nested loops on dataflow accelerators. ACM Transactions on Embedded Computing Systems (TECS), 18(5s):1--27, 2019.Google Scholar
- Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182--197, 2002.Google Scholar
- Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Manui Avodhyawasi, Harsh Rawat, Hitesh Chawla, VS Abhijith, Paolo Zambotti, Akhilesh Sharma, Carmine Cappetta, Michele Rossi, Antonio De Vita, and Francesca Girardi. 16.7 a 40-310tops/w sram-based all-digital up to 4b in-memory computing multi-tiled nn accelerator in fd-soi 18nm for deep-learning edge applications. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 260--262, 2023.Google ScholarCross Ref
- Lachit Dutta and Swapna Bharali. Tinyml meets iot: A comprehensive survey. Internet of Things, 16:100461, 2021.Google ScholarCross Ref
- Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, Saman Adham, Tan-Li Chou, Mahmut Ersin Sinangil, Yih Wang, Yu-Der Chih, Yen-Huei Chen, Hung-Jen Liao, and Tsung-Yung Jonathan Chang. A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1--3, February 2022.Google Scholar
- Massimo Giordano, Kartik Prabhu, Kalhan Koul, Robert M. Radway, Albert Gural, Rohan Doshi, Zainab F. Khan, John W. Kustin, Timothy Liu, and Gregorio B. Lopes. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W edge AI accelerator with 2 MByte on-chip foundry resistive RAM for efficient training and inference. In 2021 Symposium on VLSI Circuits, pages 1--2. IEEE, 2021.Google ScholarCross Ref
- Lien-Chih Hsu, Ching-Te Chiu, Kuan-Ting Lin, Hsing-Huan Chou, and Yen-Yu Pu. Essa: An energy-aware bit-serial streaming deep convolutional neural network accelerator. Journal of Systems Architecture, 111:101831, 2020.Google ScholarCross Ref
- Wei-Hsing Huang, Tai-Hao Wen, Je-Min Hung, Win-San Khwa, Yun-Chen Lo, Chuan-Jia Jhang, Huna-Hsi Hsu, Yu-Hsiana Chin, Yu-Chiao Chen, Chuna-Chuan Lo, Ren-Shuo Liu, Kea-Tiong Tang, Chih-Cheng Hsieh, Yu-Der Chih, Tsung-Yung Chang, and Meng-Fan Chang. A nonvolatile al-edge processor with 4mb slc-mlc hybrid-mode reram compute-in-memory macro and 51.4-251tops/w. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 15--17, 2023.Google ScholarCross Ref
- Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark. Technical report, November 2019.Google Scholar
- Supreet Jeloka, Naveen Bharathwaj Akesh, Dennis Sylvester, and David Blaauw. A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling logic-in-memory. IEEE Journal of Solid-State Circuits, 51(4):1009--1021, 2016.Google ScholarCross Ref
- Michael Jiang. TSMC chip contract price increases by up to 20%, smashing the market and reversing misgivings.Google Scholar
- Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding, 2020.Google Scholar
- Mahmut Burak Karadeniz and Mustafa Altun. Talipot: Energy-efficient dnn booster employing hybrid bit parallel-serial processing in msb-first fashion. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(8):2714--2727, 2021.Google ScholarDigital Library
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725--1732, 2014.Google ScholarDigital Library
- Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338--5348. PMLR, 2020.Google Scholar
- Nohhyup Kwak, Saeng-Hwan Kim, Kyong Ha Lee, Chang-Ki Baek, Mun Seon Jang, Yongsuk Joo, Seung-Hun Lee, Woo Young Lee, Eunryeong Lee, and Donghee Han. 23.3 A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with sub-100A self-refresh current for IoT applications. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 392--393. IEEE, 2017.Google ScholarCross Ref
- Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 754--768, 2019.Google ScholarDigital Library
- Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. Heterogeneous dataflow accelerators for multi-DNN workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71--83. IEEE, 2021.Google ScholarCross Ref
- Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609--616, 2009.Google ScholarDigital Library
- Tom H. Luan, Longxiang Gao, Zhi Li, Yang Xiang, Guiyi Wei, and Limin Sun. Fog computing: Focusing on mobile users at the edge. arXiv preprint arXiv:1502.01815, 2015.Google Scholar
- Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, and Hanlin Tang. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8--16, 2020.Google ScholarCross Ref
- Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework, August 2020.Google Scholar
- Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 246--247, February 2017.Google ScholarCross Ref
- Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.Google Scholar
- Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304--315, March 2019.Google ScholarCross Ref
- Bryan Pearson, Lan Luo, Yue Zhang, Rajib Dey, Zhen Ling, Mostafa Bassiouni, and Xinwen Fu. On Misconception of Hardware and Cost in IoT Security and Privacy. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC), pages 1--7, May 2019.Google Scholar
- Pijush Kanti Dutta Pramanik, Saurabh Pal, Aditya Brahmachari, and Prasenjit Choudhury. Processing IoT data: From cloud to fog---It's time to be down to earth. In Applications of Security, Mobile, Analytic, and Cloud (SMAC) Technologies for Effective Information Processing and Management, pages 124--148. IGI Global, 2018.Google ScholarCross Ref
- Carlo R. Raquel and Prospero C. Naval Jr. An effective use of crowding distance in multiobjective particle swarm optimization. In Proceedings of the 7th Annual conference on Genetic and Evolutionary Computation, pages 257--264, 2005.Google Scholar
- Sagnik Saha and Manish Purohit. Np-completeness of the active time scheduling problem. arXiv preprint arXiv:2112.03255, 2021.Google Scholar
- Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018.Google Scholar
- Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510--4520, 2018.Google ScholarCross Ref
- Matti Siekkinen, Markus Hiienkari, Jukka K. Nurminen, and Johanna Nieminen. How low energy is bluetooth low energy? Comparative measurements with ZigBee/802.15.4. In 2012 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 232--237, Paris, France, April 2012. IEEE.Google ScholarCross Ref
- Irina Sizova. GAP9 processor.Google Scholar
- Ramy N. Tadros, Arash Fayyazi, Massoud Pedram, and Peter A. Beerel. SystemVerilog modeling of SFQ and AQFP circuits. IEEE Transactions on Applied Superconductivity, 30(2):1--13, 2019.Google ScholarCross Ref
- Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Controlled placement of standard cell memory arrays for high density and low power in 28nm FD-SOI. In The 20th Asia and South Pacific Design Automation Conference, pages 81--86. IEEE, 2015.Google ScholarCross Ref
- Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Power, area, and performance optimization of standard cell memory arrays through controlled placement. ACM Transactions on Design Automation of Electronic Systems (TODAES), 21(4):1--25, 2016.Google Scholar
- Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models, 2019.Google Scholar
- Rangharajan Venkatesan, Yakun Sophia Shao, Miaorong Wang, Jason Clemons, Steve Dai, Matthew Fojtik, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, and Priyanka Raina. Magnet: A modular accelerator generator for neural networks. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarCross Ref
- Thomas Vogelsang. Understanding the energy consumption of dynamic random access memories. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 363--374. IEEE, 2010.Google ScholarDigital Library
- Yu Emma Wang, Gu-Yeon Wei, and David Brooks. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning, October 2019.Google Scholar
- Po-Hsuan Wei and Boris Murmann. Analog and mixed-signal layout automation using digital place-and-route tools. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 29(11):1838--1849, 2021.Google ScholarCross Ref
- Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, and Bill Jia. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE international symposium on high performance computer architecture (HPCA), pages 331--344. IEEE, 2019.Google ScholarCross Ref
- Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarCross Ref
- Sam Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, and David Brooks. SMAUG: End-to-end full-stack simulation infrastructure for deep learning workloads. ACM Transactions on Architecture and Code Optimization (TACO), 17(4):1--26, 2020.Google Scholar
- Jiaqi Yang, Hao Zheng, and Ahmed Louri. Adapt-Flow: A Flexible DNN Accelerator Architecture for Heterogeneous Dataflow Implementation. In Proceedings of the Great Lakes Symposium on VLSI 2022, pages 287--292, 2022.Google Scholar
- Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, and Priyanka Raina. Interstellar: Using halide's scheduling language to analyze dnn accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 369--383, 2020. Received 10 August 2023; revised 13 November 2023; accepted 27 February 2024Google ScholarDigital Library
Index Terms
- TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays
Recommendations
Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs
SCOPES '20: Proceedings of the 23th International Workshop on Software and Compilers for Embedded SystemsIn this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that ...
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation ConferenceResistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures
Special issue on low power electronics and designReconfigurable computers (RCs) host multiple field programmable gate arrays (FPGAs) and one or more physical memories that communicate through an interconnection fabric. State-of-the-art RCs provide abundant hardware and storage resources, but have ...
Comments