research-article

Open Access

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays

Authors:
Massimo Giordano

Stanford University, Stanford, California, United States of America

Stanford University, Stanford, California, United States of America

https://orcid.org/0000-0002-7012-4135
View Profile

,
Rohan Doshi

Stanford University, Stanford, USA

Stanford University, Stanford, USA

https://orcid.org/0009-0005-4923-825X
View Profile

,
Qianyun Lu

Stanford University, Stanford, USA

Stanford University, Stanford, USA

https://orcid.org/0000-0002-0466-5072
View Profile

,
Boris Murmann

University of Hawaii, Honolulu, Hawaii, USA

University of Hawaii, Honolulu, Hawaii, USA

https://orcid.org/0000-0003-3417-8782
View Profile

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3April 2024Pages 1033–1047https://doi.org/10.1145/3620666.3651328

Published:27 April 2024Publication History

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

Pages 1033–1047

ABSTRACT

The proliferation of smart IoT devices has given rise to tinyML, which deploys deep neural networks on resource-constrained systems, benefitting from custom hardware that optimizes for low silicon area and high energy efficiency amidst tinyML's characteristic small model sizes (50-500 KB) and low target frequencies (1-100 MHz). We introduce a novel custom latch array integrated with a compute memory fabric, achieving 8 μm²/B density and 11 fJ/B read energy, surpassing synthesized implementations by 7x in density and 5x in read energy. This advancement enables dataflows that do not require activation buffers, reducing memory overheads. By optimizing systolic vs. combinational scaling in a 2D compute array and using bit-serial instead of bit-parallel compute, we achieve a reduction of 4.8x in area and 2.3x in multiply-accumulate energy. To study the advantages of the proposed architecture and its performance at the system level, we architect tinyForge, a design space exploration to obtain Pareto-optimal architectures and compare the trade-offs with respect to traditional approaches. tinyForge comprises (1) a parameterized template for memory hierarchies and compute fabric, (2) estimations of power, area, and latency for hardware components, (3) a dataflow optimizer for efficient workload scheduling, (4) a genetic algorithm performing multi-objective optimization to find Pareto-optimal architectures. We evaluate the performance of our proposed architecture on all of the MLPerf Tiny Inference Benchmark workloads, and the BERT-Tiny transformer model, demonstrating its effectiveness in lowering the energy per inference while addressing the introduced area overheads. We show the importance of storing all the weights on-chip, reducing the energy per inference by 7.5x vs. utilizing off-chip memories. Finally, we demonstrate the potential of the custom latch arrays and bit-serial digital compute arrays to reduce by up to 1.8x the energy per inference, 2.2x the latency per inference, and 3.7x the silicon area.

References

Colby Banbury, Vijay Janapa Reddi, Peter Torelli, Jeremy Holleman, Nat Jeffries, Csaba Kiraly, Pietro Montino, David Kanter, Sebastian Ahmed, and Danilo Pau. Mlperf tiny benchmark. arXiv preprint arXiv:2106.07597, 2021.Google Scholar
Colby Banbury, Chuteng Zhou, Igor Fedorov, Ramon Matas, Urmish Thakker, Dibakar Gope, Vijay Janapa Reddi, Matthew Mattina, and Paul Whatmough. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. Proceedings of machine learning and systems, 3:517--532, 2021.Google Scholar
Orest Bula, Rebecca D. Mih, Eric Jasinski, Dennis Hoyniak, Andrew Lu, Jay Harrington, and Anne E. McGuire. Pushing SRAM densities beyond 0.13-um technology in the year 2000. In 20th Annual BACUS Symposium on Photomask Technology, volume 4186, pages 601--611. SPIE, 2001.Google ScholarCross Ref
Alessio Burrello, Francesco Conti, Angelo Garofalo, Davide Rossi, and Luca Benini. DORY: Lightweight memory hierarchy management for deep NN inference on IoT endnodes: work-in-progress. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, pages 1--2, 2019.Google ScholarDigital Library
Vincent Camus, Linyan Mei, Christian Enz, and Marian Verhelst. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 9(4):697--711, December 2019.Google ScholarCross Ref
Xing Chen, Ming Li, Hao Zhong, Yun Ma, and Ching-Hsien Hsu. DNNOff: offloading DNN-based intelligent IoT applications in mobile edge computing. IEEE transactions on industrial informatics, 18(4):2820--2829, 2021.Google Scholar
Yu-Hsin Chen, Tushar Krishna, Joel S. Emer, and Vivienne Sze. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE Journal of Solid-State Circuits, 52(1):127--138, January 2017.Google ScholarCross Ref
Yu-Der Chih, Po-Hao Lee, Hidehiro Fujiwara, Yi-Chun Shih, Chia-Fu Lee, Rawan Naous, Yu-Lin Chen, Chieh-Pu Lo, Cheng-Han Lu, Haruki Mori, Wei-Chang Zhao, Dar Sun, Mahmut E. Sinangil, Yen-Huei Chen, Tan-Li Chou, Kerem Akarvardar, Hung-Jen Liao, Yih Wang, Meng-Fan Chang, and Tsung-Yung Jonathan Chang. 16.4 An 89TOPS/W and 16.3TOPS/mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications. In 2021 IEEE International Solid- State Circuits Conference (ISSCC), volume 64, pages 252--254, February 2021.Google ScholarCross Ref
Hyungmin Cho. RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping. ACM Transactions on Embedded Computing Systems, 20(5s):53:1--53:20, 2021.Google ScholarDigital Library
Francesco Conti, Davide Rossi, Gianna Paulin, Anaelo Garofalo, Alfio Di Mauro, Georg Rutishauer, Gian marco Ottavi, Manuel Eggimann, Hayate Okuhara, and Vincent Huard. 22.1 A 12.4 TOPS/W@ 136GOPS AI-IoT System-on-Chip with 16 RISC-V, 2-to-8b Precision-Scalable DNN Acceleration and 30%-Boost Adaptive Body Biasing. In 2023 IEEE International Solid-State Circuits Conference (ISSCC), pages 21--23. IEEE, 2023.Google ScholarCross Ref
Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, and Aviral Shrivastava. Dmazerunner: Executing perfectly nested loops on dataflow accelerators. ACM Transactions on Embedded Computing Systems (TECS), 18(5s):1--27, 2019.Google Scholar
Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182--197, 2002.Google Scholar
Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Manui Avodhyawasi, Harsh Rawat, Hitesh Chawla, VS Abhijith, Paolo Zambotti, Akhilesh Sharma, Carmine Cappetta, Michele Rossi, Antonio De Vita, and Francesca Girardi. 16.7 a 40-310tops/w sram-based all-digital up to 4b in-memory computing multi-tiled nn accelerator in fd-soi 18nm for deep-learning edge applications. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 260--262, 2023.Google ScholarCross Ref
Lachit Dutta and Swapna Bharali. Tinyml meets iot: A comprehensive survey. Internet of Things, 16:100461, 2021.Google ScholarCross Ref
Hidehiro Fujiwara, Haruki Mori, Wei-Chang Zhao, Mei-Chen Chuang, Rawan Naous, Chao-Kai Chuang, Takeshi Hashizume, Dar Sun, Chia-Fu Lee, Kerem Akarvardar, Saman Adham, Tan-Li Chou, Mahmut Ersin Sinangil, Yih Wang, Yu-Der Chih, Yen-Huei Chen, Hung-Jen Liao, and Tsung-Yung Jonathan Chang. A 5-nm 254-TOPS/W 221-TOPS/mm2 Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. In 2022 IEEE International Solid- State Circuits Conference (ISSCC), volume 65, pages 1--3, February 2022.Google Scholar
Massimo Giordano, Kartik Prabhu, Kalhan Koul, Robert M. Radway, Albert Gural, Rohan Doshi, Zainab F. Khan, John W. Kustin, Timothy Liu, and Gregorio B. Lopes. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W edge AI accelerator with 2 MByte on-chip foundry resistive RAM for efficient training and inference. In 2021 Symposium on VLSI Circuits, pages 1--2. IEEE, 2021.Google ScholarCross Ref
Lien-Chih Hsu, Ching-Te Chiu, Kuan-Ting Lin, Hsing-Huan Chou, and Yen-Yu Pu. Essa: An energy-aware bit-serial streaming deep convolutional neural network accelerator. Journal of Systems Architecture, 111:101831, 2020.Google ScholarCross Ref
Wei-Hsing Huang, Tai-Hao Wen, Je-Min Hung, Win-San Khwa, Yun-Chen Lo, Chuan-Jia Jhang, Huna-Hsi Hsu, Yu-Hsiana Chin, Yu-Chiao Chen, Chuna-Chuan Lo, Ren-Shuo Liu, Kea-Tiong Tang, Chih-Cheng Hsieh, Yu-Der Chih, Tsung-Yung Chang, and Meng-Fan Chang. A nonvolatile al-edge processor with 4mb slc-mlc hybrid-mode reram compute-in-memory macro and 51.4-251tops/w. In 2023 IEEE International Solid- State Circuits Conference (ISSCC), pages 15--17, 2023.Google ScholarCross Ref
Vijay Janapa Reddi, Christine Cheng, David Kanter, Peter Mattson, Guenther Schmuelling, Carole-Jean Wu, Brian Anderson, Maximilien Breughe, Mark Charlebois, William Chou, Ramesh Chukka, Cody Coleman, Sam Davis, Pan Deng, Greg Diamos, Jared Duke, Dave Fick, J. Scott Gardner, Itay Hubara, Sachin Idgunji, Thomas B. Jablin, Jeff Jiao, Tom St. John, Pankaj Kanwar, David Lee, Jeffery Liao, Anton Lokhmotov, Francisco Massa, Peng Meng, Paulius Micikevicius, Colin Osborne, Gennady Pekhimenko, Arun Tejusve Raghunath Rajan, Dilip Sequeira, Ashish Sirasao, Fei Sun, Hanlin Tang, Michael Thomson, Frank Wei, Ephrem Wu, Lingjie Xu, Koichi Yamada, Bing Yu, George Yuan, Aaron Zhong, Peizhao Zhang, and Yuchen Zhou. MLPerf Inference Benchmark. Technical report, November 2019.Google Scholar
Supreet Jeloka, Naveen Bharathwaj Akesh, Dennis Sylvester, and David Blaauw. A 28 nm configurable memory (TCAM/BCAM/SRAM) using push-rule 6T bit cell enabling logic-in-memory. IEEE Journal of Solid-State Circuits, 51(4):1009--1021, 2016.Google ScholarCross Ref
Michael Jiang. TSMC chip contract price increases by up to 20%, smashing the market and reversing misgivings.Google Scholar
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding, 2020.Google Scholar
Mahmut Burak Karadeniz and Mustafa Altun. Talipot: Energy-efficient dnn booster employing hybrid bit parallel-serial processing in msb-first fashion. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 41(8):2714--2727, 2021.Google ScholarDigital Library
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725--1732, 2014.Google ScholarDigital Library
Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. In International Conference on Machine Learning, pages 5338--5348. PMLR, 2020.Google Scholar
Nohhyup Kwak, Saeng-Hwan Kim, Kyong Ha Lee, Chang-Ki Baek, Mun Seon Jang, Yongsuk Joo, Seung-Hun Lee, Woo Young Lee, Eunryeong Lee, and Donghee Han. 23.3 A 4.8 Gb/s/pin 2Gb LPDDR4 SDRAM with sub-100A self-refresh current for IoT applications. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 392--393. IEEE, 2017.Google ScholarCross Ref
Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, and Tushar Krishna. Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 754--768, 2019.Google ScholarDigital Library
Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, and Vikas Chandra. Heterogeneous dataflow accelerators for multi-DNN workloads. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 71--83. IEEE, 2021.Google ScholarCross Ref
Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In Proceedings of the 26th annual international conference on machine learning, pages 609--616, 2009.Google ScholarDigital Library
Tom H. Luan, Longxiang Gao, Zhi Li, Yang Xiang, Guiyi Wei, and Limin Sun. Fog computing: Focusing on mobile users at the edge. arXiv preprint arXiv:1502.01815, 2015.Google Scholar
Peter Mattson, Vijay Janapa Reddi, Christine Cheng, Cody Coleman, Greg Diamos, David Kanter, Paulius Micikevicius, David Patterson, Guenther Schmuelling, and Hanlin Tang. MLPerf: An industry standard benchmark suite for machine learning performance. IEEE Micro, 40(2):8--16, 2020.Google ScholarCross Ref
Linyan Mei, Pouya Houshmand, Vikram Jain, Sebastian Giraldo, and Marian Verhelst. ZigZag: A Memory-Centric Rapid DNN Accelerator Design Space Exploration Framework, August 2020.Google Scholar
Bert Moons, Roel Uytterhoeven, Wim Dehaene, and Marian Verhelst. 14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI. In 2017 IEEE International Solid-State Circuits Conference (ISSCC), pages 246--247, February 2017.Google ScholarCross Ref
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. Ms marco: A human generated machine reading comprehension dataset. choice, 2640:660, 2016.Google Scholar
Angshuman Parashar, Priyanka Raina, Yakun Sophia Shao, Yu-Hsin Chen, Victor A. Ying, Anurag Mukkara, Rangharajan Venkatesan, Brucek Khailany, Stephen W. Keckler, and Joel Emer. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 304--315, March 2019.Google ScholarCross Ref
Bryan Pearson, Lan Luo, Yue Zhang, Rajib Dey, Zhen Ling, Mostafa Bassiouni, and Xinwen Fu. On Misconception of Hardware and Cost in IoT Security and Privacy. In ICC 2019 - 2019 IEEE International Conference on Communications (ICC), pages 1--7, May 2019.Google Scholar
Pijush Kanti Dutta Pramanik, Saurabh Pal, Aditya Brahmachari, and Prasenjit Choudhury. Processing IoT data: From cloud to fog---It's time to be down to earth. In Applications of Security, Mobile, Analytic, and Cloud (SMAC) Technologies for Effective Information Processing and Management, pages 124--148. IGI Global, 2018.Google ScholarCross Ref
Carlo R. Raquel and Prospero C. Naval Jr. An effective use of crowding distance in multiobjective particle swarm optimization. In Proceedings of the 7th Annual conference on Genetic and Evolutionary Computation, pages 257--264, 2005.Google Scholar
Sagnik Saha and Manish Purohit. Np-completeness of the active time scheduling problem. arXiv preprint arXiv:2112.03255, 2021.Google Scholar
Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, and Tushar Krishna. Scale-sim: Systolic cnn accelerator simulator. arXiv preprint arXiv:1811.02883, 2018.Google Scholar
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510--4520, 2018.Google ScholarCross Ref
Matti Siekkinen, Markus Hiienkari, Jukka K. Nurminen, and Johanna Nieminen. How low energy is bluetooth low energy? Comparative measurements with ZigBee/802.15.4. In 2012 IEEE Wireless Communications and Networking Conference Workshops (WCNCW), pages 232--237, Paris, France, April 2012. IEEE.Google ScholarCross Ref
Irina Sizova. GAP9 processor.Google Scholar
Ramy N. Tadros, Arash Fayyazi, Massoud Pedram, and Peter A. Beerel. SystemVerilog modeling of SFQ and AQFP circuits. IEEE Transactions on Applied Superconductivity, 30(2):1--13, 2019.Google ScholarCross Ref
Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Controlled placement of standard cell memory arrays for high density and low power in 28nm FD-SOI. In The 20th Asia and South Pacific Design Automation Conference, pages 81--86. IEEE, 2015.Google ScholarCross Ref
Adam Teman, Davide Rossi, Pascal Meinerzhagen, Luca Benini, and Andreas Burg. Power, area, and performance optimization of standard cell memory arrays through controlled placement. ACM Transactions on Design Automation of Electronic Systems (TODAES), 21(4):1--25, 2016.Google Scholar
Iulia Turc, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Well-read students learn better: On the importance of pre-training compact models, 2019.Google Scholar
Rangharajan Venkatesan, Yakun Sophia Shao, Miaorong Wang, Jason Clemons, Steve Dai, Matthew Fojtik, Ben Keller, Alicia Klinefelter, Nathaniel Pinckney, and Priyanka Raina. Magnet: A modular accelerator generator for neural networks. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarCross Ref
Thomas Vogelsang. Understanding the energy consumption of dynamic random access memories. In 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pages 363--374. IEEE, 2010.Google ScholarDigital Library
Yu Emma Wang, Gu-Yeon Wei, and David Brooks. Benchmarking TPU, GPU, and CPU Platforms for Deep Learning, October 2019.Google Scholar
Po-Hsuan Wei and Boris Murmann. Analog and mixed-signal layout automation using digital place-and-route tools. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 29(11):1838--1849, 2021.Google ScholarCross Ref
Carole-Jean Wu, David Brooks, Kevin Chen, Douglas Chen, Sy Choudhury, Marat Dukhan, Kim Hazelwood, Eldad Isaac, Yangqing Jia, and Bill Jia. Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE international symposium on high performance computer architecture (HPCA), pages 331--344. IEEE, 2019.Google ScholarCross Ref
Yannan Nellie Wu, Joel S. Emer, and Vivienne Sze. Accelergy: An architecture-level energy estimation methodology for accelerator designs. In 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), pages 1--8. IEEE, 2019.Google ScholarCross Ref
Sam Xi, Yuan Yao, Kshitij Bhardwaj, Paul Whatmough, Gu-Yeon Wei, and David Brooks. SMAUG: End-to-end full-stack simulation infrastructure for deep learning workloads. ACM Transactions on Architecture and Code Optimization (TACO), 17(4):1--26, 2020.Google Scholar
Jiaqi Yang, Hao Zheng, and Ahmed Louri. Adapt-Flow: A Flexible DNN Accelerator Architecture for Heterogeneous Dataflow Implementation. In Proceedings of the Great Lakes Symposium on VLSI 2022, pages 287--292, 2022.Google Scholar
Xuan Yang, Mingyu Gao, Qiaoyi Liu, Jeff Setter, Jing Pu, Ankita Nayak, Steven Bell, Kaidi Cao, Heonjae Ha, and Priyanka Raina. Interstellar: Using halide's scheduling language to analyze dnn accelerators. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 369--383, 2020. Received 10 August 2023; revised 13 November 2023; accepted 27 February 2024Google ScholarDigital Library

Index Terms

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
      1. Single instruction, multiple data
      2. Systolic arrays
  2. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded hardware
    2. System on a chip
2. Hardware
  1. Emerging technologies
    1. Memory and dense storage

Recommendations

Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs
SCOPES '20: Proceedings of the 23th International Workshop on Software and Compilers for Embedded Systems

In this work, we systematically explore the design space of throughput, energy, and hardware costs for layer-parallel mappings of Convolutional Neural Networks (CNNs) onto coarse-grained reconfigurable arrays (CGRAs). We derive an analytical model that ...
Read More
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation Conference

Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Read More
Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures
Special issue on low power electronics and design

Reconfigurable computers (RCs) host multiple field programmable gate arrays (FPGAs) and one or more physical memories that communicate through an interconnection fabric. State-of-the-art RCs provide abundant hardware and storage resources, but have ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3
April 2024
1106 pages
ISBN:9798400703867
DOI:10.1145/3620666
General Chairs:
Nael Abu-Ghazaleh,
Rajiv Gupta,
Program Chairs:
Madan Musuvathi,
Dan Tsafrir
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 April 2024
Check for updates
Author Tags
tinyML
on-device inference
design space exploration
memory hierarchy
custom latch arrays
energy latency and area optimization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate535of2,713submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 117
  Total Downloads
- Downloads (Last 12 months)117
- Downloads (Last 6 weeks)117
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

ABSTRACT

References

Cited By

Index Terms

Recommendations

Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs

Understanding the trade-offs in multi-level cell ReRAM memory design

Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

TinyForge: A Design Space Exploration to Advance Energy and Silicon Area Trade-offs in tinyML Compute Architectures with Custom Latch Arrays

ASPLOS '24: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3

ABSTRACT

References

Cited By

Index Terms

Recommendations

Design space exploration for layer-parallel execution of convolutional neural networks on CGRAs

Understanding the trade-offs in multi-level cell ReRAM memory design

Fine-grained and coarse-grained behavioral partitioning with effective utilization of memory and design space exploration for multi-FPGA architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media