research-article

Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks

Authors:
Harsh Sharma

Washington State University, Pullman, WA, USA

Washington State University, Pullman, WA, USA

0000-0002-0334-4269
Search about this author

,
Lukas Pfromm

University of Wisconsin Madison, Madison, WI, USA

University of Wisconsin Madison, Madison, WI, USA

0000-0002-7905-9843
Search about this author

,
Rasit Onur Topaloglu

Topallabs, Poughkeepsie, NY, USA

Topallabs, Poughkeepsie, NY, USA

0000-0001-8759-6959
Search about this author

,
Janardhan Rao Doppa

Washington State University, Pullman, WA, USA

Washington State University, Pullman, WA, USA

0000-0002-3848-5301
Search about this author

,
Umit Y. Ogras

University of Wisconsin Madison, Madison, WI, USA

University of Wisconsin Madison, Madison, WI, USA

0000-0002-5045-5535
Search about this author

,
Ananth Kalyanraman

Washington State University, Pullman, WA, USA

Washington State University, Pullman, WA, USA

0000-0003-3495-2264
Search about this author

,
Partha Pratim Pande

Washington State University, Pullman, WA, USA

Washington State University, Pullman, WA, USA

0000-0002-5930-8531
Search about this author

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 22 Issue 5sArticle No.: 132pp 1–21https://doi.org/10.1145/3608098

Published:09 September 2023Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.

REFERENCES

[1] Kannan A., Jerger N., and Loh G.. 2015. Enabling interposer-based disintegration of multi-core processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO), 2015.Google ScholarDigital Library
[2] Stow D. et al. 2017. Cost-Effective design of scalable high-performance systems using active and passive interposers. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google ScholarDigital Library
[3] Cunningham J. A. et al. 1990. The use and evaluation of yield models in integrated circuit manufacturing. IEEE Transaction of Semiconductor Manufacturing (1990).Google ScholarCross Ref
[4] International technology roadmap for semiconductors 2.0, 2015 edition, system integration. Report Ch 1, 2015., Semiconductor Industry Association, 2015.Google Scholar
[5] Bharadwaj S., Yin J., Beckmann B., and Krishna T.. 2020. Kite: A family of heterogeneous interposer topologies enabled via accurate interconnect modeling. In Proceedings of 57th ACM/IEEE Design Automation Conference (DAC).Google ScholarCross Ref
[6] Krishnan G. et al. 2021. SIAM: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks. In ACM Transaction of Embedded Computer Systems 20, 5 (2021).Google Scholar
[7] Shao Y. et al. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
[8] Tan Z., Cai H., Dong R., and Ma K.. 2021. NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
[9] Bergsma S., Zeyl T., Senderovich A., and Beck J.. 2021. Generating complex, realistic cloud workloads using recurrent neural networks. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 376–391.Google ScholarDigital Library
[10] https://www.cloudera.com/content/dam/www/marketing/resources/ebooks/how-to-take-ai-applications-from-concept-to-reality-with-cml-on-aws.pdf.landing.htmlGoogle Scholar
[11] Verma A., Korupolu M., and Wilkes J.. 2014. Evaluating job packing in warehouse-scale computing. In Proceedings of the International Conference on Cluster Computing (CLUSTER).Google ScholarCross Ref
[12] Juan D. C., Li L., Peng H. K., Marculescu D., and Faloutsos C.. Beyond Poisson: Modeling inter-arrival time of requests in a datacenter. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google Scholar
[13] Jerger N., Kannan A., Li Z., and Loh. G. 2014. NoC architectures for silicon interposer systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 458-470.Google ScholarDigital Library
[14] Zimmer B. et al. 2020. A 0.32–128 TOPS, scalable Multi-Chip-Module-based deep neural network inference accelerator with ground-referenced signaling in 16 nm. IEEE Journal of Solid-State Circuits 55, 4 (2020).Google ScholarCross Ref
[15] Li F. et al. 2022. GIA: A reusable general interposer architecture for agile chiplet integration. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google ScholarDigital Library
[16] Vivet P. et al. 2021. IntAct: A 96-Core processor with six chiplets 3D-stacked on an active interposer with distributed interconnects and integrated power management. IEEE Journal of Solid-State Circuits 56, 1 (2021).Google ScholarCross Ref
[17] Sharma H. et al. 2022. SWAP: A server-scale communication-aware chiplet-based manycore PIM accelerator. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 11 (2022), 4145–4156.Google ScholarDigital Library
[18] Mittal S.. 2019. A survey of ReRAM-based architectures for processing-in-memory and neural networks. Machine Learning and Knowledge Extraction 1, 1 (2019).Google ScholarCross Ref
[19] Shafiee A. et al. 2016. Crossbars., ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in. In Proceedings of the International Symposium on Computer Architecture (ISCA). 14–26.Google ScholarDigital Library
[20] Song L., Qian X., Li H., and Chen Y.. 2017. PipeLayer: A pipelined ReRAM-Based accelerator for deep learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
[21] Giordano M. et al. 2021. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI accelerator with 2 MByte On-Chip foundry resistive RAM for efficient training and inference. In IEEE Symposium on VLSI Circuits. 1–2.Google ScholarCross Ref
[22] Li B. et al. 2020. 3D-ReG: A 3D ReRAM-based heterogeneous architecture for training deep neural networks. In the Journal of Emerging Technology of Computer Systems 16, 20 (2020).Google Scholar
[23] Chi P. et al. 2016. PRIME: A novel processing- in-memory architecture for neural network computation in ReRAM-Based main memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
[24] Imani M., Gupta S., Kim Y., and Rosing T.. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
[25] Ebadollah T., Pasricha S., and Nikdast M.. 2022. ReSiPI: A reconfigurable silicon-photonic 2.5 D chiplet network with PCMs for Energy-Efficient Interposer Communication. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google Scholar
[26] Chittamuru S. V. R. et al. 2018. BiGNoC: Accelerating big data computing with application-specific photonic network-on-chip architectures. IEEE Transactions on Parallel and Distributed Systems 29, 11 (2018), 2402–2415.Google ScholarDigital Library
[27] Sagan H.. 2012. Space-filling curves. Springer Science & Business Media (2012).Google Scholar
[28] Hilbert D.. 1891. Uber die stegie abbildung einer linie auf flachenstuck. Mathematische Annalen 38 (1891), 459–460.Google ScholarCross Ref
[29] Morton G.. 1966. A computer oriented geodetic data base and a new technique in file sequencing. In IBM, Ottawa, Canada, 1966.Google Scholar
[30] Xu P. and Tirthapura S.. 2012. A lower bound on proximity preservation by space filling curves. In Proceedings of the 26th International Parallel and Distributed Processing Symposium. 1295–1305.Google ScholarDigital Library
[31] Xu P., Cuong N., and Tirthapura S.. 2018. Onion curve: A space filling curve with near-optimal clustering. In 2018). In Proceedings of the 34th International Conference on Data Engineering (ICDE).Google ScholarCross Ref
[32] DeFord D. and Kalyanaraman A.. 2013. Empirical analysis of space-filling curves for scientific computing applications. In Proceedings of the 42nd International Conference on Parallel Processing.Google ScholarDigital Library
[33] Lindenbaum M. and Gotsman C.. 1996. The metric properties of discrete space-filling curves. IEEE Transactions on Image Processing 5, 5 (1996), 794–797.Google ScholarDigital Library
[34] Moon B., Jagadish H., Faloutsos C., and Saltz J.. 2001. Analysis of the clustering properties of Hilbert spacefilling curve. IEEE Transactions on Knowledge and Data Engineering 13, 1 (2001).Google ScholarDigital Library
[35] Tirthapura S., Seal S., and Aluru S.. 2006. A formal analysis of space filling curves for parallel domain decomposition. In Proceedings of the International Conference on Parallel Processing (ICPP'06).Google ScholarDigital Library
[36] Jagadish H.. 1990. Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data.Google ScholarDigital Library
[37] Aluru S. and Sevilgen F. E.. 1997. Parallel domain decomposition and load balancing using space-filling curves. In Proceedings of the Fourth International conference on High-Performance Computing.Google ScholarDigital Library
[38] Bethel E. W., Camp D., Donofrio D., and Howison M.. 2015. Improving performance of structured-memory, data-intensive applications on multi-core platforms via a space-filling curve memory layout. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) Workshop.Google ScholarDigital Library
[39] Haque M. M., Kalyanaraman A., Dhingra A., Abu-Lail N., and Graybeal K.. 2009. DNAjig: A new approach for building DNA nanostructures. In Proceedings of the International Conference on Bioinformatics and Biomedicine.Google ScholarDigital Library
[40] Sarkar S., Kulkarni G. R., Pande P. P., and Kalyanaraman A.. 2009. Network-on-chip hardware accelerators for biological sequence alignment. IEEE Transactions on Computers 59, 1 (2009), 29–41.Google ScholarDigital Library
[41] Majumder T., Pande P. P., and Kalyanaraman A.. 2013. High-throughput, energy-efficient network-on-chip-based hardware accelerators. In Proceedings of the Sustainable Computing: Informatics and Systems 3, 1 (2013), 36–46.Google ScholarCross Ref
[42] Pati S. et al. 2023. Computation vs. Communication Scaling for Future Transformers on Future Hardware. In arXiv:2302.02825, 2023.Google Scholar
[43] Karunaratne G. et al. 2020. In-memory hyperdimensional computing. Nature Electron 3 (2020), 327–337.Google ScholarCross Ref
[44] Chen W. et al. 2019. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nature Electron 2 (2019), 420–428.Google ScholarCross Ref
[45] Dong X., Xu C., Xie Y., and Jouppi N.. 2012. NVSim: A Circuit-Level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (2012).Google ScholarDigital Library
[46] Roy K., Chakraborty I., Ali M., Ankit A., and Agrawal A.. 2020. In-memory computing in emerging memory technologies for machine learning: An overview. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference (DAC'20).Google ScholarDigital Library
[47] Kim Y., Yang W., and Mutlu O.. 2015. RAMULATOR: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1 (2015).Google ScholarDigital Library
[48] Murty K. K. S.. 1987. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming 39 (1987), 117–129.Google ScholarDigital Library
[49] Hazra T. K. and Hore A.. 2016. A comparative study of Travelling Salesman Problem and solution using different algorithm design techniques. In Proceedings of the 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).Google ScholarCross Ref
[50] Peng X. et al. 2019. DNN+NeuroSim: An End-to-End benchmarking framework for compute-in-memory accelerators with versatile device technologies. In Proceedings of the International Electron Devices Meeting (IEDM).Google ScholarCross Ref
[51] Jiang N. et al. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 86–96.Google ScholarCross Ref
[52] Intel. 2019. Intel Foveros Interconnect. [Online].Google Scholar
[53] Gad G. et al. 2022. Deep learning-based context-aware video content analysis on IoT devices. Electronics 11, 11 (2022).Google ScholarCross Ref
[54] Kumar S., Bhagat L., and Jin J.. 2022. Multi-neural network based tiled 360° video caching with Mobile Edge Computing. Journal of Network and Computer Applications (2022).Google ScholarDigital Library
[55] Gupta U. et al. 2021. Chasing Carbon: The elusive environmental footprint of computing. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 854–867Google ScholarCross Ref

Index Terms

Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks

Recommendations

SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks
Special Issue ESWEEK 2021, CASES 2021, CODES+ISSS 2021 and EMSOFT 2021
In-memory computing (IMC) on a monolithic chip for deep learning faces dramatic challenges on area, yield, and on-chip interconnection cost due to the ever-increasing model sizes. 2.5D integration or chiplet-based architectures interconnect multiple small ...
Read More
Energy-Efficient and High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs
Embedded Computer Systems: Architectures, Modeling, and Simulation
Abstract
Nowadays, many application scenarios, such as mobile phones, drones, mobile robots, require Convolutional Neural Networks (CNNs) inference on embedded CPUs-GPUs MPSoCs. CNN model inference is usually computation intensive while the embedded CPUs-...
Read More
A shortly connected mesh topology for high performance and energy efficient network-on-chip architectures

Network-on-chip-based communication schemes represent a promising solution to the increasing complexity of system-on-chip problems. In this paper, we propose a new mesh-like topology called the shortly connected mesh technology (ScMesh), which is based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 22, Issue 5s
Special Issue ESWEEK 2023
October 2023
1394 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3614235
Editor:
Tulika Mitra
National University of Singapore, Singapore
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 9 September 2023
- Accepted: 13 July 2023
- Revised: 2 June 2023
- Received: 23 March 2023
Published in tecs Volume 22, Issue 5s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Chiplet Architecture
In-Memory Compute
CNN Inferencing
Server-Scale Computing
High-Performance Computing
Space Filling Curve
Network-on-Package
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 972
  Total Downloads
- Downloads (Last 12 months)972
- Downloads (Last 6 weeks)105
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks

ACM Transactions on Embedded Computing Systems

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks

Energy-Efficient and High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs

A shortly connected mesh topology for high performance and energy efficient network-on-chip architectures