Abstract
Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.
- [1] . 2015. Enabling interposer-based disintegration of multi-core processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO), 2015.Google ScholarDigital Library
- [2] 2017. Cost-Effective design of scalable high-performance systems using active and passive interposers. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google ScholarDigital Library
- [3] 1990. The use and evaluation of yield models in integrated circuit manufacturing. IEEE Transaction of Semiconductor Manufacturing (1990).Google ScholarCross Ref
- [4] International technology roadmap for semiconductors 2.0, 2015 edition, system integration. Report Ch 1, 2015., Semiconductor Industry Association, 2015.Google Scholar
- [5] . 2020. Kite: A family of heterogeneous interposer topologies enabled via accurate interconnect modeling. In Proceedings of 57th ACM/IEEE Design Automation Conference (DAC).Google ScholarCross Ref
- [6] 2021. SIAM: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks. In ACM Transaction of Embedded Computer Systems 20, 5 (2021).Google Scholar
- [7] 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarDigital Library
- [8] . 2021. NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- [9] . 2021. Generating complex, realistic cloud workloads using recurrent neural networks. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 376–391.Google ScholarDigital Library
- [10] https://www.cloudera.com/content/dam/www/marketing/resources/ebooks/how-to-take-ai-applications-from-concept-to-reality-with-cml-on-aws.pdf.landing.htmlGoogle Scholar
- [11] . 2014. Evaluating job packing in warehouse-scale computing. In Proceedings of the International Conference on Cluster Computing (CLUSTER).Google ScholarCross Ref
- [12] . Beyond Poisson: Modeling inter-arrival time of requests in a datacenter. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google Scholar
- [13] 2014. NoC architectures for silicon interposer systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 458-470.Google ScholarDigital Library
- [14] 2020. A 0.32–128 TOPS, scalable Multi-Chip-Module-based deep neural network inference accelerator with ground-referenced signaling in 16 nm. IEEE Journal of Solid-State Circuits 55, 4 (2020).Google ScholarCross Ref
- [15] 2022. GIA: A reusable general interposer architecture for agile chiplet integration. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google ScholarDigital Library
- [16] 2021. IntAct: A 96-Core processor with six chiplets 3D-stacked on an active interposer with distributed interconnects and integrated power management. IEEE Journal of Solid-State Circuits 56, 1 (2021).Google ScholarCross Ref
- [17] 2022. SWAP: A server-scale communication-aware chiplet-based manycore PIM accelerator. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 11 (2022), 4145–4156.Google ScholarDigital Library
- [18] . 2019. A survey of ReRAM-based architectures for processing-in-memory and neural networks. Machine Learning and Knowledge Extraction 1, 1 (2019).Google ScholarCross Ref
- [19] 2016. Crossbars., ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in. In Proceedings of the International Symposium on Computer Architecture (ISCA). 14–26.Google ScholarDigital Library
- [20] . 2017. PipeLayer: A pipelined ReRAM-Based accelerator for deep learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarCross Ref
- [21] 2021. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI accelerator with 2 MByte On-Chip foundry resistive RAM for efficient training and inference. In IEEE Symposium on VLSI Circuits. 1–2.Google ScholarCross Ref
- [22] 2020. 3D-ReG: A 3D ReRAM-based heterogeneous architecture for training deep neural networks. In the Journal of Emerging Technology of Computer Systems 16, 20 (2020).Google Scholar
- [23] 2016. PRIME: A novel processing- in-memory architecture for neural network computation in ReRAM-Based main memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- [24] . 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).Google ScholarDigital Library
- [25] . 2022. ReSiPI: A reconfigurable silicon-photonic 2.5 D chiplet network with PCMs for Energy-Efficient Interposer Communication. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google Scholar
- [26] 2018. BiGNoC: Accelerating big data computing with application-specific photonic network-on-chip architectures. IEEE Transactions on Parallel and Distributed Systems 29, 11 (2018), 2402–2415.Google ScholarDigital Library
- [27] . 2012. Space-filling curves. Springer Science & Business Media (2012).Google Scholar
- [28] . 1891. Uber die stegie abbildung einer linie auf flachenstuck. Mathematische Annalen 38 (1891), 459–460.Google ScholarCross Ref
- [29] . 1966. A computer oriented geodetic data base and a new technique in file sequencing. In IBM, Ottawa, Canada, 1966.Google Scholar
- [30] . 2012. A lower bound on proximity preservation by space filling curves. In Proceedings of the 26th International Parallel and Distributed Processing Symposium. 1295–1305.Google ScholarDigital Library
- [31] . 2018. Onion curve: A space filling curve with near-optimal clustering. In 2018). In Proceedings of the 34th International Conference on Data Engineering (ICDE).Google ScholarCross Ref
- [32] . 2013. Empirical analysis of space-filling curves for scientific computing applications. In Proceedings of the 42nd International Conference on Parallel Processing.Google ScholarDigital Library
- [33] . 1996. The metric properties of discrete space-filling curves. IEEE Transactions on Image Processing 5, 5 (1996), 794–797.Google ScholarDigital Library
- [34] . 2001. Analysis of the clustering properties of Hilbert spacefilling curve. IEEE Transactions on Knowledge and Data Engineering 13, 1 (2001).Google ScholarDigital Library
- [35] . 2006. A formal analysis of space filling curves for parallel domain decomposition. In Proceedings of the International Conference on Parallel Processing (ICPP'06).Google ScholarDigital Library
- [36] . 1990. Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data.Google ScholarDigital Library
- [37] . 1997. Parallel domain decomposition and load balancing using space-filling curves. In Proceedings of the Fourth International conference on High-Performance Computing.Google ScholarDigital Library
- [38] . 2015. Improving performance of structured-memory, data-intensive applications on multi-core platforms via a space-filling curve memory layout. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) Workshop.Google ScholarDigital Library
- [39] . 2009. DNAjig: A new approach for building DNA nanostructures. In Proceedings of the International Conference on Bioinformatics and Biomedicine.Google ScholarDigital Library
- [40] . 2009. Network-on-chip hardware accelerators for biological sequence alignment. IEEE Transactions on Computers 59, 1 (2009), 29–41.Google ScholarDigital Library
- [41] . 2013. High-throughput, energy-efficient network-on-chip-based hardware accelerators. In Proceedings of the Sustainable Computing: Informatics and Systems 3, 1 (2013), 36–46.Google ScholarCross Ref
- [42] 2023. Computation vs. Communication Scaling for Future Transformers on Future Hardware. In arXiv:2302.02825, 2023.Google Scholar
- [43] 2020. In-memory hyperdimensional computing. Nature Electron 3 (2020), 327–337.Google ScholarCross Ref
- [44] 2019. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nature Electron 2 (2019), 420–428.Google ScholarCross Ref
- [45] . 2012. NVSim: A Circuit-Level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (2012).Google ScholarDigital Library
- [46] . 2020. In-memory computing in emerging memory technologies for machine learning: An overview. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference (DAC'20).Google ScholarDigital Library
- [47] . 2015. RAMULATOR: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1 (2015).Google ScholarDigital Library
- [48] . 1987. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming 39 (1987), 117–129.Google ScholarDigital Library
- [49] . 2016. A comparative study of Travelling Salesman Problem and solution using different algorithm design techniques. In Proceedings of the 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).Google ScholarCross Ref
- [50] 2019. DNN+NeuroSim: An End-to-End benchmarking framework for compute-in-memory accelerators with versatile device technologies. In Proceedings of the International Electron Devices Meeting (IEDM).Google ScholarCross Ref
- [51] 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 86–96.Google ScholarCross Ref
- [52] Intel. 2019. Intel Foveros Interconnect. [Online].Google Scholar
- [53] 2022. Deep learning-based context-aware video content analysis on IoT devices. Electronics 11, 11 (2022).Google ScholarCross Ref
- [54] . 2022. Multi-neural network based tiled 360° video caching with Mobile Edge Computing. Journal of Network and Computer Applications (2022).Google ScholarDigital Library
- [55] 2021. Chasing Carbon: The elusive environmental footprint of computing. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 854–867Google ScholarCross Ref
Index Terms
- Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks
Recommendations
SIAM: Chiplet-based Scalable In-Memory Acceleration with Mesh for Deep Neural Networks
Special Issue ESWEEK 2021, CASES 2021, CODES+ISSS 2021 and EMSOFT 2021In-memory computing (IMC) on a monolithic chip for deep learning faces dramatic challenges on area, yield, and on-chip interconnection cost due to the ever-increasing model sizes. 2.5D integration or chiplet-based architectures interconnect multiple small ...
Energy-Efficient and High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs
Embedded Computer Systems: Architectures, Modeling, and SimulationAbstractNowadays, many application scenarios, such as mobile phones, drones, mobile robots, require Convolutional Neural Networks (CNNs) inference on embedded CPUs-GPUs MPSoCs. CNN model inference is usually computation intensive while the embedded CPUs-...
A shortly connected mesh topology for high performance and energy efficient network-on-chip architectures
Network-on-chip-based communication schemes represent a promising solution to the increasing complexity of system-on-chip problems. In this paper, we propose a new mesh-like topology called the shortly connected mesh technology (ScMesh), which is based ...
Comments