skip to main content
research-article

Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks

Published:09 September 2023Publication History
Skip Abstract Section

Abstract

Recent advances in 2.5D chiplet platforms provide a new avenue for compact scale-out implementations of emerging compute- and data-intensive applications including machine learning. Network-on-Interposer (NoI) enables integration of multiple chiplets on a 2.5D system. While these manycore platforms can deliver high computational throughput and energy efficiency by running multiple specialized tasks concurrently, conventional NoI architectures have a limited computational throughput due to their inherent multi-hop topologies. In this paper, we propose Floret, a novel NoI architecture based on space-filling curves (SFCs). The Floret architecture leverages suitable task mapping, exploits the data flow pattern, and optimizes the inter-chiplet data exchange to extract high performance for multiple types of convolutional neural network (CNN) inference tasks running concurrently. We demonstrate that the Floret architecture reduces the latency and energy up to 58% and 64%, respectively, compared to state-of-the-art NoI architectures while executing datacenter-scale workloads involving multiple CNN tasks simultaneously. Floret achieves high performance and significant energy savings with much lower fabrication cost by exploiting the data-flow awareness of the CNN inference tasks.

REFERENCES

  1. [1] Kannan A., Jerger N., and Loh G.. 2015. Enabling interposer-based disintegration of multi-core processors. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO), 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Stow D. et al. 2017. Cost-Effective design of scalable high-performance systems using active and passive interposers. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD).Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Cunningham J. A. et al. 1990. The use and evaluation of yield models in integrated circuit manufacturing. IEEE Transaction of Semiconductor Manufacturing (1990).Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] International technology roadmap for semiconductors 2.0, 2015 edition, system integration. Report Ch 1, 2015., Semiconductor Industry Association, 2015.Google ScholarGoogle Scholar
  5. [5] Bharadwaj S., Yin J., Beckmann B., and Krishna T.. 2020. Kite: A family of heterogeneous interposer topologies enabled via accurate interconnect modeling. In Proceedings of 57th ACM/IEEE Design Automation Conference (DAC).Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Krishnan G. et al. 2021. SIAM: Chiplet-based scalable in-memory acceleration with mesh for deep neural networks. In ACM Transaction of Embedded Computer Systems 20, 5 (2021).Google ScholarGoogle Scholar
  7. [7] Shao Y. et al. 2019. Simba: Scaling deep-learning inference with multi-chip-module-based architecture. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Tan Z., Cai H., Dong R., and Ma K.. 2021. NN-Baton: DNN workload orchestration and chiplet granularity exploration for multichip accelerators. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Bergsma S., Zeyl T., Senderovich A., and Beck J.. 2021. Generating complex, realistic cloud workloads using recurrent neural networks. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 376391.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] https://www.cloudera.com/content/dam/www/marketing/resources/ebooks/how-to-take-ai-applications-from-concept-to-reality-with-cml-on-aws.pdf.landing.htmlGoogle ScholarGoogle Scholar
  11. [11] Verma A., Korupolu M., and Wilkes J.. 2014. Evaluating job packing in warehouse-scale computing. In Proceedings of the International Conference on Cluster Computing (CLUSTER).Google ScholarGoogle ScholarCross RefCross Ref
  12. [12] Juan D. C., Li L., Peng H. K., Marculescu D., and Faloutsos C.. Beyond Poisson: Modeling inter-arrival time of requests in a datacenter. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining.Google ScholarGoogle Scholar
  13. [13] Jerger N., Kannan A., Li Z., and Loh. G. 2014. NoC architectures for silicon interposer systems: Why Pay for more Wires when you Can Get them (from your interposer) for Free?. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). 458-470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Zimmer B. et al. 2020. A 0.32–128 TOPS, scalable Multi-Chip-Module-based deep neural network inference accelerator with ground-referenced signaling in 16 nm. IEEE Journal of Solid-State Circuits 55, 4 (2020).Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Li F. et al. 2022. GIA: A reusable general interposer architecture for agile chiplet integration. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Vivet P. et al. 2021. IntAct: A 96-Core processor with six chiplets 3D-stacked on an active interposer with distributed interconnects and integrated power management. IEEE Journal of Solid-State Circuits 56, 1 (2021).Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Sharma H. et al. 2022. SWAP: A server-scale communication-aware chiplet-based manycore PIM accelerator. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 41, 11 (2022), 41454156.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Mittal S.. 2019. A survey of ReRAM-based architectures for processing-in-memory and neural networks. Machine Learning and Knowledge Extraction 1, 1 (2019).Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Shafiee A. et al. 2016. Crossbars., ISAAC: A convolutional neural network accelerator with in-situ analog arithmetic in. In Proceedings of the International Symposium on Computer Architecture (ISCA). 1426.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Song L., Qian X., Li H., and Chen Y.. 2017. PipeLayer: A pipelined ReRAM-Based accelerator for deep learning. In Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA).Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Giordano M. et al. 2021. CHIMERA: A 0.92 TOPS, 2.2 TOPS/W Edge AI accelerator with 2 MByte On-Chip foundry resistive RAM for efficient training and inference. In IEEE Symposium on VLSI Circuits. 12.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Li B. et al. 2020. 3D-ReG: A 3D ReRAM-based heterogeneous architecture for training deep neural networks. In the Journal of Emerging Technology of Computer Systems 16, 20 (2020).Google ScholarGoogle Scholar
  23. [23] Chi P. et al. 2016. PRIME: A novel processing- in-memory architecture for neural network computation in ReRAM-Based main memory. In Proceedings of the International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Imani M., Gupta S., Kim Y., and Rosing T.. 2019. Floatpim: In-memory acceleration of deep neural network training with high precision. In Proceedings of the 46th International Symposium on Computer Architecture (ISCA).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Ebadollah T., Pasricha S., and Nikdast M.. 2022. ReSiPI: A reconfigurable silicon-photonic 2.5 D chiplet network with PCMs for Energy-Efficient Interposer Communication. In Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design.Google ScholarGoogle Scholar
  26. [26] Chittamuru S. V. R. et al. 2018. BiGNoC: Accelerating big data computing with application-specific photonic network-on-chip architectures. IEEE Transactions on Parallel and Distributed Systems 29, 11 (2018), 24022415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Sagan H.. 2012. Space-filling curves. Springer Science & Business Media (2012).Google ScholarGoogle Scholar
  28. [28] Hilbert D.. 1891. Uber die stegie abbildung einer linie auf flachenstuck. Mathematische Annalen 38 (1891), 459460.Google ScholarGoogle ScholarCross RefCross Ref
  29. [29] Morton G.. 1966. A computer oriented geodetic data base and a new technique in file sequencing. In IBM, Ottawa, Canada, 1966.Google ScholarGoogle Scholar
  30. [30] Xu P. and Tirthapura S.. 2012. A lower bound on proximity preservation by space filling curves. In Proceedings of the 26th International Parallel and Distributed Processing Symposium. 12951305.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Xu P., Cuong N., and Tirthapura S.. 2018. Onion curve: A space filling curve with near-optimal clustering. In 2018). In Proceedings of the 34th International Conference on Data Engineering (ICDE).Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] DeFord D. and Kalyanaraman A.. 2013. Empirical analysis of space-filling curves for scientific computing applications. In Proceedings of the 42nd International Conference on Parallel Processing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Lindenbaum M. and Gotsman C.. 1996. The metric properties of discrete space-filling curves. IEEE Transactions on Image Processing 5, 5 (1996), 794797.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Moon B., Jagadish H., Faloutsos C., and Saltz J.. 2001. Analysis of the clustering properties of Hilbert spacefilling curve. IEEE Transactions on Knowledge and Data Engineering 13, 1 (2001).Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Tirthapura S., Seal S., and Aluru S.. 2006. A formal analysis of space filling curves for parallel domain decomposition. In Proceedings of the International Conference on Parallel Processing (ICPP'06).Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Jagadish H.. 1990. Linear clustering of objects with multiple attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Aluru S. and Sevilgen F. E.. 1997. Parallel domain decomposition and load balancing using space-filling curves. In Proceedings of the Fourth International conference on High-Performance Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Bethel E. W., Camp D., Donofrio D., and Howison M.. 2015. Improving performance of structured-memory, data-intensive applications on multi-core platforms via a space-filling curve memory layout. In Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS) Workshop.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. [39] Haque M. M., Kalyanaraman A., Dhingra A., Abu-Lail N., and Graybeal K.. 2009. DNAjig: A new approach for building DNA nanostructures. In Proceedings of the International Conference on Bioinformatics and Biomedicine.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Sarkar S., Kulkarni G. R., Pande P. P., and Kalyanaraman A.. 2009. Network-on-chip hardware accelerators for biological sequence alignment. IEEE Transactions on Computers 59, 1 (2009), 2941.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. [41] Majumder T., Pande P. P., and Kalyanaraman A.. 2013. High-throughput, energy-efficient network-on-chip-based hardware accelerators. In Proceedings of the Sustainable Computing: Informatics and Systems 3, 1 (2013), 3646.Google ScholarGoogle ScholarCross RefCross Ref
  42. [42] Pati S. et al. 2023. Computation vs. Communication Scaling for Future Transformers on Future Hardware. In arXiv:2302.02825, 2023.Google ScholarGoogle Scholar
  43. [43] Karunaratne G. et al. 2020. In-memory hyperdimensional computing. Nature Electron 3 (2020), 327337.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Chen W. et al. 2019. CMOS-integrated memristive non-volatile computing-in-memory for AI edge processors. Nature Electron 2 (2019), 420428.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Dong X., Xu C., Xie Y., and Jouppi N.. 2012. NVSim: A Circuit-Level performance, energy, and area model for emerging nonvolatile memory. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 7 (2012).Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Roy K., Chakraborty I., Ali M., Ankit A., and Agrawal A.. 2020. In-memory computing in emerging memory technologies for machine learning: An overview. In Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference (DAC'20).Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. [47] Kim Y., Yang W., and Mutlu O.. 2015. RAMULATOR: A fast and extensible DRAM simulator. IEEE Computer Architecture Letters 15, 1 (2015).Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. [48] Murty K. K. S.. 1987. Some NP-complete problems in quadratic and nonlinear programming. Mathematical Programming 39 (1987), 117129.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. [49] Hazra T. K. and Hore A.. 2016. A comparative study of Travelling Salesman Problem and solution using different algorithm design techniques. In Proceedings of the 7th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON).Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Peng X. et al. 2019. DNN+NeuroSim: An End-to-End benchmarking framework for compute-in-memory accelerators with versatile device technologies. In Proceedings of the International Electron Devices Meeting (IEDM).Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Jiang N. et al. 2013. A detailed and flexible cycle-accurate network-on-chip simulator. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 8696.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Intel. 2019. Intel Foveros Interconnect. [Online].Google ScholarGoogle Scholar
  53. [53] Gad G. et al. 2022. Deep learning-based context-aware video content analysis on IoT devices. Electronics 11, 11 (2022).Google ScholarGoogle ScholarCross RefCross Ref
  54. [54] Kumar S., Bhagat L., and Jin J.. 2022. Multi-neural network based tiled 360° video caching with Mobile Edge Computing. Journal of Network and Computer Applications (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. [55] Gupta U. et al. 2021. Chasing Carbon: The elusive environmental footprint of computing. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA), 854867Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Florets for Chiplets: Data Flow-aware High-Performance and Energy-efficient Network-on-Interposer for CNN Inference Tasks

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM Transactions on Embedded Computing Systems
                ACM Transactions on Embedded Computing Systems  Volume 22, Issue 5s
                Special Issue ESWEEK 2023
                October 2023
                1394 pages
                ISSN:1539-9087
                EISSN:1558-3465
                DOI:10.1145/3614235
                • Editor:
                • Tulika Mitra
                Issue’s Table of Contents

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 9 September 2023
                • Accepted: 13 July 2023
                • Revised: 2 June 2023
                • Received: 23 March 2023
                Published in tecs Volume 22, Issue 5s

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article
              • Article Metrics

                • Downloads (Last 12 months)972
                • Downloads (Last 6 weeks)105

                Other Metrics

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Full Text

              View this article in Full Text.

              View Full Text