skip to main content
research-article

Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

Published: 25 October 2013 Publication History

Abstract

Current heterogeneous chip-multiprocessors (CMPs) integrate a GPU architecture on a die. However, the heterogeneity of this architecture inevitably exerts different pressures on shared resource management due to differing characteristics of CPU and GPU cores. We consider how to efficiently share on-chip resources between cores within the heterogeneous system, in particular the on-chip network. Heterogeneous architectures use an on-chip interconnection network to access shared resources such as last-level cache tiles and memory controllers, and this type of on-chip network will have a significant impact on performance.
In this article, we propose a feedback-directed virtual channel partitioning (VCP) mechanism for on-chip routers to effectively share network bandwidth between CPU and GPU cores in a heterogeneous architecture. VCP dedicates a few virtual channels to CPU and GPU applications with separate injection queues. The proposed mechanism balances on-chip network bandwidth for applications running on CPU and GPU cores by adaptively choosing the best partitioning configuration. As a result, our mechanism improves system throughput by 15% over the baseline across 39 heterogeneous workloads.

References

[1]
Abts, D., Jerger, N. D. E., Kim, J., Gibson, D., and Lipasti, M. H. 2009. Achieving predictable performance through better memory controller placement in many-core CMPs. In Proceedings of the 31st Annual International Symposium on Computer Architecture. ACM, New York, 451--461.
[2]
AMD. 2011. AMD Accelerated ProcessingUnits. http://www.amd.com/us/products/technologies/apu/Pages/apu.aspx.
[3]
Ausavarungnirun, R., Loh, G., Chang, K., Subramanian, L., and Mutlu, O. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In Proceedings of the 34th Annual International Symposium on Computer Architecture. IEEE, 416--427.
[4]
Bakhoda, A., Kim, J., and Aamodt, T. M. 2010. Throughput-effective on-chip networks for manycore accelerators. In Proceedings of the 43rd International Symposium on Microarchitecture. IEEE, 421--432.
[5]
Beigné, E., Clermidy, F., Vivet, P., Clouard, A., and Renaudin, M. 2005. An asynchronous NOC architecture providing low latency service and its multi-level design framework. In Proceedings of the 11th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'05). IEEE, 54--63.
[6]
Bjerregaard, T. and Mahadevan, S. 2006. A survey of research and practices of Network-on-chip. ACM Comput. Surv. 38, 1, Article 1.
[7]
Bjerregaard, T. and Sparsø, J. 2005. A router architecture for connection-oriented service guarantees in the MANGO clockless network-on-chip. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'05). IEEE, 1226--1231.
[8]
Bolotin, E., Cidon, I., Ginosar, R., and Kolodny, A. 2004. QNoC: QoS architecture and design process for network on chip. J. Syst. Archit. 50, 2--3, 105--128.
[9]
Chang, D. W., Jenkins, C. D., et al. 2010. ERCBench: An open-source benchmark suite for embedded and reconfigurable computing. In Proceedings of the 20th International Conference on Field Programmable Logic and Applications (FPL'10). IEEE, 408--413.
[10]
Chang, K. K.-W., Ausavarungnirun, R., Fallin, C., and Mutlu, O. 2012. HAT: Heterogeneous adaptive throttling for on-chip networks. In Proceedings of the 24th International Symposium on Computer Architecture and High Performance (SBAC-PAD'12). IEEE, 9--18.
[11]
Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer., J. W., Lee., S.-H., and Skadron, K. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC'10). IEEE, 44--54.
[12]
Choi, Y. and Pinkston., T. M. 2004. Evaluation of queue designs for true fully adaptive routers. J. Parallel Distrib. Comput. 64, 5, 606--616.
[13]
Dally, W. and Towles, B. 2003. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA.
[14]
Das, R., Mutlu, O., Moscibroda, T., and Das, C. R. 2009. Application-aware prioritization mechanisms for on-chip networks. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, New York, 280--291.
[15]
Das, R., Mutlu, O., Moscibroda, T., and Das, C. R. 2010. Aérgia: Exploiting packet latency slack in on-chip networks. In Proceedings of the 32nd annual International Symposium on Computer Architecture. ACM, New York, 106--116.
[16]
Dobkin, R. (Reuven), Vishnyakov, V., Friedman, E., and Ginosar, R. 2005. An asynchronous router for multiple service levels networks on chip. In Proceedings of the 11th IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC'05). IEEE, 44--53.
[17]
Duato, J., Johnson, I., Flich, J., Naven, F., Javier, G. P., and Frinós, T. N. 2005. A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks. In Proceedings of the 11st International Symposium on High Performance Computer Architecture. IEEE, 108--119.
[18]
Duato, J., Yalamanchili, S., and Ni, L. 1997. Interconnection Networks: An Engineering Approach 1st Ed. IEEE.
[19]
Evripidou, M., Nicopoulos, C., Soteriou, V., and Kim, J. 2012. Virtualizing virtual channels for increased network-on-chip robustness and upgradeability. In Proceedings of the IEEE Computer Society Annual Symposium on VLSI (ISVLSI'12). IEEE, 21--26.
[20]
Goossens, K., Dielissen, J., Gangwal, O. P., Pestana, S. G., Radulescu, A., and Rijpkema, E. 2005b. A design flow for application-specific networks on chip with guaranteed performance to accelerate SOC design and verification. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'05). IEEE, 1182--1187.
[21]
Goossens, K., Dielissen, J., and Radulescu, A. 2005a. Æthereal network on chip: concepts, architectures, and implementations. IEEE Des. Test Comput. 22, 5, 414--421.
[22]
Goossens, K., Wielage, P., Peeters, A., and Van Meerbergen, J. 2002. Networks on silicon: Combining best-effort and guaranteed services. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'02). IEEE, 423--425.
[23]
Grot, B., Hestness, J., Keckler, S., W., and Mutlu, O. 2011. Kilo-NOC: A heterogeneous network-on-chip architecture for scalability and service guarantees. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. ACM, New York, 401--412.
[24]
Grot, B., Keckler, S. W., and Mutlu, O. 2009. Preemptive virtual clock: A flexible, efficient, and cost-effective QOS scheme for networks-on-chip. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, New York, 268--279.
[25]
Hansson, A., Subburaman, M., and Goossens, K. 2009. aelite: A flit-synchronous network on chip with composable and predictable services. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'09). European Design and Automation Association, Leuven, Belgium, 250--255.
[26]
Harmanci, M. D., Escudero, N. P., Leblebici, Y., and Ienne, P. 2005. Quantitative modelling and comparison of communication schemes to guarantee quality-of-service in networks-on-chip. In Proceedings of the IEEE International Symposium on Circuits and Systems, ISCAS'05, Vol. 2. IEEE, 1782--1785.
[27]
HPArch Research Group. 2011. MacSim. http://code.google.com/p/macsim/.
[28]
Intel. Haswell. http://www.intel.com/content/www/us/en/processors/core/4th-gen-core-processor-family. html.
[29]
Intel. Ivy Bridge. http://www.intel.com/content/www/us/en/silicon-innovations/intel-22nm-technology.html. Intel. Sandy Bridge. http://software.intel.com/en-us/articles/sandy-bridge/.
[30]
Jaleel, A., Hasenplaugh, W., Qureshi, M., Sebot, J., Steely, S., Jr., and Emer, J. 2008. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques (PACT'08). ACM, New York, 208--219.
[31]
Jaleel, A., Theobald, K. B., Steely, S. C., Jr., and Emer, J. 2010. High performance cache replacement using re-reference interval prediction (RRIP). In Proceedings of the 32nd annual International Symposium on Computer Architecture. ACM, New York, 60--71.
[32]
Jeong, M. K., Erez, M., Sudanthi, C., and Paver, N. 2012. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC. In Proceedings of the 49th Annual Design Automation Conference (DAC'12). ACM, New York, 850--855.
[33]
Kim, S., Chandra, D., and Solihin, Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT'04). IEEE, 111--122.
[34]
Kim, Y., Han, D., Mutlu, O., and Harchol-Balter, M. 2010a. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers. In Proceedings of the 16th International Symposium on High Performance Computer Architecture. IEEE, 1--12.
[35]
Kim, Y., Papamichael, M., Mutlu, O., and Harchol-Balter, M. 2010b. Thread cluster memory scheduling: Exploiting differences in memory access behavior. In Proceedings of the 43rd International Symposium on Microarchitecture. IEEE, 65--76.
[36]
Lai, M., Wang, Z., Gao, L., Lu, H., and Dai, K. 2008. A dynamically-allocated virtual channel architecture with congestion awareness for on-chip routers. In Proceedings of the 45th annual Design Automation Conference (DAC'08). ACM, New York, 630--633.
[37]
Lee, J. and Kim, H. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. IEEE, 91--102.
[38]
Lee, J. W., Ng, M. C., and Asanovic, K. 2008. Globally-synchronized frames for guaranteed quality-of-service in on-chip networks. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, 89--100.
[39]
Leung, L.-F. and Tsui, C.-Y. 2006. Optimal link scheduling on improving best-effort and guaranteed services performance in network-on-chip systems. In Proceedings of the 43rd Annual Design Automation Conference (DAC'06). ACM, New York, 833--838.
[40]
Liang, J., Laffely, A., Srinivasan, S., and Tessier, R. 2004. An architecture and compiler for scalable on-chip communication. IEEE Trans. VLSI Syst. 12, 7, 711--726.
[41]
Liang, J., Swaminathan, S., and Tessier, R. 2000. aSOC: A scalable, single-chip communications architecture. In Proceedings of the 9th International Conference on Parallel Architectures and Compilation Techniques. IEEE, 37--46.
[42]
Marculescu, R., Ogras, U. Y., Peh, L.-S., Jerger, N. E., and Hoskote, Y. 2009. Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives. IEEE Trans. Comput.-Aided Des. Integr. Circ. Syst. 28, 1, 3--21.
[43]
Marescaux, T. and Corporaal, H. 2007. Introducing the SuperGT network-on-chip; SuperGT QoS: More than just GT. In Proceedings of the 44th Annual Design Automation Conference (DAC'07). ACM, New York, 116--121.
[44]
Millberg, M., Nilsson, E., Thid, R., and Jantsch, A. 2004. Guaranteed bandwidth using looped containers in temporally disjoint networks within the nostrum network on chip. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'04). IEEE, 890--895.
[45]
Mishra, A. K., Vijaykrishnan, N., and Das, C. R. 2011. A case for heterogeneous on-chip interconnects for CMPs. In Proceedings of the 33rd Annual International Symposium on Computer Architecture. ACM, New York, 389--400.
[46]
Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M. T., and Moscibroda, T. 2011. Reducing memory interference in multicore systems via application-aware memory channel partitioning. In Proceedings of the 44th International Symposium on Microarchitecture. ACM, 374--385.
[47]
Mutlu, O. and Moscibroda, T. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In Proceedings of the 40th International Symposium on Microarchitecture. IEEE, 146--160.
[48]
Mutlu, O. and Moscibroda, T. 2008. Parallelism-aware batch scheduling: Enhancing both performance and fairness of shared DRAM systems. In Proceedings of the 30th Annual International Symposium on Computer Architecture. IEEE, 63--74.
[49]
Nesbit, K. J., Aggarwal, N., Laudon, J., and Smith, J. E. 2006. Fair queuing memory systems. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE, 208--222.
[50]
Nicopoulos, C. A., Park, D., Kim, J., Vijaykrishnan, N., Yousif, M. S., and Das, C. R. 2006. ViChaR: A dynamic virtual channel regulator for network-on-chip routers. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE, 333--346.
[51]
Nilsson, E., Millberg, M., Öberg, J., and Jantsch, A. 2003. Load distribution with the proximity congestion awareness in a network on chip. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'03). IEEE, 11126--11127.
[52]
NVIDIA. Fermi: NVIDIA's Next Generation CUDA Compute Architecture. http://www.nvidia.com/fermi.
[53]
NVIDIA. Project Denver. http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-ofcomputing/.
[54]
Ogras, U. Y. and Marculescu, R. 2008. Analysis and optimization of prediction-based flow control in networks-on-chip. ACM Trans. Des. Autom. Electron Syst. 13, 1, Article 11.
[55]
Patil, H., Cohn, R., Charney, M., Kapoor, R., Sun, A., and Karunanidhi, A. 2004. Pinpointing representative portions of large Intel R Itanium R programs with dynamic instrumentation. In Proceedings of the 37th International Symposium on Microarchitecture. IEEE, 81--92.
[56]
Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 29th Annual International Symposium on Computer Architecture. ACM, New York, 381--391.
[57]
Qureshi, M. K. and Patt, Y. N. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the 39th International Symposium on Microarchitecture. IEEE, 423--432.
[58]
Rijpkema, E., Goossens, K. G. W., Radulescu, A., Dielissen, J., Van Meerbergen, J., Wielage, P., and Waterlander, E. 2003. Trade offs in the design of a router with both guaranteed and best-effort services for networks on chip. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'03). IEEE, 10350--10355.
[59]
Srikantaiah, S., Kandemir, M., and Wang, Q. 2009. SHARP control: Controlled shared cache management in chip multiprocessors. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, New York, 517--528.
[60]
Stefan, R., Molnos, A., and Goossens, K. 2012. dAElite: A TDM NoC supporting QoS, multicast, and fast connection set-up. IEEE Trans. Comput. 99, PrePrints.
[61]
Suh, G. E., Devadas, S., and Rudolph, L. 2002. A new memory monitoring scheme for memory-aware scheduling and partitioning. In Proceedings of the 8th International Symposium on High Performance Computer Architecture. IEEE, 117--128.
[62]
Suh, G. E., Rudolph, L., and Devadas, S. 2004. Dynamic partitioning of shared cache memory. J. Supercomputing 28, 1, 7--26.
[63]
Tamir, Y. and Frazier, G. L. 1992. Dynamically-Allocated Multi-Queue Buffers for VLSI Communication Switches. IEEE Trans. Comput. 41, 6, 725--737.
[64]
Taylor, M. B., Kim, J., et al. 2002. The raw microprocessor: A computational fabric for software circuits and general-purpose programs. IEEE Micro 22, 2, 25--35.
[65]
The IMPACT Research Group, UIUC. Parboil Benchmark Suite. http://impact.crhc.illinois.edu/parboil.php.
[66]
Triviño, F., Sánchez, J. L., Alfaro, F. J., and Flich, J. 2012. Exploring NoC virtualization alternatives in CMPs. In Proceedings of the 20th Euromicro International Conf. on Parallel, Distributed and Network-Based Processing (PDP'12). IEEE, 473--482.
[67]
van den Brand, J. W., Ciordas, C., Goossens, K., and Basten, T. 2007. Congestion-controlled best-effort communication for networks-on-chip. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'07). EDA Consortium, San Jose, CA, 948--953.
[68]
Varatkar, G. and Marculescu, R. 2002. Traffic analysis for on-chip networks design of multimedia applications. In Proceedings of the 39th Annual Design Automation Conference (DAC'02). ACM, New York, 795--800.
[69]
Weber, W.-D., Chou, J., Swarbrick, I., and Wingard, D. 2005. A quality-of-service mechanism for interconnection networks in system-on-chips. In Proceedings of the Design, Automation, and Test in Europe Conference and Exhibition (DATE'05). IEEE, 1232--1237.
[70]
Xie, Y. and Loh, G. H. 2009. PIPP: promotion(insertion pseudo-partitioning of multi-core shared caches. In Proceedings of the 31st Annual International Symposium on Computer Architecture. ACM, New York, 174--183.
[71]
Yang, Y., Xiang, P., Mantor, M., and Zhou, H. 2012. CPU-assisted GPGPU on fused CPU-GPU architectures. In Proceedings of the 18th International Symposium on High Performance Computer Architecture. IEEE, 103--114.
[72]
Yuan, G. L., Bakhoda, A., and Aamodt, T. M. 2009. Complexity effective memory access scheduling for many-core accelerator architectures. In Proceedings of the 42nd International Symposium on Microarchitecture. ACM, New York, 34--44.

Cited By

View all
  • (2024)A Survey on Heterogeneous CPU–GPU Architectures and SimulatorsConcurrency and Computation: Practice and Experience10.1002/cpe.831837:1Online publication date: 30-Oct-2024
  • (2023) ReDeSIGN: Re use of De bug S tructures for I mprovement in Performance G ain of N oC Based MPSoCs IEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.320361111:2(432-447)Online publication date: 1-Apr-2023
  • (2023)DPBC-VCP: A Network-On-Chip Prioritization Mechanism Combined with VCP for CPU-GPU Heterogeneous Systems2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00264(1927-1934)Online publication date: 17-Dec-2023
  • Show More Cited By

Index Terms

  1. Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Design Automation of Electronic Systems
      ACM Transactions on Design Automation of Electronic Systems  Volume 18, Issue 4
      Special Section on Networks on Chip: Architecture, Tools, and Methodologies
      October 2013
      380 pages
      ISSN:1084-4309
      EISSN:1557-7309
      DOI:10.1145/2541012
      Issue’s Table of Contents
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Journal Family

      Publication History

      Published: 25 October 2013
      Accepted: 01 July 2013
      Revised: 01 June 2013
      Received: 01 January 2013
      Published in TODAES Volume 18, Issue 4

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Heterogeneous architecture
      2. on-chip network
      3. quality-of-service

      Qualifiers

      • Research-article
      • Research
      • Refereed

      Funding Sources

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)33
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 20 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)A Survey on Heterogeneous CPU–GPU Architectures and SimulatorsConcurrency and Computation: Practice and Experience10.1002/cpe.831837:1Online publication date: 30-Oct-2024
      • (2023) ReDeSIGN: Re use of De bug S tructures for I mprovement in Performance G ain of N oC Based MPSoCs IEEE Transactions on Emerging Topics in Computing10.1109/TETC.2022.320361111:2(432-447)Online publication date: 1-Apr-2023
      • (2023)DPBC-VCP: A Network-On-Chip Prioritization Mechanism Combined with VCP for CPU-GPU Heterogeneous Systems2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/ICPADS60453.2023.00264(1927-1934)Online publication date: 17-Dec-2023
      • (2023)A Task-Based Routing Algorithm for Network-on-Chip in Heterogeneous CPU-GPU Architectures2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys60770.2023.00110(758-763)Online publication date: 17-Dec-2023
      • (2021)ALPHA: A Learning-Enabled High-Performance Network-on-Chip Router Design for Heterogeneous Manycore ArchitecturesIEEE Transactions on Sustainable Computing10.1109/TSUSC.2020.29813406:2(274-288)Online publication date: 1-Apr-2021
      • (2021)Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA51647.2021.00066(723-735)Online publication date: Feb-2021
      • (2020)Denial of Service in CPU-GPU Heterogeneous Architectures2020 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC43674.2020.9286228(1-5)Online publication date: 22-Sep-2020
      • (2020) NoC 2 : An Efficient Interfacing Approach for Heavily-Communicating NoC-Based Systems IEEE Access10.1109/ACCESS.2020.30306068(185992-186011)Online publication date: 2020
      • (2019)Improving Parallelism of Breadth First Search (BFS) Algorithm for Accelerated Performance on GPUs2019 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2019.8916551(1-7)Online publication date: Sep-2019
      • (2019)Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture2019 IEEE High Performance Extreme Computing Conference (HPEC)10.1109/HPEC.2019.8916239(1-6)Online publication date: Sep-2019
      • Show More Cited By

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media