Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion

Lee, Junghee; Nicopoulos, Chrysostomos; Lee, Hyung Gyu; Kim, Jongman

doi:10.1007/s10617-014-9131-z

Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion

Published: 04 March 2014

Volume 18, pages 121–139, (2014)
Cite this article

Design Automation for Embedded Systems Aims and scope Submit manuscript

Junghee Lee¹,
Chrysostomos Nicopoulos²,
Hyung Gyu Lee³ &
…
Jongman Kim¹

303 Accesses
2 Citations
Explore all metrics

Abstract

The escalating proliferation of multicore chips has accentuated the criticality of the on-chip network. Packet-based networks-on-chip (NoC) have emerged as the de facto interconnect of future chip multi-processors (CMP). On-chip traffic comprises a mixture of data and control messages from the cache coherence protocol. Given the latency-criticality of control messages, in this paper we aim to optimize their delivery times. Instead of treating the on-chip router as a monolithic component, we advocate the introduction of an ultra-low-latency ring-inspired (i.e., utilizing ring primitive building blocks) support micro-network that is optimized for control messages. This \(\upmu \)NoC is fused with a throughput-driven conventional NoC router to form a hybrid architecture, called Centaur, which maintains separate data paths and control logic for the two fused networks. Full-system simulation results from a 64-core CMP indicate that the proposed fused Centaur router improves overall system performance by up to 26 %, as compared to a state-of-the-art router implementation. Furthermore, hardware synthesis results using commercial 65 nm libraries indicate that Centaur’s area and power overheads are 9 and 3 %, respectively, as compared to a baseline router design. More importantly, the new design does not affect the router’s critical path.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-packaged optics (CPO): status, challenges, and solutions

Article Open access 20 March 2023

Survey on chiplets: interface, interconnect and integration methodology

Article 31 March 2022

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Article 21 September 2023

Notes

In Greek mythology, Centaur was a hybrid creature that was part human and part horse. Much like this mythical creature, our proposed router architecture fuses two distinct networks into one entity.
Wind River Systems: http://www.windriver.com/

References

Abad P, Puente V, Gregorio JA (2013) LIGERO: a light but efficient router conceived for cache-coherent chip multiprocessors. ACM Trans Archit Code Optim 9(4):37:1–37:21.
Google Scholar
Abad P, Puente V, Gregorio JA, Prieto P (2007) Rotary router: an efficient architecture for cmp interconnection networks. In: Proceedings of the 34th annual international symposium on computer architecture, ISCA ’07, pp 116–125.
Abousamra A, Melhem R, Jones A (2012) Deja vu switching for multiplane nocs. In: Sixth IEEE/ACM international symposium on networks on chip (NoCS), pp 11–18.
Agarwal N, Krishna T, Peh LS, Jha N (2009) GARNET: A detailed on-chip network model inside a full-system simulator. In: IEEE international symposium on performance analysis of systems and software.
Agarwal N, Peh LS, Jha N (2009), In-network snoop ordering (INSO): snoopy coherence on unordered interconnects. In: Proceedings of the 15th international symposium on high-performance computer, architecture, pp 67–78.
Anjan K, Pinkston T, Duato J (1996) Generalized theory for deadlock-free adaptive wormhole routing and its application to disha concurrent. In: Proceedings of IPPS ’96. The 10th international parallel processing symposium, pp 815–821.
Balfour J, Dally WJ (2006) Design tradeoffs for tiled CMP on-chip networks. In: Proceedings of the 20th annual international conference on supercomputing, pp 187–198.
Bienia C (2011) Benchmarking modern multiprocessors. Ph.D. Thesis, Princeton University.
Bolotin E, Guz Z, Cidon I, Ginosar R, Kolodny A (2007) The power of priority: NoC based distributed cache coherency. In: Proceedings of the first international symposium on networks-on-chip.
Bourduas S, Zilic Z (2007) A hybrid ring/mesh interconnect for network-on-chip using hierarchical rings for global routing. In: Proceedings of the first international symposium on networks-on-chip, pp 195–204.
Chuang JH, Chao WC (1994) Torus with slotted rings architecture for a cache-coherent multiprocessor. In: Proceedings of the 1994 international conference on parallel and distributed systems, pp 76–81.
Das R, Eachempati S, Mishra A, Narayanan V, Das C (2009), Design and evaluation of a hierarchical on-chip interconnect for next-generation cmps. In: Proceedings of the 15th international symposium on high-performance computer, architecture, pp 175–186.
Das R, Mutlu O, Moscibroda T, Das C (2009) Application-aware prioritization mechanisms for on-chip networks. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, pp 280–291.
Duato J, Yalamanchili S, Ni L (2003) Interconnection networks. Margan Kaufmann, San Francisco
Google Scholar
Flores A, Aragon J, Acacio M (2010) Heterogeneous interconnects for energy-efficient message management in cmps. IEEE Trans Comput 59(1):16–28
Article MathSciNet Google Scholar
Gratz P, Kim C, McDonald R, Keckler S, Burger D (2006) Implementation and evaluation of on-chip network architectures. In: Proceedings of international conference on computer design.
Hayenga M, Jerger NE, Lipasti M (2009) SCARAB: a single cycle adaptive routing and bufferless network. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture.
Holliday M, Stumm M (1994) Performance evaluation of hierarchical ring-based shared memory multiprocessors. IEEE Trans Comput 43:52–67
Article Google Scholar
Jerger NDE, Peh LS, Lipasti MH (2008) Circuit-switched coherence. In: Proceedings of the second ACM/IEEE international symposium on networks-on-chip, pp 193–202.
Kim J (2009) Low-cost router microarchitecture for on-chip networks. In: Proceedings of the 42nd annual IEEE/ACM international symposium on microarchitecture, pp 255–266.
Kim J, Nicopoulos C, Park D (2006) A gracefully degrading and energy-efficient modular router architecture for on-chip networks. SIGARCH Comput Archit News 34(2):4–15
Article Google Scholar
Kumar A, Peh LS, Kundu P, Jha NK (2007) Express virtual channels: towards the ideal interconnection fabric. In: Proceedings of the 34th annual international symposium on computer architecture.
Kumary A, Kunduz P, Singhx A, Peh LS, Jhay N (2007) A 4.6Tbits/s 3.6 GHz single-cycle NoC router with a novel switch allocator in 65 nm CMOS. In: Proceedings of the 25th international conference on computer design, pp 63–70.
Martin MMK, Sorin DJ, Beckmann BM, Marty MR, Xu M, Alameldeen AR, Moore KE, Hill MD, Wood DA (2005) Multifacet’s general execution-driven multiprocessor simulator (gems) toolset. SIGARCH Comput Archit News 33:2005
Article Google Scholar
Matsutani H, Koibuchi M, Amano H, Yoshinaga T (2009), Prediction router: yet another low latency on-chip router architecture. In: Proceedings of the IEEE 15th international symposium on high performance computer, architecture, pp 367–378.
Mullins R, West A, Moore S (2004), Low-latency virtual-channel routers for on-chip networks. In: Proceedings of the 31st, annual international symposium on computer architecture, p 188.
Mullins R, West A, Moore S (2006) The design and implementation of a low-latency on-chip network. In: Proceedings of Asia and South Pacific conference on design automation, p 6.
Nicopoulos C, Park D, Kim J, Vijaykrishnan N, Yousif M, Das C (2006) Vichar: a dynamic virtual channel regulator for network-on-chip routers. In: 39th annual IEEE/ACM international symposium on microarchitecture, pp 333–346.
Park C, Badeau R, Biro L, Chang J, Singh T, Vash J, Wang B, Wang T (2010) A 1.2 TB/s on-chip ring interconnect for 45nm 8-core enterprise Xeon processor. In: Proceedings of IEEE international solid-state circuits conference digest of technical papers, pp 180–181.
Peh LS, Dally WJ (2001), A delay model and speculative architecture for pipelined routers. In: Proceedings of the 7th international symposium on high-performance computer, architecture, p 255.
Pinkston T (1999) Flexible and efficient routing based on progressive deadlock recovery. IEEE Trans Comput 48(7):649–669
Article Google Scholar
Sibai F (2008) Adapting the hyper-ring interconnect for many-core processors. In: International symposium on parallel and distributed processing with applications, pp 649–654.
Singh A, Dally W, Towles B, Gupta A (2004) Globally adaptive load-balanced routing on tori. Comput Archit Lett 3(1):2
Article Google Scholar
Song YH, Pinkston T (2003) A progressive approach to handling message-dependent deadlock in parallel computer systems. IEEE Trans Parallel Distrib Syst 14(3):259–275
Article Google Scholar
Volos S, Seiculescu C, Grot B, Pour N, Falsafi B, De Micheli G (2012) Ccnoc: Specializing on-chip interconnects for energy efficiency in cache-coherent servers. In: Sixth IEEE/ACM international symposium on networks on chip (NoCS), pp 67–74.

Download references

Author information

Authors and Affiliations

Georgia Institute of Technology, Atlanta, GA, USA
Junghee Lee & Jongman Kim
University of Cyprus, Nicosia, Cyprus
Chrysostomos Nicopoulos
Daegu University, Gyeongsan, South Korea
Hyung Gyu Lee

Authors

Junghee Lee
View author publications
You can also search for this author in PubMed Google Scholar
Chrysostomos Nicopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Hyung Gyu Lee
View author publications
You can also search for this author in PubMed Google Scholar
Jongman Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Junghee Lee.

Appendix I: Formal proof of protocol-level deadlock avoidance

As stated in Sect. 4.2, progressive recovery mechanisms resolve all types of deadlocks when the following two conditions are met [6, 31]: (1) the recovery network is free from deadlocks, and (2) in each and every deadlock situation, there exists at least one packet that is granted access to the recovery network. The following theorems prove that these two conditions are satisfied in the Centaur architecture and, thus, any protocol- level deadlocks are guaranteed to be broken by the proposed time-out mechanism.

Theorem 1

The deadlock freedom of the DNoC is not affected by the \(\mu \)NoC.

Proof

Although the inter-router links are shared between the DNoC and the \(\upmu \)NoC, the latter has its own buffers (\(\upmu \)NoC Buffer and Intermediate Buffer). Therefore, the \(\upmu \)NoC does not affect the deadlock freedom of the DNoC, because the packets in the \(\upmu \)NoC do not block any packets in the DNoC. \(\square \)

Theorem 2

The time-out mechanism enables every packet in a deadlock situation to have the opportunity to access the recovery network.

Proof

When a protocol-level deadlock occurs (or is suspected), all the packets in the \(\upmu \)NoC buffers (\(\upmu \)NoC and Intermediate Buffers) are given the chance to escape from the deadlock. As mentioned in Sect. 4.2, even if a packet is not at the head of the buffer, it is also given the same opportunity. \(\square \)

Corollary 1

All dependencies among the various message classes involved in deadlock situations are broken by the time-out mechanism.

Proof

Dependencies among message classes can, indeed, be created in the \(\upmu \)NoC buffers, whereas there are no such dependencies in the DNoC. In the \(\upmu \)NoC, a packet may be blocked by a preceding packet that belongs to a different message class. When the time-out mechanism is triggered, packets are no longer blocked by preceding packets in the \(\upmu \)NoC, because they are forwarded to the DNoC, and the head packet does not block the following packet(s) in the same buffer. \(\square \)

One drawback of this mechanism is that the packet order might be reversed. To preserve packet order, control packets should leave each router in the order of arrival, regardless of which network (\(\upmu \)NoC or DNoC) they arrive through. Packets with the same input-output port mappings should be ordered. As a hardware implementation of packet ordering, we introduce a sequence numbering mechanism within the router. Note that this mechanism is accounted for in the area/power/timing evaluation of Sect. 5.3.

For every pair of input and output ports, two counters are maintained for each VC (message class) in the DNoC, which makes use of control packets. The ‘Head’ counter indicates the order of arriving packets and the ‘Tail’ counter indicates the order of departing packets. When a control packet arrives at a router, it is stored either in the \(\upmu \)NoC Buffer, or the Intermediate Buffer, or a VC buffer within the DNoC. Irrespective of which buffer it is stored into, a sequence number is given. The sequence number is equal to ‘Head’ and the ‘Head’ counter is subsequently increased by one. The packet can leave only when its sequence number matches the ‘Tail’ counter. After the packet departs, ‘Tail’ is increased by one.

However, the sequence numbering raises additional dependencies within a router, which may incur deadlocks. The following theorem proves that deadlocks do not happen.

Theorem 3

Deadlock freedom is not affected by the additional dependencies created by the sequence numbering mechanism.

Proof

Corollary 1 proves that there are no dependencies among message classes when the time-out mechanism is triggered. The additional dependencies caused by the sequence numbering mechanism are only within the same message class (VC). Therefore, the additional dependencies do not cause any packet to be blocked by other message classes. In addition, since the dependencies are created based on the order of arrival (i.e., the dependencies are, essentially, ordered), they do not form any cycles. \(\square \)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lee, J., Nicopoulos, C., Lee, H.G. et al. Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion. Des Autom Embed Syst 18, 121–139 (2014). https://doi.org/10.1007/s10617-014-9131-z

Download citation

Received: 30 June 2013
Accepted: 03 February 2014
Published: 04 March 2014
Issue Date: September 2014
DOI: https://doi.org/10.1007/s10617-014-9131-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion

Abstract

Access this article

Similar content being viewed by others

Co-packaged optics (CPO): status, challenges, and solutions

Survey on chiplets: interface, interconnect and integration methodology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix I: Formal proof of protocol-level deadlock avoidance

Theorem 1

Proof

Theorem 2

Proof

Corollary 1

Proof

Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Centaur: a hybrid network-on-chip architecture utilizing micro-network fusion

Abstract

Access this article

Similar content being viewed by others

Co-packaged optics (CPO): status, challenges, and solutions

Survey on chiplets: interface, interconnect and integration methodology

Recent progress in InGaZnO FETs for high-density 2T0C DRAM applications

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix I: Formal proof of protocol-level deadlock avoidance

Appendix I: Formal proof of protocol-level deadlock avoidance

Theorem 1

Proof

Theorem 2

Proof

Corollary 1

Proof

Theorem 3

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation