An Efficient Network-on-Chip Router for Dataflow Architecture

Shen, Xiao-Wei; Ye, Xiao-Chun; Tan, Xu; Wang, Da; Zhang, Lunkai; Li, Wen-Ming; Zhang, Zhi-Min; Fan, Dong-Rui; Sun, Ning-Hui

doi:10.1007/s11390-017-1703-5

An Efficient Network-on-Chip Router for Dataflow Architecture

Regular paper
Published: 11 January 2017

Volume 32, pages 11–25, (2017)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xiao-Wei Shen^1,2,
Xiao-Chun Ye¹,
Xu Tan^1,2,
Da Wang¹,
Lunkai Zhang³,
Wen-Ming Li¹,
Zhi-Min Zhang¹,
Dong-Rui Fan¹ &
…
Ning-Hui Sun¹

333 Accesses
15 Citations
9 Altmetric
Explore all metrics

Abstract

Dataflow architecture has shown its advantages in many high-performance computing cases. In dataflow computing, a large amount of data are frequently transferred among processing elements through the network-on-chip (NoC). Thus the router design has a significant impact on the performance of dataflow architecture. Common routers are designed for control-flow multi-core architecture and we find they are not suitable for dataflow architecture. In this work, we analyze and extract the features of data transfers in NoCs of dataflow architecture: multiple destinations, high injection rate, and performance sensitive to delay. Based on the three features, we propose a novel and efficient NoC router for dataflow architecture. The proposed router supports multi-destination; thus it can transfer data with multiple destinations in a single transfer. Moreover, the router adopts output buffer to maximize throughput and adopts non-flit packets to minimize transfer delay. Experimental results show that the proposed router can improve the performance of dataflow architecture by 3.6x over a state-of-the-art router.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel Forwarding for Efficient Bandwidth Utilization in Networks-on-Chip

P-NoC: Performance Evaluation and Design Space Exploration of NoCs for Chip Multiprocessor Architecture Using FPGA

Article 19 June 2020

NoC-based hardware software co-design framework for dataflow thread management

Article Open access 11 May 2023

References

Chen T S, Du Z D, Sun N H, Wang J, Wu C Y, Chen Y J, Temam O. DianNao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In Proc. the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.269-284.
Liu D F, Chen T S, Liu S L, Zhou J H, Zhou S Y, Temam O, Feng X B, Zhou X H, Chen Y J. PuDianNao: A polyvalent machine learning accelerator. In Proc. the 20th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2014, pp.369-381.
Voitsechov D, Etsion Y. Single-graph multiple flows: Energy efficient design alternative for GPGPUs. In Proc. the 41st Int. Symp. Computer Architecture, Jun. 2014, pp.205-216.
Oriato D, Tilbury S, Marrocu M, Pusceddu G. Acceleration of a meteorological limited area model with dataflow engines. In Proc. the Symp. Application Accelerators in High Performance Computing, Jul. 2012, pp.129-132.
Pratas F, Oriato D, Pell O, Mata R A, Sousa L. Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In Proc. the 21st Annual Int. Symp. Field-Programmable Custom Computing Machines, Apr. 2013, pp.177-180.
Fu H H, Gan L, Clapp R G, Ruan H B, Pell O, Mencer O, Flynn M, Huang X M, Yang G W. Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro, 2014, 34(1): 30-40.
Theobald K B. EARTH: An efficient architecture for running threads [Ph.D. Thesis]. McGill University, Montreal, Que., Canada, 1999.
Milutinovic V, Salom J, Trifunovic N, Giorgi R. Guide to Dataflow Supercomputing (1st edition). Springer International Publishing, 2015.
Sankaralingam K, Nagarajan R, McDonald R, Desikan R, Drolia S, Govindan M S, Gratz P, Gulati D, Hanson H, Kim C, Liu H M, Ranganathan N, Sethumadhavan S, Sharif S, Shivakumar P, Keckler S W, Burger D. Distributed microarchitectural protocols in the TRIPS prototype processor. In Proc. the 39th Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2006, pp.480-491.
Burger D, Keckler S W, McKinley K S, Dahlin M, John L K, Lin C, Moore C R, Burrill J, McDonald R G, Yoder W. Scaling to the end of silicon with EDGE architectures. Computer, 2004, 37(7): 44-55.
Swanson S, Schwerin A, Mercaldi M, Petersen A, Putnam A, Michelson K, Oskin M, Eggers S J. The WaveScalar architecture. ACM Transactions on Computer Systems, 2007, 25(2): Article No.4.
Roca A, Flich J, Silla F, Duato J. A latency-efficient router architecture for CMP systems. In Proc. the 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools, Sept. 2010, pp.165-172.
Michelogiannakis G, Dally W J. Router designs for elastic buffer on-chip networks. In Proc. the Conference on High Performance Computing Networking, Storage and Analysis, Nov. 2009.
Chang Y Y, Huang Y S, Poremba M, Narayanan V K, Xie Y, King C T. TS-Router: On maximizing the quality-of-allocation in the on-chip network. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.390-399.
Tran A T, Baas B M. Achieving high-performance on-chip networks with shared-buffer routers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2014, 22(6): 1391-1403.
Poluri P, Louri A. An improved router design for reliable on-chip networks. In Proc. the 28th Int. Parallel and Distributed Processing Symp., May 2014, pp.283-292.
Ben-Itzhak Y, Cidon I, Kolodny A, Shabun M, Shmuel N. Heterogeneous NoC router architecture. IEEE Transactions on Parallel and Distributed Systems, 2015, 26(6): 2479-2492.
Zoni D, Flich J, Fornaciari W. CUTBUF: Buffer management and router design for traffic mixing in VNET-based NoCs. IEEE Transactions on Parallel and Distributed Systems, 2016, 27(6): 1603-1616.
Singh W, Deb S. Energy efficient and congestion-aware router design for future NoCs. In Proc. the 29th Int. Conference on VLSI Design, Jan. 2016, pp.81-85.
Yan P Z, Jiang S X, Sridhar R. A high throughput router with a novel switch allocator for network on chip. In Proc. the 28th International System-on-Chip Conference, Sept. 2015, pp.160-163.
Xu Y, Zhao B, Zhang Y T, Yang J. Simple virtual channel allocation for high throughput and high frequency on-chip routers. In Proc. the 16th Int. Symp. High Performance Computer Architecture, Jan. 2010, pp.1-11.
Soteriou V, Ramanujam R S, Lin B, Peh L S. A high-throughput distributed shared-buffer NoC router. IEEE Computer Architecture Letters, 2009, 8(1): 21-24.
Gu L, Li M, Siegel J. An empirically tuned 2D and 3D FFT library on CUDA GPU. In Proc. the 24th ACM International Conference on Supercomputing, Jun. 2010, pp.305-314.
Zhang Y P, Mueller F. Autogeneration and autotuning of 3D stencil codes on homogeneous and heterogeneous GPU clusters. IEEE Transactions on Parallel and Distributed Systems, 2013, 24(3): 417-427.
Kuzak J, Tomov S, Dongarra J. Autotuning GEMM kernels for the Fermi GPU. IEEE Transactions on Parallel and Distributed Systems, 2012, 23(11): 2045-2057.
Hesse R, Nicholls J, Jerger N E. Fine-grained bandwidth adaptivity in networks-on-chip using bidirectional channels. In Proc. the 6th IEEE/ACM Int. Symp. Networks-on-Chip, May 2012, pp.132-141.
Ye X C, Fan D R, Sun N H, Tang S B, Zhang M Z, Zhang H. SimICT: A fast and flexible framework for performance and power evaluation of large-scale architecture. In Proc. the Int. Symp. Low Power Electronics and Design, Sept. 2013, pp.273-278.
Li S, Ahn J H, Strong R D, Brockman J B, Tullsen D M, Jouppi N P. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proc. the 42nd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2009, pp.469-480.
Solinas M, Badia R M, Bodin F, Cohen A, Evripidou P, Faraboschi P, Fenchner B, Gao G R, Garbade A, Girbal S, Goodman D, Khan B, Koliai S, Li F, Luján M, Morin L, Mendelson A, Navarro N, Pop A, Trancoso P, Ungerer T, Valero M, Weis S, Watson I, Zuckermann S, Giorgi R. The TERAFLUX project: Exploiting the dataflow paradigm in next generation teradevices. In Proc. the Euromicro Conference on Digital System Design, Sept. 2013, pp.272-279.
Carter N P, Agrawal A, Borkar S, Cledat R, David H, Dunning D, Fryman J, Ganev I, Golliver R A, Knauerhase R, Lethin R, Meister B, Mishra A K, Pinfold W R, Teller J, Torrellas J, Vasilache N, Venkatesh G, Xu J P. Runnemede: An architecture for ubiquitous high-performance computing. In Proc. the 19th Int. Symp. High Performance Computer Architecture, Feb. 2013, pp.198-209.
Wei L, Zhou L. An equilibrium partitioning method for multicast traffic in 3D NoC architecture. In Proc. the IFIP/IEEE International Conference on Very Large Scale Integration, Oct. 2015, pp.128-133.
Agrawal M, Chakrabarty K. Test-time optimization in NOC-based manycore SOCs using multicast routing. In Proc. the 32nd IEEE VLSI Test Symposium, Apr. 2014.
Kamali M, Petre L, Sere K, Daneshtalab M. Formal modeling of multicast communication in 3D NoCs. In Proc. the 14th Euromicro Conference on Digital System Design, Aug. 31-Sept. 2, 2011, pp.634-642.
Zhan J, Ouyang J, Ge F, Zhao J S, Xie Y. Hybrid drowsy SRAM and STT-RAM buffer designs for dark-silicon-aware NoC. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2016, 24(10): 3041-3054.
Zhan J, Ouyang J, Ge F, Zhao S, Xie Y. DimNoC: A dim silicon approach towards power-efficient on-chip network. In Proc. the 52nd ACM/EDAC/IEEE Design Automation Conference, Jun. 2015. J. Comput. Sci. & Technol., Jan. 2017, Vol.32, No.1
Zhang L K, Strukov D, Saadeldeen H, Fan D R, Zhang M Z, Franklin D. SpongeDirectory: Flexible sparse directories utilizing multi-level memristors. In Proc. the 23rd Int. Conf. Parallel Architectures and Compilation, Aug. 2014, pp.61-74.
Deng Z X, Zhang L K, Franklin D, Chong F T. Herniated hash tables: Exploiting multi-level phase change memory for in-place data expansion. In Proc. the Int. Symp. Memory Systems, Oct. 2015, pp.247-257.
Zhang M Z, Zhang L K, Jiang L, Liu Z Y, Chong F T. Balancing performance and lifetime of MLC PCM by using a regionretention monitor. In Proc. the 23rd Int. Symp. High Performance Computer Architecture, Feb. 2017. (to be appeared)
LeeH H S, Tyson G S, Farrens M K. Eager writeback-a technique for improving bandwidth utilization. In Proc. the 33rd Annual IEEE/ACM Int. Symp. Microarchitecture, Dec. 2000, pp.11-21.
Zhang L K, Neely B, Franklin D, Strukov D, Xie Y, Chong F T. Mellow writes: Extending lifetime in resistive memories through selective slow write backs. In Proc. the 43rd Int. Symp. Computer Architecture, Jun. 2016, pp.519-531.

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xiao-Wei Shen, Xiao-Chun Ye, Xu Tan, Da Wang, Wen-Ming Li, Zhi-Min Zhang, Dong-Rui Fan & Ning-Hui Sun
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
Xiao-Wei Shen & Xu Tan
Department of Computer Science, The University of Chicago, Chicago, IL, 60637, USA
Lunkai Zhang

Authors

Xiao-Wei Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Chun Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xu Tan
View author publications
You can also search for this author in PubMed Google Scholar
Da Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lunkai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Ming Li
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Rui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Ning-Hui Sun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiao-Chun Ye.

Additional information

Special Section on Dataflow Architecture

This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2015AA01A301, the National Natural Science Foundation of China under Grant No. 61332009, the National HeGaoJi Project of China under Grant No. 2013ZX0102-8001-001-001, and the Beijing Municipal Science and Technology Commission under Grant Nos. Z15010101009 and Z151100003615006.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, XW., Ye, XC., Tan, X. et al. An Efficient Network-on-Chip Router for Dataflow Architecture. J. Comput. Sci. Technol. 32, 11–25 (2017). https://doi.org/10.1007/s11390-017-1703-5

Download citation

Received: 02 September 2016
Revised: 13 December 2016
Published: 11 January 2017
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11390-017-1703-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Network-on-Chip Router for Dataflow Architecture

Abstract

Access this article

Similar content being viewed by others

Parallel Forwarding for Efficient Bandwidth Utilization in Networks-on-Chip

P-NoC: Performance Evaluation and Design Space Exploration of NoCs for Chip Multiprocessor Architecture Using FPGA

NoC-based hardware software co-design framework for dataflow thread management

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Efficient Network-on-Chip Router for Dataflow Architecture

Abstract

Access this article

Similar content being viewed by others

Parallel Forwarding for Efficient Bandwidth Utilization in Networks-on-Chip

P-NoC: Performance Evaluation and Design Space Exploration of NoCs for Chip Multiprocessor Architecture Using FPGA

NoC-based hardware software co-design framework for dataflow thread management

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation