Customized Network-on-Chip for Message Reduction

Wang, Hongwei; Lu, Siyu; Zhang, Youhui; Yang, Guangwen; Zheng, Weimin

doi:10.1007/978-3-319-11197-1_41

Hongwei Wang²⁴,
Siyu Lu²⁴,
Youhui Zhang^24,25,
Guangwen Yang²⁴ &
…
Weimin Zheng^24,25

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8630))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

2660 Accesses

Abstract

This paper proposes a network-on-chip (NoC) design customized for message reduction, which enhances some common routers with a special Reduce Processing Unit (RPU) to complete reduce-computations hop-by-hop, as well as to learn the transmission path of reduction-messages adaptively. More specifically, for reduction on a small data-set, the corresponding data is transmitted through the NoC directly. Thus, along the transmission path, enhanced routers can complete reduction in place, which not only speeds up the processing procedure but also coalesces messages. An adaptive method for the deterministic routing algorithm is also introduced to enable these routers to learn transmission paths accurately to improve the processing efficiency. We present the detailed micro-architecture design and evaluate the corresponding performance, the power consumption and chip-area. Testing results show that this design can improve the reduction / all_reduce performance of 2.67~11.76 times, while the consumption of power and chip-area are both limited.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

IAM: an improved mapping on a 2-D network on chip to reduce communication cost and energy consumption

Article 16 September 2020

WHMS: An Efficient Wireless NoC Design for Better Communication Efficiency

MFLP: a low power encoding for on chip networks

Article 05 January 2016

References

Timothy, M.: The Future of Many Core Computing, http://i2pc.cs.illinois.edu/presentations/2010_05_06_Mattson_Slides.pdf
Rakesh, K., Timothy, G.M., Gilles, P., Rob, V.D.W.: The Case for Message Passing on Many-Core Chips. Multiprocessor System-on-Chip, pp. 115–123 (2011)
Google Scholar
Jie, M., Daniel, R., Ayse, K.C.: 3D Systems with On-Chip DRAM for Enabling Low-Power High-Performance Computing. In: Proceedings of Fifteenth HPEC Workshop, Massachusetts, USA (September 2011)
Google Scholar
Timothy, G.M., Rob, F.V.D.W., Michael, R., Thomas, L., Paul, B., Werner, H., Patrick, K., Jason, H., Sriram, V., Nitin, B., Greg, R., Saurabh, D.: The 48-core SCC processor: the programmer’s view. In: Proceedings of 2010 International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, LA (2010)
Google Scholar
MULTICORE COMMUNICATIONS API WORKING GROUP, http://www.multicore-association.org/workgroup/mcapi.php
Dong, Y., Chen, J., Yang, X., Yang, C., Peng, L.: Low power optimization for MPI collective operations. In: The 9th International Conference for Young Computer Scientists, ICYCS 2008, IEEE (2008)
Google Scholar
Rabenseifner, R.: Automatic MPI counter profiling of all users: First results on a CRAY t3e 900-512. In: Message Passing Interface Developer’s and User’s Conference (1999)
Google Scholar
Rabenseifner, R.: Optimization of collective reduction operations. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3036, pp. 1–9. Springer, Heidelberg (2004)
Chapter Google Scholar
Open MPI Development Team, Open MPI: open source high-performance computing, http://www.open-mpi.org/
Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. High Performance Computing Applications 19(1), 49–66 (2005)
Article Google Scholar
Rabenseifner, R.: Optimization of collective reduction operations. In: Proceedings of Int’l Conference on Computational Science (ICCS), Krakow, Poland (2004)
Google Scholar
Nicolas, F., Marc, H., Eric, L., Bernard, T.: MPI for the Clint Gb/s Interconnect. In: Proceedings of the 10th European PVM/MPI User’s Group Meeting, pp. 395–403 (2003)
Google Scholar
Maximize Platform MPI Performance with Voltaire® Fabric Collective Accelerator^TM (FCA^TM) and HP, http://www.mellanox.com/related-docs/voltaire_acceleration_software/FCA-Voltaire-Platform-HP-WEB111110.pdf
Underwood, K.D., Ligon, W.B., Sass, R.R.: Analysis of a prototype intelligent network interface. Concurrency and Computation: Practice and Experience 15(7-8), 751–777 (2003)
Article Google Scholar
Almási, G.S., et al.: Implementing MPI on the blueGene/L supercomputer. In: Danelutto, M., Vanneschi, M., Laforenza, D. (eds.) Euro-Par 2004. LNCS, vol. 3149, pp. 833–845. Springer, Heidelberg (2004)
Chapter Google Scholar
Gao, S., Schmidt, A.G., Sass, R.: Impact of reconfigurable hardware on accelerating mpi_reduce. In: 2010 International Conference on Field-Programmable Technology (FPT), pp. 29–36 (2010)
Google Scholar
Libo, H., Zhiying, W., Nong, X.: Accelerating NoC-based MPI Primitives via Communication Architecture Customization. In: Proceedings of IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors, Delft, July 2012, pp. 141–148. IEEE (2012)
Google Scholar
David, W., Patrick, G., Henry, H., Liewei, B., Bruce, E., Carl, R., Matthew, M., Chyi-Chang, M., John, F.B., John III, F.B., Anant, A.: On-chip Interconnection Architecture of the Tile Processor. IEEE Computer Society (September-October 2007)
Google Scholar
Velamati, M.K., Kumar, A., Jayam, N., Senthilkumar, G., Baruah, P.K., Sharma, R., Kapoor, S., Srinivasan, A.: Optimization of collective communication in intra-cell MPI. In: Aluru, S., Parashar, M., Badrinath, R., Prasanna, V.K. (eds.) HiPC 2007. LNCS, vol. 4873, pp. 488–499. Springer, Heidelberg (2007)
Chapter Google Scholar
Ali, Q., Midkiff, S.P., Pai, V.S.: Efficient high performance collective communication for the cell blade. In: Proceedings of the 23rd International Conference on Supercomputing, pp. 193–203. ACM (2009)
Google Scholar
Kohler, A., Radetzki, M., Gschwandtner, P., Fahringer, T.: Low-latency collectives for the intel scc. In: 2012 IEEE International Conference on Cluster Computing (CLUSTER), pp. 346–354. IEEE (2012)
Google Scholar
Peng, Y., Saldaña, M., Chow, P.: Hardware support for broadcast and reduce in mpsoc. In: 2011 International Conference on Field Programmable Logic and Applications (FPL), pp. 144–150. IEEE (2011)
Google Scholar
Gonzalez, R.E.: Xtensa: A configurable and extensible processor. IEEE Micro 20(2), 60–70 (2000)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, China
Hongwei Wang, Siyu Lu, Youhui Zhang, Guangwen Yang & Weimin Zheng
Technology Innovation Center at Yinzhou, Yangtze Delta Region Institute of Tsinghua University, ZheJiang, China
Youhui Zhang & Weimin Zheng

Authors

Hongwei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Siyu Lu
View author publications
You can also search for this author in PubMed Google Scholar
Youhui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guangwen Yang
View author publications
You can also search for this author in PubMed Google Scholar
Weimin Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Illinois Institute of Technology, 60616-3793, Chicago, IL, USA
Xian-he Sun
School of Computer Science and Technology, Dalian Maritime University, 1 Linghai Road, 116026, Dalian, China
Wenyu Qu
University of Ottawa, SEECS, 8, King Edward Ave, K1N 6N5, Ottawa, ON, Canada
Ivan Stojmenovic
Deakin University, 221 Burwood Highway, 3125, Burwood, VIC, Australia
Wanlei Zhou
Dalian Maritime University, NO.1 Linhai Road, 116026, Dailian, China
Zhiyang Li & Tingting Yang &
BeiHang University, XueYuan Road No.37,HaiDian District, Beijing, China
Hua Guo
University of Bradford, BD7 1DP, Bradford, West Yorkshire, United Kingdom
Geyong Min
Computer Network Information Center, Chinese Academy of Sciences, 100190, Beijing, China
Yulei Wu
27 Shanda Nanlu, 250100, Jinan City, Shandong Province, China
Lei Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Lu, S., Zhang, Y., Yang, G., Zheng, W. (2014). Customized Network-on-Chip for Message Reduction. In: Sun, Xh., et al. Algorithms and Architectures for Parallel Processing. ICA3PP 2014. Lecture Notes in Computer Science, vol 8630. Springer, Cham. https://doi.org/10.1007/978-3-319-11197-1_41

Download citation

DOI: https://doi.org/10.1007/978-3-319-11197-1_41
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11196-4
Online ISBN: 978-3-319-11197-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics