Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Fan, Dong-Rui; Yuan, Nan; Zhang, Jun-Chao; Zhou, Yong-Bin; Lin, Wei; Song, Feng-Long; Ye, Xiao-Chun; Huang, He; Yu, Lei; Long, Guo-Ping; Zhang, Hao; Liu, Lei

doi:10.1007/s11390-009-9295-3

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Regular Paper
Published: 06 November 2009

Volume 24, pages 1061–1073, (2009)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Dong-Rui Fan¹,
Nan Yuan¹,
Jun-Chao Zhang¹,
Yong-Bin Zhou¹,
Wei Lin¹,
Feng-Long Song¹,
Xiao-Chun Ye¹,
He Huang¹,
Lei Yu¹,
Guo-Ping Long¹,
Hao Zhang¹ &
…
Lei Liu¹

238 Accesses
28 Citations
Explore all metrics

Abstract

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Runtime-Aware Architectures

Parallel programming models for heterogeneous many-cores: a comprehensive survey

Article 31 July 2020

High-Level Programming for Many-Cores Using C++14 and the STL

Article 13 March 2017

References

Asanovic K et al. The landscape of parallel computing research: A view from Berkeley. Technical Report No.UCB/EECS-2006-183, University of California, Berkeley, December 18, 2006.
Lee E A. The problem with threads. Computer, 2006, 39(5): 33–42.
Article Google Scholar
Cantrill B, Bonwick J. Real-world concurrency. ACM Queue, 2008, 6(5): 16–25.
Article Google Scholar
Adve S V, Adve V S et al. Parallel computing research at Illinois: The UPCRC agenda. Technical Report, University of Illinois at Urbana-Champaign, November 2008.
Yuan N, Yu L, Fan D. An efficient and flexible task management for many-core architectures. In Proc. Workshop on Software and Hardware Challenges of Manycore Platforms, in Conjunction with the 35th International Symposium on Computer Architecture (ISCA-35), Beijing, China, June 22–26, 2008, pp.1–17.
Blumofe R D, Leiserson C E. Scheduling multithreaded computations by work stealing. Journal of the ACM, 1999, 46(5): 720–748.
Article MATH MathSciNet Google Scholar
Palatin P, Lhuillier Y, Temam O. CAPSULE: Hardware-assisted parallel execution of component-based programs. In Proc. the 39th Annual IEEE/ACM International Symposium on Micro-Architecture, Washington, DC, USA: IEEE Computer Society, Dec. 9–13, 2006, pp.247–258.
Villa O, Palermo G, Silvano C. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, Oct. 19–24, 2008, pp.81–90.
Carlson W W, Draper J M et al. Introduction to UPC and language specification. Technical Report No. CCS-TR-99-157, University of California, Berkeley, 1999.
Numrich R W, Reid J. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 1998, 17(2): 1–31.
Article Google Scholar
Yelick K, Semenzato L et al. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 1998, 10(11-13): 825–836.
Article Google Scholar
Fatahalian K, Horn D R et al. Sequoia: Programming the memory hierarchy. In Proc. the 2006 ACM/IEEE Conference on Supercomputing, Tampa, Florida, Nov. 11–17, 2006, pp.83–95.
Bikshandi G, Guo J et al. Programming for parallelism and locality with hierarchically tiled arrays. In Proc. the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, USA, March 29–31, 2006, pp.48–57.
Mellor-Crummey J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.
Alverson R, Callahan D et al. The Tera computer system. In Proc. the 4th Int. Conf. Supercomputing, Amsterdam, The Netherlands, June 11–15, 1990, pp.1–6.
Zhu W, Sreedhar V C et al. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.35–45.
Woo S C, Ohara M et al. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Annual International Symposium on Computer Architecture, Santa Margnerita Ligure, Italy, June 22–24, 1995, pp.24–36.
Fu Y, Yang Q et al. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics, 2004, 20(1): 1948–1954.
Article Google Scholar
Altschul S, Madden T, Schaffer A et al. Gapped Blast and Psi-Blast: A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389–3402.
Article Google Scholar
Kumar S, Jiang D et al. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review (SIGMETRICS 1999), 1999, 27(1): 23–34.
Article Google Scholar
Feo J. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 1988, 7(2): 163–185.
Article MATH Google Scholar
Yuan N, Zhou Y et al. High performance matrix multiplication on many cores. In Proc. European Conference on Parallel and Distributed Computing (Euro-Par), Delft, The Netherlands, Aug. 25–28, 2009, pp.948–959.
Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. 2008 ACM/IEEE Conf. Supercomputing (SC 2008), Austin, USA, Now. 15–21, IEEE Press, 2008, pp.1–11.
Chen L, Hu Z et al. Optimizing fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium, Long Beach, USA, March 26–30, 2007, pp.1–8.
Hu Z, Cuvillo J et al. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, August 28–September 1, pp.134–144.
Govindaraju N K et al. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing (SC2008), Austin, USA, Nov. 15–21, 2008, pp.13–24.
Williams S, Shalf J et al. The potential of the cell processor for scientific computing. In Proc. CF’06, Ischia, Italy, May 3–5, 2006, pp.9–20.
Gao G R, Sarkar V. Location consistency — A new memory model and cache consistency protocol. IEEE Transactions on Computers, 2000, 49(8): 798–813.
Article Google Scholar
Shen X et al. Commit-reconcile & fences (CRF): A new memory model for architects and compiler writers. In Proc. the 26th Annual International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.150–161.
Lftode L et al. Scope consistency: A bridge between release consistency and entry consistency. In Proc. the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, Padua, Italy, June 24–26, 1996, pp.277–287.
Ceze L, Tuck J et al. BulkSC: Bulk enforcement of sequential consistency. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.278–289.
Hofstee P. Power efficient architecture and the cell processor. In Proc. HPCA-11, San Francisco, USA, February 12–16, 2005, pp.258–262.
Almasi G, Cascaval C et al. Dissecting cyclops: A detailed analysis of a multithreaded architecture. ACM SIGARCH Computer Architecture News, 2003, 31(1): 26–38.
Article Google Scholar
Lindholm E et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.
Article Google Scholar
Mellor-Crummey, J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.
Keckler S W et al. Exploiting fine-grain thread level parallelism on the MIT multi-alu processor. In Proc. the 25th Annual International Symposium on Computer Architecture, Barcelona, Spain, June 27–July 1, 1998, pp.306–317.
Sampson J, Gonzalez R. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, USA, Dec. 9–13, 2006, pp.235–246.
Villa O et al. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, October 19–24, 2008, pp.81–90.

Download references

Author information

Authors and Affiliations

Key Laboratory of Computer Systems and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Dong-Rui Fan (Member, CCF, IEEE), Nan Yuan, Jun-Chao Zhang (Member, CCF, ACM), Yong-Bin Zhou, Wei Lin, Feng-Long Song, Xiao-Chun Ye, He Huang, Lei Yu, Guo-Ping Long, Hao Zhang & Lei Liu

Authors

Dong-Rui Fan
View author publications
You can also search for this author in PubMed Google Scholar
Nan Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Chao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yong-Bin Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Feng-Long Song
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Chun Ye
View author publications
You can also search for this author in PubMed Google Scholar
He Huang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Yu
View author publications
You can also search for this author in PubMed Google Scholar
Guo-Ping Long
View author publications
You can also search for this author in PubMed Google Scholar
Hao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong-Rui Fan.

Additional information

Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321600, the National High-Tech Research and Development 863 Program of China under Grant No. 2009AA01Z103, the National Natural Science Foundation of China under Grant No. 60736012, the National Science Fund for Distinguished Young Scholars under Grant No. 60925009, and the Beijing Natural Science Foundation under Grant No. 4092044.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, DR., Yuan, N., Zhang, JC. et al. Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions. J. Comput. Sci. Technol. 24, 1061–1073 (2009). https://doi.org/10.1007/s11390-009-9295-3

Download citation

Received: 13 March 2009
Revised: 28 September 2009
Published: 06 November 2009
Issue Date: November 2009
DOI: https://doi.org/10.1007/s11390-009-9295-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Abstract

Access this article

Similar content being viewed by others

Runtime-Aware Architectures

Parallel programming models for heterogeneous many-cores: a comprehensive survey

High-Level Programming for Many-Cores Using C++14 and the STL

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

Abstract

Access this article

Similar content being viewed by others

Runtime-Aware Architectures

Parallel programming models for heterogeneous many-cores: a comprehensive survey

High-Level Programming for Many-Cores Using C++14 and the STL

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation