Skip to main content
Log in

Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Moore’s law will grant computer architects ever more transistors for the foreseeable future, and the challenge is how to use them to deliver efficient performance and flexible programmability. We propose a many-core architecture, Godson-T, to attack this challenge. On the one hand, Godson-T features a region-based cache coherence protocol, asynchronous data transfer agents and hardware-supported synchronization mechanisms, to provide full potential for the high efficiency of the on-chip resource utilization. On the other hand, Godson-T features a highly efficient runtime system, a Pthreads-like programming model, and versatile parallel libraries, which make this many-core design flexibly programmable. This hardware/software cooperating design methodology bridges the high-end computing with mass programmers. Experimental evaluations are conducted on a cycle-accurate simulator of Godson-T. The results show that the proposed architecture has good scalability, fast synchronization, high computational efficiency, and flexible programmability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Asanovic K et al. The landscape of parallel computing research: A view from Berkeley. Technical Report No.UCB/EECS-2006-183, University of California, Berkeley, December 18, 2006.

  2. Lee E A. The problem with threads. Computer, 2006, 39(5): 33–42.

    Article  Google Scholar 

  3. Cantrill B, Bonwick J. Real-world concurrency. ACM Queue, 2008, 6(5): 16–25.

    Article  Google Scholar 

  4. Adve S V, Adve V S et al. Parallel computing research at Illinois: The UPCRC agenda. Technical Report, University of Illinois at Urbana-Champaign, November 2008.

  5. Yuan N, Yu L, Fan D. An efficient and flexible task management for many-core architectures. In Proc. Workshop on Software and Hardware Challenges of Manycore Platforms, in Conjunction with the 35th International Symposium on Computer Architecture (ISCA-35), Beijing, China, June 22–26, 2008, pp.1–17.

  6. Blumofe R D, Leiserson C E. Scheduling multithreaded computations by work stealing. Journal of the ACM, 1999, 46(5): 720–748.

    Article  MATH  MathSciNet  Google Scholar 

  7. Palatin P, Lhuillier Y, Temam O. CAPSULE: Hardware-assisted parallel execution of component-based programs. In Proc. the 39th Annual IEEE/ACM International Symposium on Micro-Architecture, Washington, DC, USA: IEEE Computer Society, Dec. 9–13, 2006, pp.247–258.

  8. Villa O, Palermo G, Silvano C. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, Oct. 19–24, 2008, pp.81–90.

  9. Carlson W W, Draper J M et al. Introduction to UPC and language specification. Technical Report No. CCS-TR-99-157, University of California, Berkeley, 1999.

  10. Numrich R W, Reid J. Co-array Fortran for parallel programming. SIGPLAN Fortran Forum, 1998, 17(2): 1–31.

    Article  Google Scholar 

  11. Yelick K, Semenzato L et al. Titanium: A high-performance Java dialect. Concurrency: Practice and Experience, 1998, 10(11-13): 825–836.

    Article  Google Scholar 

  12. Fatahalian K, Horn D R et al. Sequoia: Programming the memory hierarchy. In Proc. the 2006 ACM/IEEE Conference on Supercomputing, Tampa, Florida, Nov. 11–17, 2006, pp.83–95.

  13. Bikshandi G, Guo J et al. Programming for parallelism and locality with hierarchically tiled arrays. In Proc. the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, New York, USA, March 29–31, 2006, pp.48–57.

  14. Mellor-Crummey J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.

  15. Alverson R, Callahan D et al. The Tera computer system. In Proc. the 4th Int. Conf. Supercomputing, Amsterdam, The Netherlands, June 11–15, 1990, pp.1–6.

  16. Zhu W, Sreedhar V C et al. Synchronization state buffer: Supporting efficient fine-grain synchronization on many-core architectures. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.35–45.

  17. Woo S C, Ohara M et al. The SPLASH-2 programs: Characterization and methodological considerations. In Proc. the 22nd Annual International Symposium on Computer Architecture, Santa Margnerita Ligure, Italy, June 22–24, 1995, pp.24–36.

  18. Fu Y, Yang Q et al. Exploiting the kernel trick to correlate fragment ions for peptide identification via tandem mass spectrometry. Bioinformatics, 2004, 20(1): 1948–1954.

    Article  Google Scholar 

  19. Altschul S, Madden T, Schaffer A et al. Gapped Blast and Psi-Blast: A new generation of protein database search programs. Nucleic Acids Research, 1997, 25(17): 3389–3402.

    Article  Google Scholar 

  20. Kumar S, Jiang D et al. Evaluating synchronization on shared address space multiprocessors: Methodology and performance. ACM SIGMETRICS Performance Evaluation Review (SIGMETRICS 1999), 1999, 27(1): 23–34.

    Article  Google Scholar 

  21. Feo J. An analysis of the computational and parallel complexity of the Livermore loops. Parallel Computing, 1988, 7(2): 163–185.

    Article  MATH  Google Scholar 

  22. Yuan N, Zhou Y et al. High performance matrix multiplication on many cores. In Proc. European Conference on Parallel and Distributed Computing (Euro-Par), Delft, The Netherlands, Aug. 25–28, 2009, pp.948–959.

  23. Volkov V, Demmel J W. Benchmarking GPUs to tune dense linear algebra. In Proc. 2008 ACM/IEEE Conf. Supercomputing (SC 2008), Austin, USA, Now. 15–21, IEEE Press, 2008, pp.1–11.

  24. Chen L, Hu Z et al. Optimizing fast Fourier transform on a multi-core architecture. In Proc. IEEE International Parallel and Distributed Processing Symposium, Long Beach, USA, March 26–30, 2007, pp.1–8.

  25. Hu Z, Cuvillo J et al. Optimization of dense matrix multiplication on IBM Cyclops-64: Challenges and experiences. In Proc. Euro-Par 2006, Dresden, Germany, August 28–September 1, pp.134–144.

  26. Govindaraju N K et al. High performance discrete Fourier transforms on graphics processors. In Proc. the 2008 ACM/IEEE Conference on Supercomputing (SC2008), Austin, USA, Nov. 15–21, 2008, pp.13–24.

  27. Williams S, Shalf J et al. The potential of the cell processor for scientific computing. In Proc. CF’06, Ischia, Italy, May 3–5, 2006, pp.9–20.

  28. Gao G R, Sarkar V. Location consistency — A new memory model and cache consistency protocol. IEEE Transactions on Computers, 2000, 49(8): 798–813.

    Article  Google Scholar 

  29. Shen X et al. Commit-reconcile & fences (CRF): A new memory model for architects and compiler writers. In Proc. the 26th Annual International Symposium on Computer Architecture, Atlanta, USA, May 2–4, 1999, pp.150–161.

  30. Lftode L et al. Scope consistency: A bridge between release consistency and entry consistency. In Proc. the Eighth Annual ACM Symposium on Parallel Algorithms and Architectures, Padua, Italy, June 24–26, 1996, pp.277–287.

  31. Ceze L, Tuck J et al. BulkSC: Bulk enforcement of sequential consistency. In Proc. the 34th Annual International Symposium on Computer Architecture, San Diego, USA, June 9–13, 2007, pp.278–289.

  32. Hofstee P. Power efficient architecture and the cell processor. In Proc. HPCA-11, San Francisco, USA, February 12–16, 2005, pp.258–262.

  33. Almasi G, Cascaval C et al. Dissecting cyclops: A detailed analysis of a multithreaded architecture. ACM SIGARCH Computer Architecture News, 2003, 31(1): 26–38.

    Article  Google Scholar 

  34. Lindholm E et al. NVIDIA Tesla: A unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55.

    Article  Google Scholar 

  35. Mellor-Crummey, J M, Scott M L. Synchronization without contention. In Proc. Architectural Support for Programming Languages and Operating Systems, Santa Clara, USA, April 8–11, 1991, pp.269–278.

  36. Keckler S W et al. Exploiting fine-grain thread level parallelism on the MIT multi-alu processor. In Proc. the 25th Annual International Symposium on Computer Architecture, Barcelona, Spain, June 27–July 1, 1998, pp.306–317.

  37. Sampson J, Gonzalez R. Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers. In Proc. the 39th Annual IEEE/ACM International Symposium on Microarchitecture, Orlando, USA, Dec. 9–13, 2006, pp.235–246.

  38. Villa O et al. Efficiency and scalability of barrier synchronization on NoC based many-core architecture. In Proc. CASES 2008, Atlanta, USA, October 19–24, 2008, pp.81–90.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong-Rui Fan.

Additional information

Supported by the National Basic Research 973 Program of China under Grant No. 2005CB321600, the National High-Tech Research and Development 863 Program of China under Grant No. 2009AA01Z103, the National Natural Science Foundation of China under Grant No. 60736012, the National Science Fund for Distinguished Young Scholars under Grant No. 60925009, and the Beijing Natural Science Foundation under Grant No. 4092044.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, DR., Yuan, N., Zhang, JC. et al. Godson-T: An Efficient Many-Core Architecture for Parallel Program Executions. J. Comput. Sci. Technol. 24, 1061–1073 (2009). https://doi.org/10.1007/s11390-009-9295-3

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-009-9295-3

Keywords

Navigation