Skip to main content
Log in

New Methodologies for Parallel Architecture

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are summarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multi-core RISC processor with x86 emulation support. IEEE Micro, 2009, 29(2): 17–29.

    Article  Google Scholar 

  2. Fan D R, Yuan N, Zhang J C et al. Godson-T: An efficient many-core architecture for parallel program executions. Journal of Computer Science and Technology, 2009, 24(6): 1061–1073.

    Article  Google Scholar 

  3. Lv H, Cheng Y, Bai L, Chen M, Fan D, Sun N. P-GAS: Parallelizing a cycle-accurate event-driven many-core processor simulator using parallel discrete event simulation. In Proc. Workshop on Principle of Advanced and Distributed Simulation, Atlanta, USA, May 17–19, 2010, pp.1-8.

  4. Tang D, Bao Y, Hu W, Chen M. DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. In Proc. Int. Conf. High-Performance Computer Architecture, Bangalore, India, Jan. 9–14, 2010, pp.1-12.

  5. Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F T. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proc. IEEE/ACM Int. Symp. Microarchitecture, Atlanta, USA, Dec. 4–8, 2010, pp.337-348.

  6. Chen Y, Hu W, Chen T, Wu R. LReplay: A pending period based deterministic replay scheme. In Proc. Int. Symp. Computer Architecture, Saint-Malo, France, Jun. 19–23, 2010, pp.187-197.

  7. Su M, Chen Y, Gao X. A general method to make multi-clock system deterministic. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.1480-1485.

  8. Guo Q, Chen T, Chen Y, Zhou Z H, Hu W, Xu Z. Effiective and efficient microprocessor design space exploration using unlabeled design configurations. In Proc. Int. Joint Conf. Artificial Intelligence, Spain, 2011. (To appear)

  9. Xu D, Wu C, Yew P C. On mitigating memory bandwidth contention through bandwidth-aware scheduling. In Proc. Int. Conf. Parallel Architectures and Compilation Techniques, Vienna, Austria, Sept. 11–15, 2010, pp.237-247.

  10. Chen L, Liu L, Tang S, Huang L, Jing Z, Xu S, Zhang D, Shou B. Unified parallel C for GPU clusters: Language extensions and compiler implementation. In Proc. the 23 rd International Workshop on Languages and Compilers for Parallel Computing, Huston, USA, Oct. 7–9, 2010, pp.151-165.

  11. Wang L, Cui H, Duan Y, Lu F, Feng X, Yew P C. An adaptive task creation strategy for work-stealing scheduling. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.266-277.

  12. Liu L, Chen L, Wu C Y, Feng X B. Global tiling for communication minimal parallelization on distributed memory systems. In Proc. Int. Euro-Par Conf. Parallel Processing, Klagenfurt, Austria, Aug. 26–29, 2008, pp.382-391.

  13. Chen Y, Huang Y, Eeckhout L, Fursin G, Peng L, Temam O, Wu C. Evaluating iterative optimization across 1000 data sets. In Proc. Conf. Programming Language Design and Implementation, Toronto, Canada, Jun. 5–10, 2010, pp.448-459.

  14. Yu T, Xue J, Huo W, Feng X, Zhang Z. Level by level: Making flow- and context-sensitive pointer analysis scalable for millions of lines of code. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.218-229.

  15. Wang Z, Wu C. Yew P C. On improving heap memory layout by dynamic pool allocation. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.92-100.

  16. Li J,Wu C, HsuWC. An evaluation of misaligned data access handling mechanisms in dynamic binary translation systems. In Proc. Int. Conf. Code Generation and Optimization, Seattle, USA, Mar. 22–25, 2009, pp.180-189.

  17. Lv F, Wang L, Feng X, Li Z, Zhang Z. Exploiting idle register classes for fast spill destination. In Proc. Int. Conf. Super-computing, Island of Kos, Greece, Jun. 7–12, 2008, pp.319-326.

  18. Zhang L, Han Y, Xu Q, Li X, Li H. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans. VLSI Systems, 2009, 17(9): 1173–1186.

    Article  Google Scholar 

  19. Yan G, Liang X, Han Y, Li X. Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors. In Proc. Int. Symp. Computer Architecture, Saint-Malo, France, Jun. 19–23, 2010, pp.485-496.

  20. Pan S, Hu Y, Li X. IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.238-243.

  21. Hu W, Wang R, Chen Y, Fan B, Zhong S, Gao X, Qi Z, Yang X. Godson-3B: A 1 GHz 40 W 8-Core 128 GFlops processor in 65 nm CMOS. In Proc. Int. Solid-State Circuits Conference, 2011. (To appear)

  22. Zhang M, Li H, Li X. Path delay test generation toward activation of worst case coupling effects. IEEE Transactions on Very Large Scale Integration Systems, 2010, 18(12): 1–14.

    Article  Google Scholar 

  23. Han Y, Hu Y, Li X, Li H, Chandra A. Embedded test decompressor to reduce the required channels and vector memory of tester for complex processor circuit. IEEE Transactions on Very Large Scale Integration Systems, 2007, 5(15): 531–540.

    Article  Google Scholar 

  24. Wang D, Hu Y, Li H, Li X. The design-for-testability features and test implementation of a giga hertz general purpose microprocessor. Journal of Computer Science and Technology, 2008, 23(6): 1037–1046.

    Article  Google Scholar 

  25. Chen Y, Lv Y, Hu W, Chen T, Shen H, Wang P, Pan H. Fast complete memory consistency verification. In Proc. Int. Symp. High-Performance Computer Architecture, Raleigh, USA, Feb. 14–18, 2009, pp.381-392.

  26. Hu W, Chen Y, Chen T, Qian C, Li L. Linear time memory consistency verification. IEEE Transactions on Computers, 2011. (Accepted)

  27. Li L, Chen T, Chen Y, Li L, Qian C, Hu W. Brief announcement: Program regularization in verifying memory consistency. In Proc. Symp. Parallelism in Algorithms and Architectures, San Jose, USA, Jun. 4–6, 2011. (To appear)

  28. Guo Q, Chen T, Shen H, Chen Y, Wu Y, Hu W. Empirical design bugs prediction for verification. In Proc. Conf. Design, Automation and Test in Europe, Grenoble, France, Mar. 14–18, 2011, pp.1-6.

  29. Zhang T, Lv T, Li X. An abstraction-guided simulation approach using Markov models for microprocessor verification. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.484-489.

  30. Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3 multi-core processor. In Proc. Symp. High Performance Chips, Stanford University, USA, Aug. 24–26, 2008.

  31. Gao X, Chen Y J, Wang H D et al. System architecture of Godson-3 multi-core processors. Journal of Computer Science and Technology, 2010, 25(2): 181–191.

    Article  Google Scholar 

  32. Hu W, Chen Y. GS464V: A high-performance low-power XPU with 512-bit vector extension. In Proc. Symp. High Performance Chips, Aug. 22–24, Stanford University, USA, 2010.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong-Rui Fan.

Additional information

This work is in part supported by the National Basic Research 973 Program of China under Grant Nos. 2011CB302500, 2005CB321600, and the National Natural Science Foundation of China under Grant No.60921002.

Electronic Supplementary Material

Below is the link to the electronic supplementary material.

(PDF 93.3 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fan, DR., Li, XW. & Li, GJ. New Methodologies for Parallel Architecture. J. Comput. Sci. Technol. 26, 578–587 (2011). https://doi.org/10.1007/s11390-011-1158-z

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-011-1158-z

Keywords

Navigation