Abstract
Moore's law continues to grant computer architects ever more transistors in the foreseeable future, and parallelism is the key to continued performance scaling in modern microprocessors. In this paper, the achievements in our research project, which is supported by the National Basic Research 973 Program of China, on parallel architecture, are systematically presented. The innovative approaches and techniques to solve the significant problems in parallel architecture design are summarized, including architecture level optimization, compiler and language-supported technologies, reliability, power-performance efficient design, test and verification challenges, and platform building. Two prototype chips, a multi-heavy-core Godson-3 and a many-light-core Godson-T, are described to demonstrate the highly scalable and reconfigurable parallel architecture designs. We also present some of our achievements appearing in ISCA, MICRO, ISSCC, HPCA, PLDI, PACT, IJCAI, Hot Chips, DATE, IEEE Trans. VLSI, IEEE Micro, IEEE Trans. Computers, etc.
Similar content being viewed by others
References
Hu W, Wang J, Gao X, Chen Y, Liu Q, Li G. Godson-3: A scalable multi-core RISC processor with x86 emulation support. IEEE Micro, 2009, 29(2): 17–29.
Fan D R, Yuan N, Zhang J C et al. Godson-T: An efficient many-core architecture for parallel program executions. Journal of Computer Science and Technology, 2009, 24(6): 1061–1073.
Lv H, Cheng Y, Bai L, Chen M, Fan D, Sun N. P-GAS: Parallelizing a cycle-accurate event-driven many-core processor simulator using parallel discrete event simulation. In Proc. Workshop on Principle of Advanced and Distributed Simulation, Atlanta, USA, May 17–19, 2010, pp.1-8.
Tang D, Bao Y, Hu W, Chen M. DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance. In Proc. Int. Conf. High-Performance Computer Architecture, Bangalore, India, Jan. 9–14, 2010, pp.1-12.
Long G, Franklin D, Biswas S, Ortiz P, Oberg J, Fan D, Chong F T. Minimal multi-threading: Finding and removing redundant instructions in multi-threaded processors. In Proc. IEEE/ACM Int. Symp. Microarchitecture, Atlanta, USA, Dec. 4–8, 2010, pp.337-348.
Chen Y, Hu W, Chen T, Wu R. LReplay: A pending period based deterministic replay scheme. In Proc. Int. Symp. Computer Architecture, Saint-Malo, France, Jun. 19–23, 2010, pp.187-197.
Su M, Chen Y, Gao X. A general method to make multi-clock system deterministic. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.1480-1485.
Guo Q, Chen T, Chen Y, Zhou Z H, Hu W, Xu Z. Effiective and efficient microprocessor design space exploration using unlabeled design configurations. In Proc. Int. Joint Conf. Artificial Intelligence, Spain, 2011. (To appear)
Xu D, Wu C, Yew P C. On mitigating memory bandwidth contention through bandwidth-aware scheduling. In Proc. Int. Conf. Parallel Architectures and Compilation Techniques, Vienna, Austria, Sept. 11–15, 2010, pp.237-247.
Chen L, Liu L, Tang S, Huang L, Jing Z, Xu S, Zhang D, Shou B. Unified parallel C for GPU clusters: Language extensions and compiler implementation. In Proc. the 23 rd International Workshop on Languages and Compilers for Parallel Computing, Huston, USA, Oct. 7–9, 2010, pp.151-165.
Wang L, Cui H, Duan Y, Lu F, Feng X, Yew P C. An adaptive task creation strategy for work-stealing scheduling. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.266-277.
Liu L, Chen L, Wu C Y, Feng X B. Global tiling for communication minimal parallelization on distributed memory systems. In Proc. Int. Euro-Par Conf. Parallel Processing, Klagenfurt, Austria, Aug. 26–29, 2008, pp.382-391.
Chen Y, Huang Y, Eeckhout L, Fursin G, Peng L, Temam O, Wu C. Evaluating iterative optimization across 1000 data sets. In Proc. Conf. Programming Language Design and Implementation, Toronto, Canada, Jun. 5–10, 2010, pp.448-459.
Yu T, Xue J, Huo W, Feng X, Zhang Z. Level by level: Making flow- and context-sensitive pointer analysis scalable for millions of lines of code. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.218-229.
Wang Z, Wu C. Yew P C. On improving heap memory layout by dynamic pool allocation. In Proc. Int. Conf. Code Generation and Optimization, Toronto, Canada, Apr. 24–28, 2010, pp.92-100.
Li J,Wu C, HsuWC. An evaluation of misaligned data access handling mechanisms in dynamic binary translation systems. In Proc. Int. Conf. Code Generation and Optimization, Seattle, USA, Mar. 22–25, 2009, pp.180-189.
Lv F, Wang L, Feng X, Li Z, Zhang Z. Exploiting idle register classes for fast spill destination. In Proc. Int. Conf. Super-computing, Island of Kos, Greece, Jun. 7–12, 2008, pp.319-326.
Zhang L, Han Y, Xu Q, Li X, Li H. On topology reconfiguration for defect-tolerant NoC-based homogeneous manycore systems. IEEE Trans. VLSI Systems, 2009, 17(9): 1173–1186.
Yan G, Liang X, Han Y, Li X. Leveraging the core-level complementary effects of PVT variations to reduce timing emergencies in multi-core processors. In Proc. Int. Symp. Computer Architecture, Saint-Malo, France, Jun. 19–23, 2010, pp.485-496.
Pan S, Hu Y, Li X. IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.238-243.
Hu W, Wang R, Chen Y, Fan B, Zhong S, Gao X, Qi Z, Yang X. Godson-3B: A 1 GHz 40 W 8-Core 128 GFlops processor in 65 nm CMOS. In Proc. Int. Solid-State Circuits Conference, 2011. (To appear)
Zhang M, Li H, Li X. Path delay test generation toward activation of worst case coupling effects. IEEE Transactions on Very Large Scale Integration Systems, 2010, 18(12): 1–14.
Han Y, Hu Y, Li X, Li H, Chandra A. Embedded test decompressor to reduce the required channels and vector memory of tester for complex processor circuit. IEEE Transactions on Very Large Scale Integration Systems, 2007, 5(15): 531–540.
Wang D, Hu Y, Li H, Li X. The design-for-testability features and test implementation of a giga hertz general purpose microprocessor. Journal of Computer Science and Technology, 2008, 23(6): 1037–1046.
Chen Y, Lv Y, Hu W, Chen T, Shen H, Wang P, Pan H. Fast complete memory consistency verification. In Proc. Int. Symp. High-Performance Computer Architecture, Raleigh, USA, Feb. 14–18, 2009, pp.381-392.
Hu W, Chen Y, Chen T, Qian C, Li L. Linear time memory consistency verification. IEEE Transactions on Computers, 2011. (Accepted)
Li L, Chen T, Chen Y, Li L, Qian C, Hu W. Brief announcement: Program regularization in verifying memory consistency. In Proc. Symp. Parallelism in Algorithms and Architectures, San Jose, USA, Jun. 4–6, 2011. (To appear)
Guo Q, Chen T, Shen H, Chen Y, Wu Y, Hu W. Empirical design bugs prediction for verification. In Proc. Conf. Design, Automation and Test in Europe, Grenoble, France, Mar. 14–18, 2011, pp.1-6.
Zhang T, Lv T, Li X. An abstraction-guided simulation approach using Markov models for microprocessor verification. In Proc. Conf. Design, Automation and Test in Europe, Dresden, Germany, Mar. 8–12, 2010, pp.484-489.
Hu W, Wang J, Gao X, Chen Y. Micro-architecture of Godson-3 multi-core processor. In Proc. Symp. High Performance Chips, Stanford University, USA, Aug. 24–26, 2008.
Gao X, Chen Y J, Wang H D et al. System architecture of Godson-3 multi-core processors. Journal of Computer Science and Technology, 2010, 25(2): 181–191.
Hu W, Chen Y. GS464V: A high-performance low-power XPU with 512-bit vector extension. In Proc. Symp. High Performance Chips, Aug. 22–24, Stanford University, USA, 2010.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work is in part supported by the National Basic Research 973 Program of China under Grant Nos. 2011CB302500, 2005CB321600, and the National Natural Science Foundation of China under Grant No.60921002.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fan, DR., Li, XW. & Li, GJ. New Methodologies for Parallel Architecture. J. Comput. Sci. Technol. 26, 578–587 (2011). https://doi.org/10.1007/s11390-011-1158-z
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-011-1158-z