Abstract
Prevailing trend in design of chip multiprocessors (CMP) has been that single-core processors are replicated. Therefore, they typically define asynchronous computational model, require heavily locality-aware memory allocation, and present high overheads in intercommunication. This kind of properties make parallel programming very challenging and prone to errors. We introduce our new dual-mode MultiBunched/Threaded Architecture with Chaining (MBTAC) processor core, the main building block of the REPLICA CMP. It provides a modern, sophisticated way for writing general purpose parallel programs backed up by native execution capabilities/realization of key concepts. These include support for cost-efficient machine instruction-level synchronization and uniform shared global memory for enabling easy-to-program memory allocation of data structures and data movement. MBTAC makes use of low-overhead thread-context switching solution; it has parallel computing savvy functional unit organization to exploit inter-thread instruction-level parallelism and highly efficient multioperations. To evaluate the goodness of our proposal, we implemented three MBTAC constellations featuring up to 2048 parallel threads on FPGA, compared it with respect to DLX and Intel’s Core i7 processors. The results point toward high performance in communication-intensive problems, simplified parallel programmability, and regular, implementation-friendly structure.
Similar content being viewed by others
Notes
This paper in an extended version of the paper [9] with more detailed description of the MBTAC processor, inclusion of MBTAC for 16-core constellations in FPGA prototype description and evaluation section, and measurements and results. It extends also theoretical work [7] that used a weakly implementable development version of the ECLIPSE architecture [5] for the tests.
References
Ahmad M, Hijaz F, Shi Q, Khan O (2015) Crono: a benchmark suite for multithreaded graph algorithms executing on futuristic multicores. In: Workload Characterization (IISWC), 2015 IEEE International Symposium on, pp 44–55
Dietzfelbinger M, Karlin A, Mehlhorn K, Meyer auf der Heide F, Rohnert H, Tarjan RE (1994) Dynamic perfect hashing: upper and lower bounds. SIAM J Comput 23(4):738–761
Engelmann C (1992) Simulationen von PRAM’s, Master’s thesis. Universitat des Saarlandes, FB Informatik
Forsell M (1994) Are multiport memories physically feasible? SIGARCH Comput Archit News 22(4):47–54
Forsell M (2002) A scalable high-performance computing solution for networks on chips. IEEE Micro 22(5):46–55
Forsell M (2004) E—a language for thread-level parallel programming on synchronous shared memory NOCs. WSEAS Trans Comput 3(3):807–812
Forsell M (2011) A PRAM-NUMA model of computation for addressing low-TLP workloads. Int J Netw Comput 1(1):21–35
Forsell M (2011) Performance comparison of some shared memory organizations for 2D mesh-like NOCs. Microprocess Microsyst 35(2):274–284
Forsell M, Roivainen J, Leppänen V (2014) Prototyping the MBTAC processor for the REPLICA CMP. In: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, IPDPSW ’14. IEEE Computer Society, Washington, pp 709–716
Hennessy J, Patterson D (1990) Computer architecture: a quantitative approach. Morgan Kaufmann Publishers Inc., Palo Alto
HiPEAC (2013) The HiPEAC vision for advance computing in Horizon 2020. http://www.hipeac.net/system/files/hp-roadmap-2013.pdf
Intel (2006) Research at Intel From a Few Cores to Many: A Tera-scale Computing Research Overview. White Paper
Jaja J (1992) Introduction to parallel algorithms. Addison-Wesley, Reading
Keller J, Kessler C, Traff J (2001) Practical PRAM programming. Wiley, New York
Krommydas K, Scogland TRW, Feng W-C (2013) On the programmability and performance of heterogeneous platforms. In: Proceedings of the 2013 International Conference on Parallel and Distributed Systems, ICPADS ’13. IEEE Computer Society, Washington, pp 224–231
Lenoski D, Laudon J, Gharachorloo K, Weber W-D, Gupta A, Hennessy J, Horowitz M, Lam MS (1992) The Stanford Dash multiprocessor. Computer 25(3):63–79
Leppänen V (1996) Studies on the realization of PRAM. Turku Centre for Computer Science, University of Turku, Turku, Finland
Merritt R (2011) Panel: Wall ahead in multicore programming (Multicore Expo). EE Times
Park JJK, Park Y, Mahlke S (2015) Chimera: collaborative preemption for multitasking on a shared GPU. In: Proceedings of ASPLOS
Patterson D (2010) The trouble with multi-core. IEEE Spectr 47(7):28–32
Ranade AG (1991) How to emulate shared memory. J Comput Syst Sci 42(3):307–326
Semiconductor Industry Association (2015) International Technology Roadmap for Semiconductors. http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semiconductors_itrs/
Sun Microsystems (2005) Throughput computing: changing the economics and ecology of the data center with innovative SPARC Technology. White paper
Vishkin U (2008) Toward realizing a PRAM-on-a-chip vision. In: Proceedings of the 2007 Conference on Parallel Processing, Euro-Par’07. Springer, Berlin, pp 5–6
Vishkin U (2011) Using simple abstraction to reinvent computing for parallelism. Commun ACM 54(1):75–85
Acknowledgements
This work was funded by VTT, the Grant 289773 of Academy of Finland and the Celtic-Plus Project CONVINcE.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Forsell, M., Roivainen, J. & Leppänen, V. REPLICA MBTAC: multithreaded dual-mode processor. J Supercomput 74, 1911–1933 (2018). https://doi.org/10.1007/s11227-017-2199-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-2199-z