Resource-bounded multicore emulation using Beefarm

https://doi.org/10.1016/j.micpro.2012.05.015Get rights and content

Abstract

In this article, we present the Beefarm infrastructure for FPGA-based multiprocessor emulation, a popular research topic of the last few years both in FPGA and computer architecture communities. We explain how we modify and extend a MIPS-based open-source soft core, we discuss various design tradeoffs to make efficient use of the bounded resources available on chip and we demonstrate superior scalability compared to traditional software instruction set simulators through experimental results running Software Transactional Memory (STM) benchmarks. Based on our experience, we comment on the pros and cons and the future trends of using hardware-based emulation for multicore research.

Introduction

This article reports on our experience of designing and building an innovative eight core cache-coherent shared-memory multiprocessor system on FPGA called Beefarm, which has been implemented on the BEE3 [1] infrastructure to help investigate support for Transactional Memory [2], [3], [4]. The primary reason for using an FPGA-based simulator is to achieve a significantly faster simulation speed for multicore architecture research compared to the performance of software instruction set simulators. A secondary reason is that a system that uses only the FPGA fabric to model a multicore processor may have a higher degree of fidelity since no functionality is implemented by a magical software routine. Another way to use FPGA-based emulation is to offload infrequent or slow running instructions and I/O operations to a software simulator but retain the core functionality in FPGA hardware [5]. In our work we model the entire multiprocessor system on FPGA logic, although commercial simulator accelerators like Palladium and automated simulator parallelization efforts also take advantage of reconfigurable technology [6].

Recent advances in multicore computer architecture research are being hindered by the inadequate performance of software-based instruction set simulators which has led many researchers to consider the use of FPGA-based emulation. Although sequential software-based simulators such as Simplescalar, Simics or M5 are more expressive and mature and it is relatively easy and fast to make changes to the system in such a high-level environment, little effort has been made to parallelize or accelerate these programs which turn out to be slow for the simultaneous simulation of the cores of a typical multiprocessor of the current Chip Multiprocessing (CMP) era. New generation simulators, like Graphite [7], attack this problem and they successfully obtain higher parallelism rates at the cost of not being cycle-accurate.

The inherent advantages of using today’s FPGA systems are clear: multiple hard/soft processor cores, multi-ported SRAM blocks, high-speed DSP units, and more and more configurable fabric of logic cells each and every generation on a more rapidly growing process technology than ASIC. Another advantage of using FPGAs are the already-tested and readily-available Intellectual Property (IP) cores. There are various open-source synthesizable Register Transfer Level (RTL) models of x86, MIPS, PowerPC, SPARC, Alpha architectures. These flexible soft processor cores are excellent resources to start building a credible multicore system for any kind of architectural research. Recently, thanks to the emerging communities, many IP designs for incorporating UART, SD, Floating Point Unit (FPU), Ethernet or DDR controllers are easily accessible [8]. Furthermore, RTL models of modern processors have also been developed by chip manufacturers [9], [10], while designs even tend to spread among multiple FPGAs, as in the example of Intel Nehalem.

On-chip Block RAM (BRAM) resources on an FPGA, which are optionally pre-initialized and with built-in ECC, can be used in many configurations; such as:

  • RAM or SRAM: For implementing on-chip instruction/data cache, direct-mapped or set associative; cache tags, cache coherence bits, snoop tags, register file, multiple contexts, branch target caches, return address caches, branch history tables, debug support tables for breakpoint address/value registers, count registers or memory access history.

  • Content Addressable Memory (CAM): For reservation stations, out-of-order instruction issue/retire queues, fully associative TLBs.

  • ROM: Bootloader, look-up tables.

  • Asynchronous FIFO: To buffer data between processors, peripherals or coprocessors [11].

Special on-chip DSP blocks can be cascaded to form large multipliers/dividers or floating-point units. Complete architectural inspection of the memory and processor subsystems can be performed using statistic counters embedded in the FPGAs without any overhead.

Although FPGA-based multiprocessor emulation has received considerable attention in the recent years, the experience and tradeoffs of building such an infrastructure from these available resources has not yet been considered. Indeed, most of the emulators developed were either (i) written from scratch using higher level Hardware Description Languages (HDL), such as Bluespec [12], (ii) using hard cores such as PowerPC, or (iii) using proprietary closed-source cores such as the Microblaze.

Therefore, in this work we choose a new approach: we take an already existing freely-available uniprocessor MIPS core called Plasma [13] and we heavily modify and extend that to build a full multiprocessor system designed for multicore research. To obtain the Honeycomb core, the basic building block for the Beefarm: we designed and implemented two coprocessors, one providing support for virtual memory using a Translation Lookahead Buffer (TLB), and another one encapsulating an FPU; we optimized the Plasma to make better use of the resources on our Virtex-5 FPGAs; we modified the memory architecture to enable virtual memory addressing for 4 GB; we implemented extra instructions to better support exceptions and thread synchronization (load-linked and store conditional) and we developed the BeelibC system library to support the Beefarm system. Additionally, we designed coherent caches and developed a parameterizable system bus that accesses off-chip RAM through a DDR2 memory controller [14]. Finally, we developed a run-time system and compiler tools to support a programming environment rich enough to conduct experiments on Software Transactional Memory (STM) workloads.

A hypothesis we wish to investigate is the belief that an FPGA-based emulator for multicore systems will have better scalability for simulation performance compared to software-based instruction set simulators. We check this hypothesis using our flexible Beefarm infrastructure with designs ranging from 1 to 8 cores. The results show performance speedups of up to 6× compared to the well-known, cycle accurate M5 software simulator running on a fast host.

The key contributions of this work are:

  • A description of the Beefarm multiprocessor system on the BEE3 platform with explanations to justify our design decisions, extensions and discussions on the tradeoffs and an analysis of the FPGA resource utilization of our approach.

  • Experimental results for three benchmarks investigating support for Transactional Memory and an analysis of the performance and scalability of software simulators versus our Beefarm system.

  • A description of different strategies to implement efficiently a given well-known functionality. Focusing on floating-point support, we provide experimental results and discuss the tradeoffs of each solution.

  • An experience reporting the pros and cons of using FPGA-based multicore emulation and identification of specific challenges that need to be overcome to better support this approach in the future.

The next section explains how the Plasma core was modified to design the Honeycomb core, how the Beefarm architecture was implemented on the BEE3 platform, and the software stack, specifically with regard to research on Software Transactional Memory (TM). Section 3 compares executions of three STM benchmarks on our platform with the M5 software simulator. Section 4 presents different approaches to implement floating-point support and Section 5 describes our experience in building the Beefarm. Section 6 discusses other related research, while Section 7 concludes and describes future work.

Section snippets

The Beefarm system

This section introduces the architectural and implementation details of the Beefarm system, a bus-based multiprocessor version of the popular MIPS R3000 designed for the BEE3 FPGA platform. Our architectural and design decisions not only show the experience of implementing a multicore emulator from a popular soft core design in a modern FPGA platform, but also provide an example of the variety of available resources that are ready to be used in current reconfigurable systems. We are not

Methodology

The multiprocessor system presented in this work was designed to speed up multiprocessor architecture research, to be faster, more reliable and more scalable than software-based simulators. Its primary objective is to execute real applications in less time than popular full-system simulators, although it is not possible to run as fast as the actual ASIC. Therefore our tests:

  • Measure the performance of the simulator platform, not the performance of the system simulated. What is relevant is not

Efficient implementations for floating-point support

Most of the software applications and benchmarks assume that the underlying architecture will provide floating-point support. This assumption can cause the programmer not to try to optimize the resource usage. But in the case of floating-point hardware, a complete set of calculation, comparison and conversion operations can consume 5520 LUTs, the resources equivalent to one of our Honeycomb processors (5712 LUTs). Optimizing this kind of unit becomes of paramount importance, especially when the

The experience and trade-offs in hardware emulation

Although we achieved good scalability for our simulation speeds with respect to the number of processor cores, we have observed several challenges that still face the architecture researcher that adopts FPGA-based emulation. These include:

Place and route times can be prohibitively long, although newer synthesis tool versions have started to make use of the host multithreading capabilities. In the case of adding a simple counter to the design for observing the occurrence of some event, the

Related work

Some of the recent multicore prototyping proposals (shown on Table 1) implement full ISA on RTL and require access to large FPGA infrastructures, while others such as the Protoflex can use single-FPGA boards with SMT-like execution engines for simulator acceleration [5]. There is a wide variation of instruction set architectures such as large SMP cores versus small and unconventional cores like the TC5, for prototyping shared memory as well as message-passing schemes. Our work differs from

Conclusions and future work

In this work, we have described a different roadmap in building a full multicore emulator: By heavily modifying and extending a readily available soft processor core. We have justified our design decisions in that the core be small enough to fit many on a single FPGA while using the on-chip resources appropriately, flexible enough to easily accept changes in the ISA, and mature enough to run system libraries and a well-known STM library. We’ve presented an 8-core prototype on a modern

Oriol Arcas received his BA and MS in Computer Engineering in 2009 from the Technical University of Catalonia (UPC Barcelona Tech). Currently he is doing his PhD at the Barcelona Supercomputing Centre-Microsoft Research Cambridge joint center. His research is focused on multicore and heterogeneous architectures on reconfigurable hardware.

References (42)

  • J. Davis, C. Thacker, C. Chang, BEE3: Revitalizing computer architecture research, Microsoft...
  • L. Hammond, B.D. Carlstrom, V. Wong, B. Hertzberg, M. Chen, C. Kozyrakis, K. Olukotun, Programming with transactional...
  • K.E. Moore, J. Bobba, M.J. Moravan, M.D. Hill, D.A. Wood, LogTM: Log-based transactional memory, in: HPCA 2006, 2006,...
  • S. Tomic, C. Perfumo, C. Kulkarni, A. Armejach, A. Cristal, O. Unsal, T. Harris, M. Valero, EazyHTM: Eager-lazy...
  • E.S. Chung, E. Nurvitadhi, J.C. Hoe, B. Falsafi, K. Mai, A complexity-effective architecture for accelerating...
  • D.A. Penry, D. Fay, D. Hodgdon, R. Wells, G. Schelle, D.I. August, D. Connors, Exploiting parallelism and structure to...
  • J.E. Miller, H. Kasture, G. Kurian, C.G. III, N. Beckmann, C. Celio, J. Eastep, A. Agarwal, Graphite: a distributed...
  • OpenCores Website,...
  • P.H. Wang, J.D. Collins, C.T. Weaver, B. Kuttanna, S. Salamian, G.N. Chinya, E. Schuchman, O. Schilling, T. Doil, S....
  • G. Schelle, J. Collins, E. Schuchman, P. Wang, X. Zou, G. Chinya, R. Plate, T. Mattner, F. Olbrich, P. Hammarlund, R....
  • The myriad uses of block RAM,...
  • Bluespec Inc.,...
  • Plasma soft core,...
  • C. Thacker, A DDR2 controller for BEE3, Microsoft Research,...
  • J.-L. Brelet, XAPP201: An overview of multiple CAM designs in Virtex family devices,...
  • M. Vijayaraghavan, Arvind, Bounded dataflow networks and latency-insensitive circuits, in: MEMOCODE’09, pp....
  • Z. Tan, A. Waterman, R. Avizienis, Y. Lee, H. Cook, D. Patterson, K. Asanović, RAMP gold: an FPGA-based architecture...
  • P. Felber, C. Fetzer, T. Riegel, Dynamic performance tuning of word-based software transactional memory, in: PPoPP ’08,...
  • A.-R. Adl-Tabatabai, B.T. Lewis, V. Menon, B.R. Murphy, B. Saha, T. Shpeisman, Compiler and runtime support for...
  • N.L. Binkert et al.

    The M5 simulator: Modeling networked systems

    IEEE Micro.

    (2006)
  • R.P. Weicker

    Dhrystone: a synthetic systems programming benchmark

    Commun. ACM

    (1984)
  • Cited by (2)

    Oriol Arcas received his BA and MS in Computer Engineering in 2009 from the Technical University of Catalonia (UPC Barcelona Tech). Currently he is doing his PhD at the Barcelona Supercomputing Centre-Microsoft Research Cambridge joint center. His research is focused on multicore and heterogeneous architectures on reconfigurable hardware.

    Nehir Sonmez received a BS degree in Computer Engineering from LC Smith College of Engineering at Syracuse University in June 2003, and an MS degree in CE from Bogazici University in September 2006. Since then, he is pursuing his PhD degree in the Computer Architecture department at University Politècnica de Catalunya, while doing research in Transactional Memory at the Barcelona Supercomputing Center.

    Gokhan Sayilar received B.S degree in Computer Science and Engineering at Sabanci University, Istanbul, Turkey in June 2011. He is currently doing his Ph.D. in Electrical and Computer Engineering at The University of Texas at Austin, TX, USA. His research interests include Computer Architecture, software/hardware co-design, Multi-Processor System on Chip (MPSoC) and digital SoC FPGA design.

    Satnam Singh’s research involves finding novel ways to program and use special Lego-like chips called FPGAs. In particular, he is interested in making the circuits on these chips change as they run to adapt to new situations. Satnam Singh completed his PhD at the University of Glasgow in 1991 where he devised a new way to program and analyze digital circuits described in a special functional programming language. He then went onto be an academic at the same university (first in the Electrical Engineering department and then in the Computing Science department) and lead several research projects that explored novel ways to exploit FPGA technology for applications like software radio, Adobe Photoshop and high resolution digital printing, and graphics. In 1998 he moved to San Jose California to join Xilinx’s research lab where we developed more tools and technology for designing and formally verifying circuits for FPGAs as well as the actual FPGA chips. In particular, he developed a language called Lava in conjunction with Chalmers University which allows circuits to be laid out nicely on chips to give high performance and better utilization of silicon resources. In 2004 he joined Microsoft in Redmond Washington where we worked on a variety of techniques for producing concurrent and parallel programs and in particular explored join patterns and Software Transactional Memory. In 2006 he moved to Microsoft’s research laboratory in Cambridge where he works on reconfigurable computing and parallel functional programming.

    Osman S. Unsal is co-leader of the Architectural Support for Programming Models group at the Barcelona Supercomputing Center. Dr. Unsal is also a researcher at the BSC-Microsoft Research Centre. He holds BS, MS, and PhD degrees in electrical and computer engineering from Istanbul Technical University, Brown University, and University of Massachusetts, Amherst, respectively.

    Adrián Cristal received the “licenciatura” in Compuer Science from Universidad de Buenos Aires (FCEN) in 1995 and the PhD. degree in Computer Science in 2006, from the Universitat Politécnica de Catalunya (UPC), Spain. From 1992 to 1995 he has been lecturing in Neural Network and Compiler Design. In UPC, from 2003 to 2006 he has been lecturing on computer organization. Currently, and since 2006, he is researcher in Computer Architecture group at BSC. He is currently co-manager of the ”Computer Arquitecture for parallel paradigms”. His research interests cover the areas of microarchitecture, multicore architectures, and programming models for multicore architectures. He has published around 60 publications in these topics and participated in several research projects with other universities and industries, in framework of the European Union programmes or in direct collaboration with technology leading companies.

    Ibrahim Hur received a BS degree in Computer Science and Engineering from the Ege University, Turkey, in 1991. He completed a MS in Computer Science in 1995 from the Southern Methodist University, Texas, and he entered the Graduate School at The University of Texas at Austin. In 1997 he he joined IBM and worked at the Systems and Technology Group. In 2006 he received his PhD from The University of Texas at Austin. After a one year stay at the Barcelona Supercomputing Center he is currently employed by Aselsan, Turkey.

    Mateo Valero obtained his Telecommunication Engineering Degree from the Technical University of Madrid (UPM) in 1974 and his Ph.D. in Telecommunications from the Technical University of Catalonia (UPC) in 1980. Since 1974 he is a professor in the Computer Architecture Department at UPC, in Barcelona and since 1983 he is full professor. His research interests focuses on high performance architectures. He has published approximately 500 papers, has served in the organization of more than 300 International Conferences and he has given more than 300 invited talks. He is the director of the Barcelona Supercomputing Centre, the National Centre of Supercomputing in Spain. Dr. Valero has been honoured with several awards. Among them, the Eckert-Mauchly Award, by the IEEE and the ACM, the IEEE Harry Goode, two Spanish National awards, the “Julio Rey Pastor” to recognize research on IT technologies and the “Leonardo Torres Quevedo” to recognize research in Engineering, by the Spanish Ministry of Science and Technology, presented by the King of Spain and the “King Jaime I” in research by the Generalitat Valenciana presented by the Queen of Spain. He has been named Honorary Doctor by the University of Chalmers, by the University of Belgrade, by the Universities of Las Palmas de Gran Canarias and Zaragoza in Spain and by the University of Veracruz in Mexico. “Hall of the Fame”, selected as one of the 25 most influents European researchers in IT during the period 1983–2008. In December 1994, Professor Valero became a founding member of the Royal Spanish Academy of Engineering. In 2005 he was elected Correspondant Academic of the Spanish Royal Academy of Science and in 2006, member of the Royal Spanish Academy of Doctors and member of the “Academia Europaea”, the “Academy of Europe”. He is a Fellow of the IEEE, Fellow of the ACM and an Intel Distinguished Research Fellow. In 1998 he won a “Favourite Son” Award of his home town, Alfamén (Zaragoza) and in 2006, his native town of Alfamén named their Public College after him.

    1

    Present address: Department of Electrical and Computer Engineering, University of Texas, Austin.

    2

    Affiliated to Google Inc.

    3

    Present address: Senior Design Lead, Aselsan, Ankara, Turkey.

    View full text