Energy optimization of Application-Specific Instruction-Set Processors by using hardware accelerators in semicustom ICs technology

https://doi.org/10.1016/j.micpro.2011.06.003Get rights and content

Abstract

The increasing complexity of applications with a decreasing time-to-market requirement has created a strong interest in both high-performance and flexible embedded processors with a strong consideration for battery life. Low-power optimizations are therefore often applied toward the development of Application-Specific Instruction-Set Processors (ASIPs). In this paper ASIP accelerators for a typical DSP task are developed and synthesis results from six different cell-based and FPGA architectures are shown.

By carefully analyzing algorithms and implementing appropriate accelerators with logic, it is shown that an increase in design performance is achieved while still reducing energy consumption due to the reduced latency of the task. In addition, we show cases when classic synthesis options can outperform new power optimization features in Xilinx ISE 11.1.

Highlights

► We examine design issues of tight couple hardware accelerators in Application-Specific Instruction-Set Processors. ► A true vector processor design is developed using the architecture description language LISA. ► Embedded microprocessor design flow to achieve minimum energy consumption is presented. ► FPGA synthesis options for low energy microprocessor designs are compared. ► Cell-based ASIC synthesis results with the goal of minimum energy consumption for DSP task are shown.

Introduction

Microprocessors (μPs) have become one of the most important IP blocks in semi-custom ICs in recent years. Altera, for instance, reported that they sold 10,000 NIOS microprocessor development systems in the first 3 years they were available. Xilinx reported an even larger number of “downloads” of their PicoBlaze and MicroBlaze microprocessors [1], [2], [3], [4].

Given the wide range of embedded systems applications, it should come as no surprise that not a single microprocessor can cover all of the requirements for hardware (HW) and software (SW), so customization is necessary. In general, the design of such a microprocessor system is a long, tedious, and error-prone task consisting of typically three design phases: architecture exploration; software design (assembler, linker, loader, profiler); architecture implementation and verification (RTL generation for FPGA or cell-based ASIC) (see Fig. 1). Architecture Description Languages (ADLs) allow a microprocessor to be modeled not only from the Instruction-Set but also from the architecture description, including pipelining behavior that allows for design and development tool consistency in all levels of the design [5], [6].

These new design tools enable software developers to take their algorithmic expressions straight into custom VLSI hardware without using the traditional HDL design flow [7], [8]. These tools and associated design methodologies are classified collectively as electronic system level (ESL) design, broadly referring to system design and verification methodologies that begin at a higher level of abstraction than the current mainstream hardware description language [7], [8].

ESL tools have been existed for some time, and many perceive that these tools are predominantly focused on ASIC design flows. But with ASIC mask charges of $1.5 million in 90-nm and $4 million in 65-nm technology (according to J. Donovan Vice president at Gartner Dataquest) [9], the number of designs using FPGAs is rapidly increasing. In fact, an increasing number of ESL tool providers (e.g., Celoxica, Codetronix, CoWare, Binachip, Impulse Accelerated, Mimosys, etc.) have dedicated substantial effort towards programmable logic [10], [11]. Cell-based ICs (CBICs) still offer advantages when large quantities, highest speed, or very low power consumption is required [12].

The majority of microprocessors today are employed in embedded systems. This number is not surprising because a typical home today may have a laptop/PC with a high-performance microprocessor but probably dozens of embedded systems including electronic entertainment and household and telecom devices – each of these equipped with one or more embedded processors. A modern car usually has more than 50 microprocessors [13]. Embedded processors are most often developed by relatively small teams within short time-to-market requirements, so processor design automation is clearly a very important issue. Once a model for a new processor is available, existing hardware synthesis tools enable the path to custom low-power VLSI implementation [14], [15].

However, embedded processor designs typically begin at a level of abstraction far beyond the Instruction Set Architecture (ISA). Design choices include algorithm type, arithmetic selection (integer, block floating-point, custom or IEEE floating-point format), overall system architecture (SISD, SIMD, MISD, MIMD etc. [16]), VLSI technology (CBIC, gate array, FPGA, standard μP etc.) Several architecture exploration cycles are needed before the optimum hardware/software architecture for a particular application is found. A reliable comparison of these architectures requires a number of tools for software development and profiling. As noted above, these are normally written manually – a time consuming, inefficient, and error-prone task. With the introduction of so-called ADL, the design process can be efficient and reliable [6]. The significance of using ADLs for microprocessor development is corroborated by the fact that the world’s largest ASIC tools provider (Synopsys) recently acquired one of the leaders in ADL tools (CoWare Inc.).

In Section 2, we give an overview on ADLs and describe the language (LISA) used in this study in more detail. Section 3 introduces a standard commercial-of-the-shelf (COTS) RISC processor with a few minor modifications. In Section 4, modifications to the basic core specific for the considered application domain are introduced. To further improve the design, a true vector processor (TVP) is developed in Section 5. In Section 6, results for Xilinx FPGAs and cell-based ASICs are presented. A conclusion of our study is shown in Section 7.

Section snippets

Designing Application-Specific Processors with ADLs

The typical design flow of a microprocessor and the associated tools was introduced in Fig. 1. In the classical approach, we start with an architecture description and then develop the instruction set and architecture. We then write the HDL code for the processor and write, based on this developed architecture, the development tools (e.g., the instruction set simulator (ISS), the C compiler, the assembler, et cetera). While this hand-coded HDL may allow us to obtain an extremely small core-size

LISA 18-Bit ISA RISC processor

Xilinx offer the 8-bit PicoBlaze and the 32-bit MicroBlaze in many architecture variations except for other data path widths like the 16 or 24 bits typical for DSP algorithms. In fact, BDTI’s “Pocket Guide to Processors for DSP” [24] shows that all commercially successful fixed-point PDSPs use 16 or 24 bits. In the following scenario, for example, let us create a 16-bit RISC machine with PD. Since a 16-bit processor fits in the middle between Micro- and PicoBlaze, we will call our RISC

LISA Programmable Digital Signal Processor (PDSP)

From the DWT processor discussed in the last section, we have seen that a large arithmetic count is required for updating memory pointers and the memory access itself. The multiply-accumulate instruction in NanoBlaze requires one MUL, two LDR, and three ADD that dominate the profile. See column 2 Table 2.

DSP algorithm (e.g. convolution, correlation, FIR, IIR filter, or fast DFTs [30], [31] typically operate on linear data arrays (i.e., vectors), and post-auto-increments or decrements in the

The LISA true vector processor

General purpose CPUs could previously be improved by exploring instruction level parallelism (ILP), by adding on-chip cache and floating-point units, and through speculative branch execution and improved speed, etc. One particular problem that occurs now is that the logic to track dependencies between all in-flight instructions grows quadratically in the number of instructions [32]. As a result, these improvements have considerably solved down since 2002, and the use of multiple CPUs on the

LISA processor implementation results

Finally, let us compare all three designs in terms of power/energy consumption along with size, speed, and overall throughput in mega-samples per second (MSPS) for a DWT length-8 example.

The processor has been designed using CoWare Processor Designer (PD) 2009.1.0 Win32. The HDL code and CBIC Synopsys synthesis script generated by PD has been used. Circuits have then been synthesized from their VHDL description and optimized for speed, power, and size with the synthesis tool DC from Synopsys

Conclusion

We have presented energy optimizations for a dual-channel 8/8 DWT LISA hardware accelerator design for a reduced instruction set computer (RISC), a programmable digital signal processor (PDSP), and true vector processor (TVP). The following conclusions can be made:

  • The throughput for a 100-point DWT HW acceleration example shows a factor 7.1 improvement for FPGAs and 7.7 for CBICs for a TVP by adding a few custom instructions.

  • HW acceleration allow energy reduction by a factor of 4.2 for

Acknowledgements

The authors acknowledge the support of the Humboldt Foundation, Altera, Xilinx and CoWare/Synopsys Inc. We thank M. Witte and H. Meyr from ISS RWTH Aachen for the review of an early draft of this paper. Many thanks also to Carissa Neff for the grammatical editing of this article. The reviewers provided very valuable feedback that substantially helped to improve the paper. Products and company names used in this article may be trademarks of their respective owners. Any opinions, findings, and

Uwe Meyer-Baese received the B.S.E.E., M.S.E.E., and Ph.D. (summa cum laude) degrees from the Darmstadt University of Technology, Darmstadt, Germany, in 1987, 1989, and 1995, respectively. He is currently a Professor in the Electrical and Computer Engineering Department, Florida State University, Tallahassee. In 1994 and 1995, he held a Postdoctoral Position in the “Institute of Brain Research,” Magdeburg, Germany. In 1996 and 1997, he was a Visiting Professor at the University of Florida,

References (34)

  • Altera, Netseminar Nios processor, 2004,...
  • Altera, Delivering RISC processors in an FPGA for $2.00, White Paper,...
  • Xilinx, Microblaze – The Low-cost and Flexible Processing Solution, 2005,...
  • Altera, Adding Hardware Accelerators to Reduce Power in Embedded Systems, 2009,...
  • P. Mishra et al.

    Processor Description Languages

    (2008)
  • P. Ienne et al.

    Customizable Embedded Processors

    (2006)
  • V. Pedroni

    Circuit Design with VHDL

    (2004)
  • Z. Navabi

    Embedded Core Design with FPGAs

    (2007)
  • J. Donovan, The Truth about 300mm, eETimes, 2002,...
  • C. Rowen

    Engineering the Complex SOC

    (2004)
  • Xilinx, Electronic System Level Design Ecosystem, 2007,...
  • Tansilica, Processor Core Power Specs: A Cautionary Tale,...
  • R. Charette, This Car Runs on Code, IEEE Spectrum,...
  • K. Vivekanandarajah, T. Srikanthan, Custom instruction filter cache synthesis for low-power embedded systems, in: The...
  • T. Glökler et al.

    Methodical low-power asip design space exploration

    J. VLSI Signal Process. Syst.

    (2003)
  • T. Parsons

    Introduction to Compiler Construction

    (1992)
  • A. Hoffmann et al.

    Architecture Exploration for Embedded Processors with LISA

    (2002)
  • Cited by (6)

    Uwe Meyer-Baese received the B.S.E.E., M.S.E.E., and Ph.D. (summa cum laude) degrees from the Darmstadt University of Technology, Darmstadt, Germany, in 1987, 1989, and 1995, respectively. He is currently a Professor in the Electrical and Computer Engineering Department, Florida State University, Tallahassee. In 1994 and 1995, he held a Postdoctoral Position in the “Institute of Brain Research,” Magdeburg, Germany. In 1996 and 1997, he was a Visiting Professor at the University of Florida, Gainesville. From 1998 to 2000, he worked as a Research Scientist in the ASIC industry, where he was responsible for development of high-performance architectures for digital signal processing. During his graduate studies, he worked part-time for TEMIC, Siemens, Bosch, and Blaupunkt. He holds three patents, has published over 80 journal and conference papers, and has supervised more than 60 master thesis projects in the DSP/FPGA area. He is author of the best selling Springer textbook on DSP with FPGAs. Dr. Meyer-Baese was a recipient of the “Habilitation” (venia legendi) by the Darmstadt University of Technology in 2003, the Max-Kade Award in Neuroengineering in 1997, and the Humboldt Research Award in 2006, and a FAMU-FSU College of Engineering Teaching Award 2007.

    Guillermo Botella received the M.A.Sc. degree in Physics in 1998, the M.Sc. degree in Electronic Engineering in 2001 and the Ph.D. degree in 2007, all from the University of Granada. From 2002 to 2005 he was a research European Fellow at the Department of Architecture and Computer Technology of the Universidad de Granada and the Vision Research Laboratory at University College London. After that he joined as Assistant Professor at the Department of Computer Architecture and Automation of Complutense University of Madrid. He has been visiting professor in 2008 and 2009 at the Department of Electrical and Computer Engineering, Florida State University, Tallahassee. He has authored more than 20 technical papers in international journal and conferences. His current research focuses on image, video and signal processing on FPGAs, vision algorithms and design automation for High Level Synthesis.

    Soumak Mookherjee received the BS from Bengal Engineering and Science University in 2001, his MS from North Carolina State University in 2007 and is now a Ph.D. student and Research Assistant at Florida State University. In the past he has been a programmer analyst at the Active Network and Verusant Technology, and a Software Engineer at Infosys Technologies Ltd. His research interests include image processing, custom DSP algorithm and architectures, and FPGAs.

    Encarnación Castillo received the M.A.Sc. degree and Ph.D. degree in electronic engineering from the University of Granada, Spain, in 2002 and 2008, respectively. From 2003 to 2005 she was a research Fellow at the Department of Electronics and Computer Technology at the University of Granada, where she is now an Assistant Professor. During her research fellowship, she carried out part of her work at the Department of Electrical and Computer Engineering, Florida State University, Tallahassee. Her research interests include the protection of IP protection of VLSI and FPGAs-based systems, as well as Residue Number System arithmetic, high-performance digital signal processing and VLSI and FPL signal processing systems. She has authored more than 20 technical papers in international journals and conferences.

    Antonio Garcı´a received the M.A.Sc. degree in Electronic Engineering (obtaining the Nation Best Academic Record award) in 1995, the M.Sc. degree in Physics (majoring in Electronics) in 1997 and the Ph.D. degree in Electronic Engineering in 1999, all from the University of Granada. He was an Associate Professor at the Department of Computer Engineering of the Universidad Autónoma de Madrid before joining the Department of Electronics and Computer Technology at the University of Granada, where he actually serves as Professor. He has authored more than 70 technical papers in international journals and conferences and serves regularly as reviewer for several IEEE and IEE journals. He is also part of the Program Committee for several international conferences on programmable logic. His current research interests include IP protection of VLSI and FPGA-based systems, low-power and high-performance VLSI signal processing systems, and the combination of digital and analog programmable technologies for smart instrumentation. He is a member of IEEE and a C&S and SP Society member.

    View full text