Article

An FPGA-based VLIW processor with custom hardware execution

Authors:
Alex K. Jones

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Raymond Hoare

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Dara Kusic

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
Joshua Fazekas

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

,
John Foster

University of Pittsburgh, Pittsburgh, PA

University of Pittsburgh, Pittsburgh, PA
View Profile

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arraysFebruary 2005Pages 107–117https://doi.org/10.1145/1046192.1046207

Published:20 February 2005Publication History

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

Pages 107–117

ABSTRACT

The capability and heterogeneity of new FPGA (Field Programmable Gate Array) devices continues to increase with each new line of devices. Efficiently programming these devices is increasing in difficulty. However, FPGAs continue to be utilized for algorithms traditionally targeted to embedded DSP microprocessors such as signal and image processing applications.This paper presents an architecture that combines VLIW (Very Large Instruction Word) processing with the capability to introduce application specific customized instructions and complex hardware functions. To support this architecture, a compilation and design automation flow are described for programs written in C.Several design tradeoffs for the architecture were examined including number of VLIW functional units and register file size. The architecture was implemented on an Altera Stratix II FPGA. The Stratix II device was selected because it offers a large number of high-speed DSP (digital signal processing) blocks that execute multiply accumulate operations.We show that our combined VLIW with hardware functions exhibit as much as 230X speedup and 63X on average for computational kernels for a set of benchmarks. This allows for an overall speedup of 30X and 12X on average for signal processing benchmarks from the MediaBench.

References

Apple Computer, Inc., "Optimizing with SHARK, Big Payoff, Small Effort," http://developer.apple.com/tools/shark_optimize.html.Google Scholar
D. C. Suresh, W. A. Najjar, F. Vahid, J. R. Villarreal, G. Stitt, "Profiling Tools for Hardware/Software Partitioning of Embedded Applications", Proc. Of the 2003 ACM SiGPLAN Conf. On Languages, Compilers and Tools for Embedded Systems, San Diego, CA June 2003. Google ScholarDigital Library
P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Chang, M. Haldar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walkden, "MATCH: A MATLAB Compilation Environment for Configurable Computing Systems," International Symposium on Field-Programmable Custom Computing Machines (FCCM), Napa, CA, 2000.Google Scholar
S. Gupta, N. Savoiu, N. D. Dutt, R. K. Gupta, A. Nicolau, "Using Global Code Motions to Improve the Quality of Results for High-Level Synthesis," IEEE Transactions on Computer Aided Design, February, 2004. Google ScholarDigital Library
A. K. Jones, D. Bagchi, S. Pal, P. Banerjee, and A. Choudhary, Pact HDL: Compiler Targeting ASIC's and FPGA's with Power and Performance Optimizations, Chapter 9 in Power Aware Computing, ed. by Robert Graybill and Rami Melhem, pp. 169--190. Kluwer Academic Publishers, Boston, MA, 2002. Google ScholarDigital Library
X. Tang, T. Jiang, A. K. Jones, and P. Banerjee, "Behavioral Synthesis of Data-Dominated Circuits for Minimal Energy Implementation," in Proceedings of the IEEE International Conference on VLSI Design, January 2005. Google ScholarDigital Library
Synopsys, Inc., "Behavioral Compiler," http://www.synopsys.com.Google Scholar
V.A. Chouliaras and J. Nunez, "Scalar Coprocessors for Accelerating the G723.1 and G729A Speech Coders," IEEE Transactions on Consumer Electronics, Vol. 69 No. 3, August 2003, pp. 703--710. Google ScholarDigital Library
E. Atzori, S.M. Carta and L. Raffo, "44.6% Processing Cycles Reduction in GSM Voice by Low-power Reconfigurable Co-processor Architecture," Eletronics Letters, Vol. 38 No. 24, November 2002, pp. 1524--1526.Google ScholarCross Ref
J. Hilgenstock, K. Herrmann, J. Otterstedt, D. Niggemeyer and P. Pirsch, "A Video Signal Processor for MIMD Multiprocessing," Proceedings of the 1998 Design Automation Conference, San Francisco, CA, June 1998. Google ScholarDigital Library
R. Garg, C.Y. Chung, D. Kim and Y. Kim, "Boundary Macroblock Padding in MPEG-4 Video Decoding Using a Graphics Co-processor," IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12 No. 8, August 2002, pp. 719--723. Google ScholarDigital Library
C.N. Hinds, "An Enhanced Floating Point Coprocessor for Embedded Signal Processing and Graphics Applications," Conference Record of the 33rd Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, October 1999.Google Scholar
J.C. Alves and J.S. Matos, "RVC-A Reconfigurable Coprocessor for Vector Processing Applications," Proceedings of IEEE Symposium on FPGAs for Custom Computing Machines, Napa Valley, CA, April 1998. Google ScholarDigital Library
T. Bridges, S.W. Kitchel and R. M. Wehrmeister, "A CPU Utilization Limit for Massively Parallel MIMD Computers," Fourth Symposium on the Frontiers of Massively Parallel Computation, McLean, VA, October 1992.Google Scholar
S. C. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor, "PipeRench: A Reconfigurable Architecture and Compiler" in IEEE Computer, Vol.33, No. 4, April 2000. Google ScholarDigital Library
B. A. Levine, H. Schmit, "Efficient Application Representation for HASTE: Hybrid Architectures with a Single, Transformable Executable." FCCM 2003. Google ScholarDigital Library
C. Ebeling, D. C. Cronquist, P. Franklin, "RaPiD - Reconfigurable Pipelined Datapath", in the 6th International Workshop on Field-Programmable Logic and Applications, 1996. Google ScholarDigital Library
E. Mirsky and A. DeHon," MATRIX: A Reconfigurable Computing Architecture with Configurable Instruction Distribution and Deployable Resources", in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1996.Google ScholarCross Ref
B.Khailany et al., "Imagine: media processing with streams", Micro, March-April 2001. Google ScholarDigital Library
T.J. Callahan, J.R. Hauser and J. Wawrzynek, "The Garp architecture and C compiler," Computer, Volume: 33, Issue: 4, April 2000. Google ScholarDigital Library
M. Sima, S. Cotofana, J. T. J. van Eijndhoven, S. Vassilidis, and K. Vissers, "An 8 x 8 IDCT Implementation on an FPGA-Augmented TriMedia," Field Programmable Custom Computing Machines (FCCM) 2001. Google ScholarDigital Library
S. Hauck, T. W. Fry, M. M. Hosler, J. P. Kao, "The Chimaera Reconfigurable Functional Unit," IEEE Symposium on FPGAs for Custom Computing Machines, pp. 87--96, 1997. Google ScholarDigital Library
S. Dutta, A. Wolfe, W. Wolf and K. O'Connor, "Design Issues for Very-Long-Instruction-Word VLSI Video Signal Processors," IEEE Workshop on VLSI Signal Processing, San Francisco, October 1996.Google Scholar
R. Hoare, S. Tung, K. Werger, "A 64-Way SIMD Processing Architecture on an FPGA," in Proceedings of the 15th IASTED International Conference on Parallel and Distributed Computing and Systems, 2003, pp. 345--350.Google Scholar
A. Jones, R. Hoare, I. Kourtev, J. Fazekas, D. Kusic, J. Foster, S. Boddie, A. Muaydh, "A 64-way VLIW/SIMD FPGA Processing Architecture and Design Flow," in Proc. of ICECS, 2004.Google Scholar
Advanced RISC Machines, "ARM7TDMI Processor," http://www.arm.com/products/CPUs/ARM7TDMI.html.Google Scholar
Altera Corporation, "NIOS II Soft-core Processor," http://www.altera.com/products/ip/processors/nios2/cores/ni2-processor_cores.html.Google Scholar
Xilinx Corporation, "Microblaze Soft-core Processor," http://www.xilinx.com/ipcenter/processor_central/microblaze/performance.htm.Google Scholar
International Business Machines (IBM), "Power-PC 405 Embedded CPU," http://www-306.ibm.com/chips/techlib/techlib.nsf/products/PowerPC_405_Embedded_Cores.Google Scholar
D. Rizzo and O. Colavin, "A Video Compression case Study on a reconfigurable VLIW Architecture," Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, Paris, France, March 2002. Google ScholarDigital Library
"Trimaran, An Infrastructure for Research in Instruction Level Parallelism", 1998. http://www.trimaran.org.Google Scholar

Index Terms

An FPGA-based VLIW processor with custom hardware execution

Recommendations

A time-predictable VLIW processor and its compiler support

Time predictability is an important requirement for real-time embedded application domains such as automotive, air transportation, and multimedia processing. However, the architectural design of modern microprocessors mainly concentrates on improving ...
Read More
A design of EPIC type processor based on MIPS architecture
Abstract
This paper proposes an EPIC (Explicitly Parallel Instruction Computing Architecture) type processor based on MIPS. VLIW processors can execute multiple instructions simultaneously, but due to dependency of instructions, it is often impossible to ...
Read More
The microarchitecture of FPGA-based soft processors
CASES '05: Proceedings of the 2005 international conference on Compilers, architectures and synthesis for embedded systems

As more embedded systems are built using FPGA platforms, there is an increasing need to support processors in FPGAs. One option is the soft processor, a programmable instruction processor implemented in the reconfigurable logic of the FPGA. Commercial ...
Read More

Reviews

Reviewer: Vassilios A. Chouliaras

This is a very exciting piece of research in the general area of configurable, extensible processors and the software/hardware interface. The authors propose a hybrid architecture, consisting of a parameterized very long instruction word (VLIW) core augmented with custom hardware execution units, as a very potent programmable execution engine. In addition, they have developed the software infrastructure to allow for automatic optimization of C-based applications. In the introductory section, the authors identify large-capacity field-programmable gate arrays (FPGAs) with substantial computer/memory resources as becoming commonplace. They correctly point out that the efficient mapping of applications on such devices is not a trivial exercise anymore, with a typical use being software kernels allocated on the FPGA fabric, and the irregular (control) part of the application running on an embedded processor. This segregation has indeed been identified by the major FPGA vendors, which utilize embedded processors on their devices to accommodate both regular and irregular codes. The authors provide a good discussion of past and present behavioral synthesis solutions, and correctly identify such solutions as appropriate for combinational code, not for control-dominated applications. In addition, they provide a very good overview of the literature, both from academia and from industry, on configurable (static) and reconfigurable (dynamic) systems for software acceleration. To address large, irregular code pieces in a semi-automatic manner, the authors propose a parametric platform to efficiently exploit all parallelism. The platform is a four-wide VLIW-based processor that is binary-compatible with the Altera NIOS II instruction set architecture (ISA). In addition, it supports extending that ISA with custom hardware resources to achieve superlinear speedups. The software infrastructure is based on the well-known Trimaran VLIW research. The authors use an interesting technique to extract computational kernels (hardware functions), which are implemented directly as hardware blocks. These blocks make use of the abundant MAC units in typical high-performance FPGA devices, such as the Altera Stratix family. The authors discuss their hardware architecture, which is based on a four-wide VLIW with an eight-register, four-word (8R/4W) 32x32-bit register file, shared among the VLIW processing elements (PEs) and the custom hardware units. They also correctly identify the register file as the performance-limiting resource in an FPGA implementation, and provide substantial microarchitecture performance data. In the remaining sections, the authors discuss zero-overhead hardware/software switching, the hardware functions, and the software tool chain. They performed design, validation, and FPGA implementation, and achieved 167 megahertz (MHz) on an Altera Stratix, which is an impressive clock speed for a programmable device. Finally, they report on application speedups for both their standalone VLIW engine and their four-wide VLIW, augmented with hardware functions. Results range from nine percent to 230 times for kernel acceleration, which is indeed impressive. Overall, this is a thorough account of the proposed field of research; the authors did their best to disclose as much information as possible in the context of a conference paper. I was very much impressed with the technical ability of all those involved. This is a solid paper on embedded central processing unit (CPU) architecture. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays
February 2005
288 pages
ISBN:1595930299
DOI:10.1145/1046192
General Chair:
Herman Schmit
Tabula
,
Program Chair:
Steve Wilton
University of British Columbia
Copyright © 2005 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 February 2005
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NIOS
VLIW
compiler
kernels
parallelism
synthesis
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate125of627submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 64
  Total Citations
  View Citations
- 1,678
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An FPGA-based VLIW processor with custom hardware execution

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A time-predictable VLIW processor and its compiler support

A design of EPIC type processor based on MIPS architecture

The microarchitecture of FPGA-based soft processors

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An FPGA-based VLIW processor with custom hardware execution

FPGA '05: Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays

ABSTRACT

References

Cited By

Index Terms

Recommendations

A time-predictable VLIW processor and its compiler support

A design of EPIC type processor based on MIPS architecture

The microarchitecture of FPGA-based soft processors

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media