An effective out-of-order execution control scheme for an embedded floating point coprocessor

https://doi.org/10.1016/S0141-9331(03)00023-1Get rights and content

Abstract

This paper proposes an out-of-order execution control scheme that can be effectively applied to a coprocessor for embedded systems. A floating-point coprocessor has generally multiple pipelines such as a floating-point adder, a floating-point multiplier, a floating-point divider and a load/store pipelines. In order to utilize fully these pipelines, a constraint-based dynamic control scheme is designed for a coprocessor. This control scheme can be achieved by a data dependency checking, a resource conflict checking, and an exception prediction technique. With this control scheme a coprocessor can execute its instructions out of order without an extra hardware unit for out-of-order execution control.

Introduction

The embedded systems are the preferred choices of major semiconductor companies and mobile device manufacturers since they need a simple, light, and low power micro-controller rather than a high-performance microprocessor. There are numerous examples of embedded processors in the market. For example, ARM9 and ARM 10 architectures from ARM, SH4 and SH5 architectures from HITACHI, and VR5000/5500 series from NEC Electronics. Those embedded processors have different characteristics for specific applications.

The SH4 processor is a two-way superscalar processor for home entertainment game consoles. It has four pipelines: integer, floating-point, load/store, and branch. It supports in-order issue, in-order exception, and out-of-order completion. Especially, it supports 3D graphics operations with a floating-point vector unit (Inner-production unit) [1]. The latest VR5500 series processor has about two times of pipeline resources compared with the SH4 processor. It is a two-way superscalar processor and has six pipelines: two for integer, two for floating-point, load/store, and branch. It supports out-of-order issue, and out-of-order completion [2]. On the other hand, ARM9 and ARM10 series processors are not for high performance, but for small area and low power consumptions. Every ARM core has a general integer pipeline which executes integer, load/store, and branch instructions. But they can expand their architecture with a coprocessor interface on a single chip. For example, VFP10 (Vector Floating-point coprocessor) can supports single and double precision floating-point arithmetic and has a 5-stage load/store pipeline and a 7-stage execution pipeline [3]. The expandable architecture is a more important feature in the embedded systems for various applications.

The latest design goal of a micro-controller is to achieve lower power and higher performance within certain constraints. Flexible configurations of peripheral devices such as a floating-point unit (FPU), caches, and other I/O blocks, should be provided to meet the requirements for various applications. A coprocessor like an FPU is an auxiliary device that may or may not be used according to the requirement of the applications. Therefore, the control interface for a coprocessor should be simple and flexible. In addition, a coprocessor for an embedded system should consume low power while maintaining high performance within certain constraints, if possible.

Generally, an FPU has multiple data-paths such as a floating-point adder, a floating-point multiplier (FMUL), and a floating-point divider (FDIV). The latencies of a floating-point operation are varied with arithmetic instructions in the execution pipelines [4]. If the in-order execution control scheme is applied in designing an FPU controller, the overall performance may be degraded considerably due to the long latency instructions such as floating-point division (FDIV) and floating-point square root. Thus the out-of-order execution control scheme is essential to achieve high performance. Scoreboarding and Tomasulo's algorithms, which need special hardware blocks such as scoreboard registers and an instruction queue, can support out-of-order execution and completion [5]. Scoreboarding is an instruction issue control mechanism with a big central scoreboard register, which stores information of instruction status, functional unit status, and register status. The performance of scoreboarding can be limited by the number of scoreboard entries, the number and type of functional units. Also, Tomasulo's approach uses distributed reservation stations instead of a central scoreboard and its dependency checking and execution control are done by the reservation station [6]. However, the design cost and complexity of these techniques are too high for micro-controller applications.

In this paper, a constraint-based dynamic control scheme is proposed for a coprocessor with a data dependency checking, a resource conflict checking, and an exception prediction. Constraint-based dynamic scheduling is similar to a reduced set of scoreboarding. A data dependency checking and pipeline resource checking are done by a small amount of information settings on the register file and the instruction decode unit. It has advantages of area because it needs small register bits for instruction control to achieve out-of-order execution instead of a scoreboard or a reservation stations. Also an exception prediction technique can eliminate a special hardware unit like a reorder buffer. All operands of arithmetic instructions are checked for exception at the first stage of the pipeline. If an exception occurs, the coprocessor executes a checked instruction in order for handling the exceptional condition properly, otherwise the coprocessor performs the instructions out of order. With this technique, the coprocessor is able to execute instructions out of order without extra hardware blocks, if the coprocessor does not generate an exception. Also the exception prediction technique eliminates a special hardware unit like a reorder buffer for precise exception.

The proposed constraint-based dynamic control scheme has been used for the design of a floating-point coprocessor for an embedded system [7]. A coprocessor has been implemented with a standard-cell library to save the design time and cost of its implementation. A hard-macro block is used for a large conventional block—fraction multiplier, barrel shifter, adder, and subtractor.

The rest of this paper is organized as follows. Section 2 illustrates the architecture of a coprocessor. Section 3 describes the out-of-order execution control scheme. Some of examples of coprocessor instruction execution are given in Section 4. Section 5 describes the implementation details. Section 6 presents the conclusions.

Section snippets

The architecture of a coprocessor

A coprocessor is composed of a hardware floating-point arithmetic and logic unit (FALU), an FMUL, and an FDIV as shown in Fig. 1. It has an independent instruction decoder, a load/store unit, a register file and a coprocessor interface unit. It can execute several instructions simultaneously within some constraints and can execute the instructions out of order if an arithmetic exception is not generated by the current instruction. It has a reduced interrupt recovery mechanism with a simple

The out-of-order execution control scheme

This section describes the host-coprocessor interface, the proposed out-of-order execution control scheme, and the precise exception support. A simple and effective host-coprocessor interface unit can support precise exception as well as the dynamic control scheme with some constraints. The host processor issues instructions sequentially, but the issued instruction can be executed out of order within the coprocessor.

Execution examples

This section describes the execution of some coprocessor instructions with a sample code. In the following pipeline diagram, the letter ‘FD’, ‘FEx’, and ‘FW’ mean ‘Decode’, ‘x-th Execute’ and ‘Write-back’ stages in the coprocessor, respectively.

Implementation

The coprocessor has been designed with a 0.25 μm standard-cell library which has four metal layers. The coprocessor supports 32 bit single precision floating-point arithmetic instructions; FADD, FSUB, FMUL, FDIV, format conversion (FTOI, ITOF), comparison (FCMP), rounding (FRND), load, store (CLD), etc. The IEEE-754 standard rounding mode and all exception conditions are also supported. For the development of a coprocessor, the behavioral HDL model has been implemented and verified with a simple

Conclusion

In this paper a floating-point co-processor for an embedded system that supports out-of-order execution control with the constraint-based dynamic control scheme has been designed and verified with the standard cell library. The proposed scheme is achieved by the data dependency checking at the decode stage, the resource conflict checking at the write-back stage, and the exception prediction at the first stage of each arithmetic pipeline. It has an advantage in area and design costs because a

Cheol-Ho Jeong is a graduate student in the Department of Computer Science at Yonsei University, Korea. He is currently researching computer arithmetic and system for 3D computer graphics. He received an MS in computer science from Yonsei University.

References (12)

  • F Arakawa et al.

    SH4 RISC multimedia microprocessor

    IEEE Micro

    (1998)
  • User's Manual VR5500™ 64/32-Bit Micro processor, NEC electronics, 2002, pp....
  • S Futber

    ARM System-on-Chip Architecture

    (2000)
  • A.R Omondi

    Computer Arithmetic Systems: Algorithms Architecture and Implementation

    (1994)
  • J Hennessy et al.

    Computer Architecture: A Quantitative Approach

    (1996)
  • R.M Tomasulo

    An efficient algorithm for exploiting multiple arithmetic units

    IBM Journal of Research and Development

    (1967)
There are more references available in the full text version of this article.

Cited by (1)

Cheol-Ho Jeong is a graduate student in the Department of Computer Science at Yonsei University, Korea. He is currently researching computer arithmetic and system for 3D computer graphics. He received an MS in computer science from Yonsei University.

Woo-Chan Park is a research professor in the Department of Computer Science at Yonsei University, Korea. His research interests include 3D computer graphics accelerator architecture, micro-architecture, and computer arithmetic. He received a PhD in computer science from Yonsei University.

Tack-Don Han is a professor in the Department of Computer Science at the Yonsei University, Korea. His research interests include high performance computer architecture, media system architecture, and wearable computing. He received a PhD in computer engineering from the University of Massachusetts.

Sung-Bong Yang is an associate professor in the Department of Computer Science at Yonsei University, Korea. His research interests include Mobile system, 3D computer graphics, and Electronics Commerce. He received a PhD in computer science from the University of Oklahoma.

Moon-Key Lee is a professor in the Department of Electrical Engineering at Yonsei University, Korea. His research interests include VLSI design, CAD and embedded system. He received a PhD in electrical engineering from Yonsei University and a PhD in electrical engineering from the University of Oklahoma.

View full text