Floating accumulator architecture

https://doi.org/10.1016/j.micpro.2017.04.007Get rights and content

Abstract

Although technology advancement can pack more and more physical registers in processors, the numbers of architectural registers defined by the instruction set architectures (ISAs) remain relatively small on most modern processors. Exposing more architectural registers to compilers and programmers can improve the effectiveness of compiler optimization and the quality of code. However, increasing the number of architectural registers by simply adding extra bits to the register fields of instructions will expand the code size. Therefore, a better way of exposing more ISA registers without significantly expanding the code size is needed. This paper presents a new ISA called Floating Accumulator Architecture (FAA) that can expand the number of ISA registers without increasing the instruction length. Unlike the accumulator architecture whose accumulator is a fixed, special register, FAA dynamically chooses a register from the general-purpose register file as the accumulator. In other words, the accumulator in FAA is an alias to some register in the register file at any instruction, and the alias relation can be dynamically updated by FAA at any program points. Since the accumulator implicitly stores the result, the destination register field can be omitted from FAA instructions, resulting in a saving of 3 to 5 bits for each instruction. This new free instruction bit space can be utilized in two possible ways: doubling the number of ISA registers of modern 32-bit RISC processors or maintaining the number of ISA registers for 16-bit instructions on embedded processors. This paper presents the result of utilizing the free bit space to double the number of ISA registers from 16 to 32 on ARM processors, and experimental results show that performance can be improved by 7.6% on average for MediaBench benchmarks.

Introduction

Technology advancement has made it easy for modern processors to pack large physical register files. As more physical registers available to store variables close to the pipeline, performance can be improved. However, the numbers of architectural registers defined by the instruction set architectures (ISAs) remain relatively small on most modern processors, ranging from 8 registers on IA-32 to 32 registers on most RISC processors. Exposing more architectural registers to compilers and programmers can improve the effectiveness of compiler optimization and the quality of code. Unfortunately, increasing the number of architectural registers by simply adding extra bits to the register fields of instructions will expand the code size, as adding one bit in the register field typically causes the instruction size to grow by 2 or more bits due to multiple register fields in the instruction. As a result, modern processors generally contain higher number of physical registers than those exposed in the ISA.

Instruction encoding space is more restricted on embedded processors, as many of them provide “reduced bit-width” instruction sets which encode the most commonly used instructions using fewer bits [4], [8], [22]. Even if the hardware can support more registers, the number of architectural registers defined by ISA is much smaller due to the encoding issue. A very good example is the ARM processor with a 32-bit instruction set and a 16-bit instruction called the Thumb [19]. Due to the smaller encoding space, most THUMB instructions can only access 8 registers although all 16 registers are physically present and can contain values. Similarly, the MIPS16 embedded processor also supports such dual instruction set feature, and its instructions can access only 8 registers out of 32 general-purpose registers seen by MIPS32 instructions [20]. The 16-bit format is shown to have significant cost-performance advantages over the 32-bit format under typical memory system performance constraints [8], [22]. However, the compromises made in designing the Thumb or MIPS16 instruction set leads to significantly increased instruction counts [14].

Restricting the number of architectural registers typically incurs performance penalty since compilers and programmers can only utilize the exposed architectural registers. There have been several approaches proposed to increase the number architectural registers without expanding the code size significantly. For instance, register windows have been designed to provide more registers than allowed in the encoding [18], [21]. Differential encoding is a new register encoding scheme that allows more registers to be addressed in the operand field of instructions than the direct encoding currently being used [22]. Hardware managed register allocation schemes have even be developed to allocate more physical registers at runtime [23]. Some new instructions have been introduced into the instruction sets that can make use of all registers in all instructions by changing the visible subset of registers at any program point [14]. Another possible approach is to find some underutilized fields in instructions to represent more registers. It might be beneficial to trade conditional execution for more registers on ARM processors, as every ARM instruction carries a 4-bit condition field in that specifies the predicate of conditional execution but ratios of conditionalized instructions are generally very low [5].

This paper presents a new ISA called floating accumulator architecture (FAA) that can expand the number of ISA registers without widening the instructions. Similar to the traditional accumulator architecture, FAA reduces the instruction width by making the accumulator as the default destination of instructions. However, unlike the accumulator architecture whose accumulator is a fixed, special register, FAA dynamically chooses a register from the general-purpose register file as the accumulator. In other words, the accumulator in FAA can be viewed simply as an alias to some general-purpose register at any instruction, and the alias relation can be dynamically updated by FAA at any program points. Since the accumulator implicitly stores the result, the destination register field can be omitted from FAA instructions, resulting in a saving of 3 to 5 bits for each instruction. There could be two possible ways to utilize the free bit space: quadrupling the number of ISA registers of modern 32-bit RISC processors and maintaining the number of ISA registers for 16-bit instructions on embedded processors. This paper presents an LLVM [15] implementation that utilizes the free bit space to double the number of ISA registers on ARM processors, and experimental results shows that performance can be improved by 7.6% on average for MediaBench benchmarks when the number of ISA registers is extended from 16 to 32.

The main results of this paper are as follows:

  • FAA can generally shorten instructions by 3 to 5 bits when comparing to the general-purpose register (GPR) architectures with the same register file size.

  • FAA can double the number of ISA registers of modern 32-bit RISC processors.

  • FAA can avoid reducing the number of ISA registers for 16-bit instructions on embedded processors.

  • The microarchitecture for FAA is very similar to that of the GPR architecture, as the accumulator in FAA can be viewed simply as an alias to some general-purpose register at any instruction. Therefore, it is very easy to implement FAA based on the GPR microarchitecture.

The rest of this paper is organized as follows. Section 2 briefly recaps some well-known instruction set architectures. Section 3 introduces the new ISA and presents the microarchitecture that supports FAA. Section 4 presents an applications of FAA to double ARM registers, and Section 5 shows the experimental results. Section 6 surveys the related work, and Section 7 concludes this paper.

Section snippets

Instruction set architectures (ISAs)

ISAs can be generally classified into the following three types

  • Stack architecture

  • Accumulator architecture, and

  • General-purpose register architecture

based on the type of internal storage in the processor [7], [10]. A stack architecture uses a operand stack to execute the instruction, and the operands are implicitly on the top of the stack, as shown in Fig. 1(a). An accumulator architecture has a special register called the accumulator, which implicitly stores one operand and the result, while

Floating accumulator architecture (FAA)

The key idea of the floating accumulator architecture (FAA) is to assign one of the general-purpose registers as the accumulator, which will in turn be the implicit destination register of an instruction. For any ALU operation, e.g. Ri = Rj op Rk, Ri must be designated as the accumulator (i.e. A ≡ Ri) and hence the operation will be in fact represented by the instruction A = Rj op Rk. Since the accumulator is implicitly addressed, it does not need to occupy any register field in the

Application: doubling the number of registers on ARM

The advantage of FAA over GPR is that the destination register field can be taken out of instructions, resulting in a saving of 4 bits of the instruction length on each ARM instruction. As a result, an ARM instruction will have a 4-bit free encoding space to represent the extra registers that are available when the number of registers on ARM processors is doubled from 16 to 32 registers.

Experimental results

Experiments have been conducted by executing the MediaBench [16] on SimpleScalar/ARM, a port of SimpleScalar [3] to ARM available from the University of Michigan. SimpleScalar/ARM simulates the five stage pipeline of ARM 9 and StrongARM processors, such as ARM9TDMI [1] and SA-1110 processor [11], which implement the ARM V4 architecture. Instructions are issued in-order by invoking the sim-outorder command with the option -issue:inorder. The configurations of L1 D-cache and I-cache for this

Related work

The physical registers in most modern microprocessors significantly outnumber the architectural registers that are specified by ISAs. There are several commonly used approaches of mapping from architectural registers to physical registers. The ARM processor uses the banked registers feature to hide 20 of all 37 registers in the register file from a program at different times. These banked registers are available only when the processor is in a particular mode [19]. The SPARC processor usually

Conclusions

This paper has presented a new ISA called floating accumulator architecture (FAA) that can expand the number of ISA registers without widening the instructions. FAA reduces the instruction width by making the accumulator as the default destination of instructions. The special feature of FAA is that it dynamically chooses a register from the general-purpose register file as the accumulator. The advantage of this design is that the destination register field can be omitted from FAA instructions,

Acknowledgment

This work was supported in part by the Ministry of Science and Technology of Taiwan under Grants MOST 104-2622-8-002-002 and NSC 101-2221-E-011-029-MY3, and sponsored by MediaTek Inc., Hsin-chu, Taiwan.

Yuan-Shin Hwang is a professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. He received his Ph.D. and M.S. in computer science in 1998 and 1994 from the University of Maryland at College Park and M.S. and B.S. in electrical engineering from the National Tsing Hua University, Hsinchu, Taiwan in 1989 and 1987, respectively. His research interests include parallel and distributed computing, parallel

References (23)

  • ARM Limited

    ARM9TDMITM Technical Reference Manual

    (2000)
  • ARM Ltd

    Cortex-a9 Technical Reference Manual

    (2009)
  • T. Austin et al.

    Simplescalar: an infrastructure for computer system modeling

    IEEE Comput.

    (2002)
  • J. Bunda et al.

    16-bit vs. 32-bit instructions for pipelined microprocessors

    Proceedings of the 20th Annual International Symposium on Computer Architecture

    (1992)
  • H.-H. Chiang et al.

    Doubling the number of registers on arm processors

    Proceedings of the 16th Workshop on Interaction between Compilers and Computer Architectures

    (2012)
  • Oracle Corp

    Oracle SPARC Architecture 2015

    (2015)
  • H.G. Cragon

    Computer Architecture and Implementation

    (2000)
  • A. Halambi et al.

    A design space exploration framework for reduced bit-width instruction set architecture (rISA) design

    Proceedings of the 15th International Symposium on System Synthesis

    (2002)
  • P. Hammarlund et al.

    Haswell: the fourth-generation intel core processor

    IEEE Micro

    (2014)
  • J.L. Hennessy et al.

    Computer Architecture: A Quantative Approach

    (2012)
  • Intel Corporation

    Intel ® StrongARM SA-1110 Microprocessor Developer’s Manual

    (2000)
  • Cited by (0)

    Yuan-Shin Hwang is a professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan. He received his Ph.D. and M.S. in computer science in 1998 and 1994 from the University of Maryland at College Park and M.S. and B.S. in electrical engineering from the National Tsing Hua University, Hsinchu, Taiwan in 1989 and 1987, respectively. His research interests include parallel and distributed computing, parallel architectures, parallelizing and optimizing compilers, and programming languages.

    Wei-Che Hsu received the B.S. degrees from the Department of Computer Science and Information Engineering, Tamkang University, New Taipei City, Taiwan, in 2012. He received the M.S. degree in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, in 2014. He is currently an engineer in the industry.

    View full text