Region-based dual bank register allocation for reduced instruction encoding Architectures

doi:10.1016/j.micpro.2017.09.005

Microprocessors and Microsystems

Volume 55, November 2017, Pages 26-43

https://doi.org/10.1016/j.micpro.2017.09.005 Get rights and content

Abstract

In embedded systems, small code size is important due to memory constraints. One technique to achieve a small code size is reducing the instruction encoding from 32-bit to 16-bit, such as the ARM THUMB or MIPS-16 architectures. This half-size encoding leads to shorter register operands, making fewer registers available for register allocation and causing more spills, although invisible registers can be used as spill locations via copies. We propose reconstructing the original register file into dual-banks, added with the bank toggle instruction for bank changes and the inter-bank copies between the banks. We also propose an efficient dual-bank register allocation technique based on regions in the code to reduce spills. As a case study, we applied our banked register allocation model for the THUMB architecture. We found that the code size decreases by as much as 8% (5.8% on average) while the performance improves by as much as 11.1% (3.3% on average). Our results indicate that we would better organize the register file of an embedded CPU that can provide reduced encoding into dual banks for better quality of register allocation, rather than using the invisible registers for spills.

Introduction

Compared to a general-purpose computer system, one of the most serious constraints of an embedded system is its limited memory. Since the memory price often dominates the whole price of an embedded system and it is almost impossible to expand the memory once the system is built, embedded software is always constrained by its code size. In fact, using small instruction memory may improve the power consumption of an embedded system, lengthening the battery life. This renders the optimizing compilers for embedded systems to focus on reducing code size than improving performance when there is a conflict between the two criteria, although small code size often leads to high performance.

In addition to the compiler optimization techniques, hardware techniques for reducing the code size have been introduced, the most popular one is reducing the instruction encoding. For example, the ARM THUMB [7] and the MIPS-16 [12] have a 16-bit instruction set instead of a 32-bit instruction set. This is achieved by reducing the bit width of the opcode as well as the bit width of register operands, as depicted in Fig. 1, for the case of THUMB. With shorter instructions the same computation would require more instructions, so the instruction count increases, yet it is known that the code size decreases significantly due to its half-sized instructions, although the performance also decreases tangibly [7].

The shortened register operand fields for reduced encoding imply that fewer registers are available for register allocation, which can lead to higher register pressure and more spills, affecting both the code size and the performance negatively. For example, the ARM THUMB instructions have three-bit register operand fields instead of the four-bit fields of the original ARM instructions. So, only eight registers are available, while the processor still has sixteen registers. According to our observation, this limitation of registers leads to higher register spills than in the original architecture (see Section 5.2). Therefore, it is questioned if there is a way of using the unavailable registers, and one idea is employing an architectural mechanism called banked register.

Generally, banked registers are a register file grouped into several banks, which have been used for various purposes in diverse contexts [5], [6], [8], [9], [14], [16], [17], [18], [19], [20], [21], [23], [24], [26], [38] (see Section 6). Our context of employing banked registers for reduced encoding architectures is reconstructing the original register file into dual banks and allowing only one bank to be active at a time with a bank change instruction. This can make all of the original registers available for register allocation including those otherwise unavailable, thus reducing the spill. This idea is also applicable to the originally compact encoding (8 or 16-bit) CPUs such as Motorola 68HC12 [28], when we want to double the number of registers without compromising its instruction encoding, by reorganizing the extended register file into dual banks [20].

To allocate banked registers, we need to partition the code into two regions, one for each register bank, and allocate registers separately from each bank. If there are variables live across regions, inter-bank copies should be inserted appropriately. The most important issue is how to partition the code efficiently so as to reduce the register pressure, hence the spills, while minimizing the bank changes and inter-bank copies. We propose an efficient heuristic for code partitioning and an elaborate region-based banked register allocation technique. Unlike previous techniques, our goal is reducing the code size as well as increasing the performance, so we try to reduce the bank change overhead while partitioning the code aggressively beyond basic blocks. We could obtain a competitive result for both the code size and the performance when we perform a case study with the THUMB, yet it is generally applicable to other reduced encoding architectures.

The contribution of this paper is as follows. We propose a banked register file for half-sized encoding architectures to utilize otherwise invisible registers and an efficient banked register allocation technique that reduces spills and bank change overhead. Our results provide a useful insight for an embedded CPU design such that if one wants to build one that can also provide the half-sized encoding feature, it would be desirable to organize the register file into dual banks for banked register allocation, rather than using the inaccessible registers as spill locations as in the THUMB.

The rest of the paper is organized as follows. In Section 2, we will show how to adopt banked registers in a reduced instruction encoding architecture. Section 3 briefly describes the ARM THUMB architecture and shows our architectural change with banked registers. In Section 4, we will explain the details of our banked register allocation technique. Section 5 reports our experimental results and Section 6 describes the related work. We summarize the paper in Section 7.

Section snippets

Banked register allocation for reduced encoding architectures

In this section, we illustrate the benefit of banked register allocation with a simple example. We also provide a proposed banked register model for reduced encoding architecture and some intuition for banked register allocation.

Banked registers for the ARM THUMB architecture

In this section, we describe how to apply a banked register model for a given reduced encoding architecture, using the ARM THUMB as an example. We first summarize the original THUMB architecture, followed by our proposal for restructuring its register file into a banked register file. The focus of this paper is not about architectural modification of the THUMB, but about banked register allocation model, and we use the THUMB as a target of the case study. As such, we do not deal with its

Region-based banked register allocation

Previous sections described our banked register extension for the THUMB and some intuition for region-based banked register allocation. In this section, we describe our register allocation technique in detail with an example. Basically, banked register allocation requires partitioning the code into two regions and allocate each using a different register bank, with inter-bank copies added for those live ranges across regions. Good partitioning is needed for good register allocation.

For our

Experimental results

In this section, we evaluate the proposed banked register model and the banked register allocation technique on the target b-THUMB architecture. We first describe the experimental environment. Then, we present the code size and the performance results, with other relevant data useful for understanding the results.

Related work

Banked registers have been used for various purposes in diverse contexts. We first describe previous work that employs banked registers for short encoding, thus directly comparable to ours. We then describe related work in different contexts.

Summary and future work

Code size is important in many embedded systems for reducing memory cost, power consumption, and I-cache pressure. Reduced encoding architecture is one popular hardware solution to achieve small code size. Unfortunately, reduced encoding for register operand fields makes fewer registers available for register allocation, leading to more spills and affecting the code size and the performance negatively, although those invisible registers can be used as spill locations to mitigate the performance

Acknowledgements

Soo-Mook Moon was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2017R1A2B2005562).

Je-Hyung Lee received the B.S. degree in School of Electronics Engineering from Kyungpook National University, Korea in February 1999, and the M.S. and Ph.D. degrees in School of EE&CS from Seoul National University, Korea in February 2001 and August 2009, respectively. He is currently a senior engineer in Samsung Electronics. He is working on compilers in mobile platforms and his research interests include compiler and VM performance optimizations in LLVM, GCC and JavaScript Engines.

References (38)

MediaBench....
MiBench....
Advanced RISC Machines Ltd. ARM7TDMI Technical Reference Manual, rev 4 edition,...
P. Briggs, K.D. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Trans. Programming...
J.L. Ayala, M. Lopez-Vallejo, and A. Veidenbaum. A compiler-assisted banked register file architecture. In Workshop on...
J.-L. Cruz, A. Gonzalez, M. Valero, and N. Topham. Multiple-banked register file architectures. In Proceedings of the...
ARM. Improving ARM Code Density and Performance....
J. Hiser, S. Carr, and P. Sweany. Register assignment for software pipelining with partitioned register banks. In...
Intel Corporation. MCS51 Microcontroller Family User's Manual,...
R. Johnson et al.
Finding Regions FastSingle Entry Single Exit and Control Regions in Linear Time. Technical Report
(1993)

K.D. Cooper and T.J. Harvey. Compiler-Controlled Memory. In Proceedings of the 8th International Conference on...

K. Kissell. MIPS16: High-Density MIPS for the Embedded Market. Silicon Graphics MIPS Group,...

T. Kiyohara et al. Register connection: a new approach to adding registers into instruction set architectures. In 20th...

M. Kondo and H. Nakamura. A small, fast and low-power register file by bit-partitioning. In 11th International...

A. Krishnaswamy and R. Gupta. Dynamic coalescing for 16-bit instructions. ACM Trans. Embedded Comput. Syst.(TECS),...

A. Krishnaswamy and R. Gupta. Efficient use of invisible registers in thumb code. In Proceedings of the 38th IEEE/ACM...

J.-H. Lee, J. Park, and S.-M. Moon. Securing more registers with reduced instruction encoding architectures. In...

M. Naik and J. Palsberg. Compiling with code-size constraints. ACM Trans. Embedded Comput. Syst., 3(1):163–181,...

J. Park, J.-H. Lee, and S.-M. Moon. Register allocation for banked register file. In Proceedings of the ACM SIGPLAN...

Cited by (4)

Improving performance and determinism of multitasking systems on the LEON architecture
2021, Microprocessors and Microsystems
Citation Excerpt :
The ARM architecture makes use of register banks to increase the performance of interrupt handlers [18]. Commonly, research in this field is oriented towards obtaining better performance and utilisation of banks of registers by using compilation-time or linking-time allocation and assignment techniques [19]; studying the organisation and architecture of the banks themselves [20,21]; or improving existing architectures through reusing resources [22]. For the SPARC architecture, the work [23] describes a multi-threaded processor that supports up to four contexts.
Real-time systems are characterised by the fact that they have to meet a set of both functional and temporal requirements. Processor architectures have a significant impact on the predictability of software execution times and can add different sources of indeterminism depending on the features provided. The LEON processor family is the reference platform for space missions of the European Space Agency, with open-source implementations that are written in VHDL language. All versions of the LEON processors conform to the SPARC architecture Version 8. This architecture groups the general-purpose registers into windows to reduce memory transfer overhead in function calls. Unfortunately, this mechanism introduces indeterminism in software execution times at various levels. In this paper, we propose an extension to the original architecture that provides determinism for a configurable subset of tasks and interrupt service routines and eliminates the concurrency-related jitter, all this with a minimum cost in terms of FPGA resource utilisation. For the validation of the proposed solution, we have implemented the extension into the VHDL code of the LEON3 processor and modified the source code of the RTEMS operating system to make use of the new functionality.
Compilation of Parallel Data Access for Vector Processor in Radio Base Stations
2022, IEEE Embedded Systems Letters
Discrete selfish herd optimizer for solving graph coloring problem
2020, Applied Intelligence
Design and implementation of extended 16 bit co-operative arithmetic and logic unit (CALU) for 16 bit instructions
2019, Journal of Low Power Electronics

Soo-Mook Moon received his Ph.D at the University of Maryland, College Park, in 1993. During 1992–1993, he worked at IBM Thomas J. Watson Research Center where he developed the IBM VLIW compiler. During 1993–1994, he was a software design engineer at the Hewlett-Packard Company in California Language Lab where he contributed to the development of an optimizing compiler for the PA-RISC CPUs. Since 1994, he has been with the faculty of the Seoul National University in the School of Electrical Engineering and Computer Science where he is now a full professor.

Jinpyo Park received his Ph.D at Seoul National University in 2003. He is now working at Samsung Electronics. His area of interest is SoC design, especially low power SoC design, bus architecture, and memory subsystem.

^☆: This is a revised and extended version of a paper published in the Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Aug 2007 [17]. The major difference from the conference paper is that we focus on performance improvement as well as code size reduction of [17], by newly introducing inter-bank copies and by improving the quality of register allocation with better allocation. We also expanded the evaluation. This work was performed while Je-Hyung Lee and Jinpyo Park were at Seoul National University.

View full text

Region-based dual bank register allocation for reduced instruction encoding Architectures☆