Region-based dual bank register allocation for reduced instruction encoding Architectures☆
Introduction
Compared to a general-purpose computer system, one of the most serious constraints of an embedded system is its limited memory. Since the memory price often dominates the whole price of an embedded system and it is almost impossible to expand the memory once the system is built, embedded software is always constrained by its code size. In fact, using small instruction memory may improve the power consumption of an embedded system, lengthening the battery life. This renders the optimizing compilers for embedded systems to focus on reducing code size than improving performance when there is a conflict between the two criteria, although small code size often leads to high performance.
In addition to the compiler optimization techniques, hardware techniques for reducing the code size have been introduced, the most popular one is reducing the instruction encoding. For example, the ARM THUMB [7] and the MIPS-16 [12] have a 16-bit instruction set instead of a 32-bit instruction set. This is achieved by reducing the bit width of the opcode as well as the bit width of register operands, as depicted in Fig. 1, for the case of THUMB. With shorter instructions the same computation would require more instructions, so the instruction count increases, yet it is known that the code size decreases significantly due to its half-sized instructions, although the performance also decreases tangibly [7].
The shortened register operand fields for reduced encoding imply that fewer registers are available for register allocation, which can lead to higher register pressure and more spills, affecting both the code size and the performance negatively. For example, the ARM THUMB instructions have three-bit register operand fields instead of the four-bit fields of the original ARM instructions. So, only eight registers are available, while the processor still has sixteen registers. According to our observation, this limitation of registers leads to higher register spills than in the original architecture (see Section 5.2). Therefore, it is questioned if there is a way of using the unavailable registers, and one idea is employing an architectural mechanism called banked register.
Generally, banked registers are a register file grouped into several banks, which have been used for various purposes in diverse contexts [5], [6], [8], [9], [14], [16], [17], [18], [19], [20], [21], [23], [24], [26], [38] (see Section 6). Our context of employing banked registers for reduced encoding architectures is reconstructing the original register file into dual banks and allowing only one bank to be active at a time with a bank change instruction. This can make all of the original registers available for register allocation including those otherwise unavailable, thus reducing the spill. This idea is also applicable to the originally compact encoding (8 or 16-bit) CPUs such as Motorola 68HC12 [28], when we want to double the number of registers without compromising its instruction encoding, by reorganizing the extended register file into dual banks [20].
To allocate banked registers, we need to partition the code into two regions, one for each register bank, and allocate registers separately from each bank. If there are variables live across regions, inter-bank copies should be inserted appropriately. The most important issue is how to partition the code efficiently so as to reduce the register pressure, hence the spills, while minimizing the bank changes and inter-bank copies. We propose an efficient heuristic for code partitioning and an elaborate region-based banked register allocation technique. Unlike previous techniques, our goal is reducing the code size as well as increasing the performance, so we try to reduce the bank change overhead while partitioning the code aggressively beyond basic blocks. We could obtain a competitive result for both the code size and the performance when we perform a case study with the THUMB, yet it is generally applicable to other reduced encoding architectures.
The contribution of this paper is as follows. We propose a banked register file for half-sized encoding architectures to utilize otherwise invisible registers and an efficient banked register allocation technique that reduces spills and bank change overhead. Our results provide a useful insight for an embedded CPU design such that if one wants to build one that can also provide the half-sized encoding feature, it would be desirable to organize the register file into dual banks for banked register allocation, rather than using the inaccessible registers as spill locations as in the THUMB.
The rest of the paper is organized as follows. In Section 2, we will show how to adopt banked registers in a reduced instruction encoding architecture. Section 3 briefly describes the ARM THUMB architecture and shows our architectural change with banked registers. In Section 4, we will explain the details of our banked register allocation technique. Section 5 reports our experimental results and Section 6 describes the related work. We summarize the paper in Section 7.
Section snippets
Banked register allocation for reduced encoding architectures
In this section, we illustrate the benefit of banked register allocation with a simple example. We also provide a proposed banked register model for reduced encoding architecture and some intuition for banked register allocation.
Banked registers for the ARM THUMB architecture
In this section, we describe how to apply a banked register model for a given reduced encoding architecture, using the ARM THUMB as an example. We first summarize the original THUMB architecture, followed by our proposal for restructuring its register file into a banked register file. The focus of this paper is not about architectural modification of the THUMB, but about banked register allocation model, and we use the THUMB as a target of the case study. As such, we do not deal with its
Region-based banked register allocation
Previous sections described our banked register extension for the THUMB and some intuition for region-based banked register allocation. In this section, we describe our register allocation technique in detail with an example. Basically, banked register allocation requires partitioning the code into two regions and allocate each using a different register bank, with inter-bank copies added for those live ranges across regions. Good partitioning is needed for good register allocation.
For our
Experimental results
In this section, we evaluate the proposed banked register model and the banked register allocation technique on the target b-THUMB architecture. We first describe the experimental environment. Then, we present the code size and the performance results, with other relevant data useful for understanding the results.
Related work
Banked registers have been used for various purposes in diverse contexts. We first describe previous work that employs banked registers for short encoding, thus directly comparable to ours. We then describe related work in different contexts.
Summary and future work
Code size is important in many embedded systems for reducing memory cost, power consumption, and I-cache pressure. Reduced encoding architecture is one popular hardware solution to achieve small code size. Unfortunately, reduced encoding for register operand fields makes fewer registers available for register allocation, leading to more spills and affecting the code size and the performance negatively, although those invisible registers can be used as spill locations to mitigate the performance
Acknowledgements
Soo-Mook Moon was supported in part by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2017R1A2B2005562).
Je-Hyung Lee received the B.S. degree in School of Electronics Engineering from Kyungpook National University, Korea in February 1999, and the M.S. and Ph.D. degrees in School of EE&CS from Seoul National University, Korea in February 2001 and August 2009, respectively. He is currently a senior engineer in Samsung Electronics. He is working on compilers in mobile platforms and his research interests include compiler and VM performance optimizations in LLVM, GCC and JavaScript Engines.
References (38)
- MediaBench....
- MiBench....
- Advanced RISC Machines Ltd. ARM7TDMI Technical Reference Manual, rev 4 edition,...
- P. Briggs, K.D. Cooper, and L. Torczon. Improvements to graph coloring register allocation. ACM Trans. Programming...
- J.L. Ayala, M. Lopez-Vallejo, and A. Veidenbaum. A compiler-assisted banked register file architecture. In Workshop on...
- J.-L. Cruz, A. Gonzalez, M. Valero, and N. Topham. Multiple-banked register file architectures. In Proceedings of the...
- ARM. Improving ARM Code Density and Performance....
- J. Hiser, S. Carr, and P. Sweany. Register assignment for software pipelining with partitioned register banks. In...
- Intel Corporation. MCS51 Microcontroller Family User's Manual,...
- et al.
Finding Regions FastSingle Entry Single Exit and Control Regions in Linear Time. Technical Report
(1993)
Cited by (4)
Improving performance and determinism of multitasking systems on the LEON architecture
2021, Microprocessors and MicrosystemsCitation Excerpt :The ARM architecture makes use of register banks to increase the performance of interrupt handlers [18]. Commonly, research in this field is oriented towards obtaining better performance and utilisation of banks of registers by using compilation-time or linking-time allocation and assignment techniques [19]; studying the organisation and architecture of the banks themselves [20,21]; or improving existing architectures through reusing resources [22]. For the SPARC architecture, the work [23] describes a multi-threaded processor that supports up to four contexts.
Compilation of Parallel Data Access for Vector Processor in Radio Base Stations
2022, IEEE Embedded Systems LettersDiscrete selfish herd optimizer for solving graph coloring problem
2020, Applied IntelligenceDesign and implementation of extended 16 bit co-operative arithmetic and logic unit (CALU) for 16 bit instructions
2019, Journal of Low Power Electronics
Je-Hyung Lee received the B.S. degree in School of Electronics Engineering from Kyungpook National University, Korea in February 1999, and the M.S. and Ph.D. degrees in School of EE&CS from Seoul National University, Korea in February 2001 and August 2009, respectively. He is currently a senior engineer in Samsung Electronics. He is working on compilers in mobile platforms and his research interests include compiler and VM performance optimizations in LLVM, GCC and JavaScript Engines.
Soo-Mook Moon received his Ph.D at the University of Maryland, College Park, in 1993. During 1992–1993, he worked at IBM Thomas J. Watson Research Center where he developed the IBM VLIW compiler. During 1993–1994, he was a software design engineer at the Hewlett-Packard Company in California Language Lab where he contributed to the development of an optimizing compiler for the PA-RISC CPUs. Since 1994, he has been with the faculty of the Seoul National University in the School of Electrical Engineering and Computer Science where he is now a full professor.
Jinpyo Park received his Ph.D at Seoul National University in 2003. He is now working at Samsung Electronics. His area of interest is SoC design, especially low power SoC design, bus architecture, and memory subsystem.
- ☆
This is a revised and extended version of a paper published in the Proceedings of the 13th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA), Aug 2007 [17]. The major difference from the conference paper is that we focus on performance improvement as well as code size reduction of [17], by newly introducing inter-bank copies and by improving the quality of register allocation with better allocation. We also expanded the evaluation. This work was performed while Je-Hyung Lee and Jinpyo Park were at Seoul National University.