An effective and efficient code generation algorithm for uniform loops on non-orthogonal DSP architecture
Introduction
Most scientific and digital signal processing (DSP) applications, such as image processing and weather forecasting, are iterative and usually represented by uniform nested loops (Hsu and Jeang, 1993, Kung, 1988, Madisetti, 1995). A digital signal processor (DSP) is a special-purpose microprocessor that is designed to achieve high performance in DSP applications (Eyer and Bier, 2000). In order to meet stringent speed and power requirements for embedded applications, DSPs commonly employ non-orthogonal architectures that are typically characterized by irregular data paths, heterogeneous register sets, and multiple memory banks (Cho et al., 2002). For the data path this architecture has multiple small register files dedicated to different sets of function units instead of a large number of centralized homogeneous registers. In addition, parallel access, enabled by multi-bank memory, is useful to explore the potential of higher memory bandwidth but gives rise to the problem of how to partition variables into the multiple memory banks (Cho et al., 2002, Lee and Chen, 2004, Leupers and Kotte, 2001, Saghir et al., 1994, Saghir et al., 1996, Shiue, 2001, Sudarsanam and Malik, 2000, Wang and Hu, 2004, Zhuge et al., 2001). Therefore, to harvest the benefits provided by this non-orthogonal architecture, adequate compiler support is obviously essential (Lapsley et al., 1996, Madisetti, 1995).
Many researchers seek to design code generation algorithms for specific DSP architectures to use their features fully. The complete code generation process for non-orthogonal architecture must include several phases, such as intermediate representation, code compaction, instruction scheduling, memory bank assignment (or variable partition), and accumulator/register assignment (Sudarsanam and Malik, 2000). In our previous study, we proposed two scheduling methods for multi-bank memory architecture that cover all phases except accumulator/register assignment (Lee and Chen, 2004). Next, we also propose a code generation algorithm that contains all of the above phases (Lee and Chen, 2005). From our evaluation, although the algorithm proposed in Lee and Chen (2005) is relatively efficient and effective, it was not scalable and specifically designed for an embedded DSP Motorola DSP56000. Therefore, we want to extend it to a more general algorithm, which is suitable for various DSPs with similar architectural features.
Due to strict resource constraints of the non-orthogonal DSP architecture, accumulator/register spills will occur very often. If more spill codes are added to the final schedule, not only the scheduling length may be lengthened, but also costs more power consumption to execute those additional instructions. That is, in addition to increase the instruction-level parallelism, how to avoid generating too many spill codes is also an important issue of designing the code generation algorithm for non-orthogonal DSP architecture. Moreover, although using an effective code generation algorithm can obtain scheduling results with shorter length and less spill codes, increasing the number of resources is essentially a more direct way to achieve the same goal. Therefore, in this paper, we will propose an effective code generation method, and deep study the influence of differing number of resources on the scheduling result.
In order to do above studies, we need a parameterized architecture to model a scalable non-orthogonal DSP. Many parameterized architecture models have been developed to explore and investigate advanced compiler and architecture research (MESCAL,, OptimoDE,, ORC,, Tensilica,, Trimaran,). However, none of them can faithfully represent the irregularity of non-orthogonal DSP architecture, especially its two main features multiple memory banks and heterogeneous register sets. Thus, we define a hypothetical machine model extended from the Motorola DSP56000, in which more resources will be included. Our proposed method is named Rotation Scheduling with Spill Codes Avoiding (RSSA). It is extended from our previous study (Lee and Chen, 2005), and its scheduling goal is to achieve shorter schedule length and avoid generating spill codes as far as possible. RSSA mainly contains five parts with following features. First, it contains a procedure to generate uncompacted codes directly from a high-level language. In most other related methods, this is not included and is done by existing tools (Cho et al., 2002, Sudarsanam and Malik, 2000, Shiue, 2001). Next, memory bank assignment is performed before code compaction as in Lee and Chen (2005). This execution sequence makes memory accesses be scheduled with information of variable partitioning, which can avoid extra cycles to fetch variables. Then, RSSA separately schedules ALU and memory load instructions in different parts. This strategy makes registers unfilled while dealing with accumulator spills, which is beneficial for temporarily storing overwritten ALU results. Compared to store and reload overwritten ALU results in memory, this mechanism requires less spill codes to resolve accumulator spills. It can be shown that using RSSA can obtain scheduling results with minimum length and fewer spill codes compared with related work. The reason is that it first generates the schedule without considering resource constraints and lengthens the schedule only when required.
After introducing the general algorithm, we selected several multi-dimensional data flow graphs (MDFGs) representing DSP applications for evaluation. Two metrics including schedule length and instruction count are used to evaluate the performance at the same time. From the evaluation results, our method actually can obtain shorter schedule lengths and less spill codes than those of related studies under the Motorola DSP56000 architecture. These results represent that our method is really effective on both evaluation metrics. In addition, we further analyze the effectiveness of RSSA itself. When the target architecture consists of more than one data ALU, RSSA can produce a schedule with length equal to or less than the critical path of the given MDFG. If there is only one data ALU, it still can produce the schedule with length equal to the number of ALU instructions of the given MDFG. As for the instruction count, RSSA also generates quite few spill codes. Meanwhile, these additional spill codes will be compacted with regular codes as far as possible, which can prevent lengthening the final schedule length and cost less power consumption. Then, the proposed method is evaluated on various target architecture to study the influence of differing number of resources on the scheduling result. It shows that accumulator is the most critical resource in non-orthogonal DSP architecture, because increasing the number of it is necessary to improve performance on both evaluation metrics. From our evaluation results, if the target architecture contains more than four accumulators, it is sufficient to keep most ALU results and eliminate almost all spill codes. Using more input registers or memory banks also can slightly reduce the instruction count. However, implementing additional memory banks and associated data buses requires heavy hardware costs. Thus, in view of their cost-performance, we recommend using additional accumulators to reduce the instruction count. As for instruction-level parallelism exploration, we conclude that numbers of data ALUs and accumulators must be concurrently increased. If only more data ALUs are added, accumulator spills will occur much frequently and incur many spill codes. Besides, from evaluation results it also shows that two data ALUs in the target architecture is actually sufficient, since using our RSSA can generate shortest schedules in most MDFGs. Finally, we compare the efficiency among our method and some previous work. After analyzing execution complexities of scheduling phases for each method separately, it shows that RSSA is the most efficient one.
The remainder of this paper is organized as follows. Section 2 surveys the fundamental background and related studies. Our hypothetical target architecture and design motivations are also presented. Detailed principles and algorithms for the proposed method are introduced in Section 3. Section 4 contains our preliminary performance evaluations and brief description. Finally, conclusions and plans for future work are presented in Section 5.
Section snippets
Fundamental background
In this section, we describe some fundamentals, such as the program model, the retiming technique, and the target machine model. After surveying related studies, our design motivations are introduced.
Rotation scheduling with spill codes avoiding (RSSA)
In this section, we introduce our proposed method named Rotation Scheduling with Spill Codes Avoiding (RSSA). Section 3.1 contains some basic assumptions and scheduling principles. Detailed steps of RSSA are introduced in Section 3.2.
Preliminary performance evaluation
In Section 4.1, we evaluate the proposed RSSA and compare it with previous work using several selected MDFGs. Then, RSSA is experimented with various hypothetical architectures in Section 4.2 to study the influence of different number of resources. After presenting evaluation results, some brief summaries of the effectiveness and efficiency of our method are presented in Section 4.3.
Conclusions and future work
In this paper, we propose a code generation algorithm Rotation Scheduling with Spill Codes Avoiding (RSSA) to schedule uniform loops on non-orthogonal DSP architecture. It is extended from our previous study, and its scheduling goal is to achieve shorter schedule length and avoid generating spill codes as far as possible. RSSA mainly contains following features: generating uncompacted codes directly from a high-level language, performing variable partition before code compaction, separately
Cheng Chen is a professor in the Department of Computer Science and Information Engineering at National Chiao Tung University, Taiwan, ROC. He received his B.S. degree from the Tatung Institute of Technology, Taiwan, ROC in 1969 and M.S. degree from the National Chiao Tung University, Taiwan, ROC in 1971, both in electrical engineering. Since 1972, he has been on the faculty of National Chiao Tung University, Taiwan, ROC. From 1980 to 1987, he was a visiting scholar at the University of
References (27)
- Cho, J., Paek, Y., Whalley, D., 2002. Efficient register and memory assignment for non-orthogonal architectures via...
- Daveau, J.M., Thery, T., Lepley, T., Santana, M., 2004. A retargetable register allocation framework for embedded...
- et al.
The evolution of DSP processors
IEEE Signal Processing Magazine
(2000) - Hsu, Y.C., Jeang, Y.L., 1993. Pipeline Scheduling Techniques in High-Level Synthesis. In: Proceedings of 6th Annual...
- Kessler, C., Bednarski, A., 2002. Optimal integrated code generation for clustered VLIW architectures. In: Proceedings...
VLSI Array Processors
(1988)The parallel execution of DO loops
Communications of the ACM SIGPLAN
(1974)- et al.
DSP Processor Fundamentals: Architectures and Features
(1996) - Lee, Y.H., Chen, C., 2004. Efficient variable partitioning and scheduling methods of multiple memory modules for DSP....
- Lee, Y.H., Chen, C., 2005. An efficient code generation algorithm for non-orthogonal DSP architecture. In: Proceedings...
Retiming synchronous circuitry
Algorithmica
VLSI Digital Signal Processors: An Introduction to Rapid Prototyping and Design Synthesis
Cited by (3)
DSP-solution for high-resolution position with sin/cos-encoders
2010, Proceedings - 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization, ICSEM 2010Research on the oversampling techniques using DSP
2010, Proceedings - 2010 International Conference on System Science, Engineering Design and Manufacturing Informatization, ICSEM 2010Research on the eXpressDSP-compliant algorithms
2008, Proceedings - 1st International Workshop on Knowledge Discovery and Data Mining, WKDD
Cheng Chen is a professor in the Department of Computer Science and Information Engineering at National Chiao Tung University, Taiwan, ROC. He received his B.S. degree from the Tatung Institute of Technology, Taiwan, ROC in 1969 and M.S. degree from the National Chiao Tung University, Taiwan, ROC in 1971, both in electrical engineering. Since 1972, he has been on the faculty of National Chiao Tung University, Taiwan, ROC. From 1980 to 1987, he was a visiting scholar at the University of Illinois at Urbana Champaign. During 1987 and 1988, he served as the chairman of the Department of Computer Science and Information Engineering at the National Chiao Tung University. From 1988 to 1989, he was a visiting scholar of the Carnegie Mellon University (CMU). Between 1990 and 1994, he served as the deputy director of the Microelectronics and Information Systems Research Chnter (MISC) in National Chiao Tung University. His current research interests include computer architecture, parallel processing system design, parallelizing compiler techniques, and high performance video server design.
Yi-Hsuan Lee is a Ph.D. candidate in Computer Science and Information Engineering at National Chiao Tung University, Taiwan, ROC. She received her B.S. degree in Computer Science and Information Engineering at National Chiao Tung University, Taiwan, ROC in 1999. Her current research interests include computer architecture, parallelizing compiler techniques, task scheduling for heterogeneous systems, scheduling problem in DSP architecture, and low-power scheduling techniques in embedded system.