An approach to build cycle accurate full system VLIW simulation platform

doi:10.1016/j.simpat.2016.06.006

Simulation Modelling Practice and Theory

Volume 67, September 2016, Pages 14-28

https://doi.org/10.1016/j.simpat.2016.06.006 Get rights and content

Abstract

Very long instruction word (VLIW) architecture is widely used in the design of digital signal processors (DSPs) and application-specific processors because of its hardware simplicity and high efficiency. Some heterogeneous systems also use VLIW style accelerators to achieve high computing performance and power efficiency. However, there are few widely accepted simulators that can cycle-accurately model a VLIW architecture or simulate the entire heterogeneous system with VLIW accelerators. In this paper we present an approach to build cycle accurate full system VLIW simulation platform. The basic idea is to analyze Petri Nets modeling used in the traditional cycle accurate simulation and adjust it to match VLIW architecture. The adjustments reconstruct and optimize the colored token, the place and the arc in Petri Nets in order to adapt it with VLIW characteristics. According to our approach and based on the InOrder simulator in the open source simulator framework Gem5, we build a heterogeneous multicore full system simulator for the MaPU (Mathematical Processor Unit) chip, which is composed by a functional accurate ARM simulator and a cycle accurate accelerator simulator. To evaluate the performance and accuracy of our simulator ‘Gem5-MaPU’, we compare the results of a set of DSP benchmarks executed by both the simulator and the RTL model. The result shows our simulator is about 1000 times faster than the RTL model while the cycle error is reduced to less than 5%. With high accuracy rate and good accelerating ratio over RTL simulation, the cycle accurate simulator turns out to be an efficient and flexible tool for VLIW related architectures’ study and development, such as the hardware-software co-design and performance evaluation, etc.

Introduction

Very long instruction word (VLIW) architecture has been studied for several decades. A VLIW machine has long machine instructions, an orthogonal instruction set and a high degree of parallelism [1]. Compared with general purpose processor architectures like complex instruction set computing (CISC) and reduced instruction set computer (RISC) which use complex superscalar design, VLIW has unique advantages. This architecture uses simple control logic and structured hardware design to exploit instruction level parallelism (ILP), achieving high computing performance as well as reasonable power efficiency [2]. These features make it quite successful in the digital signal processing (DSP) and application-specific integrated circuit (ASIC) design areas, for applications like digital audio/video signal processing, pattern recognition and cellphone communication, etc. A typical example is Texas Instruments’ VelociTI architecture DSPs such as the TMS32OC6s family [3]. VLIW architecture is also used in building GPU. The well-known vendor AMD adopts VLIW structure to build its GPU products and there has been many research studying them, such as [4], [5], [6].

However, VLIW architecture does have some restrictions and deficiency. For many applications where the algorithms are not so regular or structured, VLIW processors would be less efficient and waste a lot of computing resource. Because of that, a common way is using VLIW accelerator to build heterogeneous computing systems. For example, mobile cellphones use VLIW DSP cores like TI’s TMS320C6x and ADI’s Tiger-SHARC to meet communication and power efficiency requirements [7], [8]. Many researches use CPU-GPU architecture to accelerate linear algebra algorithms and dense matrix computation [9], [10], [11]. Heterogeneous systems take advantage of multiple architectures, and the booming of these computing systems extends the usage of VLIW architecture greatly.

VLIW architecture is often used in the development of DSP, ASIC, and some other power sensitive situations. Usually these chip developments have long cycle and high cost, making the help of software simulators very important and necessary. Simulation is very helpful both in manufacturing and in academic studies [12]. With the help of software simulators, fast and low-cost analysis of systems is available. A good simulator gives help during all the time of chip development. It helps comparing alternative designs, supporting software–hardware co-designs and shorting the time to market, analyzing system performance and finding bottlenecks. For fast product development, precise and flexible simulators are very important.

Instruction set simulator (ISS) is commonly used in hardware simulation. ISS supplies software environment which reads microprocessor instructions and simulates the execution of hardware [13]. Usually ISS provides more information than real hardware, as it imitates the function of hardware and generates related results such as memory and register value, providing easier and more detailed way to observe these information. There are basically two kinds of ISS: the instruction accurate ones and the cycle accurate ones. Generally the instruction accurate simulators are fast and mainly used in developing software and tracing function-related information. While the cycle accurate ones are slower but they give much more detailed simulation of target systems. For some purposes like software performance estimation and real-time system analysis, cycle accurate simulation is necessary. There are many general purpose simulators, including SimpleScalar [14], [15], SimOS [16], Gem5 [17], and IBM’s Mambo, AMD’s SimNow, etc. Many of those commercial or open-source simulators support multiple simulation models and different target processors.

However, few cycle-accurate simulators support the simulation of VLIW processors. Most of these widely used simulators such as SimpleScalar and Gem5 gives the modeling of many kinds of superscalar architectures like x86, MIPS, Spark, etc. Except for some specific VLIW simulators given to describe certain commercial products such as TI’s DSP chips [18], most of other related researches and simulators only give instruction-accurate modeling and simulation, such as Simple-VLIW [19], [20], or educational purpose researches like the VLIW-DLX simulator [21]. Some simulators provide function accurate simulating to estimate cycle-accurate information. None of them meets the requirements of the MaPU [22], which has the high performance accelerator consists of two types of heterogeneous VLIW units with precisely cycle-accurate programming model. The MaPU’s high performance VLIW accelerator depends on a specially designed precise cycle-accurate pipeline to achieve all kinds of algorithms, giving quite a challenge to its software simulator on cycle accuracy and simulating speed.

Petri Nets are perfectly suitable to describe processor architecture and its running behavior. And there has been a lot of related research adapting Petri Nets or extending their features for all kinds of simulation purposes, such as for pipeline modeling [23], and for dynamic systems [24]. In this paper, we present an approach to build cycle accurate VLIW simulator based on open source simulator Gem5. The main contribution of this paper is to analyze the Petri Nets of Gem5 simulator, figuring out the features that don’t match VLIW architecture and adjust the model to build VLIW simulation engine. Multi-pipeline model is also supported, to meet the requirement of simulating the whole MaPU’s heterogeneous architecture.

The rest of the paper is organized as follows. Section 2 describes some basic concepts of Petri Nets, and the optimizations for Petri Nets made by Gem5 in the InOrderCPU model. Section 3 analyzes the problems of InOrderCPU model when it comes to VLIW architecture, and introduces our approach and modifications. Section 4 describes some key points of our simulator product, mainly about some modules’ design and consideration. Section 5 shows the simulation results on a set of DSP benchmarks, analyzing simulating accuracy and speed. In the last, Section 6 summarizes the paper and describes the future work.

Section snippets

Basic concept of Petri Nets

Petri Nets [23], [25] have been proposed as an important mathematical tool for hardware systems modeling by Misunas [26], Ramchandani [27], Agerwala [28], etc. Events are key elements of a Petri Net model. There are a set of possible events in Petri Net model of hardware systems. Each event has pre-conditions and post-conditions. The occurrence of an event relies on its pre-conditions to be true and it drives its post-condition events to occur. The basic idea of describing a hardware system

Engine designs for VLIW modeling

VLIW architecture is designed to exploit instruction level parallelism (ILP). A VLIW processor has multiple function units whose instruction could be combined into a long instruction without confliction. And the execution of these long instructions is a pre-determinated process, dispatched and scheduled by compiler before running time. Because the execution order is totally decided by compiler instead of processor’s hardware, VLIW processors usually have better computing performance with less

Gem5 based implementation

According to the modeling engine introduced in Section 3, we implement a VLIW simulator based on Gem5. This chapter introduces the simulator implementation, mainly about VLIW-related ISA description system. We implement the VLIW simulator for MaPU, supporting the entire instruction sets and two heterogeneous pipeline with different ISAs.

Experiments and evaluation

Gem5-MaPU supports two different types of VLIW structure inside the MaPU processor: the scalar processing unit (SPU) and the microcode processing units (MPU). The SPU is a 4-slots VLIW processor with MIPS-like instruciton set and is usually used for simple computing and configuration for MPU, while the MPU is a 14-slots VLIW processor designed for data-dense computing. The MPU runs the data-dense DSP core algorithm and it is mainly discussed. The SPU are discussed with separate experiments in

Conclusion

This paper presents an approach to build cycle accurate VLIW processor simulator. The basic idea is to analyze Petri Nets modeling and adjust it to match VLIW architecture. According to the approach we build a heterogeneous VLIW simulator for MaPU based on Gem5. The simulator has a VLIW modeling engine and a lot of optimizations. Several DSP algorithm benchmarks in the form of comparison between software simulation and MaPU’ RTL modeling show the good accuracy and accelerating ratio of our

References (32)

R. David et al.
Petri nets for modeling of dynamic systems: A survey
Automatica
(1994)
M. Lam
Software pipelining: An effective scheduling technique for VLIW machines
ACM Sigplan Notices
(1988)
M. Soliman
A VLIW architecture for executing multi-scalar/vector instructions on unified datapath
Proceedings of the 2013 Saudi International Electronics, Communications and Photonics Conference (SIECPC)
(2013)
N. Seshan
High velocity processing [Texas instruments VLIW dsp architecture]
IEEE Signal Process. Mag.
(1998)
ZhangY. et al.
Performance and power analysis of ATI GPU: A statistical approach
Proceedings of the 2011 Sixth IEEE International Conference on Networking, Architecture and Storage (NAS)
(2011)
R. Ubal et al.
Multi2sim: a simulation framework for CPU-GPU computing
Proceedings of the Twenty First International Conference on Parallel Architectures and Compilation Techniques
(2012)
R. Taylor et al.
A micro-benchmark suite for AMD GPUs
Proceedings of the 2010 Thirty Ninth International Conference on Parallel Processing Workshops (ICPPW)
(2010)
A. Gatherer et al.
Dsp-based architectures for mobile communications: past, present and future
IEEE Commun. Mag.
(2000)
S. Kaxiras et al.
Comparing power consumption of an smt and a cmp dsp for mobile phone workloads
Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems
(2001)
L.S. Mitta et al.
A survey of CPU-GPU heterogeneous computing techniques
ACM Computing Surveys (CSUR)
(2015)

N. Nakasato

A fast Gemm implementation on the cypress GPU

ACM SIGMETRICS Perf. Eval. Rev.

(2011)

V. Volkov et al.

Benchmarking gpus to tune dense linear algebra

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2008

(2008)

F. Hosseinpour et al.

Importance of simulation in manufacturing

World Acad. Sci. Eng. Technol.

(2009)

LiuJ. et al.

Software timing analysis using hw/sw cosimulation and instruction set simulator

Proceedings of the Sixth International Workshop on Hardware/Software Codesign

(1998)

T. Austin et al.

Simplescalar: An infrastructure for computer system modeling

Computer

(2002)

D. Burger et al.

The simplescalar tool set, version 2.0

ACM SIGARCH Comput. Archit. News

(1997)

Cited by (10)

CAPE: A cross-layer framework for accurate microprocessor power estimation
2019, Integration
Citation Excerpt :
Their design modeled the microarchitectural details based on published and estimated data. Yang et al. extend gem5 to build a VLIW simulation platform [27]. They also modeled their design based on a cycle-accurate simulator and finally validated against the RTL simulator.
State-of-the-art system-level simulators can deliver fast power estimates for microprocessor designs, but often at the expense of reduced accuracy. The inaccuracies mainly stem from incorrect or over-simplified modeling of the target architecture. On the other hand, modern register-transfer level (RTL) simulators are cycle-accurate but overwhelmingly time consuming for most real-life workloads. Consequently, the design community often has to make a compromise between accuracy and speed. In this work, we propose a novel cross-layer power estimation (CAPE) technique that carefully integrates system-level and RTL profiling data for the target design in order to attain better accuracy. Our proposed methodology first leverages the SimPoint tool to transform a workload into weighted simulation points. We, then, present two different strategies to represent the critical segment of an application - either with a workload-specific simulation point (CAPE-WSSP) or, with the highest-weighted simulation point (CAPE-HWSP). Next, we profile the critical simulation point with an RTL simulator for maximum accuracy, while the other simulation points are simulated at system-level for fast evaluation. Finally, we input the integrated set of profiling data to the power simulator (McPAT). Our evaluation results show that CAPE can improve the power estimation accuracy by up to 15% for individual simulation points and by ∼8% for the full application, compared to that of a system-level only simulation scheme while adding minimal runtime overhead.
Appraising production targets through agent-based Petri net simulation of material handling systems in open pit mines
2018, Simulation Modelling Practice and Theory
Citation Excerpt :
Colored Petri net is the one of the widely used high-level Petri net which can model the dynamic behavior of a system and information flow [22]. This method enables tokens to transfer complex data information [23]. Since the tokens carry attributes of the objects, colored Petri net is preferred for systems including communication processes.
In a mining operation, significant differences between production targets in the planning stage and actual production quantities are a common issue. These differences can be related to heterogeneity of quality of ore within orebody, availability, and reliability of mining equipment, design-related problems of mining activities, and external factors. One way to understand the feasibility of targeted production rates is to simulate the activities. In this paper, an agent-based Petri net simulation model is proposed to check whether production targets are feasible, and the extent to control head grade in mineral processing. The model evaluates different realizations under the uncertain operation environment. Moreover, the fuel consumption of haul trucks is tracked in the proposed model. A case study was carried out to evaluate the proposed approach in an open pit mine. The research outcomes showed that this approach could assist in capacity installation, mineral processing design, and fuel tracking in mining operations.
The Vienna Architecture Description Language
2024, arXiv
Performance prediction from simulation systems to physical systems using machine learning with transfer learning and scaling
2023, Concurrency and Computation: Practice and Experience
Predicting physical computer systems performance and power from simulation systems using machine learning model
2023, Computing
Towards accurate performance modeling of RISC-V designs
2021, arXiv

View all citing articles on Scopus

View full text

An approach to build cycle accurate full system VLIW simulation platform

Abstract

Introduction

Section snippets

Basic concept of Petri Nets

Engine designs for VLIW modeling

Gem5 based implementation

Experiments and evaluation

Conclusion

Automatica

Software pipelining: An effective scheduling technique for VLIW machines

ACM Sigplan Notices

A VLIW architecture for executing multi-scalar/vector instructions on unified datapath

Proceedings of the 2013 Saudi International Electronics, Communications and Photonics Conference (SIECPC)

High velocity processing [Texas instruments VLIW dsp architecture]

IEEE Signal Process. Mag.

Performance and power analysis of ATI GPU: A statistical approach

Proceedings of the 2011 Sixth IEEE International Conference on Networking, Architecture and Storage (NAS)

Multi2sim: a simulation framework for CPU-GPU computing

Proceedings of the Twenty First International Conference on Parallel Architectures and Compilation Techniques

A micro-benchmark suite for AMD GPUs

Proceedings of the 2010 Thirty Ninth International Conference on Parallel Processing Workshops (ICPPW)

Dsp-based architectures for mobile communications: past, present and future

IEEE Commun. Mag.

Comparing power consumption of an smt and a cmp dsp for mobile phone workloads

Proceedings of the 2001 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems

A survey of CPU-GPU heterogeneous computing techniques

ACM Computing Surveys (CSUR)