# A Post-Silicon Trace Analysis Approach for System-on-Chip Protocol Debug

Yuting Cao, Hao Zheng Computer Science & Engineering U. South Florida Tampa, FL 33620 {cao2, haozheng}@usf.edu

## ABSTRACT

Reconstructing system-level behavior from silicon traces is a critical problem in post-silicon validation of System-on-Chip designs. Current industrial practice in this area is primarily manual, depending on collaborative insights of the architects, designers, and validators. This paper presents a trace analysis approach that exploits architectural models of the system-level protocols to reconstruct design behavior from partially observed silicon traces in the presence of ambiguous and noisy data. The output of the approach is a set of all potential interpretations of a system's internal executions abstracted to the system-level protocols. To support the trace analysis approach, a companion trace signal selection framework guided by system-level protocols is also presented, and its impacts on the complexity and accuracy of the analysis approach are discussed. That approach and the framework have been evaluated on a multi-core system-onchip prototype that implements a set of common industrial system-level protocols.

## 1. INTRODUCTION

Post-silicon validation makes use of pre-production silicon integrated circuit (IC) to ensure that the fabricated system works as desired under actual operating conditions with real software. It is a critical component of the design validation life-cycle for modern microprocessors and system-onchip (SoC) designs. Unfortunately, it is also highly complex, performed under aggressive schedules and accounting for more than 50% of the overall design validation cost [12].

An SoC design is often composed of a large number of predesigned hardware or software blocks (often referred to as "intellectual properties" or "IPs") that coordinate through complex protocols to implement system-level behavior [6]. An execution trace of a system typically involves activities from the CPU, audio controller, display controller, wireless radio antenna, etc., reflecting the interleaved execution of a potentially large number of communication protocols. As SoCs integrate more IPs, the interactions among the IPs are increasingly more complex. Moreover, modern interconnects are highly concurrent allowing multiple transactions to be processed simultaneously for scalability and performance. They are an important source of design errors. On the other hand, observability limitations allow only a small number of participating signals to be actually traced during silicon execution. Furthermore, electrical perturbations cause silicon data to be noisy, lossy, and ambiguous. It is non-trivial during post-silicon debug to identify all participating protocols and pinpoint the interleavings that result

Sandip Ray, Jin Yang Strategic CAD Lab Intel Hillsboro, OR {sandip.ray, jin.yang}@intel.com

in an observed trace.

Previous work [15] proposed a method for correlating silicon traces with system-level protocol specifications. The idea was to reconstruct protocol execution scenarios from a partially observed silicon trace, which provide abstract views of system internal executions to facilitate post-silicon SoC debug. While that work showed promising results, it has a number of deficiencies precluding its applicability in practice. First, there was no way to qualify or rank the quality of protocol execution scenarios generated by the reconstruction procedure. Under poor observability condition, it was possible for the algorithm to generate hundreds or thousands of potential protocol execution scenarios consistent with a partially observed trace. Without a metric to rank the quality of these reconstructions, the debugger is faced with the unenviable task of wading through these potential scenarios to infer what may actually have happened in a specific silicon execution. Moreover, based on past experiences, interleavings of different protocol executions are a major source of functional bugs. Since the method developed in [15] does not capture orderings among different protocol executions, the results obtained with that method offer little help for bug localization and root causing.

This paper addresses the above deficiencies by introducing an optimized trace analysis approach. Central to this optimized approach is a new formulation of protocol execution scenarios that comprehends ordering relations among protocol executions. Quantitative metrics are also developed so that the quality of the results derived by and the efficiency of the analysis approach can be measured. Trace signal selections can have great impacts on the complexity and accuracy of the trace analysis. Therefore, a companion trace signal selection framework is proposed. This framework is communication-centric, and guided by system-level protocols. Its objective is to facilitate the trace analysis to produce high quality interpretations of observed silicon traces efficiently. Various trace signal selection strategies are evaluated and analyzed based on their impacts on the trace analysis approach applied to a non-trivial multi-core SoC model that implements a number of common industrial systemlevel protocols.

## 2. FLOW SPECIFICATION

An SoC model as shown in Figure 1 is used to illustrate and experiment the work described in this paper. It consists of two CPUs (CPU\_X), each with a private Data Cache (Cache\_X), a graphics engine (GFX), a power management unit (PMU), a system memory, and three peripheral blocks:



Figure 1: The block diagram of the simple SoC model.

an audio control unit (Audio), a UART controller (UART), and a USB controller (USB). All these blocks are connected through an interconnect fabric (Bus).

System operations are realized by executions performed in various blocks that are coordinated by system-level protocols. These protocols are typically specified in architecture documents as message flow diagrams, where the words "protocol" and "flow" are used interchangeably. In this paper, as in [15], system flows are formalized using Labeled Petri-nets (LPNs). Figure 2 shows a memory write protocol initiated from a CPU CPU\_X in LPN where  $X \in \{0, 1\}$  and X' = 1 - X. An LPN is a tuple  $(P, T, E, L, s_0)$  where P is a finite set of *places*, T is a finite set of *transitions*, E is a finite set of events, and  $L: T \to E$  is a labeling function that maps each transition  $t \in T$  to an event  $e \in E$ . For each transition  $t \in T$ , its preset, denoted as  $\bullet t \subseteq P$ , is the set of places connected to t, and its postset, denoted as  $t \bullet \subseteq P$ , is the set of places that t is connected to. A state  $s \subseteq P$  of a LPN is a subset of places marked with tokens. There are two special states associated with each LPN;  $s_0 \subseteq P$  which is the set of initially marked places, also referred to as the *initial state*, and the end state  $s_{end}$  which is the set of places not going to any transitions.

A transition t can be executed in a state s if  $\bullet t \subseteq s$ . Executing t causes the labeled event to be emitted, and leads to a new state  $s' = (s - \bullet t) \cup t \bullet$ . Therefore, executing an LPN leads to a sequence of events. Execution of a LPN completes if its  $s_{end}$  is reached.

For example, in Figure 2,  $t_1$  can be executed in  $s_0 = \{p_1\}$ . Event (CPU\_X : Cache\_X : wr\_req) is emitted after  $t_1$  is executed, and the LPN state becomes  $\{p_2\}$ . The end state is  $s_{end} = \{p_9\}$ .

A flow specification may also contain multiple branches describing different ways a system can execute such flow. For example, the flow shown in Figure 2 has three branches covering the cases where the cache (snoop) operation is hit or miss.

# 3. POST-SILICON TRACE ANALYSIS

#### 3.1 Previous Work

This section recaps the previous approach in [15]. The objective of the trace analysis is to reconstruct design internal behavior *wrt* given system-level flow specifications  $\mathbf{F}$  from a partially observed silicon trace on a small number of hardware signals. The off-chip analysis includes two broad phases: (1) trace abstraction, which maps a silicon



Figure 2: LPN formalization of a CPU write protocol. Each LPN transition is labeled with an event (src,dest, cmd) where cmd is a command sent from a source component src to a destination component dest. The places without outgoing edges are *terminals*, which indicate termination of protocols represented by the LPNs.

trace into a sequence of flow events, higher-level architectural constructs including *e.g.*, messages, operations, etc, and (2) trace interpretation, which infers possible flow execution scenarios that are compliant with the abstracted event sequence.

To illustrate the basic idea, consider the system flow in Figure 2, which we call  $F_1$ . Suppose that the following flow execution trace is abstracted from an observed silicon trace by executing a design that implements  $F_1$ .

$$t_1 t_2 t_1 t_2 t_3 t_3 t_4 t_4 t_5 t_6 t_5 t_6 \dots$$

Here the flow events are referred to by their transition names in the LPN. The first four events result in the following flow execution scenario

$$\{(F_{1,1}, \{p_3\}), (F_{1,2}, \{p_3\})\}.$$
 (1)

A flow execution scenario is defined as a set of flow instances and their respective current states after some events are processed [15]. It can be viewed as an abstraction of system states wrt system flows. The above execution scenario indicates that the sequence of the first four events is a result from executing those two flow instances of  $F_1$  from their initial states to the shown states. For the first event  $t_3$ , it may be a result from executing  $F_{1,1}$  or  $F_{1,2}$ , but exactly which one is unknown due to limited observability. Both possible cases are considered, and two execution scenarios below are derived as a result from interpreting  $t_3$ .

$$\{ (F_{1,1}, \{p_4\}), (F_{1,2}, \{p_3\}) \} \{ (F_{1,1}, \{p_3\}), (F_{1,2}, \{p_4\}) \}.$$
 (2)

After handling the next event  $t_3$ , the above two execution scenarios are reduced to the one as shown below.

$$\{(F_{1,1}, \{p_4\}), (F_{1,2}, \{p_4\})\}.$$

After the remaining six events are handled, the following execution scenario is derived.

$$\{(F_{1,1}, \{p_7\}), (F_{1,2}, \{p_7\})\}$$

As another example, now suppose that the design with a bug generates the flow trace below.

$$t_1 t_2 t_1 t_2 t_3 t_3 t_4 t_4 t_5 t_6 t_5 t_{11} \dots$$

This sequence is almost the same as the previous one except that the last event is  $t_{11}$ : (Cache\_X:CPU\_X:rd\_resp) instead of  $t_6$ : (Mem:Bus:rd\_resp) in the previous trace.  $t_{11}$  is an event used in a different flow specification describing a CPU memory read protocol. Analyzing the trace right before  $t_{11}$  leads to the execution scenarios below.

$$\{ (F_{1,1}, \{p_7\}), (F_{1,2}, \{p_6\}) \} \{ (F_{1,1}, \{p_6\}), (F_{1,2}, \{p_7\}) \}.$$
 (3)

However,  $t_{11}$  cannot be a result from executing either flow instances in both scenarios, which indicates a noncompliance of the design implementation with respect to the given flow specification. Such an event is referred to as being *inconsistent*. In this case, the algorithm halts, and returns  $t_{11}$ and the derived flow execution scenarios as shown in (3) for debugger to examine further.

#### **3.2 Flow Execution Scenarios**

The trace analysis approach in [15] does not capture orderings among flow instances for execution scenarios. However, from a debugger's point of view, communication protocols can be related. For example, a firmware loading protocol always happens before a firmware execution protocol. If a firmware execution protocol is found to happen before a firmware loading protocol, that possibly indicates an error in the system implementing such protocols. Such properties cannot be checked by the previous approach.

To address that problem, this paper presents a new definition of flow execution scenarios as

$$\{(F_{i,j}, s_{i,j}, start_{i,j}, end_{i,j}) \mid F_i \in \mathbf{F}\}$$

where  $start_{i,j}$  and  $end_{i,j}$  are two indices representing relative time when  $F_{i,j}$  is initiated and completed. The ordering relations can be derived by comparing their start and end indices. For example, for two flow instances in an execution scenario,  $(F_{u,v}, s_{u,v}, start_{u,v}, end_{u,v})$  and  $(F_{x,y}, s_{x,y}, start_{x,y}, end_{x,y})$ ,  $F_{u,v}$  is initiated before  $F_{x,y}$  if  $start_{u,v} < start_{x,y}$ , or  $F_{x,y}$  is initiated after  $F_{u,v}$  is completed if  $end_{u,v} < start_{x,y}$ . The ordering relations can provide more accurate information for understanding system execution under limited observability. Section 3.3 explains how  $start_{i,j}$  and  $end_{i,j}$  are decided during the trace analysis.

In order to support the new definition of flow execution scenarios, the trace abstraction, which maps an observed silicon trace to a linear sequence of flow events as in [15], is also generalized. A SoC design can be viewed as a group of IP blocks networked by an on-chip interconnect fabric. These blocks communicate with each other through communication links, each of which implements a protocol, such as ARM AXI, over a set of wires. The approach presented in this paper is communication centric in that it works on silicon traces on a selected number of wires of a selected number of communication links for observation. Suppose that there are n communication links, and some wires from each link are selected for observation. A silicon trace is assumed to be a sequence of  $\alpha_0, \alpha_1, \ldots$  such that each  $\alpha_i$  is a vector defined as

$$\alpha_i = \langle \alpha_{0,i}, \dots \alpha_{n,i} \rangle$$

where  $\alpha_{k,i}$  is a state on link k in step i.

If all wires of a link are observable, then a state on that link can be uniquely mapped to a flow event of the same link. Under limited observability, a state on a link is typically mapped to a set of flow events. Therefore, a silicon trace is abstracted to a sequence  $\vec{E_0}, \vec{E_1}, \ldots$  where

$$\vec{E_i} = \langle E_{0,i}, \dots, E_{n,i} \rangle \tag{4}$$

is a vector of sets of flow events abstracted from  $\alpha_i$ , and each  $E_{k,i}$  in  $\vec{E_i}$  is a set of flow events abstracted from state  $\alpha_{k,i}$  in  $\alpha_i$ . No temporal orderings exist among all events in  $\vec{E_i}$ . On the other hand, for two events,  $e_i \in \vec{E_i}$  and  $e_j \in \vec{E_j}$ such that i < j, then  $e_i$  happens before  $e_j$ .

Based on different levels of information captured, this paper classifies flow execution scenarios as follows.

- *Type-1* execution scenarios capture the number of instances of each flow specification *initiated* from a silicon trace, and their relative orderings of initiations.
- *Type-2* execution scenarios, on top of what is captured by Type-1 scenarios, capture *completion* of each flow instance. This additional information can be used to identify potential problems if there is any flow instance that is not completed. Furthermore, Type-2 execution scenarios capture the relative orderings among all flow instances as described above.
- *Type-3* execution scenarios, on top of what is captured by Type-2 scenarios, capture information on execution paths followed by individual flow instances. This information can provide a means to debuggers to have a detailed examination on *how* each flow instance is executed.

These different execution scenarios can be used to provide different views of system execution, from coarse-grained to more detailed ones, at different stages of debug.

#### 3.3 Algorithms

Algorithm 1 shows the top-level procedure for detecting internal flow executions based on a partially observed silicon trace, and checks the compliance *wrt* a given flow specification. It takes as inputs  $\mathbf{F}$ , a set of system level flow specifications, and a signal trace  $\rho$ , which is assumed to be a sequence of states on a set of observable trace signals, and each state is uniquely indexed starting from 0.

This algorithm scans trace  $\rho$  starting from index h initialized to 0, extracts all possible flow events from  $\rho$  at index has described in section 3.2 (line 6), and maps each of those extracted flow events to update already detected execution scenarios (line 11). The algorithm terminates if one of two conditions holds. If an inconsistence is encountered, the set Algorithm 1: Check-Compliance( $\mathbf{F}, \rho$ )

| 1        | /* <b>F</b> : a set of flow specification */                                                                                                                                                                                                                                       |  |  |  |  |  |  |  |  |  |  |  |
|----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|--|--|--|
| <b>2</b> | 2 /* $\rho$ : a partially observed silicon trace */                                                                                                                                                                                                                                |  |  |  |  |  |  |  |  |  |  |  |
| 3        | $\mathcal{M} \leftarrow \{\emptyset\}$                                                                                                                                                                                                                                             |  |  |  |  |  |  |  |  |  |  |  |
| 4        | $h \leftarrow 0$                                                                                                                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |  |  |
| 5        | while $h \leq  \rho  \operatorname{\mathbf{do}}$                                                                                                                                                                                                                                   |  |  |  |  |  |  |  |  |  |  |  |
| 6        | $ \begin{vmatrix} \vec{E} \leftarrow abstract(\rho, h) \\ \textbf{foreach} \ E_i \ of \ \vec{E} \ \textbf{do} \\ \\   \ inconsistent \leftarrow true \\ \textbf{foreach} \ e \in E_i \ \textbf{do} \\ \\   \ \textbf{foreach} \ scen \in \mathcal{M} \ \textbf{do} \end{vmatrix} $ |  |  |  |  |  |  |  |  |  |  |  |
| 7        | $ \begin{array}{c} -0 \\ \text{ille } h \leq  \rho  \ \text{do} \\ \vec{E} \leftarrow abstract(\rho, h) \\ \text{foreach } E_i \ of \ \vec{E} \ \text{do} \\ \\ \hline inconsistent \leftarrow true \\ \text{foreach } e \in E_i \ \text{do} \end{array} $                         |  |  |  |  |  |  |  |  |  |  |  |
| 8        | $inconsistent \leftarrow true$                                                                                                                                                                                                                                                     |  |  |  |  |  |  |  |  |  |  |  |
| 9        | foreach $e \in E_i$ do                                                                                                                                                                                                                                                             |  |  |  |  |  |  |  |  |  |  |  |
| 10       | $\textbf{for each } scen \in \mathcal{M} \textbf{ do}$                                                                                                                                                                                                                             |  |  |  |  |  |  |  |  |  |  |  |
| 11       | $Scens \leftarrow analysis(\mathbf{F}, scen, e, h)$                                                                                                                                                                                                                                |  |  |  |  |  |  |  |  |  |  |  |
| 12       | if $Scens \neq \emptyset$ then                                                                                                                                                                                                                                                     |  |  |  |  |  |  |  |  |  |  |  |
| 13       | $inconsistent \leftarrow false$                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |  |  |  |  |
| 14       | $          \mathcal{M} \leftarrow (\mathcal{M} - scen) \cup Scens $                                                                                                                                                                                                                |  |  |  |  |  |  |  |  |  |  |  |
| 15       | if $inconsistent = true$ then                                                                                                                                                                                                                                                      |  |  |  |  |  |  |  |  |  |  |  |
| 16       | <b>return</b> $(\mathcal{M}, h, i)$                                                                                                                                                                                                                                                |  |  |  |  |  |  |  |  |  |  |  |
| 17       | $h \leftarrow h + 1$                                                                                                                                                                                                                                                               |  |  |  |  |  |  |  |  |  |  |  |
| 18       | return $(\mathcal{M}, -1, -1)$                                                                                                                                                                                                                                                     |  |  |  |  |  |  |  |  |  |  |  |
| _        |                                                                                                                                                                                                                                                                                    |  |  |  |  |  |  |  |  |  |  |  |

of detected partial execution scenarios along with two indices h and i are returned (line 16). Index h provides temporal information on when the inconsistency occurs, while i provides spatial information on which communication link an inconsistent event is transmitted. If no inconsistency is found, the set of all execution scenarios compliant with the observed trace is returned (line 18) when index h is larger than the length of the trace.

Algorithm 2 takes the specification  $\mathbf{F}$ , an execution scenario *scen*, a flow event e, and index h of the trace where e is extracted, and it produces a set of execution scenarios  $\mathcal{R}$  consistent with e. This algorithm performs two tasks. In the first task (lines 5-12), the algorithm checks every flow instance to decide if e can be accepted. If such an instance is found (line 7), then it is updated with the new state as the result of e (line 9). Furthermore, if e causes the flow instance to complete, its index  $end_{i,j}$  is set to h (line 10-11), indicating the completion of that instance due to event e at step h of the trace. In task 2, all possibilities where e can initiate a new flow instance are considered (line 14-20). If a new instance can be initiated, its  $start_{i,j}$  is set to h, indicating the initiation of that instance due to a signal event at step h of the trace.

#### **3.4** On the Complexity and Accuracy

Due to the limited observability, reconstructing system level executions from an observed silicon trace is an imprecise process. The large number of execution scenarios typically derived during the analysis would take large amounts of runtime and memory to process and to store, thus making it less efficient. This is referred to as the *complexity* problem of the trace analysis. After the analysis is done, a large number of derived execution scenarios make it difficult to understand the analysis results, thus being less helpful for debugging. Obviously, a single flow execution scenario derived at the end of the trace analysis provides much more precise information for debug than ten candidate flow execution scenarios. This is referred to as the *accuracy* problem of the trace analysis.

| Algorithm 2: $Analysis(\mathbf{F}, scen, e, h)$                               |  |  |  |  |  |  |  |  |
|-------------------------------------------------------------------------------|--|--|--|--|--|--|--|--|
| $\frac{1}{1} / * scen = \{ (F_{i,j}, s_{i,j}, start_{i,j}, end_{i,j}) \} * /$ |  |  |  |  |  |  |  |  |
| 2 /* e is a flow event abstracted from silicon trace at                       |  |  |  |  |  |  |  |  |
| index $h^*/$                                                                  |  |  |  |  |  |  |  |  |
| з $\mathcal{R}=\emptyset$                                                     |  |  |  |  |  |  |  |  |
| 4 /* Check if $e$ can change state any existing flow                          |  |  |  |  |  |  |  |  |
| instances of scen $*/$                                                        |  |  |  |  |  |  |  |  |
| 5 foreach $(F_{i,j}, s_{i,j}, start_{i,j}, end_{i,j}) \in scen$ do            |  |  |  |  |  |  |  |  |
| $6     s_{i,j}' \leftarrow accept(F_{i,j}, s_{i,j}, e)$                       |  |  |  |  |  |  |  |  |
| 7 if $s'_{i,j} \neq \emptyset$ then                                           |  |  |  |  |  |  |  |  |
| <b>8</b> Let <i>scen'</i> be a copy of <i>scen</i>                            |  |  |  |  |  |  |  |  |
| 9 Replace $s_{i,j}$ of $scen'$ with $s'_{i,j}$                                |  |  |  |  |  |  |  |  |
| 10 if $s'_{i,j} = F_i \cdot s_{end}$ then                                     |  |  |  |  |  |  |  |  |
| 11 Update $end_{i,j}$ of $scen'$ with $h$                                     |  |  |  |  |  |  |  |  |
| 12 $\mathcal{R} \leftarrow \mathcal{R} \cup scen'$                            |  |  |  |  |  |  |  |  |
| 13 /* Check if $e$ can extend <i>scen</i> by initiating new flow              |  |  |  |  |  |  |  |  |
| instances */                                                                  |  |  |  |  |  |  |  |  |
| 14 foreach $F_i \in \mathbf{F}$ do                                            |  |  |  |  |  |  |  |  |
| <b>15</b> create a new instance $F_{i,h}$                                     |  |  |  |  |  |  |  |  |
| 16 $s'_{i,h} \leftarrow accept(F_{i,h}, F_i.s_0, e)$                          |  |  |  |  |  |  |  |  |
| 17 if $s'_{i,h} \neq \emptyset$ then                                          |  |  |  |  |  |  |  |  |
| <b>18</b> Let <i>scen'</i> be a copy of <i>scen</i>                           |  |  |  |  |  |  |  |  |
| 19 $scen' \leftarrow scen' \cup (F_{i,h}, s'_{i,h}, h, -1)$                   |  |  |  |  |  |  |  |  |
| 20 $\mathcal{R} \leftarrow \mathcal{R} \cup scen'$                            |  |  |  |  |  |  |  |  |
| 21 return $\mathcal{R}$                                                       |  |  |  |  |  |  |  |  |
|                                                                               |  |  |  |  |  |  |  |  |

The contributing factors to the complexity and accuracy problems are explained below.

- 1. A signal event mapped to a set of flow events Due to the limited observability, a signal event of an observed silicon trace is often interpreted as a number of different flow events, which typically leads to derivation of a number of different execution scenarios. This situation is exacerbated by the fact that silicon traces are often very long, which could lead to excessively large numbers of possible execution scenarios derived during or at the end of the analysis.
- 2. A flow event mapped to different temporal flow instances – Temporal flow instances refer to the flow instances activated by the same component, e.g. read/write flows activated by CPU\_0. If several temporal instances of some flows are activated by a component, mapping flow events to those flow instances can be ambiguous. For example, suppose that an execution scenario includes two instances of the flow as shown in Figure 2 activated by CPU\_0, one in state  $\{p_2\}$ , and the other one in state  $\{p_8\}$ . An instance of flow event (Cache\_0 : CPU\_0 : wr\_resp) can be mapped to either flow instance leading to two new execution scenarios from the current one.
- 3. A flow event mapped to flow instances activated by different components – This situation can happen when flow instances that share some common events are activated by different components. For example, suppose an execution scenario has two instances of the flow as shown in Figure 2, one activated by CPU\_0 and the other one by CPU\_1, and both are in state  $\{p_6\}$ . A flow event (Mem : Bus : rd\_resp) can be mapped to either one of these two instances, leading to two new

execution scenarios derived from the current one.

The above issues can be mitigated by good signal selections to be discussed in the following section. In order to evaluate the impacts of different trace signal selections on the complexity and accuracy of the trace analysis, this paper introduces two quantitative metrics. The complexity is measured by the *peak* count of flow execution scenarios encountered during the analysis process, *i.e.*, the largest size of  $\mathcal{M}$  encountered during the execution of Algorithm 1. The accuracy is measured by the *final* count of flow execution scenarios derived at the end of the analysis process, *i.e.*, the size of  $\mathcal{M}$  returned on either line 16 or 18 of Algorithm 1.

#### 4. TRACE SIGNAL SELECTION

Trace signal selection is a critical step in post-silicon debug. It includes two different efforts: pre-silicon and postsilicon. During pre-silicon selection, a few thousand signals among a vast number of internal signals are tapped for observation. All necessary signals must be selected at this stage, otherwise, expensive re-design along with silicon re-spin are required. During post-silicon debug, a small subset of those tapped signals are routed to the chip interface for tracing during system execution.

Previous work such as [4] is typically applied to gate level design models, and the quality of the results is evaluated by the commonly used state restoration ratio. However, it is difficult to scale those methods to large and complex SoC designs. More importantly, signals selected at the gate level are often irrelevant to system-level functionalities. There is an attempt to raise the abstraction level for trace signal selection to the register transfer level (RTL) guided by assertions [11], however that work does not consider system level functionalities either. In [3], a system level protocol guided approach is proposed. It is similar to our work in that both are based on system level protocols. However, the selection techniques developed in [3] are simple and irrelevant to understanding silicon traces at the system level, and the evaluation was performed on an abstract transaction level model.

This section introduces a framework shown in Figure 3 for trace signal selection guided by system-level protocols. Due to the page limit, this paper only considers the pre-silicon trace signal selection. Since the pre-silicon selection needs to support all types of execution scenarios, it is sufficient to consider only Type-3 scenarios as they supersede Type-1 or -2 scenarios.

## 4.1 System Level Selection

During the system level selection, different subsets of flow events for observation are selected from given flow specifications. Then, those results are passed to the more refined bit level selection.

To support Type-3 scenarios, the start and end events of all flow specifications must be selected. If a flow specification has multiple branches, additional events may need to be selected so that the branch followed by a flow instance during system execution can be captured. Figure 4 shows two examples of different branching structures for flows.

• In Figure 4(a), each branch ends with an unique event. There is no need to select additional events as observing different end events can clearly identify the branch followed during system execution.



Figure 3: A framework of trace signal selection.



Figure 4: Examples of flow structures.

• In Figure 4(b), branches split and then join, and the flow ends with a common event. In this case, an unique event needs to be selected for each branch.

The flow shown in Figure 2 has three branches with a structure similar to Figure 4(b). Its start and end events are  $\{t_1, t_8, t_9, t_{10}\}$ . Note that  $t_8, t_9$ , and  $t_{10}$  actually refer to the same event. There is no choice for the right branch as  $t_{10}$  must be selected. To identify the left two branches from the right one, either  $t_2$  or  $t_3$  needs to be selected. Similarly, one of events in  $\{t_4, t_5, t_6, t_7\}$  needs to be selected in order to identify the left branch from the middle one. Therefore, all possible event selections for that flow are

$$\{t_1, t_8, t_9, t_{10}\} \times \{t_2, t_3\} \times \{t_4, t_5, t_6, t_7\}.$$

Among possibly large number of event selections, there are two types of events that have interesting characteristics. One type includes events that are unique to specific flows (ref. unique events), while the other type includes events shared by multiple flows (ref. shared events). This section considers their impacts on the complexity and accuracy of the trace analysis and the signal selection. For the complexity and accuracy of the trace analysis, only issues #2 and #3 in section 3.4 are considered in this section. Issue #1 is relevant to the bit level selection.

It is important to select events that can be mapped to smallest number of flow instances during the trace analysis in order to reduce the complexity and improve the accuracy. Refer to the flow shown in Figure 2. Events  $t_2$  and  $t_3$  are used only in that flow. During the trace analysis, they are just mapped to the instances of that particular flow. Issue #2 can be addressed in that instances of different flows initiated by the same component are ignored for those events. Issue #3 is addressed in that instances of any flows initiated by different components are ignored for them.  $t_4$  and  $t_7$  have a similar characteristic.

On the other hand, events  $t_5$  and  $t_6$  are used in many different flows of different components. Those flows can be read/write flows of CPU\_0 or CPU\_1. During the trace analysis, if there are multiple instances of such flows, it is impossible to know which of those flow instances cause those events to be generated. Therefore, the analysis algorithm has to map those events to those flow instances in all possible ways. That can cause huge negative impacts on the complexity and accuracy of the trace analysis.

In terms of trace signal selection, those two types of events can lead to different results. If unique events are selected. then the total number of events selected can be large, and as a result, a large number of trace signals need to be selected in order to observe those events. On the other hand, the total number of events can be smaller if shared events are selected. That leads to a smaller number of trace signals that need to be selected. The negative impacts of selecting shared events can also be mitigated if certain implementation details are available. Next section gives more discussions on that point.

#### 4.2 Bit Level Selection

The bit level selection takes as inputs the set of event selections produced in the previous step and an RTL model that implements the system flow specifications, and performs two tasks for each event selection:

- 1. Evaluate its quality *wrt* the three issues discussed in Section 3.4;
- 2. Choose one selection, and generate a set of candidate trace signals that implement the selected events.

The ultimate goal of the bit level selection is to produce a reduced set candidate trace signals optimized for the trace analysis approach. Since the bit level selection depends on implementation specifics, this section can only discuss some general guidelines and tradeoffs. Note that flow specifications are typically independent of memory address and data information. Therefore, the address and data bits included in event implementations can be generally ignored.

Signals that implement the Cmd field of flow events are selected based on their respective distinguishing power. Given a set of flow events E and a set of signals W that implement E, the distinguishing power of  $W_i \subseteq W$ , is defined by E can be partitioned wrt  $W_i$ . A finer partition means higher distinguishing power. For example, suppose two flow events on link (cpu0, DCache0) implemented by eight signals  $b_7 \dots b_0$ with the following encodings.

```
(cpu0:DCache0:wr_req) 0100 0000
(cpu0:DCache0:rd_req) 1000 0000
```

Under these encodings, signals  $b5...b_0$  have zero distinguishing power.  $b_7$  and  $b_6$  have the equal power, therefore selecting either one would be fine. Selecting signals with

high distinguishing power helps to address issue #1 as discussed in Section 3.4.

RTL models may contain additional implementation information that can help to address issue #2 and #3. For example, memory operations may be executed out-of-order. In this case, CPUs usually assign unique sequence IDs to flow instances to maintain data and control dependency in the original programs. If sequence IDs are available, selecting signals implementing them can help address issue #2.

If the on-chip interconnect needs to handle events from different components in a system, the events are usually assigned with tags to identify their originating components. Selecting tags can affect how events are selected. Refer to Figure 2 for the following discussion.

- 1. If unique events such as  $t_4$  or  $t_7$  are selected, observing tags is not needed.
- 2. Shared events  $t_5$  or  $t_6$  are selected along with tags.

For option 2, tags can help to map events to the flow instances with the same tags during the trace analysis, thus addressing issue #3. Even though additional signals for tags are selected, the total number of events may be smaller if the shared events are used in many different flows, therefore resulting in reduced signals for observation overall.

The following discussion illustrates yet another example of how implementation information can allow different events to be selected. Refer to Figure 2. That flow contains two branching places,  $p_2$  and  $p_4$ . When a flow instance reaches  $p_2$ , which branch to take next depends on whether the cache operation is hit or miss. Similarly, which branch to take at  $p_4$  depends on whether the cache snoop operation is hit or miss. If these two status signals are available and included for observation, there is no need to select branch events. Observing start/end events plus those status signals are sufficient to identify branches followed by a flow instance during system execution.

#### 5. EXPERIMENTAL RESULTS

To the best of our knowledge, this work is the first to present a systematic approach to post-silicon trace analysis guided by system level protocols. We are not able to find any similar previous work where ours can be evaluated and compared with. The closest work to ours is [15]. However, our work is more general and developed with practical considerations. Additionally, the work in [15] is discussed and evaluated based on an abstract transaction-level model while our approach is evaluated on a RTL model.

#### 5.1 The Model

The ideas and techniques presented in this paper are evaluated on a multi-core SoC prototype, as shown in Figure 1, which implements a number of common industrial systemlevel protocols including cache coherence and power management. This prototype is a cycle- and pin-accurate RTL model written in VHDL. Even though this model is simple compared to real SoC designs, it is much more sophisticated than the gate-level benchmark suites typically considered as targets for post-silicon analysis [10, 4, 5].

Since the proposed trace analysis approach is communication centric, the focus of this model is the implementation of system-level protocols. The CPUs are treated as a test environment where software programs are simulated in VHDL to trigger various protocols. Therefore, there is no instruction cache as no instructions are involved when the CPUs are simulated. The peripheral blocks, GFX, PMU, Audio, *etc*, are also described as abstract models that generate events to initiate flows or to respond incoming requests.

More details of some system-level protocols implemented in our model can be found in [3]. They include downstream read/write protocols for each CPU, upstream read/write for the peripheral blocks, and system power management protocols, which are abstracted from real industrial protocols. These system-level protocols are supported by inter-block communication protocols based on the ARM AXI4-lite [1]. A total of 16 flows are implemented for this prototype.

A flow event is generated from a source and consumed by a destination by messages transmitted over that link. In our model, each message is organized as follows.

 $\langle Val(1), Cmd(8), Tag(8), Sid(8), Addr(32), Data(32) \rangle$ 

The meanings of the message fields are given below. The numbers following the individual fields indicate their respective widths. Note that not all fields are used on all links. That model has over four thousand single bit signals.

- Val indicates validity of a message.
- Cmd carries operations to be performed by the target block.
- Tag is used by Bus to identify the original sources of messages from different blocks that go to the same destination, *e.g.* memory wr\_req from Bus in response to wr\_req from both CPUs.
- Sid is an unique number generated by a component to represent sequencing information of flows initiated by the same component.
- Addr carries the memory address at the target block where Cmd is applied.
- Data carries data to a target or from a source. Its width can vary depending on the links where a message is sent. On the links between Cache to Bus, the width is equal to the size of the cache block, which is 64 bytes. For all the other links, the width is 32 bits.

#### 5.2 Experiment Setup

Test Environment The prototype is simulated in a random test environment where CPUs, GFX, and other peripheral blocks are programmed to randomly select a flow to initiate in each clock cycle. The contents of Cmd, Addr, and Data in each activated flow are set randomly. Additionally, CPUs can activate power management protocols non-deterministically. Each of these blocks activates a total of 100 flow instances during entire simulation.

**Trace Signal Selection** In the experiments, different selections of trace signals are produced as discussed in section 4, and their impacts on the complexity and accuracy of the trace analysis approach are evaluated. The list below explains the selections at the system level while information on the bit level selection is given in Table 1.

- S1 All events of all flow specifications, and all signals implementing each event are selected. This selection offers full observation, and provides a baseline for comparing with other selections.
- S2 The start and end events of all protocols are selected. Furthermore, for each branch in each flow, one unique event is selected.

- S3 The start and end events of all protocols are selected. Furthermore, for each branch in each flow, a highly shared event is selected.
- S4 The start and end events of all protocols are selected. Instead of selecting events for branches in each flow, signals whose states control the flow branching are selected.

At the bit level, the Addr and Data fields are not considered. On the other hand, the Val bit is always selected so that valid messages can be identified from observed traces. For selections S2, S3, and S4, experiments are performed to evaluate all combinations of Cmd, Tag and Sid fields.

#### **5.3 Result Analysis**

In Table 1, a  $\checkmark$  means that all signals implementing a particular field for all selected events in selection  $S_X$  are traced. Otherwise, all those signals are not traced. Third row (# Bits) shows the total numbers of single-bit signals are traced for different selections. As discussed in section 4, systemlevel selection may choose events unique to particular flows or events shared by multiple flows. From the table, we can see that selecting shared events leads to a smaller number of trace signals (S3) compared with selecting unique events (S2). However, if status signals controlling flow branching are selected without selecting any branch events, S4 leads to the smallest trace signal selection.

From the table, it is quite obvious that not selecting Cmd or Sid has severe impacts on the trace analysis as explained in issues #1 and #2 in section 3.4. On the other hand, not selecting Tag has negative impacts, but not as severe. The trace analysis can still finish even though it takes more time and memory. Next, compare the results obtained by selecting Cmd and Sid but no Tag under S2-S4. The results with S4 are much better than S2 or S3. This is due to that no branch events are selected for S4, therefore, issues #2 and #3 are avoided. Combined with the benefit of reduced trace signals, S4 appears to be the best option. On the other hand, not selecting any branch events may cause difficulty in understanding flow execution if a branch is long and a system execution fails to reach the end of that branch.

In the above discussion, selections of Cmd, Tag and Sid are applied to all events as the result of the system-level selection. A finer selection can be used to reduce trace signals if unique events and shared events are considered separately. For unique events, the sources where they are generated are known from flows, therefore Tags need not be traced. Shared events may be results of flow instances initiated by different components, therefore tracing Tags are necessary. On the other hand, tracing or not tracing Cmds has little impact on the trace analysis. These points are supported by the results shown in columns under "U S". Under S2, compare the results under "U S" against those with all three fields selected. We can see that the runtime performance and the complexity and accuracy of the trace analysis are similar while the trace signals are reduced with the finer selection. Comparing the results under "U S" against those obtained with only Cmd and Sid selected, the complexity is significantly dropped. The same conclusion can be drawn for S3 and S4.

From the above discussion, it is necessary to trace signals implementing Cmd and Sid whenever possible, and trace as many signals implementing Tag as allowed to reduce complexity of the trace analysis even more. If Tag or Sid is not

Table 1: Runtime Results of Trace analysis with different trace signal selections. Runtime is in seconds and memory usage is in MB. – indicates the results are not available due to the 10 minute time limit exceeded.

| System<br>level<br>selection | S1            | S2                    |                  |        |                               |       | S3                                      |        |        |             |       | S4                                      |         |                  |         |       |
|------------------------------|---------------|-----------------------|------------------|--------|-------------------------------|-------|-----------------------------------------|--------|--------|-------------|-------|-----------------------------------------|---------|------------------|---------|-------|
| Cmd<br>Tag<br>Sid            | $\rightarrow$ | <b>&gt; &gt; &gt;</b> | <b>&gt; &gt;</b> | ✓<br>✓ | <ul><li>✓</li><li>✓</li></ul> |       | <ul><li>✓</li><li>✓</li><li>✓</li></ul> | ✓<br>✓ | ✓<br>✓ | <b>&gt;</b> |       | <ul><li>✓</li><li>✓</li><li>✓</li></ul> | ✓<br>✓  | <b>&gt; &gt;</b> | ✓<br>✓  |       |
| # Bits                       | 870           | 545                   | 401              | 401    | 401                           | 401   | 495                                     | 367    | 367    | 367         | 367   | 378                                     | 258     | 258              | 258     | 258   |
| # scen<br>(Final)            | 1             | 1                     | _                | 1      | _                             | 1     | 1                                       | _      | 1      | -           | 1     | 1                                       | _       | 1                | _       | 1     |
| # scen<br>(Max)              | 1             | 1                     | > 1M             | 5184   | 110k                          | 1     | 1                                       | > 4M   | 5184   | 221K        | 1     | 1                                       | > 1M    | 8                | > 8M    | 1     |
| Time                         | 1.628         | 1.475                 | 600              | 3.679  | 600                           | 1.464 | 1.444                                   | 600    | 3.812  | 600         | 1.426 | 1.430                                   | 1.411   | 1.424            | 600     | 1.419 |
| Mem                          | 0.516         | 1.10                  | > 2GB            | 4.2    | 66                            | 1.124 | 1.11                                    | > 5GB  | 4.2    | 101         | 1.1   | 0.504                                   | > 2  GB | 0.58             | > 5  GB | 1.116 |

part of the design, we recommend to add DFx circuitry in order to trace such information. In the above experiments, the final execution scenarios under different signal selection, if available, contain the correct number of flow instances initiated, and the orderings among the flow instances, as generated by the test environment, are correctly captured.

#### 6. RELATED WORK

Our work is closely related to communication-centric and transaction based debug. An early pioneering work is described by Goossens *et al.* [9, 14, 8], which advocates the focus on observing activities on the interconnect network among IP blocks, and mapping these activities to transactions for better correlation between computations and communications. A similar transaction-based debug approach is presented by Gharebhagi and Fujita [7]. It proposes an automated extraction of state machines at transaction level from high level design models. From an observed failure trace, it tries to derive a set of feasible transaction traces that lead to the observed failure state. However, this approach requires manual inputs and may not be able to derive such traces.

Singerman *et al.* [13] deploys a central repository of system events and simple transactions defined by architects and IP designers. It spans across a wide spectrum of the post-silicon validation including DFx instrumentation, test generation, coverage, and debug. Also, Abarbanel *et al.* [2] propose a model at a higher-level of abstraction, *flows*, is proposed. Flows are used to specify more sophisticated cross-IP transactions such as power management, security, etc, and to facilitate reuse of the efforts of the architectural analysis to check HW/SW implementations.

## 7. CONCLUSION

An improved trace analysis approach for post-silicon debug is presented where observed raw silicon traces are interpreted *wrt* system flow specifications. In this approach, a new formulation of flow execution scenarios is described where more diverse information among flows can be captured and represented. A trace signal selection framework is also described in support of the proposed trace analysis approach. Some observations on trace signal selections and their impacts on the accuracy and efficiency of the trace analysis are discussed. Experiments on a non-trivial SoC prototype reveal insights on impacts of different signal selections on the complexity and accuracy of the trace analysis. In the future, we plan to perform more extensive and in-depth study on trace signal selections guided by system flow specifications.

## 8. REFERENCES

- [1] Amba axi and ace protocol specification. http://www.arm.com.
- [2] Y. Abarbanel, E. Singerman, and M. Y. Vardi. Validation of soc firmware-hardware flows: Challenges and solution directions. In *Proceedings of DAC'14*, pages 2:1–2:4, 2014.
- [3] M. Amrein. System-level trace signal selection for post-silicon debug using linear programming. Master's thesis, Univ. of Illinois Urbana-Champaign, May 2015.
- [4] K. Basu and P. Mishra. Efficient trace signal selection for post silicon validation and debug. In VLSI Design (VLSI Design), pages 352–357. IEEE, 2011.
- [5] D. Chatterjee, C. McCarter, and V. Bertacco. Simulation-based signal selection for state restoration in silicon debug. In *ICCAD*, pages 595–601. IEEE, 2011.
- [6] H. D. Foster. Trends in functional verification: A 2014 industry study. In DAC, pages 48:1–48:6, 2015.
- [7] A. M. Gharehbaghi and M. Fujita. Transaction-based post-silicon debug of many-core system-on-chips. In *ISQED*, pages 702–708, 2012.
- [8] K. Goossens, B. Vermeulen, and A. B. Nejad. A high-level debug environment for communication-centric debug. In *Proceedings of DATE'09*, pages 202–207, 2009.
- [9] K. Goossens, B. Vermeulen, R. v. Steeden, and M. Bennebroek. Transaction-based communication-centric debug. In *Proceedings of NOCS'07*, pages 95–106, 2007.
- [10] H. F. Ko and N. Nicolici. Algorithms for state restoration and trace-signal selection for data acquisition in silicon debug. *IEEE TCAD*, 28(2):285–297, 2009.
- [11] S. Ma, D. Pal, R. Jiang, S. Ray, and S. Vasudevan. Can't see the forest for the trees: State restoration's limitations in post-silicon trace signal selection. ICCAD '15, pages 1–8, Piscataway, NJ, USA, 2015. IEEE Press.
- [12] P. Patra. On the cusp of a validation wall. *IEEE Des. Test*, 24(2):193–196, Mar. 2007.
- [13] E. Singerman, Y. Abarbanel, and S. Baartmans. Transaction based pre-to-post silicon validation. In *Proceedings of DAC'11*, pages 564–568, 2011.
- [14] B. Vermeulen and K. Goossens. A noc monitoring infrastructure for communication-centric debug of

embedded multi-processor socs. In  $\it VLSI-DAT$ '09, pages 183–186, 2009.

 [15] H. Zheng, Y. Cao, S. Ray, and J. Yang. Protocol-guided analysis of post-silicon traces under limited observability. In *Proceedings of ISQED'16*, pages 301–306, March 2016.

# APPENDIX

# A. CPU READ/WRITE DOWNSTREAM PRO-TOCOL

#### X={ 0, 1} X'=1-X Target={ Memory, USB, UART, AUDIO, GFX} CMD ={read, write }

# $X = \{ 0, 1 \}$

X'=1-X Target={ Memory, USB, UART, AUDIO, GFX} CMD ={read, write }



Figure 5: LPN formalization of a CPU read/write protocol.

CPU downstream request with cache hit





CPU downstream request with snoop hit



CPU downstream request with snoop miss



Figure 6: MSQ of CPU downstream read/write protocol

## **B. UPSTREAM READ/WRITE PROTOCOL**

Initiator={ GFX, USB, AUDIO, UART}
Target={ Memory, USB, UART, AUDIO, GFX}
Note that a peripheral can't initialize a read flow to read
itself

Initiator={ GFX, USB, AUDIO, UART} Target={ Memory, USB, UART, AUDIO, GFX} Note that a peripheral can't initialize a read flow to read itself



Figure 7: LPN formalization of a upstream read protocol.

Upstream request with Cache\_0 snoop hit



Upstream request with Cache\_1 snoop hit



Upstream request with both Cache snoop miss



Figure 8: MSQ of upstream read protocol

Upstream request with Cache\_0 snoop hit

Initiator={ GFX, AUDIO} Target={ Memory, USB, UART, AUDIO, GFX} Note that a peripheral can't initialize a write flow to write itself

Initiator={ GFX, AUDIO}

Target={ Memory, USB, UART, AUDIO, GFX}

Note that a peripheral can't initialize a write flow to write itself



Figure 9: LPN formalization of a upstream write protocol.



Upstream request with Cache\_1 snoop hit



Upstream request with both Cache snoop miss



Figure 10: MSQ of upstream write protocol

# C. CPU WRITE BACK PROTOCOL

# D. CPU POWER ON/OFF PROTOCOL

X={1,0} Target={ Memory, USB, UART, AUDIO, GFX}



Figure 11: LPN formalization of a CPU write back protocol.



Figure 12: MSQ of power write back protocol

CMD={ pwr\_on, pwr\_off} X={1,0} Target={ Memory, USB, UART, AUDIO, GFX}



Figure 13: LPN formalization of a CPU power on/off protocol.



Figure 14: MSQ of power on/off protocol