# Layering the monitoring action for improved flexibility and overhead control: work-in-progress

Giacomo Valente<sup>1</sup>, Tiziana Fanni<sup>2</sup>, Carlo Sau<sup>3</sup>, Francesco Di Battista<sup>1</sup>

<sup>1</sup>Università degli Studi dell'Aquila, <sup>2</sup>Università degli Studi di Sassari, <sup>3</sup>Università degli Studi di Cagliari

email address: giacomo.valente@univaq.it, tfanni@uniss.it, carlo.sau@diee.unica.it

*Abstract*—With the diffusion of complex heterogeneous platforms and their need of characterization, monitoring the system gained increasing interest. This work proposes a framework to build custom and modular monitoring systems, flexible enough to face the heterogeneity of modern platforms, offering a predictable HW/SW impact.

Index Terms—SW monitoring, HW Monitoring, self-awareness, monitoring layer

# I. CONTEXT AND OBJECTIVES

Nowadays, embedded platforms are evolving toward heterogeneous architectures, including general and specific purpose processors implemented on chips with both dedicated and reconfigurable logic. This evolution has been mainly driven by the need of different functionalities while trading-off among non-functional requirements (e.g., timing, energy, cost) [1], [2]. In this context, the demand of system characterization techniques is increasing: simulation not always represents an acceptable solution, since a fine granularity requires complex models and tends to slowdown the application (with respect to its actual running time [3]), qualifying the usage of runtime monitoring systems [4]. However, the evolution toward heterogeneous architectures also impacts on monitor design, since it (i) elicits fresh monitoring ReQuireMents (RQMs), and (ii) requires the satisfaction of traditional RQMs (e.g., monitor of power dissipation, monitor of execution time) on those new platforms. As a result, there are consolidated solutions dedicated to specific areas, targeting specific platform components and ROMs. The drawback of this approach is highlighted when we try their adaptation for monitoring new platforms (flexibility), and also on evolving them toward satisfaction of fresh RQMs (applicability). Furthermore, the need to satisfy multiple RQMs leads to the adoption of multiple different monitoring systems, with a difficult to predict HW/SW impact on system resources (predictability). Many academic and industrial solutions are available in literature to address these issues. The works in [5], [6], [7], [3], and [8] all offer a solution to compose a monitoring system in a modular way, starting from the basic events to be monitored and building the monitor to capture and store them. Works in [9] and [10] both allow to design and introduce a monitoring system during a High-Level Synthesis (HLS) flow. Finally, [11] and [12] start from simulation of the system, trying to understand which are the target signals to be monitored against some RQMs, by exploiting data-mining techniques. However, both the approaches based on HLS and data-mining

provide monitors difficult to combine with SW-tasks, due to a monitor creation that does not take into account microprocessors architectures, limiting their flexibility. Modular monitoring solutions represent a promising approach, but the available ones lack in applicability to different ROMs, being typically focused on specific purposes, i.e., debugging [5], or timing performance [3], [6]. Only the work in [8] targets both performance- and debugging-oriented monitoring over heterogeneous architectures. By looking at industrial solutions, two modular approaches can be mentioned: ARM Coresight [13] and AXI Performance monitor [14]. The former is not flexible enough to be used on custom accelerators, while the latter only targets bus interconnections. The main contribution of this work in progress paper is a HW monitoring layer, part of a larger project that aims at providing a framework for building custom hardware monitoring systems, that, differently from the state of art approaches, offers flexibility, applicability, and predictability for the generated monitors. The HW monitoring layer is the component that allows to provide those features by allowing the build of the monitors in a modular way. The final framework will target heterogeneous architectures with reconfigurable accelerators implemented on FPGA, taking as input a model of the platform description and RQMs, producing an HDL description of the monitored platform as output. The resulting monitored platform will be provided with monitors satisfying the given ROMs, with a well defined monitoring overhead in terms of area, PoWeR (PWR), and SoftWare OVerhead (SWOV). The proposed framework is similar to [6] and [5] but additionally, besides targeting hardware accelerators (that could be addressed with an extensions of such previous works), it allows optimal resource utilization leveraging on event instances sharing.

# II. THE HW MONITORING LAYER

As starting point for the construction of the HW monitoring layer, we considered the generic online monitoring process proposed by Kornaros et al. [4]. Authors state that a monitoring process has five phases: event trigger, data capture, filtering, decision, and reaction. To this end, we divided the construction in two main phases: Ph.1) identification of places for event triggers to have a monitoring action that is both applicable for satisfying fresh RQMs and flexible; Ph.2) building of a HW layer to implement the monitoring action. In Ph.1 RQMs are covered in a general way (thus enabling applicability). We considered the six classes of RQMs for a monitoring

Test Results. Int., DataM and Core refer respectively to event triggers placed, sticking to the proposed HW layer, in Interconnection, Data Manager and Core (1E32 is 1 EVMON 32-bit size, 1T64 is 1 TMON 64-bit size, P is programmability). A term of comparison for LUT and FF is provided: in this case [14] is used to monitor Int. and [15] to monitor DataM and Core.

| ID | RQM            | Int.      | DataM  | Core   | LUT    | FF      | PWR[mW] | SWOV[us] | LUT [14] [15] | FF [14] [15] |
|----|----------------|-----------|--------|--------|--------|---------|---------|----------|---------------|--------------|
| Y0 | -              | -         | -      | -      | 3397   | 2864    | 24      | 10.981   | -             | -            |
| Y1 | RQM1+RQM2+RQM3 | (1E32)(P) | (1T64) | (2E10) | +9.45% | +13.2%  | +8.33%  | +39.68%  | +82.16%       | +166.3%      |
| Y2 | RQM2+RQM3+RQM4 | -         | (1T64) | (2E10) | +3.39% | +8.66%  | +8.33%  | +37.33%  | +68.11%       | +40.12%      |
| Y3 | RQM1+RQM4+RQM5 | (1E32)(P) | (1T64) | -      | +8.98% | +12.33% | +12.5%  | +37.36%  | +82.16%       | +166.3%      |

action [4]: *Monitor for DeBuG* (MDBG), *PerFormance* (MPF), *Power/Energy/Temperature* (PET), *Quality of Service* (QoS), *Fault Tolerance/Reliability* (FT), and *Security* (Sec). Then, to guarantee the flexibility, we associated those RQMs to a general reference platform for embedded systems, identifying the places for event triggers (see Fig. 1). Analysing the output of Ph.1, we noticed that multiple RQMs can share the same trigger location.



Fig. 1. Heterogeneous reference platform with places for performance event triggers (blue) and for debug ones (red).

Sharing the event triggers among multiple RQMs allows to share, in turn, the collected events: in the proposed HW layer, built in Ph.2 (see Fig. 2), an adapter block samples the i-th event instance and sends it to the data capture and filtering blocks. For these last two phases, a customizable number of nucleus blocks is provided to selectively capture different data according to RQMs (enabling applicability). Each nucleus can aggregate events, in form of event instances coming from adapter, by means of multiple event monitors (EVMON) and time monitors (TMON). The aggregation basically reflects the function mapping events to metrics (e.g. start and done signals assertion to measure an execution time). Nucleus data are then sent to a global monitor interface (GMI), that sends the information toward a global monitor (GM), connected at the same level of the reference platform actors; GM implements decision making stage and triggers a reaction by means of an interrupt controller. In case of target architecture change, only adapter and nucleus need to be modified, enabling flexibility. To illustrate the benefits of the proposed solution in generating monitoring systems, we defined some RQMs for an embedded application executed on a heterogeneous platform implemented, using Xilinx Vivado 2017.4, on a Zynq7000 XC7Z020 [16]. It has an AXI system bus connecting a dualcore ARM processor, an external DRAM memory and a hardware accelerator generated with the MDC suite<sup>1</sup> [17]. The application, running on the ARM processor, prepares some inputs in the external DRAM and triggers a DMA to transfer them to the accelerator. This latter performs multiply-andaccumulate operations (constituting the HW task) and stores back the result in the DRAM through a new DMA mediated transfer. The considered RQMs are the followings: RQM1 (MDBG - data transfer fault detection on the accelerator), RQM2 (MPF - execution time of the HW task), RQM3 (MDBG - accelerator computation fault detection), RQM4 (MDBG - watchdog for the HW task), RQM5 (MPF - throughput of data processed by HW task). Table I reports area, PWR (only dynamic FPGA fabric one), and SWOV impact of the different monitoring solutions depending on the considered input RQMs (Y1, Y2, and Y3). A comparison, only in terms of resources, with commercial monitoring solutions ([14] and [15]) shows the better performance of the proposed solution in satisfying different RQMs.



## ACKNOWLEDGMENT

This work is part of the FitOptiVis project [18], funded by the ECSEL Joint Undertaking under grant number H2020-ECSEL-2017-2-783162, and of the Comp4Drones project No. 826610, ECSEL-JU 2018.

## REFERENCES

- G. Valente *et al.*, "Dynamic partial reconfiguration profitability for realtime systems," *IEEE Embedded Systems Letters*, pp. 1–1, 2020.
- [2] C. Sau *et al.*, "Challenging the best hevc fractional pixel fpga interpolators with reconfigurable and multifrequency approximate computing," *IEEE Embedded Systems Letters*, vol. 9, no. 3, pp. 65–68, 2017.
- [3] N. C. Doyle et al., "Performance impacts and limitations of hardware memory access trace collection," in Conf. Design, Automation Test in Europe, 2017, 2017, pp. 506–511.
- [4] G. Kornaros and D. Pnevmatikatos, "A survey and taxonomy of onchip monitoring of multicore systems-on-chip," ACM Trans. Des. Autom. Electron. Syst., vol. 18, no. 2, Apr. 2013.
- [5] M. Seo and R. Lysecky, "Non-intrusive in-situ requirements monitoring of embedded system," ACM Trans. Des. Autom. Electron. Syst., vol. 23, no. 5, Aug. 2018. [Online]. Available: https://doi.org/10.1145/3206213

<sup>1</sup>Multi-Dataflow Composer: https://github.com/mdc-suite/mdc

- [6] G. Valente et al., "A flexible profiling sub-system for reconfigurable logic architectures," in Conf. on Parallel, Distributed, and Network-Based Processing, 2016.
- [7] A. Moro, F. Federici, G. Valente, L. Pomante, M. Faccio, and V. Muttillo, "Hardware performance sniffers for embedded systems profiling," in 2015 12th International Workshop on Intelligent Solutions in Embedded Systems (WISES), 2015, pp. 29–34.
- [8] T. Fanni et al., "Run-time performance monitoring of heterogenous hw/sw platforms using papi," in Workshop on FPGAs for Software Programmers, 2019.
- [9] J. Goeders and S. J. E. Wilton, "Signal-tracing techniques for in-system fpga debugging of high-level synthesis circuits," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 36, no. 1, pp. 83–96, 2017.
- [10] M. B. Hammouda et al., "A unified design flow to automatically generate on-chip monitors during high-level synthesis of hardware accelerators," *IEEE Transactions on Computer-Aided Design of Integrated Circuits* and Systems, vol. 36, no. 3, pp. 384–397, 2017.
- [11] M. N. at al., "A design-time method for building cost-effective run-time power monitoring," *IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems*, vol. 36, no. 7, pp. 1153–1166, 2017.
- [12] D. Zoni and others, "PowerTap: All-digital power meter modeling for run-time power monitoring," *Microprocessors and Microsystems*, vol. 63, pp. 128 – 139, 2018.
- [13] ARM, "White Paper: CoreSight Technical Introduction, A quickstart for designers. Document Number: ARM-EPM-039795," 2013-08.
- [14] Xilinx, "AXI Performance Monitor v5.0, PG037," 2017-10-4.
- [15] Xilinx, "System Integrated Logic Analyzer v1.0, PG261," 2017-06-7.
- [16] Xilinx. (2020-06) Zynq7000 soc. [Online]. Available: https://www. xilinx.com/products/silicon-devices/soc/zynq-7000.html
- [17] C. Sau *et al.*, "Automated design flow for multi-functional dataflowbased platforms," *Journal of Signal Processing Systems*, vol. 85, pp. 143–165, 2016.
- [18] Z. Al-Ars *et al.*, "The fitoptivis ECSEL project: highly efficient distributed embedded image/video processing in cyber-physical systems," in *Conf. on Computing Frontiers*, 2019, pp. 333–338.