Keywords

1 Introduction

Embedded multicore systems are widely used in areas such as networking, automobiles, and robotics. Some of these systems even compete with HPC platforms [1] and are promising to deliver high GFLOPS/Watt. Since parallelism has become a major driving force in computing, microprocessor vendors concentrate on integrating accelerators together with the central processing unit (CPU) on the same platform. The current trend of such platforms is that they are heterogeneous in nature, i.e., their operating systems and memory spaces are usually different [2] from traditional platforms. For example, Qualcomm’s heterogeneous processor Snapdragon 810 integrates an ARM Cortex CPU and an Adreno 430 GPU on the same chip. Such an integration produces hardware platforms that satisfy the requirements regarding performance, flexibility, and energy consumption of embedded systems. Unfortunately the software to port applications to such systems is often still immature and typically fine tuned for a specific platform. As a result, maintaining a single code base across multiple platforms is not feasible; this is a major concern.

Programming models designed for high performance computing (HPC) platforms are not necessarily the best for handling embedded multicore systems, especially when these systems have limited resources, such as a small number of cores, or limited memory. Among the models, a high-level directive-based model is OpenMP [3] that has been recently extended to support accelerators. However embedded systems are still unique unlike traditional accelerators such as GPUs, so OpenMP in its current status will still not be suitable for embedded devices. An alternate language is OpenCL [4] that is known for its portability but OpenCL involves steep learning curve making it a challenge to adopt the language on hardware.

To address these challenges, the Multicore Association (MCA)Footnote 1 was formed by a group of leading-edge companies addresses the programming challenges of heterogeneous embedded multicore platforms. MCA’s primary objective is the definition of a set of open specifications and application program interfaces (API) to facilitate multicore product development. MCA offers industry standard APIs for data sharing among different types of cores namely the Multicore Resource Management API (MRAPI), inter-core communication namely the Multicore Communication API (MCAPI), and for task management namely the Multicore Task Management API (MTAPI). Since the APIs are system agnostic, they facilitate development of portable code, thus improving the feasibility of running the same application on more than just one hardware platform.

This paper makes the following contributions:

  • Creates a light-weight, task-based portable software stack to target resource-constrained heterogeneous embedded multicore systems using MTAPI

  • Assesses the software stack by evaluation case studies on an embedded platform equipped with ARM processors and GPUs

  • Showcases two open source MTAPI implementations for programmers to useFootnote 2

Note: This paper does not aim to compare and contrast between the two implementations. Instead, the goal is to discuss how they can be used by software developers to port applications to heterogeneous multicore systems.

The rest of the paper is organized as follows: Sect. 2 discusses the state-of-the-art and Sect. 3 gives an overview of MTAPI. The design and implementation strategies of our runtime library (RTL) implementation is given in Sect. 4. Section 5 discusses the experimental results and Sect. 6 presents conclusions along with some ideas for future work.

2 Related Work

In this section, we discuss some state-of-the-art parallel programming models for heterogeneous multicore systems from the task parallelism perspective.

OpenMP has been widely used in HPC for exploring shared memory parallelism [5] until recent advancements in the standard to support heterogeneous systems. OpenMP 3.1 ratified tasks, and task parallelism for multicore SoCs was implemented by several frameworks [6,7,8] deployed on shared memory systems. OpenMP 4.0 extended tasks to support task dependencies evaluated in [9] again using traditional shared memory-based architectures.

Other task-based efforts include Intel’s TBB [10] that treats operations as tasks and assigns them to multiple cores through a runtime library. As most frameworks, however, TBB targets desktop or server applications and is not designed for low-footprint embedded and heterogeneous systems.

Cilk [11] is a C language extensions developed by MIT for multithreaded parallel computing. While Cilk simplifies task parallel applications, it only supports shared memory environment which limits its application to homogeneous systems.

OpenCL [4] is a standard designed for data parallel processing used to program CPUs, GPUs, DSPs, FPGAs, etc. Although the standard can target multiple platforms, there is a steep learning curve making it a challenge to be adaptable.

OmpSs (OpenMP SuperScalar) [12] is a task-based programming model which exploits parallelism based on annotations using pragmas. OmpSs has been extended to many-core processors with accelerators such as multiple GPU systems. However, OmpSs needs special compiler support which limits its usability for embedded, heterogeneous systems.

StarPU [13] is a tasking API that allows developers to design applications in heterogeneous environments. StarPU’s runtime schedules the tasks and corresponding data transfers among the CPU and GPU accelerators. However, the necessary extension plug-in for GCC puts constraints on the deployment of StarPU to embedded systems with limited resources or bare-metal devices.

As discussed above, there are many approaches that explore task parallelism. However, they may not be best suited for embedded platforms which, unlike traditional platforms, lack plenty of resources and sometimes do not even have an OS. Additionally, many embedded systems are subject to real-time constraints and forbid dynamic memory allocation during operation which is completely ignored by the discussed approaches.

Our prior work in [14] uses MCAPI to establish communication through well-pipelined DMA protocols between Freescale P4080’s Power Architecture and the specialized RegEx Pattern Matching Engine (PME) accelerator. We also created an abstraction layer for easy programmability by translating OpenMP to MRAPI [15, 16]. Designed and implemented in ANSI C, MCAPI and MRAPI do not require specific compiler support or user-defined language extensions.

3 MTAPI Overview

Figure 1 gives an overview of MTAPI. Applications can be developed by directly calling the MTAPI interface or via further abstraction layers such as OpenMP (a translation from OpenMP to MTAPI is described in [17]). MTAPI can be implemented on most operating systems or even bare metal thanks to its simple design and minimal dependencies.

Fig. 1.
figure 1

MTAPI framework

In the following, we describe the main concepts of MTAPI.

Node: An MTAPI node is an independent unit of execution. A node can be a process, a thread, a thread pool, a general purpose processor or an accelerator.

Job and Action: A job is an abstraction representing the work and is implemented by one or more actions. For example, a job can be implemented by one action on the CPU and another action on the GPU. The MTAPI system binds tasks to the most suitable actions during runtime.

Task: An MTAPI task is an instance of a job together with its data environment. Tasks are very light-weight with fine granularity which allows creating, scheduling, and executing numerous tasks in parallel. A task can be offloaded to a neighboring node other than its origin node depending on the dynamic action binding. Therefore, optimized and efficient scheduling algorithms are desired for task management on heterogeneous multicore platforms.

Queue: A queue is defined by the MTAPI specification to guarantee sequential execution of tasks.

Group: MTAPI groups are defined for synchronization purposes. A group is similar to a barrier in other task models. Tasks attached to the same group must be completed before the next step by calling mtapi_group_wait.

Related work on MTAPI includes the European Space Agency (ESA) creating an MTAPI implementation [18] for a LEON4 processor, which is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 architectures. Wallentowitz et al. [19] developed a baseline implementation and plans for deploying MCAPI and MTAPI on tiled many-core SoCs.

This project is in collaboration with Siemens who created an own industry-grade MTAPI implementation as part of a larger open source project called Embedded Multicore Building Blocks (EMB\(^2\)) [20]. EMB\(^2\) has been specifically designed for embedded systems and the typical requirements that accompany them, such as predictable memory consumption, which is essential for safety-critical applications, and real-time capability. For the latter, the library supports task priorities and affinities, and the scheduling strategy can be optimized for non-functional requirements such as minimal latency and fairness.

Besides the task scheduler, EMB\(^2\) provides parallel algorithms like loops and reductions, concurrent data structures, and high-level patterns for implementing stream processing applications. These building blocks are largely implemented in a non-blocking (lock-free) fashion, thus preventing frequently encountered pitfalls like lock contention, deadlocks, and priority inversion. As another advantage in real-time systems, the algorithms and data structures give certain progress guarantees [21].

We evaluate both the implementations and demonstrate the usability and applicability of MTAPI for heterogeneous embedded multicore platforms.

4 MTAPI Design and Usage

4.1 Job Scheduling and Actions

As mentioned earlier, MTAPI decomposes computations into multiple tasks, schedules them among the available processing units, and combines the results after synchronization. Here a task is defined as a light-weight operation that describes the job to be done. However, during the task-creation cycle, the task does not know with which action it will be associated. MTAPI provides a dynamic binding policy between tasks and actions. This is to facilitate jobs to be scheduled on more than one hardware type. The scheduler handles the load-balancing issues. Depending on the where the task is located, it is marked either a local task or a remote task. If the task is assigned to an action residing on the same node, the task is marked as a local task; otherwise the task is marked as a remote task. Figure 2 gives an example the relationship between task and action. In the example, tasks a, b and d are assigned actions a, b, and, d, respectively, on remote nodes other than node 1, thus becoming remote tasks. On the other hand, task c is associated with action c on node 1, making it a local task. Each node consists of different processors.

Fig. 2.
figure 2

MTAPI Job and Action

The MTAPI RTL defines an abstract interface for thread control including thread creation, termination, and synchronization using mutexes or semaphores. MTAPI kernel developers may implement this interface with particular thread libraries for the target platform, thus making MTAPI portable across a wide range of architectures. This portable and flexible approach is one of the appealing factors of MTAPI.

Listing 1.1 demonstrates the matrix multiplication program using MTAPI. In this code, we see there are two action functions that implements the matrix multiplication job. ActionFunction_GPU is implemented with CUDA kernel while ActionFunction_CPU is implemented with sequential CPU kernel. After defining the two action functions, we initialize the MTAPI environment by attaching these two actions to the same matrix multiplication job. Then we create three tasks respectively. arg_GPU and arg_CPU are pointers to the matrix data. These tasks are then assigned to different actions to execute. Thus, the GPU and CPU are utilized to do the computation in parallel.

figure a

4.2 Inter-Node Communication

Essentially, each node has one receiver thread and a sender thread. These threads initialize the MCAPI environment and create MCAPI endpoints for message passing through MCAPI function calls. They together compose the MCAPI communication layer between nodes within the domain. Technically, the data and information transported between MTAPI nodes are packed as an MCAPI message. MCAPI then transports these messages across the nodes for load balancing of tasks, information update, and synchronization. The message contains the domain ID, node ID, and port ID. Once a message is created, it is inserted into a central message queue on the node waiting for the sender to initiate the communication. Every message is assigned a priority. The high priority messages such as action updates are inserted at the head of the message queue while the low priority messages, like load balancing, are inserted at the tail of the queue. The sender wraps the MTAPI message into an MCAPI message, according to its type, and sends it to its destination node. The receiver thread keeps listening to its neighboring nodes to check if there is an MCAPI message sent to this node. Upon receipt of an MCAPI message, the receiver decodes the MCAPI message and creates an MTAPI message carrying the necessary information. Then, the receiver pushes the newly created MTAPI message into the message queue, waiting for the next cycle of message processing by the sender thread. Finally, the receiver thread continues listening to its neighbor nodes. In the UH-MTAPI design, a priority scheduler manages the message queue. The priority scheduler uses a centralized message queue, where the messages are sorted.

5 Performance Evaluation

In this section, we evaluate the Siemens MTAPI implementationFootnote 3 and UH-MTAPIFootnote 4. We select applications from BOTS [7] and Rodinia Benchmarks [22] to demonstrate their performance. The benchmarks are executed on NVIDIA’s Jetson TK1 embedded development platform [23] with a Tegra K1 processor which integrates a 4-Plus-1 quad-core ARM Cortex-A15 processor and a Kepler GPU with 192 cores. We use the GCC OpenMP implementation shipped with the board by NVIDIA as reference for comparison purposes.

SparseLU Benchmark: The SparseLU factorization benchmark from BOTS computes an LU matrix factorization for sparse matrices. A sparse matrix contains submatrix blocks that may not be allocated. The vacancy of certain unallocated submatrix blocks leads to imbalance. Thus, task parallelism has better performance over other work sharing directives such as OpenMP’s parallel for. In the SparseLU factorization, tasks are created only for the allocated submatrix blocks to reduce the overhead caused by imbalance.

The sparse matrix contains \(50\times 50\) submatrices, where each submatrix has size \(100\times 100\) on both hardware platforms. We collect multiple metrics such as execution time, matrix size, and number of threads. The execution time for calculating the speed-up is measured on the CPU for the core part of the computation, excluding I/O and initial setup.

Figure 3(a) shows the speed-up using different implementations. The UH-MTAPI implementation demonstrates comparable performance with the Siemens MTAPI implementation as well as GCC’s OpenMP version. Both Siemens and UH-MTAPI implementations achieve a roughly linear speed-up which indicates their scalability on multicore processors.

Heartwall Benchmark: The Heartwall tracking benchmark is an application from Rodina [22] which tracks the changing shapes of a mouse heart wall. We reorganized it by splitting loop parallelism into tasks, where each task handles a chunk of the image data. The image procedures are encapsulated in an action function that processes the data associated with the corresponding tasks. Figure 3(b) shows the speed-up over a single thread. We observe that task parallelism conducted by UH-MTAPI matches the performance of data parallelism offered by OpenMP parallel for and the Siemens MTAPI implementation. However, none of the three versions meets the expectation of linear speedup as the number of threads increases.

Fig. 3.
figure 3

Speed-up for SparseLU and Heartwall benchmarks with OpenMP, Siemens MTAPI and UH-MTAPI on NVIDIA Tegra TK1 board

Matrix-Matrix Multiplication: This benchmark (multiplication for dense matrices) is relatively compute-intensive. The complexity of a traditional multiplication of two square matrices is \(\mathcal {O}(n^3)\). Although matrix multiplication can be implemented using the parallel working directives such as OpenMP’s parallel for, the computation takes a lot of time due to the limited number of CPU threads and poor data locality. In contrast, heterogeneous systems with accelerators such as GPUs are a good fit for such algorithms, specifically as their architecture with a large amount of processing units allow to run many threads concurrently. Additionally, GPU matrix-matrix multiplication algorithms are potentially more cache friendly than CPU algorithms [24]. We implemented different types of action functions targeting the different processing units. The CPU action is implemented in C++ while the GPU action relies on CUDA [25]. Moreover, we designed four different approaches to execute the benchmark and an additional approach was used by Siemens to achieve maximum performance using both CPU and GPU:

 

ARM-Seq.:

Sequential implementation on ARM CPU.

MTAPI-CPU.:

MTAPI implementation with a single action for ARM CPU.

MTAPI-CPU-GPU.:

MTAPI implementation with actions for CPU and GPU.

MTAPI-GPU.:

MTAPI implementation with a single action for GPU.

MTAPI-CPU-GPU-Opt.:

Same as MTAPI-CPU-GPU, but with work item sizes tailored to the particular needs of the respective computation units and copying of data to the GPU overlapped with computation.

 

Figure 4(a) shows the normalized execution times for matrix sizes 128, 256, 512, and 1024 for UH-MTAPI. Figure 4(a) shows the results for Siemens MTAPI. We observe that the ARM action has comparable performance with the GPU action for matrices with sizes less than 128.

Fig. 4.
figure 4

Normalized execution time for matrix multiplication with UH-MTAPI and Siemens MTAPI on NVIDIA Tegra TK1 board

The data being copied between the CPU and the GPU pose a major communication overhead. However, as the matrix size increases, the data copying time can be largely ignored for which reason the GPU action outperforms the CPU action. A simple distribution of the work to both processing units did not yield a speedup. In fact, the CPU action is far slower than the GPU action, and equally sized work items let the GPU finish while the CPU is still calculating. For this reason, the optimized version uses bigger work items for the GPU and smaller ones for the CPU. Moreover, data is transferred asynchronously, thus hiding the transfer time in computations. This technique results in a speedup in all tested cases, but the contribution of the CPU shrinks with increasing matrix size as expected.

6 Conclusion and Future Work

Programming models for heterogeneous multicore systems are important yet challenging. In this paper, we described the design and implementation of a parallel programming standard, the Multicore Task Management API (MTAPI). MTAPI enables application-level task parallelism on embedded devices with symmetric or asymmetric multicore processors. We showed that MTAPI provides a convenient way to develop portable and scalable applications targeting heterogeneous systems in a straight-forward manner. Our experimental results of MTAPI using different benchmarks show competitive performance compared to OpenMP while being more flexible. In the future, we will target further platforms such as DSPs.

Our sincere gratitude to the anonymous reviewers and many thanks to Markus Levy, President of the Multicore Association for his continued support.