Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Zhu, Suyang; Chandrasekaran, Sunita; Sun, Peng; Chapman, Barbara; Winter, Marcus; Schuele, Tobias

doi:10.1007/978-3-319-58943-5_56

Suyang Zhu²⁶,
Sunita Chandrasekaran²⁷,
Peng Sun²⁶,
Barbara Chapman²⁶,
Marcus Winter²⁸ &
…
Tobias Schuele²⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10104))

Included in the following conference series:

European Conference on Parallel Processing

1704 Accesses

Abstract

Current trends in multicore platform design indicate that heterogeneous systems are here to stay. Such systems include processors with specialized accelerators supporting different instruction sets and different types of memory spaces among several other features. These features increase the programming effort to port applications to target platforms. We need effective programming strategies that can exploit the rich feature set of such heterogeneous multicore architectures and yet not require increased learning curve to apply these strategies.

To distribute workload effectively across such systems that have different cores running at different speed, we have explored task-based programming models in this paper. This model allows decomposition of a problem into a set of tasks for simultaneous execution. We present a task-based approach that employs the Multicore Association’s (MCA) Task Management API (MTAPI), a robust, cross-platform, scalable API that avoids unnecessary synchronization thus offering a tiered and flexible approach and distributing workload efficiently across processors of varying types. For evaluation purposes, we use an NVIDIA Jetson TK1 board (ARM + GPU) as our test bed. As applications, we employ codes from benchmark suites such as Rodinia and BOTS.

You have full access to this open access chapter, Download conference paper PDF

Assessing Task-to-Data Affinity in the LLVM OpenMP Runtime

Evaluation of OpenMP Task Scheduling Algorithms for Large NUMA Architectures

CellCilk: Extending Cilk for Heterogeneous Multicore Platforms

Keywords

1 Introduction

Embedded multicore systems are widely used in areas such as networking, automobiles, and robotics. Some of these systems even compete with HPC platforms [1] and are promising to deliver high GFLOPS/Watt. Since parallelism has become a major driving force in computing, microprocessor vendors concentrate on integrating accelerators together with the central processing unit (CPU) on the same platform. The current trend of such platforms is that they are heterogeneous in nature, i.e., their operating systems and memory spaces are usually different [2] from traditional platforms. For example, Qualcomm’s heterogeneous processor Snapdragon 810 integrates an ARM Cortex CPU and an Adreno 430 GPU on the same chip. Such an integration produces hardware platforms that satisfy the requirements regarding performance, flexibility, and energy consumption of embedded systems. Unfortunately the software to port applications to such systems is often still immature and typically fine tuned for a specific platform. As a result, maintaining a single code base across multiple platforms is not feasible; this is a major concern.

Programming models designed for high performance computing (HPC) platforms are not necessarily the best for handling embedded multicore systems, especially when these systems have limited resources, such as a small number of cores, or limited memory. Among the models, a high-level directive-based model is OpenMP [3] that has been recently extended to support accelerators. However embedded systems are still unique unlike traditional accelerators such as GPUs, so OpenMP in its current status will still not be suitable for embedded devices. An alternate language is OpenCL [4] that is known for its portability but OpenCL involves steep learning curve making it a challenge to adopt the language on hardware.

To address these challenges, the Multicore Association (MCA)^{Footnote 1} was formed by a group of leading-edge companies addresses the programming challenges of heterogeneous embedded multicore platforms. MCA’s primary objective is the definition of a set of open specifications and application program interfaces (API) to facilitate multicore product development. MCA offers industry standard APIs for data sharing among different types of cores namely the Multicore Resource Management API (MRAPI), inter-core communication namely the Multicore Communication API (MCAPI), and for task management namely the Multicore Task Management API (MTAPI). Since the APIs are system agnostic, they facilitate development of portable code, thus improving the feasibility of running the same application on more than just one hardware platform.

This paper makes the following contributions:

Creates a light-weight, task-based portable software stack to target resource-constrained heterogeneous embedded multicore systems using MTAPI
Assesses the software stack by evaluation case studies on an embedded platform equipped with ARM processors and GPUs
Showcases two open source MTAPI implementations for programmers to use^{Footnote 2}

Note: This paper does not aim to compare and contrast between the two implementations. Instead, the goal is to discuss how they can be used by software developers to port applications to heterogeneous multicore systems.

The rest of the paper is organized as follows: Sect. 2 discusses the state-of-the-art and Sect. 3 gives an overview of MTAPI. The design and implementation strategies of our runtime library (RTL) implementation is given in Sect. 4. Section 5 discusses the experimental results and Sect. 6 presents conclusions along with some ideas for future work.

2 Related Work

In this section, we discuss some state-of-the-art parallel programming models for heterogeneous multicore systems from the task parallelism perspective.

OpenMP has been widely used in HPC for exploring shared memory parallelism [5] until recent advancements in the standard to support heterogeneous systems. OpenMP 3.1 ratified tasks, and task parallelism for multicore SoCs was implemented by several frameworks [6,7,8] deployed on shared memory systems. OpenMP 4.0 extended tasks to support task dependencies evaluated in [9] again using traditional shared memory-based architectures.

Other task-based efforts include Intel’s TBB [10] that treats operations as tasks and assigns them to multiple cores through a runtime library. As most frameworks, however, TBB targets desktop or server applications and is not designed for low-footprint embedded and heterogeneous systems.

Cilk [11] is a C language extensions developed by MIT for multithreaded parallel computing. While Cilk simplifies task parallel applications, it only supports shared memory environment which limits its application to homogeneous systems.

OpenCL [4] is a standard designed for data parallel processing used to program CPUs, GPUs, DSPs, FPGAs, etc. Although the standard can target multiple platforms, there is a steep learning curve making it a challenge to be adaptable.

OmpSs (OpenMP SuperScalar) [12] is a task-based programming model which exploits parallelism based on annotations using pragmas. OmpSs has been extended to many-core processors with accelerators such as multiple GPU systems. However, OmpSs needs special compiler support which limits its usability for embedded, heterogeneous systems.

StarPU [13] is a tasking API that allows developers to design applications in heterogeneous environments. StarPU’s runtime schedules the tasks and corresponding data transfers among the CPU and GPU accelerators. However, the necessary extension plug-in for GCC puts constraints on the deployment of StarPU to embedded systems with limited resources or bare-metal devices.

As discussed above, there are many approaches that explore task parallelism. However, they may not be best suited for embedded platforms which, unlike traditional platforms, lack plenty of resources and sometimes do not even have an OS. Additionally, many embedded systems are subject to real-time constraints and forbid dynamic memory allocation during operation which is completely ignored by the discussed approaches.

Our prior work in [14] uses MCAPI to establish communication through well-pipelined DMA protocols between Freescale P4080’s Power Architecture and the specialized RegEx Pattern Matching Engine (PME) accelerator. We also created an abstraction layer for easy programmability by translating OpenMP to MRAPI [15, 16]. Designed and implemented in ANSI C, MCAPI and MRAPI do not require specific compiler support or user-defined language extensions.

3 MTAPI Overview

Figure 1 gives an overview of MTAPI. Applications can be developed by directly calling the MTAPI interface or via further abstraction layers such as OpenMP (a translation from OpenMP to MTAPI is described in [17]). MTAPI can be implemented on most operating systems or even bare metal thanks to its simple design and minimal dependencies.

In the following, we describe the main concepts of MTAPI.

Node: An MTAPI node is an independent unit of execution. A node can be a process, a thread, a thread pool, a general purpose processor or an accelerator.

Job and Action: A job is an abstraction representing the work and is implemented by one or more actions. For example, a job can be implemented by one action on the CPU and another action on the GPU. The MTAPI system binds tasks to the most suitable actions during runtime.

Task: An MTAPI task is an instance of a job together with its data environment. Tasks are very light-weight with fine granularity which allows creating, scheduling, and executing numerous tasks in parallel. A task can be offloaded to a neighboring node other than its origin node depending on the dynamic action binding. Therefore, optimized and efficient scheduling algorithms are desired for task management on heterogeneous multicore platforms.

Queue: A queue is defined by the MTAPI specification to guarantee sequential execution of tasks.

Group: MTAPI groups are defined for synchronization purposes. A group is similar to a barrier in other task models. Tasks attached to the same group must be completed before the next step by calling mtapi_group_wait.

Related work on MTAPI includes the European Space Agency (ESA) creating an MTAPI implementation [18] for a LEON4 processor, which is a synthesizable VHDL model of a 32-bit processor compliant with the SPARC V8 architectures. Wallentowitz et al. [19] developed a baseline implementation and plans for deploying MCAPI and MTAPI on tiled many-core SoCs.

This project is in collaboration with Siemens who created an own industry-grade MTAPI implementation as part of a larger open source project called Embedded Multicore Building Blocks (EMB\(^2\)) [20]. EMB\(^2\) has been specifically designed for embedded systems and the typical requirements that accompany them, such as predictable memory consumption, which is essential for safety-critical applications, and real-time capability. For the latter, the library supports task priorities and affinities, and the scheduling strategy can be optimized for non-functional requirements such as minimal latency and fairness.

Besides the task scheduler, EMB\(^2\) provides parallel algorithms like loops and reductions, concurrent data structures, and high-level patterns for implementing stream processing applications. These building blocks are largely implemented in a non-blocking (lock-free) fashion, thus preventing frequently encountered pitfalls like lock contention, deadlocks, and priority inversion. As another advantage in real-time systems, the algorithms and data structures give certain progress guarantees [21].

We evaluate both the implementations and demonstrate the usability and applicability of MTAPI for heterogeneous embedded multicore platforms.

4 MTAPI Design and Usage

4.1 Job Scheduling and Actions

As mentioned earlier, MTAPI decomposes computations into multiple tasks, schedules them among the available processing units, and combines the results after synchronization. Here a task is defined as a light-weight operation that describes the job to be done. However, during the task-creation cycle, the task does not know with which action it will be associated. MTAPI provides a dynamic binding policy between tasks and actions. This is to facilitate jobs to be scheduled on more than one hardware type. The scheduler handles the load-balancing issues. Depending on the where the task is located, it is marked either a local task or a remote task. If the task is assigned to an action residing on the same node, the task is marked as a local task; otherwise the task is marked as a remote task. Figure 2 gives an example the relationship between task and action. In the example, tasks a, b and d are assigned actions a, b, and, d, respectively, on remote nodes other than node 1, thus becoming remote tasks. On the other hand, task c is associated with action c on node 1, making it a local task. Each node consists of different processors.

The MTAPI RTL defines an abstract interface for thread control including thread creation, termination, and synchronization using mutexes or semaphores. MTAPI kernel developers may implement this interface with particular thread libraries for the target platform, thus making MTAPI portable across a wide range of architectures. This portable and flexible approach is one of the appealing factors of MTAPI.

Listing 1.1 demonstrates the matrix multiplication program using MTAPI. In this code, we see there are two action functions that implements the matrix multiplication job. ActionFunction_GPU is implemented with CUDA kernel while ActionFunction_CPU is implemented with sequential CPU kernel. After defining the two action functions, we initialize the MTAPI environment by attaching these two actions to the same matrix multiplication job. Then we create three tasks respectively. arg_GPU and arg_CPU are pointers to the matrix data. These tasks are then assigned to different actions to execute. Thus, the GPU and CPU are utilized to do the computation in parallel.

4.2 Inter-Node Communication

Essentially, each node has one receiver thread and a sender thread. These threads initialize the MCAPI environment and create MCAPI endpoints for message passing through MCAPI function calls. They together compose the MCAPI communication layer between nodes within the domain. Technically, the data and information transported between MTAPI nodes are packed as an MCAPI message. MCAPI then transports these messages across the nodes for load balancing of tasks, information update, and synchronization. The message contains the domain ID, node ID, and port ID. Once a message is created, it is inserted into a central message queue on the node waiting for the sender to initiate the communication. Every message is assigned a priority. The high priority messages such as action updates are inserted at the head of the message queue while the low priority messages, like load balancing, are inserted at the tail of the queue. The sender wraps the MTAPI message into an MCAPI message, according to its type, and sends it to its destination node. The receiver thread keeps listening to its neighboring nodes to check if there is an MCAPI message sent to this node. Upon receipt of an MCAPI message, the receiver decodes the MCAPI message and creates an MTAPI message carrying the necessary information. Then, the receiver pushes the newly created MTAPI message into the message queue, waiting for the next cycle of message processing by the sender thread. Finally, the receiver thread continues listening to its neighbor nodes. In the UH-MTAPI design, a priority scheduler manages the message queue. The priority scheduler uses a centralized message queue, where the messages are sorted.

5 Performance Evaluation

In this section, we evaluate the Siemens MTAPI implementation^{Footnote 3} and UH-MTAPI^{Footnote 4}. We select applications from BOTS [7] and Rodinia Benchmarks [22] to demonstrate their performance. The benchmarks are executed on NVIDIA’s Jetson TK1 embedded development platform [23] with a Tegra K1 processor which integrates a 4-Plus-1 quad-core ARM Cortex-A15 processor and a Kepler GPU with 192 cores. We use the GCC OpenMP implementation shipped with the board by NVIDIA as reference for comparison purposes.

SparseLU Benchmark: The SparseLU factorization benchmark from BOTS computes an LU matrix factorization for sparse matrices. A sparse matrix contains submatrix blocks that may not be allocated. The vacancy of certain unallocated submatrix blocks leads to imbalance. Thus, task parallelism has better performance over other work sharing directives such as OpenMP’s parallel for. In the SparseLU factorization, tasks are created only for the allocated submatrix blocks to reduce the overhead caused by imbalance.

The sparse matrix contains \(50\times 50\) submatrices, where each submatrix has size \(100\times 100\) on both hardware platforms. We collect multiple metrics such as execution time, matrix size, and number of threads. The execution time for calculating the speed-up is measured on the CPU for the core part of the computation, excluding I/O and initial setup.

Figure 3(a) shows the speed-up using different implementations. The UH-MTAPI implementation demonstrates comparable performance with the Siemens MTAPI implementation as well as GCC’s OpenMP version. Both Siemens and UH-MTAPI implementations achieve a roughly linear speed-up which indicates their scalability on multicore processors.

Heartwall Benchmark: The Heartwall tracking benchmark is an application from Rodina [22] which tracks the changing shapes of a mouse heart wall. We reorganized it by splitting loop parallelism into tasks, where each task handles a chunk of the image data. The image procedures are encapsulated in an action function that processes the data associated with the corresponding tasks. Figure 3(b) shows the speed-up over a single thread. We observe that task parallelism conducted by UH-MTAPI matches the performance of data parallelism offered by OpenMP parallel for and the Siemens MTAPI implementation. However, none of the three versions meets the expectation of linear speedup as the number of threads increases.

Matrix-Matrix Multiplication: This benchmark (multiplication for dense matrices) is relatively compute-intensive. The complexity of a traditional multiplication of two square matrices is \(\mathcal {O}(n^3)\). Although matrix multiplication can be implemented using the parallel working directives such as OpenMP’s parallel for, the computation takes a lot of time due to the limited number of CPU threads and poor data locality. In contrast, heterogeneous systems with accelerators such as GPUs are a good fit for such algorithms, specifically as their architecture with a large amount of processing units allow to run many threads concurrently. Additionally, GPU matrix-matrix multiplication algorithms are potentially more cache friendly than CPU algorithms [24]. We implemented different types of action functions targeting the different processing units. The CPU action is implemented in C++ while the GPU action relies on CUDA [25]. Moreover, we designed four different approaches to execute the benchmark and an additional approach was used by Siemens to achieve maximum performance using both CPU and GPU:

ARM-Seq.:: Sequential implementation on ARM CPU.
MTAPI-CPU.:: MTAPI implementation with a single action for ARM CPU.
MTAPI-CPU-GPU.:: MTAPI implementation with actions for CPU and GPU.
MTAPI-GPU.:: MTAPI implementation with a single action for GPU.
MTAPI-CPU-GPU-Opt.:: Same as MTAPI-CPU-GPU, but with work item sizes tailored to the particular needs of the respective computation units and copying of data to the GPU overlapped with computation.

Figure 4(a) shows the normalized execution times for matrix sizes 128, 256, 512, and 1024 for UH-MTAPI. Figure 4(a) shows the results for Siemens MTAPI. We observe that the ARM action has comparable performance with the GPU action for matrices with sizes less than 128.

The data being copied between the CPU and the GPU pose a major communication overhead. However, as the matrix size increases, the data copying time can be largely ignored for which reason the GPU action outperforms the CPU action. A simple distribution of the work to both processing units did not yield a speedup. In fact, the CPU action is far slower than the GPU action, and equally sized work items let the GPU finish while the CPU is still calculating. For this reason, the optimized version uses bigger work items for the GPU and smaller ones for the CPU. Moreover, data is transferred asynchronously, thus hiding the transfer time in computations. This technique results in a speedup in all tested cases, but the contribution of the CPU shrinks with increasing matrix size as expected.

6 Conclusion and Future Work

Programming models for heterogeneous multicore systems are important yet challenging. In this paper, we described the design and implementation of a parallel programming standard, the Multicore Task Management API (MTAPI). MTAPI enables application-level task parallelism on embedded devices with symmetric or asymmetric multicore processors. We showed that MTAPI provides a convenient way to develop portable and scalable applications targeting heterogeneous systems in a straight-forward manner. Our experimental results of MTAPI using different benchmarks show competitive performance compared to OpenMP while being more flexible. In the future, we will target further platforms such as DSPs.

Our sincere gratitude to the anonymous reviewers and many thanks to Markus Levy, President of the Multicore Association for his continued support.

Notes

1.
https://www.multicore-association.org.
2.
UH-MTAPI and Siemens MTAPI: Software created by researchers at the University of Houston and Siemens.
3.
https://github.com/siemens/embb.
4.
https://github.com/MCAPro2015/OpenMP_MCA_Project.

References

Stotzer, E., Jayaraj, A., Ali, M., Friedmann, A., Mitra, G., Rendell, A.P., Lintault, I.: OpenMP on the low-power TI keystone II ARM/DSP system-on-chip. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 114–127. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40698-0_9
Chapter Google Scholar
Li, T., Brett, P., Knauerhase, R., Koufaty, D., Reddy, D., Hahn, S.: Operating system support for overlapping-ISA heterogeneous multi-core architectures. In: IEEE 16th International Symposium on High Performance Computer Architecture (HPCA), pp. 1–12. IEEE (2010)
Google Scholar
Dagum, L., Enon, R.: OpenMP: an industry standard API for shared-memory programming. IEEE Comput. Sci. Eng. 5(1), 46–55 (1998)
Article Google Scholar
Stone, J.E., Gohara, D., Shi, G.: OpenCL: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(1–3), 66–73 (2010)
Article Google Scholar
Chapman, B., Jost, G., Van Der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming, vol. 10. MIT Press, Cambridge (2008)
Google Scholar
Chapman, B., Huang, L., Biscondi, E., Stotzer, E., Shrivastava, A., Gatherer, A.: Implementing OpenMP on a high performance embedded multicore MPSoC. In: Parallel and Distributed Processing, pp. 1–8. IEEE (2009)
Google Scholar
Duran, A., Corbalán, J., Ayguadé, E.: Evaluation of OpenMP task scheduling strategies. In: Eigenmann, R., Supinski, B.R. (eds.) IWOMP 2008. LNCS, vol. 5004, pp. 100–110. Springer, Heidelberg (2008). doi:10.1007/978-3-540-79561-2_9
Chapter Google Scholar
Liao, C., Hernandez, O., Chapman, B., et al.: OpenUH: an optimizing, portable OpenMP compiler. Concurrency Comput.: Practice Exp. 19(18), 2317–2332 (2007)
Article Google Scholar
Ghosh, P., Yan, Y., Eachempati, D., Chapman, B.: A prototype implementation of OpenMP task dependency support. In: Rendell, A.P., Chapman, B.M., Müller, M.S. (eds.) IWOMP 2013. LNCS, vol. 8122, pp. 128–140. Springer, Heidelberg (2013). doi:10.1007/978-3-642-40698-0_10
Chapter Google Scholar
Reinders, J.: Intel Threading Building Blocks: Outfitting C++ for Multi-core Processor Parallelism. O’Reilly Media Inc., Sebastopol (2007)
Google Scholar
Blumofe, R.D., Joerg, C.F., Kuszmaul, B.C., Leiserson, C.E., Randall, K.H. Zhou, Y.: Cilk: an efficient multithreaded runtime system, vol. 30. ACM (1995)
Google Scholar
Duran, A., Ayguadé, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: OmpSS: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), 173–193 (2011)
Article MathSciNet Google Scholar
Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.-A.: StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency Comput.: Practice Exp. 23(2), 187–198 (2011)
Article Google Scholar
Sun, P., Chandrasekaran, S., Chapman, B.: Targeting Heterogeneous SoCs using MCAPI. In: SRC TECHCON 2014, in the GRC Research Category Section 29.1 (2014)
Google Scholar
Wang, C., Chandrasekaran, S., Sun, P., et al.: Portable mapping of openMP to multicore embedded systems using MCA APIs. In: Proceedings of LCTES 2013, pp. 153–162 (2013)
Google Scholar
Wang, C., Chandrasekaran, S., Chapman, B., Holt, J.: libEOMP: a portable OpenMP runtime library based on MCA APIs for embedded systems. In: Proceedings of PMAM, pp. 83–92 (2013)
Google Scholar
Sun, P., Chandrasekaran, S., Zhu, S., Chapman, B.: Deploying OpenMP task parallelism on multicore embedded systems with MCA task APIs. In: Proceedings of IEEE HPCC (2015, to appear)
Google Scholar
Cederman, D., Hellstrom, D., Sherrill, J., Bloom, G., Patte, M., Zulianello, M.: RTEMS SMP for LEON3/LEON4 multi-processor devices. In: Data Systems in Aerospace (2014)
Google Scholar
Wallentowitz, S., Wagner, P., Tempelmeier, M., et al.: Open tiled manycore system-on-chip. arXiv preprint arXiv:1304.5081 (2013)
Siemens. Embedded Multicore Building Blocks. https://github.com/siemens/embb
Herlihy, M., Shavit, N.: On the nature of progress. In: Fernàndez Anta, A., Lipari, G., Roy, M. (eds.) OPODIS 2011. LNCS, vol. 7109, pp. 313–328. Springer, Heidelberg (2011). doi:10.1007/978-3-642-25873-2_22
Chapter Google Scholar
Che, S., Boyer, M., Meng, J., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: Proceedings of IISWC, pp. 44–54. IEEE (2009)
Google Scholar
NVIDIA Jetson TK1 Development Kit. http://developer.download.nvidia.com/embedded/jetson/TK1/docs/Jetson_platform_brief_May2014.pdf
Fatahalian, K., Sugerman, J., Hanrahan, P.: Understanding the efficiency of GPU algorithms for matrix-matrix multiplication. In: Proceedings of Conference on Graphics Hardware, pp. 133–137. ACM (2004)
Google Scholar
NVIDIA. CUDA programming guide (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Houston, Houston, USA
Suyang Zhu, Peng Sun & Barbara Chapman
Department of Computer and Information Sciences, University of Delaware, Newark, USA
Sunita Chandrasekaran
Siemens Corporate Technology, Princeton, USA
Marcus Winter & Tobias Schuele

Authors

Suyang Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Sunita Chandrasekaran
View author publications
You can also search for this author in PubMed Google Scholar
Peng Sun
View author publications
You can also search for this author in PubMed Google Scholar
Barbara Chapman
View author publications
You can also search for this author in PubMed Google Scholar
Marcus Winter
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Schuele
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sunita Chandrasekaran .

Editor information

Editors and Affiliations

Inria, Université Grenoble Alpes, Grenoble, France
Frédéric Desprez
LIG, Université Grenoble Alpes, Grenoble, France
Pierre-François Dutot
Computer Technology Institute, University of Patras, Patras, Greece
Christos Kaklamanis
CNRS, University of Lyon, Lyon, France
Loris Marchal
Agilient Technologies, Santa Clara, California, USA
Korbinian Molitorisz
Department of Computer Science, University of Pisa, Pisa, Italy
Laura Ricci
Università di Salerno, Salerno, Italy
Vittorio Scarano
University of Extremadura, Caceres, Spain
Miguel A. Vega-Rodríguez
University of Amsterdam, Amsterdam, The Netherlands
Ana Lucia Varbanescu
TU Wien, Vienna, Austria
Sascha Hunold
Oak Ridge National Laboratory, Tennessee Tech University, Oak Ridge, Tennessee, USA
Stephen L. Scott
RWTH Aachen University, Aachen, Germany
Stefan Lankes
TU München, Garching, Bayern, Germany
Josef Weidendorfer

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhu, S., Chandrasekaran, S., Sun, P., Chapman, B., Winter, M., Schuele, T. (2017). Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API. In: Desprez, F., et al. Euro-Par 2016: Parallel Processing Workshops. Euro-Par 2016. Lecture Notes in Computer Science(), vol 10104. Springer, Cham. https://doi.org/10.1007/978-3-319-58943-5_56

Download citation

DOI: https://doi.org/10.1007/978-3-319-58943-5_56
Published: 28 May 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-58942-8
Online ISBN: 978-3-319-58943-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics