Export Citations
No abstract available.
Proceeding Downloads
An Evaluation of Emerging Many-Core Parallel Programming Models
In this work we directly evaluate several emerging parallel programming models: Kokkos, RAJA, OpenACC, and OpenMP 4.0, against the mature CUDA and OpenCL APIs. Each model has been used to port TeaLeaf, a miniature proxy application, or mini-app, that ...
Discovering Pipeline Parallel Patterns in Sequential Legacy C++ Codes
Since free performance lunch of processors is over, parallelism has become the new trend in hardware and architecture design. However, parallel resources deployed in data centers are underused in many cases, given that sequential programming is still ...
Embedding Semantics of the Single-Producer/Single-Consumer Lock-Free Queue into a Race Detection Tool
- Manuel F. Dolz,
- David del Rio Astorga,
- Javier Fernández,
- J. Daniel García,
- Félix García-Carballeira,
- Marco Danelutto,
- Massimo Torquati
The rapid progress of multi-/many-core architectures has caused data-intensive parallel applications not yet be fully suited for getting the maximum performance. The advent of parallel programming frameworks offering structured patterns has alleviated ...
Accelerating Dynamic Data Race Detection Using Static Thread Interference Analysis
Precise dynamic race detectors report an error if and only if more than one thread concurrently exhibits conflict on a memory access. They insert instrumentations at compile-time to perform runtime checks on all memory accesses to ensure that all races ...
Efficient Parallelization of Complex Automotive Systems
As the automotive industry seeks to include more and more features in its vehicles while simultaneously attempting to reduce the number of "Electronic Control Units" (ECUs) that execute the corresponding embedded software, the necessary policy shift ...
Enhancing Metaheuristic-based Virtual Screening Methods on Massively Parallel and Heterogeneous Systems
Molecular docking through Virtual Screening is an optimization problem which can be approached with metaheuristic methods. The interaction between two chemical compounds (typically a protein or receptor and small molecule or ligand) is measured with ...
Parallel Locality and Parallelization Quality
This paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler, for example at function calls and loops iterations. ...
Software-managed Cache Coherence for fast One-Sided Communication
The ongoing many-core design aims at core counts where cache coherence becomes a serious challenge. Therefore, this paper discusses how one-sided communication can be implemented on a non-cache coherent many-core CPU. The Intel SCC serves as an ...
Multitasking Real-time Embedded GPU Computing Tasks
In this study, we consider the specific characteristics of workloads that involve multiple real-time embedded GPU computing tasks and design several schedulers that use alternative approaches. Then, we compare the performance of schedulers and determine ...
Flow Driven GPGPU Programming combining Textual and Graphical Programming
GPGPUs (General Purpose Computation on Graphics Processing Unit) have become the most important invention in the last years in computer graphics and the vision domain. Despite improvement of the two main programming platforms, CUDA (Compute Unified ...
Multi-GPU implementation of the Horizontal Diffusion method of the Weather Research and Forecast Model
The Weather Research and Forecasting (WRF), a next generation mesoscale numerical weather prediction system, has a considerable amount of work regarding GPU acceleration. However, the amount of works exploiting multi-GPU systems is limited. This work ...
JParEnt: Parallel Entropy Decoding for JPEG Decompression on Heterogeneous Multicore Architectures
The JPEG format is the de facto image compression standard, with billions of views every day. Parallelizing the entropy decoding step of the JPEG decompression algorithm remains a challenging problem, because codewords are of variable length, and the ...
On Guided Installation of Basic Linear Algebra Routines in Nodes with Manycore Components
Computational systems are nowadays composed of basic computational components which share multiprocessors and coprocessors of different types, typically several GPUs or MICs. The software previously developed and optimized for simpler systems needs to ...