Export Citations
No abstract available.
Proceeding Downloads
Batched Gauss-Jordan Elimination for Block-Jacobi Preconditioner Generation on GPUs
In this paper, we design and evaluate a routine for the efficient generation of block-Jacobi preconditioners on graphics processing units (GPUs). Concretely, to exploit the architecture of the graphics accelerator, we develop a batched Gauss-Jordan ...
TaskInsight: Understanding Task Schedules Effects on Memory and Performance
Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights ...
A high-performance portable abstract interface for explicit SIMD vectorization
This work establishes a scalable, easy to use and efficient approach for exploiting SIMD capabilities of modern CPUs, without the need for extensive knowledge of architecture specific instruction sets. We provide a description of a new API, known as UME:...
PETRAS: Performance, Energy and Thermal Aware Resource Allocation and Scheduling for Heterogeneous Systems
Many computing systems today are heterogeneous in that they consist of a mix of different types of processing units (e.g., CPUs, GPUs). Each of these processing units has different execution capabilities and energy consumption characteristics. Job ...
Reduction to Tridiagonal Form for Symmetric Eigenproblems on Asymmetric Multicore Processors
Asymmetric multicore processors (AMPs), as those present in ARM big.LITTLE technology, have been proposed as a means to address the end of Dennard power scaling law. The idea of these architectures is to activate only the type (and number) of cores that ...
High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs
Detecting strongly connected components (SCC) has been broadly used in many real-world applications. To speedup SCC detection for large-scale graphs, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations are able ...
Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views
In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic ...
Assessing One-to-One Parallelism Levels Mapping for OpenMP Offloading to GPUs
The proliferation of accelerators in modern clusters makes efficient coprocessor programming a key requirement if application codes are to achieve high levels of performance with acceptable energy consumption on such platforms. This has led to ...
A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms
Traditional widely used parallel programming models and methods focus on data distribution and are suitable for implementing data parallelism. They lack the abstraction of task parallelism and make it inconvenient to separate the applications' high ...
Index Terms
- Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores