ICS is well known as the premier technical forum for researchers to present their latest results and to discuss the state of the art in high-performance computing (HPC). ICS 2017 continues the strong tradition of featuring keynote presentations emphasizing new directions and results in HPC; strong, peer-reviewed technical presentations; and a carefully selected tutorial and workshop covering special topics of interest in HPC.
Proceeding Downloads
Demystifying automata processing: GPUs, FPGAs or Micron's AP?
Many established and emerging applications perform at their core some form of pattern matching, a computation that maps naturally onto finite automata abstractions. As a consequence, in recent years there has been a substantial amount of work on high-...
Enabling scalability-sensitive speculative parallelization for FSM computations
Finite state machines (FSMs) are the backbone of many applications, but are difficult to parallelize due to their inherent dependencies. Speculative FSM parallelization has shown promise on multicore machines with up to eight cores. However, as hardware ...
SPIRIT: a framework for creating distributed recursive tree applications
An important set of applications, from diverse domains such as cosmological simulations, data mining, and computer graphics, involve repeated, depth-first traversal of trees. As these applications operate over massive data sets, it is often necessary to ...
Frequent subtree mining on the automata processor: challenges and opportunities
Frequency counting of complex patterns such as subtrees is more challenging than for simple itemsets and sequences, as the number of possible candidate patterns in a tree is much higher than one-dimensional data structures, with dramatically higher ...
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs
This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations,...
Packet coalescing exploiting data redundancy in GPGPU architectures
General Purpose Graphics Processing Units (GPGPUs) are becoming a cost-effective hardware approach for parallel computing. Many executions on the GPGPUs place heavy stress on the memory system, creating network bottlenecks near memory controllers. We ...
Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU
We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length for addressing the ...
Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs
Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel ...
HPAT: high performance analytics with scripting ease-of-use
Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We ...
Simplification and runtime resolution of data dependence constraints for loop transformations
Loop transformations such as tiling, parallelization or vectorization are essential tools in the quest for high-performance program execution. Precise data dependence analysis is required to determine whether the compiler can apply a transformation or ...
Optimizing recursive task parallel programs
We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join ...
Fast segmented sort on GPUs
Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role ...
Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU
The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, ...
On improving performance of sparse matrix-matrix multiplication on GPUs
Sparse matrix-matrix multiplication (SpGEMM) is an important primitive for many data analytics algorithms, such as Markov clustering. Unlike the dense case, where performance of matrix-matrix multiplication is considerably higher than matrix-vector ...
A performance analysis framework for exploiting GPU microarchitectural capability
GPUs are widely used in accelerating deep neural networks (DNNs) for their high bandwidth and parallelism. But tuning the performance of DNN computations is challenging, as it requires a thorough understanding of both underlying architectures and ...
GraphGrind: addressing load imbalance of graph partitioning
We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original ...
Automatic topology mapping of diverse large-scale parallel applications
Topology-aware mapping aims at assigning tasks to processors in a way that minimizes network load, thus reducing the time spent waiting for communication to complete. Many mapping schemes and algorithms have been proposed. Some are application or domain ...
Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems
Heterogeneous memory systems that comprise memory nodes based on widely-different device technologies (e.g., DRAM and nonvolatile memory (NVM)) are emerging in various computing domains ranging from high-performance to embedded computing. Despite the ...
Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation
Modern chip multiprocessors (CMPs) employ on-chip networks to enable communication between the individual cores. Operations such as coherence and synchronization generate a significant amount of the on-chip network traffic, and often create network ...
Way-combining directory: an adaptive and scalable low-cost coherence directory
Today, general-purpose commercial multicores approaching one hundred cores are already a reality and even thousand core chips are being prototyped. Maintaining coherence across such a high number of cores in these manycore architectures requires careful ...
Iteration-fusing conjugate gradient
This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear ...
Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques
In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard ...
HiPA: history-based piecewise approximation for functions
Applications that can tolerate a certain degree of inaccuracy offer opportunities for performance improvement and/or power reduction through techniques that produce approximate results. Such techniques have been proposed at many levels of the system ...
Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation
As the rate of data generation is growing exponentially each year, data aggregation has become one of the most common and expensive operations for data analysis. Previous efforts to accelerate data aggregation have been mainly focused on multi-core CPUs,...
Revisiting phased transactional memory
In recent years, Hybrid TM (HyTM) has been proposed as a transactional memory approach that leverages on the advantages of both hardware (HTM) and software (STM) execution modes. HyTM assumes that concurrent transactions can have very different phases ...
Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures
Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of both worlds, because NVM can offer larger capacity and have near-zero standby power ...
SSDUP: a traffic-aware ssd burst buffer for HPC systems
Many high performance computing (HPC) applications are highly data intensive. Current HPC storage systems still use hard disk drives (HDDs) as their dominant storage devices, which suffer from disk head thrashing when accessing random data. New storage ...
libPRISM: an intelligent adaptation of prefetch and SMT levels
- Cristobal Ortega,
- Miquel Moreto,
- Marc Casas,
- Ramon Bertran,
- Alper Buyuktosunoglu,
- Alexandre E. Eichenberger,
- Pradip Bose
Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under different workload demands. An impractical and time consuming offline profiling is needed to evaluate the design space to find the ...