skip to main content
10.1145/3079079acmconferencesBook PagePublication PagesicsConference Proceedingsconference-collections
ICS '17: Proceedings of the International Conference on Supercomputing
ACM2017 Proceeding
  • General Chairs:
  • William D. Gropp,
  • Pete Beckman,
  • Program Chairs:
  • Zhiyuan Li,
  • Francisco J. Cazorla
Publisher:
  • Association for Computing Machinery
  • New York
  • NY
  • United States
Conference:
ICS '17: 2017 International Conference on Supercomputing Chicago Illinois June 14 - 16, 2017
ISBN:
978-1-4503-5020-4
Published:
14 June 2017
Sponsors:

Bibliometrics
Skip Abstract Section
Abstract

ICS is well known as the premier technical forum for researchers to present their latest results and to discuss the state of the art in high-performance computing (HPC). ICS 2017 continues the strong tradition of featuring keynote presentations emphasizing new directions and results in HPC; strong, peer-reviewed technical presentations; and a carefully selected tutorial and workshop covering special topics of interest in HPC.

Skip Table Of Content Section
SESSION: Automata and tree-mining optimization
research-article
Demystifying automata processing: GPUs, FPGAs or Micron's AP?

Many established and emerging applications perform at their core some form of pattern matching, a computation that maps naturally onto finite automata abstractions. As a consequence, in recent years there has been a substantial amount of work on high-...

research-article
Public Access
Enabling scalability-sensitive speculative parallelization for FSM computations

Finite state machines (FSMs) are the backbone of many applications, but are difficult to parallelize due to their inherent dependencies. Speculative FSM parallelization has shown promise on multicore machines with up to eight cores. However, as hardware ...

research-article
Public Access
SPIRIT: a framework for creating distributed recursive tree applications

An important set of applications, from diverse domains such as cosmological simulations, data mining, and computer graphics, involve repeated, depth-first traversal of trees. As these applications operate over massive data sets, it is often necessary to ...

research-article
Frequent subtree mining on the automata processor: challenges and opportunities

Frequency counting of complex patterns such as subtrees is more challenging than for simple itemsets and sequences, as the number of possible candidate patterns in a tree is much higher than one-dimensional data structures, with dramatically higher ...

SESSION: GPUs - part 1
research-article
Public Access
Novel HPC techniques to batch execution of many variable size BLAS computations on GPUs

This paper presents a software framework for solving large numbers of relatively small matrix problems using GPUs. Our approach combines novel and existing HPC techniques to methodically apply performance analysis, kernel design, low-level optimizations,...

research-article
Packet coalescing exploiting data redundancy in GPGPU architectures

General Purpose Graphics Processing Units (GPGPUs) are becoming a cost-effective hardware approach for parallel computing. Many executions on the GPGPUs place heavy stress on the memory system, creating network bottlenecks near memory controllers. We ...

research-article
Dynamic scheduling for efficient hierarchical sparse matrix operations on the GPU

We introduce a hierarchical sparse matrix representation (HiSparse) tailored for the graphics processing unit (GPU). The representation adapts to the local nonzero pattern at all levels of the hierarchy and uses reduced bit length for addressing the ...

SESSION: Compilation techniques
research-article
Compile-time optimized and statically scheduled N-D convnet primitives for multi-core and many-core (Xeon Phi) CPUs

Convolutional networks (ConvNets), largely running on GPUs, have become the most popular approach to computer vision. Now that CPUs are closing the FLOPS gap with GPUs, efficient CPU algorithms are becoming more important. We propose a novel parallel ...

research-article
HPAT: high performance analytics with scripting ease-of-use

Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We ...

research-article
Simplification and runtime resolution of data dependence constraints for loop transformations
Article No.: 10, pp 1–11https://doi.org/10.1145/3079079.3079098

Loop transformations such as tiling, parallelization or vectorization are essential tools in the quest for high-performance program execution. Precise data dependence analysis is required to determine whether the compiler can apply a transformation or ...

research-article
Optimizing recursive task parallel programs
Article No.: 11, pp 1–11https://doi.org/10.1145/3079079.3079102

We present a new optimization DECAF that optimizes recursive task parallel (RTP) programs by reducing the task creation and termination overheads. DECAF reduces the task termination (join) operations by aggressively increasing the scope of join ...

SESSION: GPUs - part 2
research-article
Public Access
Fast segmented sort on GPUs
Article No.: 12, pp 1–10https://doi.org/10.1145/3079079.3079105

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role ...

research-article
Globally homogeneous, locally adaptive sparse matrix-vector multiplication on the GPU
Article No.: 13, pp 1–11https://doi.org/10.1145/3079079.3079086

The rising popularity of the graphics processing unit (GPU) across various numerical computing applications triggered a breakneck race to optimize key numerical kernels and in particular, the sparse matrix-vector product (SpMV). Despite great strides, ...

research-article
Public Access
On improving performance of sparse matrix-matrix multiplication on GPUs
Article No.: 14, pp 1–11https://doi.org/10.1145/3079079.3079106

Sparse matrix-matrix multiplication (SpGEMM) is an important primitive for many data analytics algorithms, such as Markov clustering. Unlike the dense case, where performance of matrix-matrix multiplication is considerably higher than matrix-vector ...

research-article
A performance analysis framework for exploiting GPU microarchitectural capability
Article No.: 15, pp 1–10https://doi.org/10.1145/3079079.3079083

GPUs are widely used in accelerating deep neural networks (DNNs) for their high bandwidth and parallelism. But tuning the performance of DNN computations is challenging, as it requires a thorough understanding of both underlying architectures and ...

SESSION: Application load imbalance, task and data mapping
research-article
GraphGrind: addressing load imbalance of graph partitioning
Article No.: 16, pp 1–10https://doi.org/10.1145/3079079.3079097

We investigate how graph partitioning adversely affects the performance of graph analytics. We demonstrate that graph partitioning induces extra work during graph traversal and that graph partitions have markedly different connectivity than the original ...

research-article
Public Access
Automatic topology mapping of diverse large-scale parallel applications
Article No.: 17, pp 1–10https://doi.org/10.1145/3079079.3079104

Topology-aware mapping aims at assigning tasks to processors in a way that minimizes network load, thus reducing the time spent waiting for communication to complete. Many mapping schemes and algorithms have been proposed. Some are application or domain ...

research-article
Design and implementation of bandwidth-aware memory placement and migration policies for heterogeneous memory systems
Article No.: 18, pp 1–10https://doi.org/10.1145/3079079.3079092

Heterogeneous memory systems that comprise memory nodes based on widely-different device technologies (e.g., DRAM and nonvolatile memory (NVM)) are emerging in various computing domains ranging from high-performance to embedded computing. Despite the ...

SESSION: Hardware design
research-article
Carpool: a bufferless on-chip network supporting adaptive multicast and hotspot alleviation
Article No.: 19, pp 1–11https://doi.org/10.1145/3079079.3079090

Modern chip multiprocessors (CMPs) employ on-chip networks to enable communication between the individual cores. Operations such as coherence and synchronization generate a significant amount of the on-chip network traffic, and often create network ...

research-article
Way-combining directory: an adaptive and scalable low-cost coherence directory
Article No.: 20, pp 1–10https://doi.org/10.1145/3079079.3079096

Today, general-purpose commercial multicores approaching one hundred cores are already a reality and even thousand core chips are being prototyped. Maintaining coherence across such a high number of cores in these manycore architectures requires careful ...

SESSION: Runtimes and algorithms for parallel-application performance and reliability support
research-article
Iteration-fusing conjugate gradient
Article No.: 21, pp 1–10https://doi.org/10.1145/3079079.3079091

This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear ...

research-article
Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques
Article No.: 22, pp 1–10https://doi.org/10.1145/3079079.3079093

In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard ...

research-article
HiPA: history-based piecewise approximation for functions
Article No.: 23, pp 1–10https://doi.org/10.1145/3079079.3079107

Applications that can tolerate a certain degree of inaccuracy offer opportunities for performance improvement and/or power reduction through techniques that produce approximate results. Such techniques have been proposed at many levels of the system ...

SESSION: Data aggregation and hardware/software co-design approaches
research-article
Public Access
Efficient SIMD and MIMD parallelization of hash-based aggregation by conflict mitigation
Article No.: 24, pp 1–11https://doi.org/10.1145/3079079.3079080

As the rate of data generation is growing exponentially each year, data aggregation has become one of the most common and expensive operations for data analysis. Previous efforts to accelerate data aggregation have been mainly focused on multi-core CPUs,...

research-article
Revisiting phased transactional memory
Article No.: 25, pp 1–10https://doi.org/10.1145/3079079.3079094

In recent years, Hybrid TM (HyTM) has been proposed as a transactional memory approach that leverages on the advantages of both hardware (HTM) and software (STM) execution modes. HyTM assumes that concurrent transactions can have very different phases ...

research-article
Hardware/software cooperative caching for hybrid DRAM/NVM memory architectures
Article No.: 26, pp 1–10https://doi.org/10.1145/3079079.3079089

Non-Volatile Memory (NVM) has recently emerged for its nonvolatility, high density and energy efficiency. Hybrid memory systems composed of DRAM and NVM have the best of both worlds, because NVM can offer larger capacity and have near-zero standby power ...

research-article
SSDUP: a traffic-aware ssd burst buffer for HPC systems
Article No.: 27, pp 1–10https://doi.org/10.1145/3079079.3079087

Many high performance computing (HPC) applications are highly data intensive. Current HPC storage systems still use hard disk drives (HDDs) as their dominant storage devices, which suffer from disk head thrashing when accessing random data. New storage ...

research-article
libPRISM: an intelligent adaptation of prefetch and SMT levels
Article No.: 28, pp 1–10https://doi.org/10.1145/3079079.3079101

Current microprocessors include several knobs to modify the hardware behavior in order to improve performance under different workload demands. An impractical and time consuming offline profiling is needed to evaluate the design space to find the ...

Contributors
  • University of Illinois Urbana-Champaign
  • Argonne National Laboratory

Recommendations

Acceptance Rates

Overall Acceptance Rate584of2,055submissions,28%
YearSubmittedAcceptedRate
ICS '211573925%
ICS '151604025%
ICS '141603421%
ICS '132024321%
ICS '061413726%
ICS '031713621%
ICS '021443122%
ICS '011334534%
ICS '001223327%
ICS '991805732%
ICS '971354533%
ICS '961165043%
ICS '951204941%
ICS '941144539%
Overall2,05558428%