No abstract available.
Distributed storage control unit for the Hitachi S-3800 multivector supercomputer
- Katsuyoshi Kitai,
- Tadaaki Isobe,
- Tadayuki Sakakibara,
- Shigeko Yazawa,
- Yoshiko Tamaki,
- Teruo Tanaka,
- Kouichi Ishii
This paper discusses the storage control unit of the Hitachi S-3800 supercomputer series, which is capable of achieving 8 GFLOPS in each of up to four shared-memory multiprocessors. This storage control unit is distributed to the V-SCs (vector-processor-...
A model for dataflow based vector execution
Although the dataflow model has been shown to allow the exploitation of parallelism at all levels, research of the past decade has revealed several fundamental problems: Synchronization at the instruction level, token matching, coloring and re-labeling ...
Synchronized access to streams in SIMD vector multiprocessors
The synchronized and simultaneous access to several vectors that form a single stream is typical in SIMD vector multiprocessors as well as in MIMD superscalar multiprocessors with decoupled access. In this paper we propose a block-interleaved storage ...
The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization
Current parallelizing compilers cannot identify a significant fraction of fully parallel loops because they have complex or statically insufficiently defined access patterns. For this reason, we have developed the Privatizing DOALL test—a technique for ...
Reducing data communication overhead for DOACROSS loop nests
If the iterations of a loop nest cannot be partitioned into independent tasks, data communication for data dependence is inevitable in order to execute them on parallel machines. This kind of loop nest is referred to as a DOACROSS loop nest.
This paper ...
Evaluating automatic parallelization for efficient execution on shared-memory multiprocessors
We present a parallel code generation algorithm for complete applications and a new experimental methodology that tests the efficacy of our approach. The algorithm optimizes for data locality and parallelism, reducing or eliminating false sharing. It ...
An evaluation of directory protocols for medium-scale shared-memory multiprocessors
This paper considers alternative directory protocols for providing cache coherence in shared-memory multiprocessors with 32 to 128 processors, where the state requirements of DirN may be considered too large. We consider DiriB, i=1,2,4, DirN, Tristate (...
An evaluation of a compiler optimization for improving the performance of a coherence directory
Both hardware-controlled and compiler-directed mechanisms have been proposed for maintaining cache coherence in large-scale shared-memory multiprocessors, but both of these approaches have significant limitations. We examine the potential performance ...
Parallelisation of the SDEM distinct element stress analysis code on the KSR-1
The SDEM code models systems of interacting blocks of rock using the distinct element (DE) method, which represents these systems as discontinuums with each block acting under Newton's laws of motion. The data structures associated with the DE method ...
Ultrasonic wave propagation on parallel machines
“ULTSON” is a 2D code which solves the elastodynamic equations in a regular structured mesh. It has been developed at EDF to be used for non-destructive testing of nuclear power plants. Today, the code runs on classical architectures like Cray (YMP or ...
An efficient approach to computing fixpoints for complex program analysis
A chief source of inefficiency in program analysis using abstract interpretation comes from the fact that a large context (i.e., problem state) is propagated from node to node during the course of an analysis. This problem can be addressed and largely ...
Optimal local register allocation for a multiple-issue machine
This paper presents an algorithm that allocates registers optimally for straight-line code running on a generic multi-issue computer. On such a machine, an optimal register allocation is one that minimizes the number of issue slots that the code ...
Scheduling reductions
In order to detect more parallelism in scientific programs, one may extract a parallelism relative to reductions. This paper presents such a method which schedules programs with explicit computations of reductions. We describe the way the reductions are ...
A dominating set model for broadcast in all-port wormhole-routed 2D mesh networks
A new model for broadcast in wormhole-routed networks is proposed. The model uses and extends the concept of dominating sets in order to systematically develop efficient broadcast algorithms for all-port wormhole-routed systems, in which each node can ...
The interaction between virtual channel flow control and adaptive routing in wormhole networks
Multiprocessor interconnection networks based on low dimensional mesh or torus topologies and employing wormhole switching have become increasingly popular. Two concepts that have been proposed to improve the performance of such networks are Virtual ...
Fault-tolerant wormhole routing in tori
We present a method to enhance wormhole routing algorithms for deadlock-free fault-tolerant routing in tori. We consider arbitrarily-located faulty blocks and assume only local knowledge of faults. Messages are routed via shortest paths when there are ...
Performance of the CM-5 scalable file system
Assessing the performance and software interactions of emerging parallel input/output systems is a critical first step in input/output software tuning. Moreover, understanding the system response to well-understood, synthetic input/output patterns is ...
Communication in the KSR1 MPP: performance evaluation using synthetic workload experiments
We have developed an automatic technique for evaluating the communication performance of massively parallel processors (MPPs). Both communication latency and the amount of communication are investigated as a function of a few basic parameters that ...
Architecture implications of high-speed I/O for distributed-memory computers
We consider the problem of high-speed I/O for a single application running on multiple nodes of a distributed-memory parallel computer. Our model is that the parallel system is connected to an I/O system that provides the interface between the internal ...
Combining static and dynamic scheduling on distributed-memory multiprocessors
Loops are a large source of parallelism for many numerical applications. An important issue in the parallel execution of loops is how to schedule them so that the workload is well balanced among the processors. Most existing loop scheduling algorithms ...
An optimal upper bound on the minimal completion time in distributed supercomputing
We first consider an MIMD multiprocessor configuration with n processors. A parallel program, consisting of n processes, is executed on this system—one process per processor. The program terminates when all processes are completed. Due to ...
Compiler techniques for maximizing fine-grain and coarse-grain parallelism in loops with uniform dependences
In this paper, an approach to the problem of exploiting parallelism within nested loops is proposed. The proposed method first finds out all the initially independent computations, and then, based on them, identifies the valid partitioning bases to ...
Data and program restructuring of irregular applications for cache-coherent multiprocessor
Applications with irregular data structures such as sparse matrices or finite element meshes account for a large fraction of engineering and scientific applications. Domain decomposition techniques are commonly used to partition these applications to ...
Nonzero structure analysis
Because the efficiency of sparse codes is very much dependent on the size and structure of input data, peculiarities of the nonzero structures of sparse matrices must be accounted for in order to avoid unsatisfying performance. Usually, this implies ...
Techniques to overlap computation and communication in irregular iterative applications
There are many applications in CFD and structural analysis that can be more accurately modeled using unstructured grids. Parallelization of implicit methods for unstructured grids is a difficult and important problem. This paper deals with coloring ...
Performance analysis of a synchronous, circuit-switched interconnection cached network
In many parallel applications, each computation entity (process, thread etc.) switches the bulk of its communication between a small group of other entities. We call this phenomenon switching locality. The Interconnection Cached Network (ICN) is a ...
An analysis model on nonblocking multirate broadcast networks
Designing efficient interconnection networks with powerful connecting capability remains a key issue to parallel and distributed computing systems. Many progresses have been made in nonblocking broadcast networks which can realize all one-to-many ...
Exploiting cache affinity in software cache coherence
Cache affinity is important to the performance of scalable shared memory multiprocessors. For multiprocessors without hardware cache coherence support, software cache coherence is the only alternative. Most existing software cache schemes ignore cache ...
Performance evaluation of hybrid hardware and software distributed shared memory protocols
Hardware distributed shared memory (DSM) systems efficiently support fine grain sharing of data by maintaining coherence at the level of individual cache lines and providing automatic replication in processor caches. Software DSM systems, on the other ...
Limited area numerical weather forecasting on a massively parallel computer
A data-parallel implementation on a SIMD platform of an operational numerical weather forecast model is presented. The performances of two popular numerical techniques within these models are discussed, namely finite difference (gridpoint) methods and ...
Index Terms
- Proceedings of the 8th international conference on Supercomputing