- Sponsor:
- sigarch
No abstract available.
Branch folding in the CRISP microprocessor: reducing branch delay to zero
A new method of implementing branch instructions is presented. This technique has been implemented in the CRISP Microprocessor. With a combination of hardware and software techniques the execution time cost for many branches can be effectively reduced ...
An evaluation of branch architectures
Branch instructions form a significant fraction of executed instructions, and their design is thus a crucial component of any architecture. This paper examines three alternatives in the design of branch instructions: delayed vs. non-delayed branches, ...
Checkpoint repair for out-of-order execution machines
Out-of-order execution and branch prediction are two mechanisms that can be used profitably in the design of Supercomputers to increase performance. Unfortunately this means there must be some kind of repair mechanism, since situations do occur that ...
Instruction issue logic for high-performance, interruptable pipelined processors
The performance of pipelined processors is severely limited by data dependencies. In order to achieve high performance, a mechanism to alleviate the effects of data dependencies must exist. If a pipelined CPU with multiple functional units is to be used ...
Fast temporary storage for serial and parallel execution
There is an apparent conflict between the hardware requirements for fast parallel execution and the hardware requirements for fast serial execution. For example, fast vector execution is achieved by maintaining high execution concurrency over extended ...
Performance analysis and design of a logic simulation machine
The high costs associated with logic simulation of large VLSI circuits has led to the need for new computer architectures tailored to the simulation task. Such architectures have the potential for significant speed-ups over software-based logic ...
A modular systolic architecture for image convolutions
This paper describes a modular, systolic design for two-dimensional convolution which is a frequent and computationally intensive operation in low-level image processing. The design consists of a one-dimensional array of homogeneous cells, each with a ...
A template matching algorithm using optically-connected 3-D VLSI architecture
Three-dimensional VLSI (in short, 3-D VLSI) is a new device technology that is expected to realize high performance systems. In this paper, we propose an image processing architecture based on 3-D VLSI consisting of optically-connected layers. Since the ...
Mapping data flow programs on a VLSI array of processors
With the advent of VLSI, relatively large processing arrays may be realized in a single VLSI chip. Such regularly structured arrays take considerably less time to design and test, and fault-tolerance can easily be introduced into them. However, only a ...
Analytical modeling and architectural modifications of a dataflow computer
Dataflow computers are an alternative to the von Neumann architectures and are capable of exploiting large amount of parallelism inherent in many computer applications. This paper deals with the performance analysis of the Manchester dataflow computer ...
A unified resource management and execution control mechanism for data flow machines
This paper presents a unified resource management and execution control mechanism for data flow machines. The mechanism integrates load control, depth-first execution control, cache memory control and a load balancing mechanism. All of these mechanisms ...
High performance integrated Prolog processor IPP
To realize the highest performance possible for a sequential processor, and to realize utilization of a large amount of existing software, an integrated Prolog processor (IPP) and its optimized compiler are now being developed.
A tagged architecture ...
Performance studies of a parallel Prolog architecture
This paper presents a new multiprocessor architecture for the parallel execution of logic programs, developed as part of the Aquarius Project. This architecture is designed to support AND-parallelism, OR-parallelism, and intelligent backtracking. We ...
An experimental VLSI Prolog interpreter: preliminary measurements and results
This work presents the preliminary results of a project oriented to the design and VLSI implementation of a Prolog interpreter. Even if the interpretative approach is being considered an inefficient way to execute high level languages when compared to ...
Deterministic and stochastic modeling of parallel garbage collection: towards real-time criteria
The study of garbage collection for a logic programming language machine has exhibited fundamental differences with the more popular functional programming garbage collection. These differences yield behaviours that cannot be observed with classical ...
Architectural issues in designing symbolic processors in optics
This paper analyzes potential optical architectures for AI applications (such as knowledge-based systems). Our goal was to investigate architectures most suitable for implementation completely in optics. While optical computing appears to hold much ...
Rearrangeability of multistage shuffle/exchange networks
In this paper we study the rearrangeability of multistage shuffle/exchange networks. Although a theoretical lower bound of (2 log2N - 1) stages for rearrangeability of a network with N = 2n inputs and outputs has been known, the sufficiency of (2 log2N -...
Optimized mesh-connected networks for SIMD and MIMD architectures
A class of mesh networks with wrap-around links is obtained from a class of circulant graphs by means of a graph isomorphism. We demonstrate how to obtain, from the adjacency pattern of the graph, simple parameters that serve to construct a planar ...
Performance evaluation of reduced bandwidth multistage interconnection networks
This paper presents and evaluates a class of buffered interconnection networks which provide performance and cost levels intermediate to a bus and a delta network. These networks, referred to as hybrid networks, are formed by beginning with a delta ...
Hardware support for interprocess communication
In recent years there has been increasing interest in message-based operating systems, particularly in distributed environments. Such systems consist of a small message-passing kernel supporting a collection of system server processes that provide such ...
Architecture of a message-driven processor
We propose a machine architecture for a high-performance processing node for a message-passing, MIMD concurrent computer. The principal mechanisms for attaining this goal are the direct execution and buffering of messages and a memory-based architecture ...
Effect of storage allocation/reclamation methods on parallelism and storage requirements
The write after read/write synchronizations (the anti- and output-dependence constraints) inhibit the parallelism exhibited by Fortran programs. These constraints can be avoided by allocating storage for the values generated in a program dynamically, so ...
Cache design of a sub-micron CMOS system/370
An innovative cache accessing scheme based on high MRU (most recently used) hit ratio [1] is proposed for the design of a one-cycle cache in a CMOS implementation of System/370. It is shown that with this scheme the cache access time is reduced by 30 ~ ...
An architectural perspective on a memory access controller
In this paper a CMOS memory access controller chip is described that provides the basis for achieving high-performance 68020-based (68030-based) systems. This controller matches the speed of the memory system to that of the microprocessor by providing a ...
Organization and analysis of a gracefully-degrading interleaved memory system
A hardware mechanism has been proposed to reconfigure an interleaved memory system. The reconfiguration scheme is such that, at any instant all fault-free memory banks in the memory system are utilized in interleaved manner. A performance metric is ...
Correct memory operation of cache-based multiprocessors
This paper shows that cache coherence protocols can implement indivisible synchronization primitives reliably and can also enforce sequential consistency. Sequential consistency provides a commonly accepted model of behavior of multiprocessors. We ...
Hierarchical cache/bus architecture for shared memory multiprocessors
A new, large scale multiprocessor architecture is presented in this paper. The architecture consists of hierarchies of shared buses and caches. Extended versions of shared bus multicache coherency protocols are used to maintain coherency among all ...
Multiprocessor cache design considerations
In this paper, cache design is explored for large high-performance multiprocessors with hundreds or thousands of processors and memory modules interconnected by a pipe-lined multi-stage network. The majority of the multiprocessor cache studies in the ...
Performance evaluation of multiple register sets
In this paper a DEC VAX with multiple register sets is evaluated under many differently sized register sets. Both the number of register sets and the number of registers per set were varied. Performance, measured in terms of memory traffic, is compared ...
Index Terms
- Proceedings of the 14th annual international symposium on Computer architecture