Newsletter Downloads
Critical issues in mapping neural networks on message-passing multicomputers
Connectionist models such as artificial neural systems, offer an intrinsically concurrent computational paradigm. We investigate the architectural requirements for efficiently simulating large neural networks on a multicomputer system with thousands of ...
Multinomial conjunctoid statistical learning machines
Multinomial Conjunctoids are supervised statistical modules that learn the relationships among binary events. The multinomial conjunctoid algorithm precludes the following problems that occur in existing feedforward multi-layered neural networks: (a) ...
A bit-plane architecture for optical computing with two-dimensional symbolic substitution
A novel architecture based on optical technology is presented for constructing parallel computers. The architecture exploits optics for its ultra-high speed, massive parallelism, and dense connectivity. The processing is based on a new technique called ...
The reconfigurable arithmetic processor
The Reconfigurable Arithmetic Processor (RAP) is an arithmetic processing node for a message-passing, MIMD concurrent computer. It incorporates on one chip several serial, 64 bit floating point arithmetic units connected by a switching network. By ...
The performance potential of multiple functional unit processors
In this paper, we look at the interaction of pipelining and multiple functional units in single processor machines. When implementing a high performance machine, a number of hardware techniques maybe used to improve the performance of the final system. ...
Exploiting parallel microprocessor microarchitectures with a compiler code generator
With advances in VLSI technology, microprocessor designers can provide more microarchitectural parallelism to increase performance. We have identified four major forms of such parallelism: multiple microoperations issued per cycle, multiple result ...
Analysis of memory referencing behavior for design of local memories
Memory referencing behavior is analyzed via the study of traces for the purpose of developing new local memory structures and management techniques. A novel trace processing technique called flattening reduces the dependence of the results on the ...
Performance evaluation of on-chip register and cache organizations
Chip area is a critical resource in the design of VLSI processors. There are many different alternative designs that could fill this chip area. This paper compares several different local memory organizations applicable for single-chip processors. ...
On the inclusion properties for multi-level cache hierarchies
The inclusion property is essential in reducing the cache coherence complexity for multiprocessors with multilevel cache hierarchies. We give some necessary and sufficient conditions for imposing the inclusion property for fully- and set-associative ...
A simulation study of two-level caches
We report on a trace-driven simulation study to examine the effect of a two-level cache hierarchy in uniprocessors. A simulation model of a multiple-cycle-per-instruction processor was constructed to estimate the total cycles required to execute a ...
Hyperswitch network for the hypercube computer
The performance of a parallel algorithm depends in a large part on the interconnection topology of the multicomputer system. The method presented in this paper realizes a kind of interconnection network, called a hyperswitch network, that is achieved ...
Analysis of bus hierarchies for multiprocessors
In order to build large shared-memory multiprocessor systems that take advantage of current hardware-enforced cache coherence protocols, an interconnection network is needed that acts logically as a single bus while avoiding the electrical loading ...
Extra group network: a cost-effective fault-tolerant multistage interconnection network
This paper introduces a new class of fault-tolerant multistage interconnection networks, dubbed as Extra Group Networks (EGNs). An EGN-m of size N is designed to have m + 1 unique path multistage networks of size N/m. This approach of constructing the ...
A partial-multiple-bus computer structure with improved cost effectiveness
This paper addresses the design and performance analysis of partial-multiple-bus interconnection networks. One such structure, called processor-oriented partial-multiple-bus (or PPMB), is proposed. It serves as an alternative to the conventional ...
Flagship: a parallel architecture for declarative programming
The Flagship project aims to produce a computing technology based on the declarative style of programming. A major component of that technology is the design for a parallel machine which can efficiently exploit the implicit parallelism in declarative ...
Toward a dataflow/von Neumann hybrid architecture
Dataflow architectures offer the ability to trade program level parallelism in order to overcome machine level latency. Dataflow further offers a uniform synchronization paradigm, representing one end of a spectrum wherein the unit of scheduling is a ...
Resource requirements of dataflow programs
Parallel execution of programs requires more resources and more complex resource management than sequential execution. If concurrent tasks can be spawned dynamically, programs may require an inordinate amount of resources when the potential parallelism ...
Priority-driven, preemptive I/O controllers for real-time systems
Current I/O controller architectures inhibit the use of priority-driven preemptive scheduling algorithms that can guarantee hard deadlines in real-time systems. This paper examines the effect of three I/O controller architectures upon schedulable ...
A kernel-independent, pipelined architecture for real-time 2-D convolution
Existing architectures for 2-D convolution suffer from such drawbacks as inflexibility with respect to image and/or kernel sizes (systolic arrays) or data distribution and collection overhead (SIMD processor arrays). This paper introduces a pipelined ...
Exploiting bit level concurrency in real-time geometric feature extractions
Geometric feature extraction can be characterized as a computationally intensive task in the environment of real-time automated vision systems requiring algorithms with a high degree of parallelism and pipelining under the raster-scan I/O constraint. ...
Measuring VAX 8800 performance with a histogram hardware monitor
This paper reports the results of a study of VAX 8800 processor performance using a hardware monitor that collects histograms of the processor's micro-PC and memory bus status. The monitor keeps a count of all machine cycles executed at each micro-PC ...
Multiprocessor cache analysis using ATUM
The design of high-performance multiprocessor systems necessitates a careful analysis of the memory system performance of parallel programs. Lacking multiprocessor address traces, previous multiprocessor performance studies using analytical models had ...
Trade-offs between devices and paths in achieving disk interleaving
There is a continuing need to improve the performance of disk subsystems, and one of the key factors of a disk subsystem's performance is the data transfer rate. While it is clear that increasing the data transfer rate would reduce the service time for ...
Design of a concurrent computer for solving systems of linear equations
In this paper we describe the design of a systolic array of Householder processor elements, which is dedicated to the solution of large (dense) systems of linear equations. The array is capable of executing two different algorithms. One for the solution ...
The white dwarf: a high-performance application-specific processor
This paper presents the design and implementation of a high-performance special-purpose processor, called The White Dwarf, for accelerating finite element analysis algorithms. The White Dwarf CPU contains two Am29325 32-bit floating-point processors and ...
Solving partial differential equations in a data-driven multiprocessor environment
Partial differential equations can be found in a host of engineering and scientific problems. The emergence of new parallel architectures has spurred research in the definition of parallel PDE solvers. Concurrently, highly programmable systems such as ...
Scrambled storage for parallel memory systems
A scrambled storage scheme is proposed for storing arrays of NXN elements in N = 2n parallel memory modules to allow conflict-free access to various array partitions. It is shown that the scheme allows conflict-free access to rows, columns, square ...
The architecture of a Linda coprocessor
We describe the architecture of a coprocessor that supports the communication primitives of the Linda parallel programming environment in hardware. The coprocessor is a critical element in the architecture of the Linda Machine, an MIMD parallel ...
Deadlock avoidance for systolic communication
Under the systolic communication model, each cell (or processor) in a parallel processing system can operate directly on data residing at the cell's input queues and move computed results directly to the cell's output queues. Incoming and outgoing ...
Cache performance of vector processors
An instruction-level simulator for IBM 3090 with VF (vector facility) has been developed for studying the performance of vector processors and their memory hierarchies. Initial use of the simulator is to understand the program locality of real ...