Export Citations
No abstract available.
Architecture and compiler tradeoffs for a long instruction wordprocessor
A very long instruction word (VLIW) processor exploits parallelism by controlling multiple operations in a single instruction word. This paper describes the architecture and compiler tradeoffs in the design of iWarp, a VLIW single-chip microprocessor ...
Tradeoffs in instruction format design for horizontal architectures
With recent improvements in software techniques and the enhanced level of fine grain parallelism made available by such techniques, there has been an increased interest in horizontal architectures and large instruction words that are capable of issuing ...
Overlapped loop support in the Cydra 5
The CydraTM 5 architecture adds unique support for overlapping successive iterations of a loop to a very long instruction word (VLIW) base. This architecture allows highly parallel loop execution for a much larger class of loops than can be vectorized, ...
Architectural support for synchronous task communication
This paper describes the motivation for a set of intertask communication primitives, the hardware support of these primitives, the architecture used in the Sylvan project which studies these issues, and the experience gained from various experiments ...
The fuzzy barrier: a mechanism for high speed synchronization of processors
Parallel programs are commonly written using barriers to synchronize parallel processes. Upon reaching a barrier, a processor must stall until all participating processors reach the barrier. A software implementation of the barrier mechanism using ...
Efficient synchronization primitives for large-scale cache-coherent multiprocessors
This paper proposes a set of efficient primitives for process synchronization in multiprocessors. The only assumptions made in developing the set of primitives are that hardware combining is not implemented in the inter-connect, and (in one case) that ...
A software instruction counter
Although several recent papers have proposed architectural support for program debugging and profiling, most processors do not yet provide even basic facilities, such as an instruction counter. As a result, system developers have been forced to invent ...
Efficient debugging primitives for multiprocessors
Existing kernel-level debugging primitives are inappropriate for instrumenting complex sequential or parallel programs. These functions incur a heavy overhead in their use of system calls and process switches. Context switches are used to alternately ...
Sheaved memory: architectural support for state saving and restoration in pages systems
The concept of read-one/write-many paged memory is introduced and given the name sheaved memory. It is shown that sheaved memory is useful for efficiently maintaining checkpoints in main memory and for providing state saving and state restoration for ...
Reference history, page size, and migration daemons in local/remote architectures
We address the problem of paged main memory management in the local/remote architecture subclass of shared memory multiprocessors. We consider the case where the operating system has primary responsibility and uses page migration as its main tool. We ...
Translation lookaside buffer consistency: a software approach
We discuss the translation lookaside buffer (TLB) consistency problem for multiprocessors, and introduce the Mach shootdown algorithm for maintaining TLB consistency in software. This algorithm has been implemented on several multiprocessors, and is in ...
Failure correction techniques for large disk arrays
The ever increasing need for I/O bandwidth will be met with ever larger arrays of disks. These arrays require redundancy to protect against data loss. This paper examines alternative choices for encodings, or codes, that reliably store information in ...
A unified vector/scalar floating-point architecture
In this paper we present a unified approach to vector and scalar computation, using a single register file for both scalar operands and vector elements. The goal of this architecture is to yield improved scalar performance while broadening the range of ...
Data buffering: run-time versus compile-time support
Data-dependency, branch, and memory-access penalties are main constraints on the performance of high-speed microprocessors. The memory-access penalties concern both penalties imposed by external memory (e.g. cache) or by under utilization of the local ...
A real-time support processor for ada tasking
Task synchronization in Ada causes excessive run-time overhead due to the complex semantics of the rendezvous. To demonstrate that the speed can be increased by two orders of magnitude by using special purpose hardware, a single chip VLSI support ...
The runtime environment for Scheme, a Scheme implementation on the 88000
We are implementing a Scheme development system for the Motorola 88000. The core of the implementation is an optimizing native code compiler, together with a carefully designed runtime system. This paper describes our experiences with the 88000 as a ...
Program optimization for instruction caches
This paper presents an optimization algorithm for reducing instruction cache misses. The algorithm uses profile information to reposition programs in memory so that a direct-mapped cache behaves much like an optimal cache with full associativity and ...
Using registers to optimize cross-domain call performance
This paper describes a new technique to improve the performance of cross-domain calls and returns in a capability-based computer system. Using register optimization information obtained from the compiler, a trusted linker can minimize the number of ...
The design of nectar: a network backplane for heterogeneous multicomputers
Nectar is a “network backplane” for use in heterogeneous multicomputers. The initial system consists of a star-shaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be ...
A message driven OR-parallel machine
A message driven architecture for the execution of OR-parallel logic languages is proposed. The computational model is based on well known compilation techniques for Logic Languages. We present first the multiple binding mechanism for the OR-parallel ...
Evaluating the performance of software cache coherence
In a shared-memory multiprocessor with private caches, cached copies of a data item must be kept consistent. This is called cache coherence. Both hardware and software coherence schemes have been proposed. Software techniques are attractive because they ...
Analysis of cache invalidation patterns in multiprocessors
To make shared-memory multiprocessors scalable, researchers are now exploring cache coherence protocols that do not rely on broadcast, but instead send invalidation messages to individual caches that contain stale data. The feasibility of such directory-...
The effect of sharing on the cache and bus performance of parallel programs
Bus bandwidth ultimately limits the performance, and therefore the scale, of bus-based, shared memory multiprocessors. Previous studies have extrapolated from uniprocessor measurements and simulations to estimate the performance of these machines. In ...
Available instruction-level parallelism for superscalar and superpipelined machines
Superscalar machines can issue several instructions per cycle. Superpipelined machines can issue only one instruction per cycle, but they have cycle times shorter than the latency of any functional unit. In this paper these two techniques are shown to ...
Micro-optimization of floating-point operations
This paper describes micro-optimization, a technique for reducing the operation count and time required to perform floating-point calculations. Micro-optimization involves breaking floating-point operations into their constituent micro-operations and ...
Limits on multiple instruction issue
This paper investigates the limitations on designing a processor which can sustain an execution rate of greater than one instruction per cycle on highly-optimized, non-scientific applications. We have used trace-driven simulations to determine that ...
Recommendations
Acceptance Rates
Year | Submitted | Accepted | Rate |
---|---|---|---|
ASPLOS '19 | 351 | 74 | 21% |
ASPLOS '18 | 319 | 56 | 18% |
ASPLOS '17 | 320 | 53 | 17% |
ASPLOS '16 | 232 | 53 | 23% |
ASPLOS '15 | 287 | 48 | 17% |
ASPLOS '14 | 217 | 49 | 23% |
ASPLOS XV | 181 | 32 | 18% |
ASPLOS XIII | 127 | 31 | 24% |
ASPLOS XII | 158 | 38 | 24% |
ASPLOS X | 175 | 24 | 14% |
ASPLOS IX | 114 | 24 | 21% |
ASPLOS VIII | 123 | 28 | 23% |
ASPLOS VII | 109 | 25 | 23% |
Overall | 2,713 | 535 | 20% |