Export Citations
- Sponsor:
- sigarch
No abstract available.
Using hybrid branch predictors to improve branch prediction accuracy in the presence of context switches
Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, ...
An analysis of dynamic branch prediction schemes on system workloads
Recent studies of dynamic branch prediction schemes rely almost exclusively on user-only simulations to evaluate performance. We find that an evaluation of these schemes with user and kernel references often leads to different conclusions. By analyzing ...
Correlation and aliasing in dynamic branch predictors
Previous branch prediction studies have relied primarily upon the SPECint89 and SPECint92 benchmarks for evaluation. Most of these benchmarks exercise a very small amount of code. As a consequence, the resources required by these schemes for accurate ...
Decoupled hardware support for distributed shared memory
This paper investigates hardware support for fine-grain distributed shared memory (DSM) in networks of workstations. To reduce design time and implementation cost relative to dedicated DSM systems, we decouple the functional hardware components of DSM ...
MGS: a multigrain shared memory system
Parallel workstations, each comprising 10-100 processors, promise cost-effective general-purpose multiprocessing. This paper explores the coupling of such small- to medium-scale shared memory multiprocessors through software over a local area network to ...
COMA: an opportunity for building fault-tolerant scalable shared memory multiprocessors
Due to the increasing number of their components, Scalable Shared Memory Multiprocessors (SSMMs) have a very high probability of experiencing failures. Tolerating node failures therefore becomes very important for these architectures particularly if ...
Evaluation of design alternatives for a multiprocessor microprocessor
In the future, advanced integrated circuit processing and packaging technology will allow for several design options for multiprocessor microprocessors. In this paper we consider three architectures: shared-primary cache, shared-secondary cache, and ...
Memory bandwidth limitations of future microprocessors
This paper makes the case that pin bandwidth will be a critical consideration for future microprocessors. We show that many of the techniques used to tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Using a ...
Missing the memory wall: the case for processor/memory integration
Current high performance computer systems use complex, large superscalar CPUs that interface to the main memory through a hierarchy of caches and interconnect systems. These CPU-centric designs invest a lot of power and chip area to bridge the widening ...
Don't use the page number, but a pointer to it
Most newly announced high performance microprocessors support 64-bit virtual addresses and the width of physical addresses is also growing. As a result, the size of the address tags in the L1 cache is increasing. The impact of on chip area is ...
The difference-bit cache
The difference-bit cache is a two-way set-associative cache with an access time that is smaller than that of a conventional one and close or equal to that of a direct-mapped cache. This is achieved by noticing that the two tags for a set have to differ ...
Understanding application performance on shared virtual memory systems
Many researchers have proposed interesting protocols for shared virtual memory (SVM) systems, and demonstrated performance improvements on parallel programs. However, there is still no clear understanding of the performance potential of SVM systems for ...
Application and architectural bottlenecks in large scale distributed shared memory machines
Many of the programming challenges encountered in small to moderate-scale hardware cache-coherent shared memory machines have been extensively studied. While work remains to be done, the basic techniques needed to efficiently program such machines have ...
Increasing cache port efficiency for dynamic superscalar microprocessors
The memory bandwidth demands of modern microprocessors require the use of a multi-ported cache to achieve peak performance. However, multi-ported caches are costly to implement. In this paper we propose techniques for improving the bandwidth of a single ...
High-bandwidth address translation for multiple-issue processors
In an effort to push the envelope of system performance, microprocessor designs are continually exploiting higher levels of instruction-level parallelism, resulting in increasing bandwidth demands on the address translation mechanism. Most current ...
DCD—disk caching disk: a new approach for boosting I/O performance
This paper presents a novel disk storage architecture called DCD, Disk Caching Disk, for the purpose of optimizing I/O performance. The main idea of the DCD is to use a small log disk, referred to as cache-disk, as a secondary disk cache to optimize ...
Polling watchdog: combining polling and interrupts for efficient message handling
Parallel systems supporting multithreading, or message passing in general, have typically used either polling or interrupts to handle incoming messages. Neither approach is ideal; either may lead to excessive overheads or message-handling latencies, ...
Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor
Simultaneous multithreading is a technique that permits multiple independent threads to issue multiple instructions each cycle. In previous work we demonstrated the performance potential of simultaneous multithreading, based on a somewhat idealized ...
Evaluation of multithreaded uniprocessors for commercial application environments
As memory speeds grow at a considerably slower rate than processor speeds, memory accesses are starting to dominate the execution time of processors, and this will likely continue into the future. This trend will be exacerbated by growing miss rates due ...
Performance comparison of ILP machines with cycle time evaluation
Many studies have investigated performance improvement through exploiting instruction-level parallelism (ILP) with a particular architecture. Unfortunately, these studies indicate performance improvement using the number of cycles that are required to ...
Rotating combined queueing (RCQ): bandwidth and latency guarantees in low-cost, high-performance networks
Network service guarantees not only provide significant performance benefits to distributed computing systems (more balanced resource utilization, fast fault recovery, and fair network access), but they are also essential for many new applications ...
A router architecture for real-time point-to-point networks
Parallel machines have the potential to satisfy the large computational demands of emerging real-time applications. These applications require a predictable communication network, where time-constrained traffic requires bounds on latency or throughput ...
Coherent network interfaces for fine-grain communication
Historically, processor accesses to memory-mapped device registers have been marked uncachable to insure their visibility to the device. The ubiquity of snooping cache coherence, however, makes it possible for processors and devices to interact with ...
Informing memory operations: providing memory performance feedback in modern processors
Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the ...
Instruction prefetching of systems codes with layout optimized for reduced cache misses
High-performing on-chip instruction caches are crucial to keep fast processors busy. Unfortunately, while on-chip caches are usually successful at intercepting instruction fetches in loop-intensive engineering codes, they are less able to do so in large ...
Compiler and hardware support for cache coherence in large-scale multiprocessors: design considerations and performance study
In this paper, we study a hardware-supported, compiler directed (HSCD) cache coherence scheme, which can be implemented on a large-scale multiprocessor using off-the-shelf microprocessors, such as the Cray T3D. It can be adapted to various cache ...
Early experience with message-passing on the SHRIMP multicomputer
- Edward W. Felten,
- Richard D. Alpert,
- Angelos Bilas,
- Matthias A. Blumrich,
- Douglas W. Clark,
- Stefanos N. Damianakis,
- Cezary Dubnicki,
- Liviu Iftode,
- Kai Li
The SHRIMP multicomputer provides virtual memory-mapped communication (VMMC), which supports protected, user-level message passing, allows user programs to perform their own buffer management, and separates data transfers from control transfers so that ...
STiNG: a CC-NUMA computer system for the commercial marketplace
"STiNG" is a Cache Coherent Non-Uniform Memory Access (CC-NUMA) Multiprocessor designed and built by Sequent Computer Systems, Inc. It combines four processor Symmetric Multi-processor (SMP) nodes (called Quads), using a Scalable Coherent Interface (SCI)...
Index Terms
- Proceedings of the 23rd annual international symposium on Computer architecture