Elsevier

Microprocessors and Microsystems

Volume 33, Issues 7–8, October–November 2009, Pages 430-440
Microprocessors and Microsystems

Hiding message delivery latency using Direct-to-Cache-Transfer techniques in message passing environments

https://doi.org/10.1016/j.micpro.2009.07.001Get rights and content

Abstract

Communication overhead is the key obstacle to reaching hardware performance limits. The majority is associated with software overhead, a significant portion of which is attributed to message copying. To reduce this copying overhead, we have devised techniques that do not require to copy a received message in order for it to be bound to its final destination. Rather, a late-binding mechanism, which involves address translation and a dedicated cache, facilitates fast access to received messages by the consuming process/thread.

We have introduced two policies namely Direct to Cache Transfer (DTCT) and lazy DTCT that determine whether a message after it is bound needs to be transferred into the data cache. We have studied the proposed methods in simulation and have shown their effectiveness in reducing access times to message payloads by the consuming process.

Introduction

Parallel architectures promise to provide increased computing power and scalability for computation-intensive applications. Low-latency and high-bandwidth communication, together with the sharing of data among processing elements are critical to obtaining high performance in these environments. The raw bandwidth of networks has increased significantly in the past few years and networking hardware supporting bandwidth in the order of gigabits per second has become available. However, the communication latency of traditional networking architectures and protocols has not decreased at the same pace as the raw interconnect capacity nor the increase in processor performance [1], [2].

As well as the above-mentioned observations concerning hardware, software makes the situation worse. Applications issue a system call when they need to send (receive) a message to (from) a network in a cluster system. The layered nature of legacy networking software and the use of expensive system calls and extra memory-to-memory copies profoundly affect the communication subsystem performance as seen in the running applications.

In traditional software messaging layers, there are usually four message copying operations from the send buffer to the receive buffer, as shown in Fig. 1. These occur during data transfer from the sender to a system buffer, and from the system to the network interface (NI) buffer. Crossing the network, at the receive side, the arrived message is copied from the NI to the system buffer, and also from the system to the receive buffer when the receive call is posted. Below, we elaborate further on the send and receive mechanisms.

Different methods can be employed to move a message from a source location to the network. These techniques involve Direct Memory Access (DMA) transfer or regular copy operations. Due to their better performance, contemporary processors leverage DMA techniques to transfer data between a processor’s memory system and a Network Interface Card (NIC). As such, this in turn affects the cache subsystem that experiences increased number of cache misses, which further affect the working set of a running application. Given the increased gap between memory and processor performance, any additional cache misses further affect the performance of the system.

I/O operations directly involve memory subsystems through accessing destination or source buffers, or indirectly by manipulating message descriptors. The cache subsystem, which is affected by the execution of communication protocols, experiences misses that further affect the working set of the running application. In turn, the gap between processor and memory performance makes the cache misses expensive.

To overcome the aforementioned problems, various techniques have been proposed to enhance some aspects of communication performance. These include improving network performance, providing more efficient communication software layers, designing dedicated message processors, providing bulk data transfers, off-loading or on-loading communication protocol processing, and integrating NIC and the processor on the same chip or using a cache to hold messages.

This paper is organized as follows. Section 2 discusses the background and the related work; Section 3 presents a description of the architectural extensions introduced to support efficient processing of sending and receiving of messages in message passing environments. The implementation of network cache is presented in Section 4; Section 5 describes the Direct-to-Cache-Data transfer polices. Then, Sections 6 Methodology and simulation environment, 7 Implementation discuss the simulation environment and our assumptions, Section 8 discusses the obtained results and the conclusions are presented in Section 9. Finally, we wrap up the paper with future work in Section 10.

Section snippets

Background and related work

High performance computing is increasingly concerned with efficient communication across the interconnect. System Area Networks (SANs), such as Myrinet [3], Quadrics [4], and InfiniBand [5], provide high bandwidth and low latency while several user-level messaging techniques have removed the operating system kernel and protocol stack from the critical path of communications [6], [7]. A significant portion of the software communication overhead is attributed to message copying. Traditional

The proposed architectural extension

The main goal of this work is to propose techniques to achieve zero-copy communication in message passing environments. A network cache has been proposed [20] in the processor architecture to transfer the received data near the place where it will be consumed.

Ideally if the data, destined to be consumed by a process or thread, has been cached, that process or thread will encounter minimum delay in accessing this data. Our aim is to introduce architectural extensions that will facilitate the

Network cache implementation

The network cache mechanism presented in Section 3 (and in Fig. 2) calls for searches based on a network tag, as long as the received message has not been bound to the receiving process. During binding, the message is identified through the message ID while after the message is bound, it is identified through its process tag. These identifiers (that is, the network tag, message tag and process tag) need to be searched. We have not specified the organization of the network cache, the implication

Data transfer policies

This section elaborates upon handling arrived messages using the proposed extension. Specifically, two different policies are introduced that determine when a message is to be bound and whether it is sent to the data cache. These policies are called Direct to Cache Transfer (DTCT) and lazy DTCT.

Methodology and simulation environment

The SimpleScalar infrastructure was selected to model and verify the proposed architecture. The SimpleScalar toolset provides an infrastructure for simulation and architectural modeling. The toolset can model a variety of platforms ranging from simple unpipelined processors to detailed dynamically-scheduled microarchitectures with multiple-level memory hierarchies [25]. This infrastructure provides a cycle-accurate simulation, which allows for a precise notion of time in running parallel

Implementation

As stated previously, the SimpleScalar infrastructure was selected to model and verify the proposed architecture due to its cycle accurate simulation. Because of the large computational costs involved when one chooses to simulate a complete cluster environment, we elected to simulate only one processor while the network traffic was to be provided by a separate entity. Thus the simulation environment incorporates two threads working simultaneously.

Results

After preparing the environment and importing the information as discussed in the previous section into our simulator, we implemented our Network Processor extension and tested the effectiveness of this extension in handling short messages.

Our simulator is a modified version of Sim-outorder from SimpleScalar suite and is installed on a dual Xeon processor running at 2.4 GHz. As we are concerned with short messages (of a length identical to or less than that of a network cache line), we used the

Conclusions and discussion

In this work we presented the results of the evaluation of a network processor extension specifically targeted to decreasing the message reception latency in an MPI environment. Our focus was to study the impact of the DTCT approaches on the message access time. In addition, we investigated the impact of the messaging on the data cache as well as the required size and organization of the data and network caches.

The simulations showed that by using the proposed network extension along with the

Future work

This work has dealt with messages of a length identical to that of a cache line. As a future work we plan to extend the network cache to accommodate larger messages by dividing them into blocks. An annotating mechanism can be added to the network cache to keep track of messages that span more than one cache line. Whether the message that is kept inside the network cache should be treated as a whole message or as separate cache lines would need to be addressed. This mechanism and the associated

Farshad Khunjush (Khun Jush) received the B.Sc. and M.Sc. degrees in Computer Engineering from Shiraz University, Shiraz, Iran, in 1991 and 1995, respectively, and a Ph.D. degree in Electrical & Computer Engineering from University of Victoria, Victoria, BC, Canada, in 2008. From 1995 to 2001, he was an Instructor at the Department of Electrical Engineering, Hormozgan University, BandarAbbas, Iran. He was a Post-Doctoral fellow at the Laboratory for Parallel and Intelligent Systems (LAPIS),

References (29)

  • S. Hioki

    Construction of staples in lattice gauge theory on a parallel computer

    Parallel Computing

    (1996)
  • G.E. Moore

    Cramming more components onto integrated circuits

    Electronics

    (1965)
  • D.A. Patterson

    Latency lags bandwidth

    Communications of the ACM

    (2004)
  • N.J. Boden et al.

    Myrinet: a gigabit-per-second local area network

    IEEE Micro

    (1995)
  • Quadrics Interconnect Homepage,...
  • I.T. Association, InfiniBand Architecture Specification,...
  • C. Dubnicki, A. Bilas, Y. Chen, S. Damianakis, K. Li, VMMC-2: efficient support for reliable connection-oriented...
  • S.H. Rodrigues, T.E. Anderson, D.E. Culler, High-performance local area communication with fast sockets, in:...
  • M. Welsh et al.

    Memory management for user-level network interfaces

    IEEE Micro

    (1998)
  • RDMA Consortium, Architectural Specifications for RDMA over TCP/IP,...
  • M. Banikazemi et al.

    MPI-LAPI: an efficient implementation of MPI for IBM RS/6000 SP systems

    IEEE Transactions Parallel Distributed Systems

    (2001)
  • H. Chu, Zero-copy TCP in Solaris, in: Proceedings of the USENIX Annual Technical Conference, San Diego, USA, January...
  • M. Rangarajan, A. Bohra, TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation and...
  • D. Dunning et al.

    The virtual interface architecture

    IEEE Micro

    (1998)
  • Cited by (1)

    Farshad Khunjush (Khun Jush) received the B.Sc. and M.Sc. degrees in Computer Engineering from Shiraz University, Shiraz, Iran, in 1991 and 1995, respectively, and a Ph.D. degree in Electrical & Computer Engineering from University of Victoria, Victoria, BC, Canada, in 2008. From 1995 to 2001, he was an Instructor at the Department of Electrical Engineering, Hormozgan University, BandarAbbas, Iran. He was a Post-Doctoral fellow at the Laboratory for Parallel and Intelligent Systems (LAPIS), Victoria, BC, Canada, from 2008 to 2009. He is currently an Assistant Professor in the department of Computer Scicence and Engineering at Shiraz University, Shiraz, Iran.

    His research interests include Multi-Core & Parallel Computer Architectures, Multi-Core & Parallel Programming Paradigms, and High-Performance Interconnection Networks. Dr. Khunjush is a Member of the IEEE and ACM.

    Nikitas J. Dimopoulos received the B.Sc. degree in Physics from the University of Athens and the M.Sc. and Ph.D. degrees in Electrical Engineering from the University of Maryland, College Park in 1975, 1976 and 1980, respectively. He joined the Department of Electrical and Computer Engineering, University of Victoria (UVic) in 1988 where he is currently Professor and Lansdowne Chair in Computer Engineering. He served as the Chair of the Department (1998–2003 and 2005–2008) and was Visiting Professor at the Computer Engineering Laboratory, Delft University of Technology (2001). Previous to his appointment at UVic, he was Assistant and then Associate Professor at the Department of Electrical Engineering, Concordia University (1980–1987) and member of the Technical Staff at the Jet Propulsion Laboratory, Pasadena, CA (1986–1987).

    He is an Associate Editor of the Journal of Circuits Systems and Computers, was a member of the BC Science Council’s Computers and Computing Committee (1990–1995), the Canadian Foundation for Innovation Multidisciplinary Adjudication Committees (2004, 2006), Expert Panel for the National Centers of Excellence in Commercialization and Research (2008) and external review panels for several ECE Departments in Canada(Queens – 2000, Manitoba – 2008, Western Ontario – 2008).

    His fields of interest are in parallel computers, the Grid, power aware computing and neural networks. His research has been funded by NSERC, the Canadian Cable Labs Fund, ASI, CFI and CMC. He has published over 150 works in refereed journals and conferences. Professor Dimopoulos is a Senior Member of the IEEE and Fellow of the Engineering Institute of Canada.

    View full text