Elsevier

Journal of Systems and Software

Volume 95, September 2014, Pages 217-230
Journal of Systems and Software

Xen2MX: High-performance communication in virtualized environments

https://doi.org/10.1016/j.jss.2014.04.036Get rights and content

Highlights

  • We discuss current methods of network access in virtualization platforms.

  • We identify key design choices for a VM-aware cluster interconnection protocol.

  • We introduce Xen2MX, a high-performance interconnection protocol for virtualized environments.

  • Xen2MX semantically enriches the guest-to-host communication.

  • Xen2MX is able to saturate a 10 Gbps link without the necessity to use specialized hardware.

Abstract

Cloud computing infrastructures provide vast processing power and host a diverse set of computing workloads, ranging from service-oriented deployments to high-performance computing (HPC) applications. As HPC applications scale to a large number of VMs, providing near-native network I/O performance to each peer VM is an important challenge. In this paper we present Xen2MX, a paravirtual interconnection framework over generic Ethernet, binary compatible with Myrinet/MX and wire compatible with MXoE. Xen2MX combines the zero-copy characteristics of Open-MX with Xen's memory sharing techniques. Experimental evaluation of our prototype implementation shows that Xen2MX is able to achieve nearly the same raw performance as Open-MX running in a non-virtualized environment. On the latency front, Xen2MX performs as close as 96% to the case where virtualization layers are not present. Regarding throughput, Xen2MX saturates a 10 Gbps link, achieving 1159 MB/s, compared to 1192 MB/s of the non-virtualized case. Scales efficiently with the number of VMs, saturating the link for even smaller messages when 40 single-core VMs put pressure on the network adapters.

Introduction

Modern cloud data centers provide flexibility, dedicated execution, and isolation to a vast number of service-oriented applications (i.e. high-availability web services, core network services like mail servers, DNS servers, etc.). These infrastructures, built on clusters of multicores, offer huge processing power; this feature makes them ideal for mass deployment of compute-intensive applications. In the HPC context, applications often scale to a large number of nodes, leading to the need for a high-performance interconnect to provide low-latency and high-bandwidth communication. Unfortunately, in the cloud context, I/O-intensive applications suffer from poor performance (Nanos et al., 2010, Santos et al., 2008, Youseff et al., 2006), due to various intermediate layers that abstract away the physical characteristics of the underlying hardware and multiplex the application's access to I/O resources. This limitation is one of the most important reasons that HPC applications are not widely deployed in virtualized environments (Simons and Buell, 2010).

Numerous studies both in native (Geoffray, 2001, Koukis et al., 2010) and virtualized environments (Diakhaté et al., 2008, Dong et al., 2009, Liu et al., 2006a, Menon et al., 2006, Nanos et al., 2010, Nanos et al., 2011, Nanos and Koziris, 2009, Nanos and Koziris, 2012, PCI, 2007, Youseff et al., 2006) explore the implications of alternative data-paths that increase the system's I/O throughput, helping applications overcome significant bottlenecks in data retrieval from storage or network devices. However, near-native I/O performance for Virtual Machines (VMs) in a generic cloud environment, built from off-the-shelf components, is still far from being achieved. One of the most important reasons for this shortcoming is the I/O interfaces provided by hypervisors and hardware: software approaches appear too intrusive (Ben-Yehuda et al., 2009, Landau et al., 2010, Liu et al., 2006b, Menon et al., 2006), while hardware approaches need specialized adapters (Auernhammer and Sagmeister, 2010, Dong et al., 2008, Kieran Mansley Greg Law, 2007, Raj and Schwan, 2007).

I/O operations in virtualized environments are handled by software layers within the hypervisor or hardware extensions provided by specialized Network Interface Cards (NICs) (Fig. 1). The software layers are implemented either by (a) emulating device operations or by (b) the split driver model, provided by the PV concept (Whitaker et al., 2002). The hardware approach, case (c), is based on the device assignment mechanism, where the hypervisor allows the VM to interact with the hardware directly. Device assignment outperforms the previous methods in terms of high-bandwidth and low-latency (Nanos et al., 2010, Yassour et al., 2008).

Device assignment, however, implies that the adapter is only available to a specific VM, thereby rendering this approach inapplicable to a cloud computing infrastructure, where, by definition, VMs share hardware resources. To address this limitation while providing near-native performance, I/O Virtualization (IOV) techniques have been introduced (Auernhammer and Sagmeister, 2010, Liu et al., 2006b, PCI, 2007). These methods combine the advantages of device assignment while allowing multiple VMs to share the same I/O device. The community has proposed several optimizations to IOV, but a major issue related to its design remains unsolved: flexibility. When using IOV adapters, migration becomes more difficult, since the degree of heterogeneity increases between internal or external data-center availability zones. Additionally, the number of VMs that enjoy direct access to the network is limited by the hardware capabilities of the specific IOV-enabled adapter.

This is a serious concern; cloud providers need to be able to manage and manipulate the VM access to the network, in order to account for efficient consolidation, while at the same time providing Quality of Service (QoS) and Service Level Agreements (SLAs). By design, IOV bypasses the VM container, as multiplexing is realized entirely in hardware; although IOV adapters export an interface to control features like traffic shaping and packet filtering, they do not provide a unified way to manage their capabilities. This further complicates the task of managing the network access of a specific range of VMs.

As we move toward the standardization of Ethernet in both Cloud computing and HPC, we need a way to study the effect of message-passing protocols in the Cloud, without the complexity of TCP/IP. However, current approaches do not provide such a software solution to efficiently exploit the hypervisors’ interface to the hardware. In previous studies (Nanos et al., 2010, Nanos et al., 2011, Nanos and Koziris, 2009), we have attempted to examine the trade-offs related to device sharing, using custom lower-level protocols. In this work, we move forward to a more generic design, in order to gain insight into the system's internals and, ultimately, optimize the way VMs communicate with the network.

We describe the design and implementation of Xen2MX, a high-performance interconnection framework for virtualized environments. Xen2MX contains many features that reduce or eliminate problems associated with traditional PV drivers in an HPC context. Specifically, it minimizes the overhead of event handling for latency sensitive message exchange and enhances throughput by (a) using zero-copy data transfers for large messages (b) re-using existing mappings between guest and host memory regions, (c) decoupling control messages from data exchange using a high-availability consumer-producer scheme. Our design is applicable to any hypervisor that supports PV.

The contribution of this paper can be summarized as follows:

  • We identify key design choices for a VM-aware cluster interconnection protocol and discuss current methods of network access in virtualization platforms (Section 2.3).

  • We introduce Xen2MX, a high-performance interconnection protocol for virtualized environments, binary compatible with MX and wire compatible with MXoE (Section 3).

  • We discover limitations in the existing network approaches present in Xen, leading to increased communication latency and unstable behavior. Xen2MX is able to overcome these limitations by employing an alternative approach, semantically enriching the guest-to-host communication. Our prototype implementation shows that Xen2MX is able to saturate a 10 Gbps link without the necessity to use specialized hardware, at the expense of an ≈8% CPU utilization overhead (Section 4).

Specific results from the native MX benchmarks over our framework show that Xen2MX is able to reduce the round-trip latency to as low as 14 μs compared to 44 μs of a software bridge setup. Compared to directly attached adapters (or IOV), Xen2MX exhibits a 4% overhead. In terms of bandwidth, Xen2MX is able to nearly saturate a 10 Gbps link, achieving 1159 MB/s, compared to 490 MB/s of the bridged case and 1192 MB/s of the directly attached case. We also evaluate the scalability of Xen2MX compared to the bridged setup (Section 4.4), while overwhelming the system with send and receive loads from a variable number of VMs (up to 40 VMs): Xen2MX achieves near-native throughput for 512 KB messages (16 VMs).

The rest of this paper is organized as follows: first, we lay groundwork in Section 2 by presenting the basic concepts of high-performance computing cluster interconnects. In Section 2.3 we present the requirements for efficient message-passing and elaborate on the current choices for communication in virtualized environments. Section 3 describes Xen2MX, while Section 4 presents a detailed evaluation of its performance, with regards to latency, throughput, CPU utilization, and scaling. Finally, we discuss related research (Section 5) and conclude, presenting possible future endeavors (Sections 6 Discussion, 7 Conclusion).

Section snippets

Motivation

In this section we describe the components that Xen2MX is based on. We discuss cluster interconnection options for message exchange (Sections 2.1 Interconnection protocol, 2.2 Open-MX), and focus on the incompatibility between high-performance communication and virtualized environments. We present how our design balances the trade-offs between commodity hardware and near-native performance in a VM context with regards to flexibility.

Xen2MX: design and implementation

In this section, we present Xen2MX, our framework for high-performance communication in virtualized environments over Ethernet. We structure the communication stack using a frontend running on the guest VM and a backend running on the host. To interface with the application we use a communication library, while to communicate with the network we use the generic Linux-kernel Ethernet stack. Our design is based on the Xen split driver model, Myrinet/MX and Open-MX.

Performance evaluation

In this section we describe the experiments we performed to analyze the behavior of Xen2MX and identify specific characteristics of our approach compared to the common communication setups used in virtualized environments.

We setup two identical boxes described in Table 1 and perform two basic experiments, in order to illustrate the merits and shortcomings of our approach compared to the other methods of communication, as well as the performance in a real-life scenario:

  • Experiment 1. VM container

Related work

The advent of 10G Ethernet and its extensive use in cluster interconnects has given rise to a large body of literature on optimizing upper-level protocols, specifically, protocol handling and processing overheads (Recio et al., 2002, Karlsson et al., 2007, Goglin, 2011, Shalev et al., 2010). Several optimizations have been introduced: OS-bypass communication, zero-copy for TCP/IP, offloading the network stack on a separate, dedicated core, etc. Based on these findings, an efficient

Discussion

The literature suggests that paravirtualized approaches are the de-facto standard for network I/O in generic cloud infrastructures. Our results provide a clear indication that Xen2MX could become a viable alternative to generic Ethernet virtual interfaces in an HPCs context. In this section, we present possible uses of Xen2MX and highlight specific optimizations that could be employed on our framework:

  • Optimize message handling. At the moment, Xen2MX handles incoming messages via the control I/O

Conclusion

We have presented the design and implementation of Xen2MX, a framework for high-performance communication in commodity cloud infrastructures. Xen2MX is based on the paravirtualization concept, consisting of a backend and a frontend driver running on the host and guest respectively. Our implementation is based on Open-MX, exploiting its features of binary compatibility with Myrinet/MX and wire compatibility with MXoE. Xen2MX is open-source software, available online at

Acknowledgements

We would like to thank Nikos Nikoleris, Elisavet Kozyri, Stratos Psomadakis, and Dimitris Aragiorgis for their valuable contributions to this project. We would also like to thank Yoshio Turner, Georgios Goumas, Dimitrios Tsoumakos and Konstantinos Nikas for providing feedback on this work. Finally we thank the anonymous reviewers for their insightful comments.

Part of this work was supported by the Greek State Scholarship Foundation (SSF) (grant no. 5271). Additionally, this research has been

Anastasios Nanos is a PhD Candidate in Computer Engineering (ECE, NTUA – expected graduation: 2013). His area of specialization is I/O systems for high-performance and cloud computing, and particularly the efficient sharing of I/O devices in Virtualized Environments. His research interests include Communication Architectures for Clusters, Systems Software, I/O Virtualization, and Scalable Storage Architectures based on Clusters.

References (41)

  • P. Geoffray

    Opiom: off-processor IO with Myrinet

  • A. Gordon et al.

    ELI: bare-metal performance for I/O VIrtualization

  • A. Gordon et al.

    Towards exitless and efficient paravirtual I/O

  • W. Huang et al.

    Virtual machine aware communication libraries for high performance computing

  • S. Karlsson et al.

    MultiEdge: An Edge-based Communication Subsystem for Scalable Commodity Servers

    (2007)
  • D.R. Kieran Mansley Greg Law

    Getting 10 Gb/s from Xen: safe and fast device access from unprivileged domains

  • Kivity

    Avi: KVM: the Linux virtual machine monitor

  • E. Koukis et al.

    Gmblock: optimizing data movement in a block-level storage sharing system over Myrinet

    Cluster Comput.

    (2010)
  • A. Landau et al.

    Plugging the hypervisor abstraction leaks caused by virtual networking

  • A. Menon et al.

    Optimizing network virtualization in Xen

  • Cited by (6)

    • Efficient accelerator sharing in virtualized environments: A Xeon Phi use-case

      2019, Journal of Systems and Software
      Citation Excerpt :

      Essentially, it tries to minimize the exits to the hypervisor which constitutes the major factor of overhead in this model. Depending on the specific approach, a paravirtualization solution may even perform at a near-native rate (Mouzakitis et al., 2017; Nanos and Koziris, 2014; Tang and Li, 2010), offering a major advantage in terms of performance. The downside stems from the fact that parts of the guest operating system (usually drivers) have to be modified in order to take full advantage of guest virtualization awareness.

    • A multivariate and quantitative model for predicting cross-application interference in virtual environments

      2017, Journal of Systems and Software
      Citation Excerpt :

      However, some challenges must be overcome to bridge the gap between performance provided by a dedicated infrastructure and the one supplied by clouds. Overheads introduced by virtualization layer, hardware heterogeneity and high latency networks, for example, affect negatively the performance of HPC applications when executed in clouds (Alves and Drummond, 2014; Gupta et al., 2014; Nanos and Koziris, 2014; Chen et al., 2015; Lin et al., 2012). In addition, cloud providers usually adopt resource sharing policies that can reduce even more the performance of HPC applications.

    • Optimized inter-domain communications among multiple virtual machines based on shared memory

      2015, Proceedings - 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security and 2015 IEEE 12th International Conference on Embedded Software and Systems, HPCC-CSS-ICESS 2015
    • Functional model of a software system with random time horizon

      2015, Conference of Open Innovation Association, FRUCT

    Anastasios Nanos is a PhD Candidate in Computer Engineering (ECE, NTUA – expected graduation: 2013). His area of specialization is I/O systems for high-performance and cloud computing, and particularly the efficient sharing of I/O devices in Virtualized Environments. His research interests include Communication Architectures for Clusters, Systems Software, I/O Virtualization, and Scalable Storage Architectures based on Clusters.

    View full text