Virtual private grid: a command shell for utilizing hundreds of machines efficiently

https://doi.org/10.1016/S0167-739X(03)00036-0Get rights and content

Abstract

We describe design and implementation of virtual private grid (VPG), a shell that can utilize many machines distributed over multiple subnets. VPG works around common security policies (e.g., firewall, private IP, DHCP) that restrict communication between machines and even break uniqueness of IP addresses. VPG provides the following functions: (1) a unique nickname to each machine that does not depend on a DNS name or a fixed IP address; (2) job submissions to any nicknamed machine; (3) redirections from/to a file on any nicknamed machine; (4) pipes between commands executed on any nicknamed machine. VPG implements the above functions by constructing a self-stabilizing spanning tree among machines and forwarding messages via a path in the tree. We ran VPG on about 100 nodes (270 CPUs) and measured a turn around time of a small job submission with VPG and other tools: rsh, SSH, and globus-job-run. The experimental result shows that VPG can submit a job faster than SSH and globus-job-run, since VPG performs authentication only when it constructs a tree.

Introduction

Today, computer users commonly have an access to hundreds of machines across multiple subnets and geographically distributed places. In such environments, they can potentially achieve high performance for certain types of parallel applications (e.g., parameter sweep applications). Job submission tools, such as Condor [6] and PBS [11], enable a user to submit many jobs to clusters/supercomputers in a local network.

However, when available machines are distributed over multiple subnets, it becomes difficult and cumbersome to utilize them with these tools. We explain this problem below.

Machines distributed over multiple subnets are usually managed by different administrators, who impose various restrictions on their use for the sake of security and ease of administration. Examples of these restrictions are:

  • (1)

    Firewall. A firewall protects local machines from malicious attacks by restricting accesses from external machines. For example, IP filtering—a typical firewall configuration—restricts connections from/to machines with a particular IP address.

  • (2)

    Private IP. A private IP is an IP address that is visible only within a subnet. Machines outside a subnet cannot establish direct connections to machines that have only a private IP address. In addition, since a private IP address is visible only within a subnet, machines in different subnets may have the same private IP address without confusion. This breaks the uniqueness of IP addresses.

  • (3)

    DHCP client. DHCP is a mechanism that enables machines to extract their network configuration from a DHCP server. An IP address of a DHCP client changes dynamically whenever it extracts a new configuration (typically when it reboots).

Because of the above restrictions, machines cannot necessarily establish a direct connection to every other machine or may even not have unique addresses. Hence, a user may not submit a job easily across multiple subnets with existing tools; she/he has to work around the restrictions with ad hoc methods, which are found on a case-by-case basis with human intervention. For example, when submitting a job to machines behind a firewall, a user usually has to first log onto a gateway machine and then to the target. Accessing a DHCP client requires some database that stores its IP address. A situation becomes more complicated if those addresses are private IP addresses.

In addition, we must consider that the number of available machines is large (e.g., >100) and that the topology of the network usually changes dynamically (e.g., machines may crash or the network may become disconnected). These also make it difficult for a user to manually work around the administrative restrictions. For instance, a user has to know which machines are available at present, and she/he may need to find a new job forwarding route whenever a machine crashes.

To summarize, the administrative restrictions significantly increase the user’s cost for utilizing remote machines, and consequently, obstruct smooth utilization of computational resources. A user would like to have a solution in which all machines can be reached directly and transparently, with names fixed over time.

To this end, we have developed virtual private grid (VPG), a command shell that can utilize hundreds of machines efficiently. It enables a user to easily submit jobs to remote machines by providing the following mechanisms:

  • (1)

    Nickname mechanism. It gives each machine a unique name that does not depend on a DNS name or a fixed IP address.

  • (2)

    Communication mechanism. A user can directly access remote machines that would be reachable by using existing tools (e.g., rsh, SSH [12]) several times from her/his local machine. A user can access these machines merely by specifying them with their nickname.

The above mechanisms can be implemented without modifying administrative policies and tolerate dynamic changes of the topology of the network, though some manual configurations are required.

The remainder of this paper is organized as follows. Section 2 shows the difficulty of working around administrative restrictions through a practical motivating scenario and discusses issues in the implementation of the communication mechanism. Section 3 describes the user interface of VPG and the manual configurations that it requires. Section 4 presents the details of the communication mechanism. Section 5 shows the experimental results. Section 6 mentions related work. The final section summarizes the paper and states future work.

Section snippets

A practical scenario

In this section, we show the difficulty of working around administrative restrictions through a practical motivating scenario. Consider the network shown in Fig. 1. Harp, tuba, … in Fig. 1 represent host names. This network consists of three subnets including a DHCP client and a machine that has only a private IP. Firewalls restrict connections between different subnets; SSH to the gateway machines (harp, cscl0, and ise0) is the only allowed in-bound connection. Such a configuration is fairly

Overview of VPG

The following summarizes the functions provided by VPG:

  • It gives each machine a (per-user) unique name that does not depend on a DNS name or a fixed IP address (nicknaming).

  • It provides a job submission to any nicknamed machine.

  • It provides a redirection from/to a file on any nicknamed machine.

  • It provides a pipe between commands executed on any nicknamed machine.

We make the above functions accessible by a combination of the simple shell syntax and existing commands (see Fig. 2). For example, a

Communication mechanism

VPG implements the communication mechanism by constructing a spanning tree and forwarding messages via a path in the tree. In this section, we describe details of this communication mechanism. First, we formalize administrative restrictions. Next, we mention the self-stabilizing spanning tree algorithm [2], [3]. With this algorithm, VPG daemons select necessary connections to make all the machines available. Then, we present the algorithm that calculates routes to participating machines for job

Experimental environment

We ran VPG in the network shown in Fig. 6. The network consists of three subnets, and machines are equipped with several operating systems (Solaris, Linux, and IRIX) and CPUs (SPARC, x86, PowerPC, and MIPS). We ran VPG daemons on about 100 nodes. In this experiment, daemons constructed a spanning tree that had a diameter of 5.

Comparison to other job submission tools

We compared VPG with three other job submission tools: rsh, SSH, and globus-job-run (globus-job-run is a remote job submission tool provided by Globus). rsh used Rhost

Resource management

Many remote job submission tools on clusters or on Grid environments have been developed (e.g., Globus [5], Condor [6], Nimrod [4]).

To the author’s knowledge, none of them are focusing on integrating many ‘desktop’ resources that are typically configured without DNS names and with DHCP/private IP addresses. Such machines constitute a large fraction of compute resources. The original Globus is blocked by typical firewall configurations and cannot submit jobs from outside firewall to the inside.

Summary and future work

In this paper, we have described VPG, a shell that can easily utilize hundreds of machines distributed over multiple subnets. The shell gives each machine a unique nickname that does not depend on a DNS name or a fixed IP address. In addition, it provides a job submission, redirection, and pipe on any nicknamed machine. VPG implements these functions by constructing a self-stabilizing spanning tree among machines and forwarding messages via a path in the tree. The latest implementation of VPG

Kenji Kaneda is a Master Course student at the Department of Computer Science, University of Tokyo. He received the BE degree in Information Science from University of Tokyo in 2001. His current major research interests are in the area of parallel and distributed systems.

References (13)

  • C. Scott, P. Wolfe, M. Erwin, Virtual Private Networks, 2nd ed., O’Reilly,...
  • Y. Afex, S. Kutten, M. Yung, Memory efficient self-stabilizing protocols for general network, in: Proceedings of the...
  • S. Aggarwal, S. Kutten, Time optimal self-stabilizing spanning tree algorithms, in: Proceedings of the 13th Conferences...
  • R. Buyya, D. Abramson, J. Giddy, Nimrod/G: an architecture for a resource management and scheduling system in a global...
  • K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, S. Tuecke, A resource management architecture...
  • J. Frey, T. Tannenbaum, I. Foster, M. Livny, S. Tuecke, Condor-G: a computation management agent for...
There are more references available in the full text version of this article.

Cited by (10)

View all citing articles on Scopus

Kenji Kaneda is a Master Course student at the Department of Computer Science, University of Tokyo. He received the BE degree in Information Science from University of Tokyo in 2001. His current major research interests are in the area of parallel and distributed systems.

Kenjiro Taura is an Associate Professor at the Department of Information and Communication Engineering, University of Tokyo. He was born in 1969, and received his BS, MS, and DSc degrees from University of Tokyo in 1992, 1994, and 1997, respectively. His major research interests include parallel/distributed computing and programming languages. He is a Member of ACM and IEEE.

Akinori Yonezawa is a Professor at the Department of Computer Science, University of Tokyo, and an ACM Fellow. He received his PhD in Computer Science from MIT in 1977. His current major research interests are in the areas of concurrent/parallel computation models, programming languages, object-oriented computing, and distributed computing. He is the author and editor of several books including: “Object-oriented concurrent programming” (MIT Press, 1987); “ABCL: an object-oriented concurrent system” (MIT Press, 1990). He served as an Associate Editor of ACM Transaction of Programming Languages and Systems (TOPLAS).

View full text