1 Introduction

A performance benchmark (hereinafter referred to simply as a benchmark) typically involves running an artificial workload (e.g., simulating user interactions) on an instrumented application (i.e., specialized with code to record detailed execution logs) deployed on a single or multiple computing nodes whose resources of interest (e.g., CPU, memory, disk I/O) are monitored while the workload is running. Benchmarks are commonly used to evaluate and compare the performance of applications under different configurations and systems through the execution of equivalent, if not equal, workloads. For example, TPC [22] is a well-known family of benchmarks to evaluate and compare, among other relevant metrics, the number of transactions a database system can process per second.

Also, in experimental research in computer systems, hypotheses have to be confirmed by compelling evidence found in both resource monitoring logs (e.g., CPU utilization) and execution logs (e.g., response times) generated by the execution of carefully constructed benchmarks. As an illustrative case, we – project Elba researchers – and our collaborators have executed thousands of large-scale benchmarks over the years to study interesting phenomena in cloud computing environments, leading to the discovery of previously unknown transient resource bottlenecks (e.g., in CPU utilization, disk I/O) with the duration of only a few tens or hundreds of milliseconds that start to happen at quite low average utilization levels (e.g., less than 40% in the case of CPU). Later, we would also confirm the hypothesis that these so-called millibottlenecks can propagate through the components of a distributed system and have their effects amplified, causing significant performance bugs (e.g., response time long tail problem) [15].

Currently, one of our major research goals is to discover and study new types and sources of millibottlenecks. So far we have been able to discover millibottlenecks in CPU utilization due to many causes: garbage collector of the Java Virtual Machine 1.5 [23, 25], DVFS (dynamic voltage and frequency scaling) for energy savings [24], and interference from noisy neighbors in consolidated cloud environments [12]. We have also found disk I/O millibottlenecks due to dirty page flushing [26].

Although executing complex benchmarks (e.g., of n-tier systems) is essential to collect the data needed to achieve our goal of discovering and studying new types and sources of millibottlenecks, their construction is challenging. For example, performance debugging of distributed systems requires the analysis of detailed event logs to reconstruct the complete execution path of anomalous requests (e.g., the ones taking seconds to be processed, while the majority is still being served within a few milliseconds) across different computing nodes; further, when looking for their root causes, aforementioned millibottlenecks pose a resource monitoring challenge due to their short life spans and variety – to support their observation, we need a diverse set of resource monitors with low overhead and very short sampling periods (at 50 ms intervals, we can reliably detect millibottlenecks of 100 ms or longer).

Furthermore, reproducibility is a cornerstone of the scientific method: the research community must be able to independently validate published results by reproducing their experiments. As a result, an increasing number of academic journals (e.g., Science [17]) now require the submission of software, scripts, and data used to obtain claimed results along with the article itself.

Automation of complex benchmark workflows is thus essential to guarantee that the original sequence of tasks or, alternatively, another correct sequence of tasks (i.e., one satisfying orchestration dependencies such as server initialization order) is precisely followed when reproducing them. Common tasks include copying artifacts from remote servers, installing libraries, configuring software, initializing applications, running workloads, and archiving results.

However, ensuring the reproducibility of benchmarks usually demands more than simply releasing their software, scripts, and data – it is also necessary to satisfy their hidden and implicit dependencies (e.g., required libraries, compiler versions). Manually recreating complex benchmark environments used in published research is typically intricate, error-prone, and time-consuming so, in previous versions of our benchmarks, hidden software dependencies such as required libraries were satisfied through the use of pre-built operating system images; in turn, implicit dependencies such as network naming conventions adopted in our benchmarks’ scripts limited their execution to a single scientific cloud infrastructure called Emulab [5].

Benchmarks should therefore rely on as few hidden and implicit dependencies as possible to be highly configurable and easily portable. For example, the use of pre-built operating system and container images to satisfy hidden software dependencies makes it difficult to increase the configuration spaces of benchmarks (e.g., adding support to multiple library versions would require building an exponential number of these images, one for each possible combination of library versions), which is also critical to our goal of discovering and studying new types and sources of millibottlenecks. Moreover, restricting the execution of benchmarks to a single public cloud infrastructure due to the network naming conventions adopted in their scripts is a major inconvenience because computing nodes are becoming an increasingly scarce resource in these cloud infrastructures. For a better understanding of benchmark dependencies, we refer the reader to Table 1.

Table 1. Commonly found benchmark dependencies according to their target.

In summary, three concrete needs arise from these software instrumentation, resource monitoring, workflow automation, and dependency management challenges to construct, execute, and reproduce complex benchmarks:

  1. 1.

    Low-overhead software specialization to record detailed logs of interesting events (e.g., request arrivals, remote procedure calls) across computing nodes while a benchmark workload is running.

  2. 2.

    Fine-grained (order of milliseconds), low-overhead tools to monitor many different resources (e.g., CPU, memory, network, disk I/O) of computing nodes while a benchmark workload is running.

  3. 3.

    An appropriate workflow language for the specification of benchmarks, ideally enabling the declaration of all sorts of dependencies so it becomes easier to meet them at runtime (e.g., by adding workflow tasks to install software dependencies instead of relying on pre-built operating system and container images) and to eventually extend configuration spaces (e.g., by adding alternative workflow execution paths to support multiple public cloud infrastructures).

In this work, we present the next generation of the Elba toolkit available under a Beta release, showing how we have used it for experimental research in computer systems using RUBBoS, a well-known n-tier system benchmark, as example. We first show, in Sect. 2, how we have leveraged milliScope – Elba toolkit’s monitoring and instrumentation framework – to collect log data from benchmark executions at unprecedented fine-granularity, as well as how we have specified benchmark workflows with WED-Make – a declarative workflow language whose main characteristic is precisely to facilitate the declaration of dependencies; further, the execution of workflows specified with WED-Make is driven by declared dependencies themselves, guaranteeing their satisfaction. Then, in Sect. 3 we show how to execute WED-Makefiles (i.e., workflow specifications written with WED-Make). Next, in Sect. 4 we show how we have successfully reproduced the experimental verification of the millibottleneck theory of performance bugs, first presented in [15], in multiple cloud environments and systems, relying on as few hidden and implicit dependencies as possible. Finally, we present related work in Sect. 5 and summarize our conclusions in Sect. 6.

2 RUBBoS Monitoring and Workflow Specification with the Elba Toolkit

2.1 RUBBoS Benchmark

Most of our studies on interesting phenomena in cloud computing environments have stemmed from executions of RUBBoS [16] – an n-tier system benchmark that simulates user requests to a bulletin board web application similar to Slashdot [19]. As shown in Fig. 1, the RUBBoS web application is composed of 4 tiers: Apache HTTP servers [21], Tomcat application servers [1], C-JDBC middleware servers [2], and MySQL database servers [10].

When the workload is running, the RUBBoS client application generates HTTP requests simulating 24 types of user interactions (e.g., creating and viewing stories). Specifically, this client application sends HTTP requests to the Apache servers, responsible for serving HTML pages, which often need to forward these requests to the Tomcat servers, responsible for handling the application logic. These Tomcat servers send SQL queries to the C-JDBC servers, which forward them to the MySQL servers.

Fig. 1.
figure 1

RUBBoS architecture.

2.2 RUBBoS Monitoring with milliScope

milliScope [9] is our resource and event monitoring framework for n-tier systems, built precisely to discover and study transient phenomena such as millibottlenecks in cloud computing environments.

The resource monitoring component of milliScope contains open-source tools like sar [20], Collectl [4], and iostat to monitor resources like CPU, memory, network, and disk I/O with low-overhead and fine granularity. These tools have all been reliably used in production systems for quite a few years and are under active development.

The event monitoring component of milliScope comprises software (e.g., Apache HTTP Server, Tomcat application server) specialized with low overhead event loggers that mainly serve the purpose of identifying execution boundaries of requests (i.e., instants of time that each server was called and returned a response for every request). To impose minimum overhead, these event loggers simply extend native log components. As messages pass from one server to another, unique identifiers generated for each request are logged along with a timestamp and the event type (e.g., sent message to Tomcat server, received message from Tomcat server). In this way, it is possible to reconstruct the complete execution path of requests, as shown in Fig. 2.

Fig. 2.
figure 2

Reconstruction of the execution path (with latency data) of a request generated by the RUBBoS client application, showing its execution boundaries.

By integrating the fine-grained monitoring of many different resources and highly detailed event logging, both with low-overhead, milliScope has enabled us to analyze the performance of distributed systems across a wide variety of use cases – in particular, finding correlations between millibottlenecks and interesting events like server queue overflows and very long response times.

2.3 RUBBoS Workflow Specification with WED-Make

WED-Make, the major novelty of this new Elba toolkit release, is a workflow language for the specification of complex benchmarks that facilitates the declaration of dependencies. A WED-Makefile (i.e., a workflow specification written with WED-Make) comprises:

  • an initial guard;

  • a final guard;

  • a set of tasks.

In a WED-Makefile, the initial guard explicits dependencies to start the workflow execution. These dependencies are declared as logical predicates defined over variables of interest (like in Bash, a \(\$\) has to be placed in front of the variable identifier to retrieve its value). In the following, we present the initial guard specified in the RUBBoS WED-Makefile:

figure a

In the RUBBoS WED-Makefile, variables with suffix _NET_NODES represent the hostnames of computing nodes provisioned for each tier (WEB, APP, MIDDL, and DB). Hence, to start the execution of the RUBBoS workflow, these variables have to be defined (i.e., their values have to be different than the empty string).

On the other hand, the final guard defined in a WED-Makefile explicits dependencies to terminate the workflow execution. For example, after running the RUBBoS workload, a compressed archive containing log data generated by all computing nodes must be created and stored. As we can see in the final guard of the RUBBoS WED-Makefile presented below, the only dependency to terminate the execution of the RUBBoS workflow is exactly to check whether the path to this archive has already been set (variable BENCH_RESULTSTARBALL represents the path to this archive).

figure b

In a WED-Makefile, each task has a name, a script written in Bash, and a guard expliciting dependencies for the execution of this script. It is worth noting that the variables used in task guards can be interchangeably used as Bash variables with the same identifiers in task scripts. The RUBBoS workflow comprises around 90 tasks for copying artifacts from remote servers, installing libraries, configuring software, initializing applications, running workloads, archiving results, etc. In the following, we present the task specified in the RUBBoS WED-Makefile to collect resulting log data from Apache HTTP servers.

figure c

In this example, declared dependencies ensure that:

  • $WEB_NET_NODES != "" – hostnames of provisioned computing nodes were defined for the web tier.

  • $WEB_NET_USERNAME != "" – username to remotely access these computing nodes was defined.

  • $WEB_HTTPD_HOMEDIR != "" – Apache directory was already created in these computing nodes.

  • $WEB_FS_RESULTSDIR != "" – directory to store resulting log data was already created in these computing nodes.

  • $WEB_HTTPD_STOPPEDAT != "" – Apache HTTP server was already stopped.

  • $WEB_HTTPD_RESULTSDIR = "" – directory to store resulting log data specifically from the Apache HTTP server was not created in these computing nodes yet.

We also remark that the task script of the previous example sets the value of WEB_HTTPD_RESULTSDIR. Therefore, this script can be successfully executed at most once because the logical predicate $WEB_HTTPD_RESULTSDIR = "" is part of its associate guard.

To our benefit, we already had an implementation of the RUBBoS workflow (in an ad-hoc way, though, with an orchestration script dictating the order in which task scripts had to be executed). Thus, our work mainly consisted in porting it to a WED-Makefile, finding and declaring its hidden and implicit dependencies.

3 RUBBoS Execution

3.1 Execution Model

The execution of WED-Makefiles is inspired by the WED-flow approach for modeling workflows [6]: as previously mentioned, it is driven by declared dependencies. More specifically, dependencies declared in guards have to be satisfied by the data state of workflow instances (i.e., by the valuation of their variables).

Therefore, before executing a workflow specified in a WED-Makefile, the initial data state (i.e., the initial valuation of variables) must be defined in a separate configuration file written as a plain Bash script. Global variables resulting from the execution of this configuration file are the initial data state. In the following, we present an example of RUBBoS workload configuration with 500 concurrent connections:

figure d

We remark that some variables are not supposed to be defined in the configuration file (e.g., variables representing the start time of servers). In this case, like in Bash, their values are initialized with empty strings.

If the initial guard specified in the WED-Makefile is satisfied by the initial state, the workflow execution can be started. Then, once the dependencies for executing a task are satisfied, the execution of its script can be triggered. Global variables resulting from the execution of a task script are used to update the current data state. For example, at the end of the execution of the script of task WebCollectResultsApacheServer, presented in Sect. 2, the value of variable WEB_HTTPD_RESULTSDIR is updated. Finally, the workflow execution is terminated when the final guard defined in the WED-Makefile is satisfied by the current data state.

3.2 Implementation

The execution of WED-Makefiles needs concurrency control capabilities because more than one task guard can be satisfied at the same time (e.g., for tasks mounting filesystems in computing nodes of different tiers). Therefore, we have used the PostgreSQL [14] database management system to implement execution engines for WED-Makefiles, leveraging its battle-tested concurrency control mechanisms to maintain the ACID properties of transactions.

In summary, the WED-Makefile is first translated to SQL commands that create a single table and multiple stored procedures that constitute the execution engine itself. This table’s columns and rows represent, respectively, variables and workflow instances (i.e., each row represents exactly the data state of the correspondent workflow instance), as shown in Fig. 3. These stored procedures instantiate workflows, check the satisfaction of guards, and even encapsulate task script executions (more specifically, we have used PostgreSQL 9.6 with a plugin called plsh [13] that enables stored procedures written in Bash to be run inside transactions).

Fig. 3.
figure 3

WED-Makefile execution engines store the data state of ongoing and past executions of the workflow in database tables.

An inconvenience of using a single table, however, is the fact that most open-source database management systems use a row-level locking mechanism to guarantee serializability, thus preventing two or more concurrent transactions (task scripts, in this case) to operate on the same row (workflow instances, in this case). However, there clearly exists a natural separation of variables in n-tier benchmarks like RUBBoS – one group of variables for each tier. Hence, to achieve maximum parallelism when executing RUBBoS, we have split that single table into multiple tables – one for each group of variables related to the same tier – to be joined by execution identifiers.

4 RUBBoS Reproduction

As aforementioned, the initial data state of instances of a workflow specified in a WED-Makefile has to be defined in a separate configuration file containing parameters (e.g., software versions, workload size and duration). Reproducing a benchmark workflow specified in a WED-Makefile should thus just be a matter of using the same parameters used in the original benchmark execution. However, hidden and implicit dependencies commonly frustrate attempts to reproduce benchmark executions.

In the case of RUBBoS, hidden software dependencies such as required libraries were satisfied through the use of pre-built operating system images, hindering our ability to increase its configuration space. We remark that finding these hidden dependencies is difficult because they can only be found when the benchmark execution fails for not meeting them, but we still managed to implement tasks to install most needed software and libraries at runtime (i.e., building those custom operating system images at runtime), expliciting their dependencies, and eventually abandoning the pre-built operating system images.

On the other hand, finding implicit dependencies such as hard-coded values (e.g., directory paths) and embedded Emulab [5] network naming conventions in RUBBoS scripts was trivial with a tool like grep. In this way, we easily declared these dependencies. Later, we would also add alternative workflow execution paths to support a newer public cloud infrastructure called CloudLab [3].

5 Related Work

Other approaches have been proposed for monitoring resources and events in computer systems. We highlight Dapper [18], which uses sampling to reduce event monitoring overhead, and SysViz, a Fujitsu product with limited adoption.

Another workflow language based on WED-flow is WED-SQL [11]. In a similar way, execution engines are generated for workflows specified with WED-SQL through the translation to SQL commands.

To enable the constructions of benchmarks with large configuration spaces, Mulini [8] has leveraged template-based code generation techniques to render custom Bash scripts to implement and orchestrate benchmark executions.

Recently, the Popper convention [7] proposed the use of DevOps configuration tools to manage the execution of reproducible experiments. For example, Popper suggests using containers (e.g., Docker images) to meet software dependencies and configuration tools (e.g., Ansible) to orchestrate experiments.

6 Conclusions

Benchmark executions have led us to the discovery of previously unknown interesting phenomena in cloud computing environments. However, the needs for low-overhead software specialization to record detailed event logs, low-overhead monitoring of many different resources with fine-granularity, and an appropriate workflow language to handle benchmark dependencies have posed significant challenges to the construction, execution, and reproduction of complex benchmarks.

Unlike widely adopted ad-hoc approaches, the new release of our Elba toolkit presented in this paper enables the systematic construction, execution, and reproduction of complex benchmarks needed to achieve our goal of discovering and studying new types and sources of millibottlenecks. In particular, milliScope has enabled us to collect log data from benchmark executions at unprecedented fine-granularity. In turn, WED-Make, the major novelty of this new Elba toolkit release, not only facilitates the declaration of benchmark dependencies but also guarantees their satisfaction.