Keywords

1 Introduction

Business improvement ideas often do not lead to actual improvements [9, 10]. Contemporary Business Process Management Systems (BPMSs) enable quick deployment of new process ideas, but they do not offer support for validating the improvement assumptions existent in the new version. Support for validating such assumptions during process redesign is also limited.

The AB testing approach from DevOps can be adopted in Business Processes Management to provide fair validation support. A new process version can be deployed alongside the older version on the same process engine such that these versions (A and B) are operational in parallel. User requests can be routed to either of these versions using various instance routing algorithms. Based on the performance of each version, the routing configuration can be dynamically adjusted to ultimately find the best performing version. This general idea has been introduced in AB-BPM [14]. However, the routing algorithm proposed in AB-BPM does not address scenarios where the process performance is measured through multiple Process Performance Indicators (PPIs) which may be available at different times. It also does not provide support for evaluating processes for which some of these PPIs may never be available, and processes that are affected by external factors (e.g. the weather condition).

In this paper, we address these shortcomings by revising the routing mechanism of AB-BPM. We propose a pluggable instance router architecture that allows routing algorithms to asynchronously collect and evaluate PPIs. We also propose a routing algorithm, ProcessBandit, that can be plugged into the instance router. ProcessBandit finds a good routing policy in the presence of delays and incompleteness in observing PPIs, and also when true performance depends on contextual external conditions. We show that our approach identifies the best routing policy given the performance, and that it scales horizontally. We also demonstrate the overall approach using a synthetic case study.

The remainder of the paper starts with a discussion on the background, key requirements, and related work in Sect. 2. Section 3 describes the architecture of the instance router and the details of ProcessBandit algorithm. In Sect. 4, we analyze the behaviour of the algorithm, and study a use case. Section 5 discusses our approach and draws conclusions.

2 Background

2.1 Problem Description and Requirements

Business process improvement efforts are often analysed by measuring four performance dimensions: time, cost, flexibility, and quality. Improvement decisions have to reflect trade-offs between these dimensions [7, 13]. In many cases, shortcomings in one dimension may not be compensated by improvements in other dimensions. For example, a low user satisfaction cannot be compensated with faster performance. In addition, the relationships between these dimensions may not be intuitive. This is illustrated by an anecdote of a leading European bank. The bank improved their loan approval process by cutting turnaround time down from one week to few hours. However, this resulted in a steep decline in customer satisfaction: customers with a negative notice would complain that their application might have been declined unjustifiably; customers with a positive notice would inquire whether their application had been checked with due diligence.

This anecdote shows that customer preferences are difficult to anticipate before deployment and that there is a need to carefully test improvement hypotheses in practice. Up until now, only the AB-BPM approach [14] supports the idea of using principles of AB testing to address these problems. As an early proposal, AB-BPM has a number of limitations regarding utilization of PPIs.

First of all, there may be multiple process performance indicators involved in determining a better version. For instance, both user satisfaction score and process execution time are acceptable PPIs for a process instance. Second, not all of the required Process Performance Indicators (PPIs) may be observable at the same time. A PPI such as the user satisfaction score is obtained at different times with delays of varying length. This means that the evaluation mechanism should support a PPI to be collected and aggregated asynchronously, i.e., at different points in time. Finally, some process instances will not produce all of the PPIs. It is likely, for example, that the number of users who do not respond to requests for providing satisfaction scores will outnumber those who do. Therefore, we should also be able to handle the missing or incomplete PPI observations.

Another aspect to consider is the effect of contextual factors. The performance of a process can be influenced by factors such as resource constraints, the environment, and market fluctuation. One example of influence of weather has been observed in “teleclaims” process of an insurance company [1]. The call centers of this company receive an incoming call volume of 9,000 per week. However, during a storm season, the volume can reach up to 20,000 calls per week. In order to manage this influx, the managers manually escalate the cases to maintain quality and meet deadlines. Identifying and acting on such contextual factors is crucial to find the best process version.

From the above analysis, we derive four key requirements and propose approaches outlined in Table 1. To address these requirements, we have implemented a two-pronged solution: a pluggable architecture for instance routing, and a routing algorithm that asynchronously learns about process performance.

Table 1. Requirements of AB testing system and our approach

2.2 Related Work

AB testing is a commonly used approach for performing randomized sequential experiments. This approach is widely used to test micro changes in web applications [6, 10, 11]. In applying AB testing to business process versions, performing randomized experiments can inadvertently introduce risks such as loss of revenue. Risks in this context are higher than that for standard web applications, where the changes and the effects are small (e.g. the placement of buttons). Therefore, user requests should be distributed according to performance of process versions.

We introduced the idea of AB testing for business process versions in AB-BPM [14], where we modeled this routing challenge as a contextual multi-armed bandit problem [2, 4, 12]. We proposed LtAvgR, which is based on LinUCB [5, 12] – a well-known contextual multi-armed bandit algorithm. LtAvgR dynamically adjusts how user requests are routed to each version by observing numerical rewards derived from process performance. LtAvgR defines an experimentation phase where observing rewards is emphasized over optimal routing, and a post-experimentation phase where the best routing policy is selected based on the observed rewards. LtAvgR updates its learning by averaging historical rewards, which enables it to support long-running processes. However, LtAvgR can only handle scenarios where all PPIs are available at the same time. In this paper, we propose ProcessBandit, an algorithm that addresses this limitation.

Multi-armed bandit algorithms have been adopted for various kinds for experiment designs [4]. However, work on the effect of feedback delay and impact of partial rewards is not well studied. Furthermore, the effect of sparseness of some rewards, such as those with user satisfaction scores, have not been considered for multi-armed bandits. Temporal-difference (TD) learning can be used to converge towards the best routing configuration in presence of delayed rewards [17, Chap. 6]. Silver et al. [16] propose an asynchronous concurrent TD learning approach for maximizing metrics such as customer satisfaction by learning from partial customer interactions. This approach can be used in sequential scenarios like marketing campaigns where interactions can affect customer state. We propose a simpler multi-armed bandit algorithm that handles asynchronous learning with partial rewards, adapted for scenarios where only one interaction (processes instantiation) needs to be observed.

Approaches for prediction based on imbalanced data include techniques such as oversampling and undersampling [8, Chap. 2]. Such techniques make assumptions about what balanced data should look like, and do not introduce any new knowledge. In our scenario, imbalance occurs when only a subset of all PPIs are observed. Since AB testing aims to remove implicit assumptions, we avoid sampling techniques. Instead, we ensure that the routing algorithm learns mostly through observations that have all PPIs.

3 Solution

Our solution consists of two parts. First, we propose a pluggable and scalable architecture that facilitates a routing algorithm to learn the best routing policy even when PPIs are missing, delayed, or incomplete, and when the performance is affected by contextual factors. Second, we propose a routing algorithm named ProcessBandit that learns routing policies by utilizing this architecture.

3.1 The Instance Router

The instance router is a modular system composed of an Asynchronous Task Queue, the Controller, the Routing Algorithm, the Context Module, the Tasks Module, and the Rewards Module. Figure 1 shows the architecture of the system.

Fig. 1.
figure 1

The architecture of instance router

figure a

The instance router assigns an instance of the deployed processes, version A or B, to an incoming request. Upon receiving a request, the Controller invokes the Context Module to extract contextual information from the request and construct a feature vector. If required by the Routing Algorithm, the Context Module captures and stores hypothesized contextual factors associated with each request. These hypothesized contextual factors are set at the start of AB tests. If a contextual factors is confirmed through the analysis of the stored values, the Context Module integrates the contextual factor in the feature vector. Using this feature vector, the Controller invokes the Routing Algorithm, which instantiates a process and returns an identifier. This process instance identifier is used by the Controller to schedule an update task on the Asynchronous Task Queue.

An update task asynchronously polls the BPMS for PPIs of the instantiated process, and calculates a numerical reward using the PPIs. When only a subset of the desired PPIs are available, update tasks are re-scheduled by the Tasks Module so that that the missing PPIs can be collected and evaluated at a later point in time. In such scenarios, temporary rewards are calculated using the available PPIs. Reward calculation is delegated to the Rewards Modules. The Routing Algorithm can learn a routing policy through these numerical rewards.

3.2 ProcessBandit Algorithm

We propose ProcessBandit, a routing algorithm that can be plugged into the architecture. The algorithm asynchronously observes PPIs associated with a particular request, distributes requests to process versions, observes process performance, and learns the best routing policy given the process performance.

The pseudo code for sampling a process version (or “arm” in multi-armed bandit terminology) to test its performance is shown in Algorithm 1. The algorithm maintains an average of complete, incomplete, and overall rewards for each d-dimensional context in relevant matrices, indicated as b. These values are updated asynchronously according to the performance of each process instance.

The algorithm consists of experimentation (P1) and post-experimentation (P2) phases. When contextual factor detection is enabled, the experimentation phase is further divided into pre-contextual factor (P1A) and post-contextual factor (P1B) phases. The phases are configured using an exponential decay function \(exp(\lambda )\), experimentation request threshold M, and pre-contextual factor reward threshold. Request count is incremented on process instantiation. Reward count is incremented when a reward calculated using all PPIs is received.

The algorithm uses an approach similar to LinUCB [12] to select a candidate arm \(a_\text {linucb}\) such that the expected reward is maximized. When the algorithm is in experimentation phase, it either chooses \(a_\text {linucb}\) or the alternate arm based on the probability sampled from the exponential decay function. Asynchronous reward updates are scheduled for all decisions made in the experimentation phase.

Asynchronous Reward Update. We define an ideal PPI vector \(p_{\text {ideal}}\) as the vector that represents the best possible values for each PPI. We also introduce a reference vector \(p_{\text {ref}}\), which defines values that can be used as a substitute for missing PPIs. In AB testing scenarios, historical data of one process version is available. This can be used to inform the choice of \(p_{\text {ref}}\). Finally, we define the effective vector \(p_{\text {eff}}\) as the vector that contains all PPIs used to evaluate a reward. If not all PPIs are available at the time of observing a completed process instance, an effective vector \(p_{\text {eff}}\) is constructed using components of reference vector \(p_{\text {ref}}\) instead of the unavailable PPIs. If/when these PPIs are made available, an update is applied by removing the effect of previous \(p_{\text {eff}}\), and then using the new effective vector. This helps us address requirements R1 and R2.

Using \(p_{\text {eff}}\), rewards can be calculated through a point-based or classification based approach, as illustrated in Fig. 2. In the point-based method, the Rewards Module constructs \(p_{\text {eff}}\), applies weights to PPIs (if any), and then normalizes all components of \(p_{\text {eff}}\) and \(p_{\text {ideal}}\). After the normalization, it calculates the effective reward as the euclidean distance between \(p_{\text {eff}}\) and \(p_{\text {ideal}}\). Therefore, the objective of the algorithm is to choose versions that produce shorter distances between the effective vector and the ideal vector.

The point-based approach is intuitive and easy to implement. However, it makes the implicit assumption that a decrease in one PPI can be compensated by an increase in another PPI [3, Chap. 2]. In many real-world scenarios, this may not be the case. For example, while the increase in costs may be compensated with better processing times, lower user satisfaction may not be compensated with any other metric. In addition, granular and insignificant differences in distance can accumulate and produce an effect on routing. In such scenarios, a better approach is to classify performance into categories aligned with business goals.

figure b

Therefore, in the classification-based approach domain experts design reward classes and assign weights to each class. Weights represent the relative importance of each class. \(p_{\text {eff}}\) is constructed as above and a reward is assigned as the weight of the class it falls on. The most important class \(C_i\) has the highest weight \(w_i\). As depicted in Fig. 2, the most important class \(C_1\) has the highest weight \(w_1\), \(C_2\) has a lower weight \(w_2\), and so on. The objective of the algorithm is to choose versions that produce the highest average weight.

Fig. 2.
figure 2

Reward design approaches. Rewards can be categorical or distance based.

The algorithm specification is independent of how reward values are derived from PPIs. Without loss of generality we typically choose rewards on a negative scale (e.g. \(w_1 \mapsto -1, \dots , w_5 \mapsto -5\)). The learning rate and convergence are, however, dependent on the quantity of the reward. It is possible for the algorithm to be misled by a large quantity of rewards derived from partially observed metrics. If only a small percentage of process instances provide information about all PPIs, the effect of rewards derived from these process instances can be diluted by rewards derived using incomplete PPI observations from other process instances. To ensure that such dilution does not occur, the algorithm keeps track of the ratio of complete and incomplete rewards, \(\tau \), for each version in each context. To accommodate \(\tau \), Algorithm 2 starts in bootstrap mode for the first few requests. During bootstrap, rewards are collected regardless of \(\tau \). The algorithm accepts a partial reward either at the bootstrap period when the number of complete rewards is below a certain threshold, or when the reward ratio is less than or equal to \(\tau \). The usage of reward ratios in this manner addresses requirement R3.

Contextual Factor Detection and Context Integration. Algorithm 3 shows our solution to requirement R4 – the context integration mechanism. The algorithm starts in the pre-contextual factor phase (P1A). In this phase, contextual feature vectors are constructed using information available with the user requests (e.g., age group). Hypothesized contextual factors are observed and stored by the controller for future analysis. When the pre-contextual factor reward threshold is reached, the correlation between the hypothesized contextual factors and process performance is analysed. If the correlation is above a pre-determined threshold, the algorithm state is reset to accommodate new contextual information. This marks the beginning of the post-contextual factor experimentation phase (P1B). From this point onward, contextual feature vectors are constructed using the information from user requests and the observed values of the contextual factors. Finally, when the experimentation request count is achieved, the algorithm switches to the post-experimentation phase (P2). In this phase, the algorithm stops learning from new requests. However, to account for long delays between process instantiation and reward observation, the algorithm continues learning from the requests made in phase P1B.

figure c

In summary, we address requirements R1 and R2 through asynchronous partial reward updates using effective and ideal vectors. Requirement R3 is addressed by maintaining a user defined reward ratio \(\tau \) between complete and incomplete rewards, and handling updates accordingly. Finally, R4 is addressed by identifying and integrating contextual features in contextual feature vectors.

4 Evaluation

In this section, we analyse the behaviour of the approach and specifically the ProcessBandit algorithm in the presence of contextual factors, and in scenarios where an important PPI is available only for a small number of requests. We also evaluate the response times of the algorithm under various infrastructure settings. Finally, we demonstrate the approach using an example process.

The instance router is prototyped using Python and served by Nginx HTTP serverFootnote 1 and uWSGI application serverFootnote 2. We use RedisFootnote 3 as the asynchronous task queue and data store. Two worker processes operate on the asynchronous queue. Tasks that require rescheduling are scheduled after 1 s.

4.1 Convergence Characteristics

In the following experiments, we study how ProcessBandit routes requests to process versions and whether the AB tests converge to the best routing policy given the rewards. We consider two baselines: a naïve randomized routing algorithm with uniform request distribution, random-udr, and LTAvgR [14].

Table 2. PPI configuration.

Our experiment setup consists of a simulated BPMS which returns two PPIs, user satisfaction and profit margin, for two process versions. We assume two process versions, A and B, which perform differently based on the context, X and Y, and an contextual factor f. Table 2 summarizes the PPIs returned by each version under various conditions. These PPIs are mapped to the reward design models shown in Fig. 2. \(p_{\text {ideal}}\) represents user satisfaction score of 5 and profit margin of 20%. \(p_{\text {ref}}\) represents user satisfaction score of 5 and profit margin of 10%. We chose an optimistic reference point with the philosophy that users who do not provide satisfaction scores are happy, and that profit margin is good. Rewards are derived using the classification model in Fig. 2 with weight mapping of \(\{ C_1 \mapsto w_1, C_2 \mapsto w_2, \ldots , C_5 \mapsto w_5\}\) such that \({w_1 = -1}, {w_2 = -2}, \ldots , {w_5 = -5}\).

We define the following key terms that we use in the experiments below:

 

\(t_\text {ppi1}\)::

the time between request invocation and observation of the first PPI,

\(t_\text {obs}\)::

the time between request invocation and observation of the full reward,

d::

delay between the first and the second PPI such that \(t_\text {obs} = t_\text {ppi1} + d\),

\(\rho \)::

ratio between the average request inter-arrival rate and \(t_\text {obs}\).

 

Convergence is shown by evaluating regret over time. Regret is defined as the difference between the sum of rewards associated with the optimal solution and the sum of rewards collected by pulling the chosen arm [19]. The objective of our algorithm is to find a configuration where the average regret of future actions tends to zero. Graphically, this is the case when the cumulative regret curve tends to become parallel to the x-axis. Given the initial uncertainty about the performance of the versions, the algorithm needs to start with experimentation, and hence by necessity accumulate some regret at first.

Overall behavior. Figure 3 shows cumulative regret of the algorithms using various probabilistic decay functions such that \(M=500\), \(d=0.2~\cdot ~t_{\text {ppi1}}\) and \(f=1\) for all process instances. In this experiment, we emulate business processes by sampling the completion time of each version from the process execution data of one of the processes from the BPIC 2015 Challenge [18]. The best routing policy can be found only if contextual factor f, context information, and both PPIs are available. To ensure that these algorithms can be compared, regret for LtAvgR and random-udr is calculated using the weights of reward classes in the same manner as ProcessBandit. We observe that ProcessBandit correctly distributes requests to better performing versions, and converges to the best routing policy. However, LtAvgR consistently makes the wrong decision because it never sees the actual value of the second PPI, and is incapable of updating past rewards.

Fig. 3.
figure 3

Cumulative Regret of various algorithms with \(d=0.2 \cdot t_{\text {ppi1}}\).

Fig. 4.
figure 4

Convergence at various reward delays.

Figure 4 shows the cumulative regret of ProcessBandit with various delays between the observation times of the first and second PPI. In this experiment, we use deterministic completion times for each process instance so that the value of the PPI delay is the same for all observations. We can observe that the regret curves have similar characteristics, and that the algorithm converges to the best routing policy in all cases. There are some small differences in cumulative regret in all scenarios. However, these differences do no support the idea of a conclusive relationship between the delays and overall regret. The magnitude of cumulative regret can be affected by the non-determinism inherent in the algorithm’s experimentation phase, and the order in which the PPIs were observed.

Partial rewards and failures. In this section we evaluate convergence characteristics of ProcessBandit when only one PPI can be observed for some instances. We use an experiment setup with \(f = 1, \rho =160, \ d=0.2 \cdot t_{\text {ppi1}}\), and a constant execution time for all process instances. Each process instance returns the first PPI (profit margin) immediately after execution. The other PPI (user satisfaction) is either never returned, or returned after a delay – which can be expected if, e.g., users are asked to participate in a short survey.

Fig. 5.
figure 5

Convergence with various reward ratio and experiment phase parameters.

We define p as the percentage of process instances that return both PPIs. We compare regret characteristics of ProcessBandit with random-udr because random-udr is agnostic to p. Figure 5 shows the convergence characteristics for parameter values of \(\lambda =100\), \(M=500\), and \(\tau =0\), respectively \(\tau =1\). It depicts behavior for values of p that highlight when convergence happens (e.g., 30% for \(\tau =1\)) and when not (20% in the same configuration). We observe that by increasing \(\tau \) from 0 to 1, the algorithm can converge when p is smaller.

The resilience to partial rewards depends on the values of \(\lambda \), \(\tau \) and M. There must be enough complete observations in every context so that it is possible to reach a state where the current reward ratio is equal or below \(\tau \). In some cases where the best routing policy is found, the algorithm can temporarily perform worse than random-udr. Increasing \(\lambda \) to 500 (not shown in figures), convergence is achieved with \(p=20\%\). This is further improved to \(p=10\%\) when \(\tau \) is increased to 1.5, and finally to \(p=2.5\%\) when M is increased to 750.

4.2 Response Time

We measure end-to-end response time and throughput using two servers, one for the instance router and one for the BPMS. We define our SLA metrics in terms of response time and correctness: our system adheres to SLA if it serves 100% of requests under 300 ms. We host these components on Amazon EC2 M4 large instances with 2 vCPUs and 8 GB RAM. We use the reward setup described in Table 2. Figure 6 shows the response times of these configurations, each named with the convention Phase-Configuration.

Fig. 6.
figure 6

Response times of various stages and configurations of ProcessBandit.

Fig. 7.
figure 7

Response time in Phase P1A. i and q are number of servers for instance router and task queue respectively.

Response times are shown as the average of all requests during a 5 min burst of the corresponding workload. We observe that the throughput is lower and the responses are slower when contextual factor detection is enabled. This is caused by the instance router making additional request to the BPMS to observe the value of f. Performance can be improved by adjusting sampling rates of f. For example, factors like weather condition can be sampled every minute instead.

We observe that the CPU utilization is generally high (above 90%) and proportional to the workload but memory utilization is low (5–7%). The random-udr algorithm serves up to 400 requests per second under our SLA criterion. On the same infrastructure, ProcessBandit achieves a throughput that is between 10% and 25% of random-udr, depending on the configuration.

Based on this observation, we conduct a second experiment to test the horizontal scalability of ProcessBandit at its slowest configuration. This configuration, deployed on a single machine, serves a baseline. Then we deploy the instance router and the asynchronous task queue in separate servers, and horizontally scale the number of the instance router servers. We evaluate response-time and throughput for three deployment configurations. The results are shown in Fig. 7. Because instance router was the bottleneck, we observe that increasing the number of instance router servers increases the throughput.

Table 3. User satisfaction model
Fig. 8.
figure 8

Performance classification

4.3 AB Test with Synthetic Process

We demonstrate our approach using process versions from the domain of helicopter pilot licensing, as introduced in [14]. The process consists of six activities: Schedule, Eligibility Test, Medical Test, Theory Test, Practical Test, and License Processing. We here add two contextual factors associated with an applicant – age group and applicant type (new or returning). The probability of success in the Medical Test activity is set to be higher for younger age groups. For other activities, success probabilities are the same regardless of age groups.

Table 4. Performance of versions A and B

Activities in Version A of the process are ordered sequentially such that a scheduling activity occurs before each test activity. In Version B, one scheduling activity is performed at the start, which determines the schedules of all the tests, thus reducing the costs of having multiple scheduling activities. We use the activity costs and durations outlined in [14]. Using these process versions, we design an experiment where the process performance is determined by two PPIs: satisfaction score, and cost. Rewards are derived from four categories shown in Fig. 8. Satisfaction scores are derived from the outcome and the duration of the process. Satisfaction score is high if the license is approved and processing is fast, and low otherwise. Satisfaction scores also depend on whether the applicant is new or returning – we assume that returning applicants are harsher on the older version. This is shown in Table 3. While the age group is treated as known context, applicant type is treated as a hypothetical contextual factor.

Fig. 9.
figure 9

Probability of receiving satisfaction score over time

To simulate a scenario where the satisfaction score is not always available, we assume that satisfaction scores are collected within 60 days after process completion. Applicants are notified four times after process completion – on the 7th, 14th, 21st, and 42nd day. The cumulative probability of response is assumed to jump after a notification, a behavior similar to the response rates of web-based career survey [15]. Response probabilities are shown in Fig. 9.

With this setup, the algorithm needs to account for two PPIs (R1), the delay in receiving satisfaction scores (R2), availability of satisfaction scores (R3), and the effect of applicant type on the PPIs (R4). The results of performing AB tests on this setup are shown in Table 4. We observe that in all cases more requests are sent to the version that performs better on average (shown in bold).

5 Discussion and Conclusion

Summary. We introduce ProcessBandit, a dynamic process instance routing algorithm that learns a routing policy based on process performance. The algorithm is supported by a modular architecture. ProcessBandit meets all of our requirements and, while not very fast, can be scaled horizontally. It makes sound decisions in scenarios where performance is determined by delayed PPIs which may be fully observed only for some process instances. It also identifies contextual factors at runtime and uses these factors to make routing decisions.

Discussion. ProcessBandit assumes that overall performance can be summarized using the mean. Rewards are averaged per context and version, which helps the algorithm learn its routing policy. As shown in Table 4, the rewards are not always normally distributed. If other statistical properties are important in decision-making, the mechanism for estimating rewards should be changed in the algorithm. Average performance can be ineffective in scenarios where performance deviations from the norm are small but very important. For example, a tiny number of instances may take exceptionally long time. Rewards for such cases are received late and have negligent effect on the mean. Such cases can be handled by an upper bound on the duration: if the process does not complete within an acceptable time, a strong negative reward can be assigned.

There is a threat to external validity by using synthetic datasets. In previous work [14], we demonstrated that AB-BPM can work on real-world datasets, by deriving a single PPI from the available data. Despite our best efforts, we could not find or produce a real-world dataset that combined all features (multiple PPIs, context, etc.) to evaluate ProcessBandit. We aimed to minimize this risk by producing synthetic datasets based on parameters taken from the literature [15], industry (described in [14]), and a BPI Challenge [18] where possible.

Our contextual factor detection approach is based on correlation. In our experiments, we set the correlation thresholds low, so that the context sensitivity could be evaluated. The detection of contextual factors does not need to be based on correlation. Our approach is not tied to how these contextual factors are identified. Contextual factor detection is however a challenge per se that needs further investigation.

Conclusion and Future Work. Unlike prior work on AB testing, our solution provides a risk-managed approach tailored for the requirements of business processes. We demonstrate that this solution meets these requirements by evaluating the behaviour of the routing algorithm, the horizontal scalability of the approach, and its effectiveness in a synthetic business process. Our future plans include extension of the approach to accommodate other statistical properties in reward evaluation and upper bound on duration. We also plan to collaborate with domain experts to conduct field tests in the industry.