Business process improvement with AB testing and reinforcement learning: grounded theory-based industry perspectives

Kurz, Aaron Friedrich; Kampik, Timotheus; Pufahl, Luise; Weber, Ingo

doi:10.1007/s10270-024-01229-2

Business process improvement with AB testing and reinforcement learning: grounded theory-based industry perspectives

Special Section Paper
Open access
Published: 28 November 2024

Volume 24, pages 87–109, (2025)
Cite this article

Download PDF

You have full access to this open access article

Software and Systems Modeling Aims and scope Submit manuscript

Business process improvement with AB testing and reinforcement learning: grounded theory-based industry perspectives

Download PDF

967 Accesses
1 Citation
3 Altmetric
Explore all metrics

Abstract

In order to better facilitate the need for continuous business process improvement, the application of DevOps principles has been proposed. In particular, the AB-BPM methodology applies AB testing—a DevOps practice—and reinforcement learning to increase the speed and quality of business process improvement efforts. In this paper, we provide an industry perspective on this approach, assessing prerequisites, suitability, requirements, risks, and additional aspects of the AB-BPM methodology and supporting tools. Our qualitative study follows the grounded theory research methodology, including 16 semi-structured interviews with BPM practitioners. The main findings indicate: (1) a need for expert control during reinforcement learning-driven experiments in production, (2) the importance of involving the participants and aligning the method culturally with the respective setting, (3) the necessity of an integrated process execution environment, and (4) the long-term potential of the methodology for effective and efficient validation of algorithmically (co-)created business process variants, and their continuous management.

Reinforcement Learning-Supported AB Testing of Business Process Improvements: An Industry Perspective

Improving Business Processes: Does Anybody have an Idea?

Process Improvement Using the Scientific Method

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Business processes are crucial for creating value and delivering products and services. Improving these processes is essential for gaining a competitive edge and enhancing value delivery, as well as increasing efficiency and customer satisfaction. This makes business process improvement (BPI) a key aspect of business process management (BPM), which is described as “the art and science of overseeing how work is performed in an organization to ensure consistent outcomes and to take advantage of improvement opportunities” [1].

DevOps, an integration of development and operations, is “a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality” [2] and widely applied in the software industry. A new line of research in the field of BPM proposes using DevOps principles, like AB testing, to facilitate continuous BPI with a method called AB-BPM [3]. AB testing assesses different software feature versions in the production environment with real users, usually in end user-facing parts of the software. The current version of the feature is only retired if the test data supports the improvement hypothesis. Applying such rapid validation of improvements to processes is a departure from the traditional BPM life cycle, where the possibility of not meeting improvement hypotheses in production is rarely considered so thoroughly, leading to expensive do-overs [3, 4]. Going beyond traditional AB testing, AB-BPM proposes the application of reinforcement learning (RL) to utilize performance measurements already while the experiments are conducted, hence dynamically balancing the routing of incoming process instances to the currently better performing version and exploring versions whose performance characteristics are not yet well known.

The AB-BPM method has not yet been systematically analyzed regarding the needs of BPM practitioners. Such an evaluation of the methodology is of interest for multiple reasons. First, AB testing has thus far mostly been applied to user interface and algorithm choice questions, but not to business process-centric design problems [5]. This shift in application area potentially presents a whole new set of opportunities and challenges for the BPM community to consider. BPM practitioners’ insights into the AB-BPM methodology can be of value to the wider BPM community, since they can uncover hurdles and possibilities on the path to more automation in the field of process (re-)design. Second, the applicability of and confidence in the specific AB-BPM methodology itself needs to be increased to enable fruitful further development, since it presents a paradigm leap. This is particularly important given the relatively low success rate of software projects: One-fifth of all software projects fail, while a large percentage of the rest conclude without reaching their primary goals [6]. Poorly captured and maintained requirements have been identified as a significant contributing factor to the failure of software projects [7], which is why the viewpoints of BPM practitioners are important for continued success of this and adjacent lines of research and development.

Therefore, this paper presents a qualitative study, with the overarching research question being: How do BPM practitioners assess the value of AB-BPM, and which implications does this have for the further development of the methodology and supporting tools? We used a range of subquestions to answer this question, including negative prior BPI experiences, prerequisites, feasibility, risks, desired tool features, suitability, and long-term vision.

In order to answer these questions, BPM experts from a large enterprise software company took part in semi-structured interviews. Data collection and analysis were guided by the grounded theory (GT) [8] research methodology. The main contributions of this paper are^{Footnote 1}:

1.
An assessment of the AB-BPM methodology regarding feasibility, prerequisites, and suitability in light of various BPM context factors;
2.
An overview of risks to be mitigated and feature requirements for an AB-BPM tool to be usable in practice;
3.
An understanding of long-term perspectives and practitioners’ visions for the methodology;
4.
A set of implications and themes that emerge from observations of patterns throughout the aforementioned categories.

In the following, we first introduce the relevant background in Sect. 2. Then, we present a detailed explanation of the applied research methodology in Sect. 3. We subsequently present our findings in Sect. 4. The results are interpreted, and implications are discussed in Sect. 5. Finally, we present our main conclusions in Sect. 6.

2 Background

This section first gives an overview of BPM. AB-BPM is then explained in more detail. Subsequently, relevant concepts like AB testing, RL, and multi-armed bandits are presented.

2.1 BPM

Organizations of all industries and sizes perform various combinations of activities to achieve desired results, may it be the production of physical goods or the provision of services. These combinations of activities are called business processes [10]. They are often standardized and documented, meaning that each time an organization tries to achieve a particular result, they use a similar combination of activities. Such a standardized business process can be modeled with the Business Process Model and Notation (BPMN) [11], which is arguably the most popular standard for modeling business processes in industry and academia [12]. The BPMN standard enables execution of BPMN models, meaning that the standard includes non-visual properties that allow for the creation of user interfaces, decision logic, and connection to external programs/processes and web services [1]. A BPMS is an information system that uses an explicit description of an executable process model in the form of a BPMN model to execute business processes and manage relevant resources and data. It presents a centralized, model-driven way of business process execution and intelligence [1]. Since business processes are central to organizational value creation, describing, analyzing, and optimizing them have been of considerable interest to businesses, and the last decades have resulted in many scientific advances in this field of research [13].

BPI is a central part of BPM [1]. The traditional BPM life cycle (see Fig. 1) is generally sequential [1]. When a company tries to improve a certain process, it must first discover the current as-is process. This could mean, for example, that data from IT systems is analyzed or that workshops are conducted with employees. Afterward, the process is analyzed for weaknesses and improvement opportunities. Based on these insights, the process is redesigned. To aid process designers, there is a wealth of theoretical consideration and a variety of redesign heuristics they can make use of [14]. The possible redesigns are then analyzed, and the most promising version is chosen to replace the currently running process. Then, after implementing the new process version and retiring the old version, the process is monitored again for problems and weaknesses. In case any problems arise, the cycle is repeated.

A process participant is a company-internal actor performing tasks of a process [1], i.e., an employee of the organization executing the process. (In this work, “participant” refers to a study interviewee; only when using “process participant” do we mean employees.)

2.2 AB-BPM

The sequential improvement approach of the traditional BPM life cycle hardly considers the possibility of improvement hypotheses being wrong. However, there is evidence that this could occur relatively often. Research on BPI has shown that \(75\%\) of BPI ideas did not lead to an improvement: Half of them had no effect, and a quarter even had detrimental outcomes [4]. This effect can also be seen in the context of web application management. In a study conducted at Microsoft, only a third of website improvement ideas actually had a positive impact [15]. This is especially problematic due to the high cost that the traditional BPM life cycle incurs:

1.
The analysis before the implementation is often lengthy to diminish the risk of implementing bad versions. However, this analysis is often in vain, as evidenced by the research results mentioned above.
2.
The correction of the problems means repeating the whole life cycle, incurring even more costs without a clear improvement hypothesis.

Furthermore, comparing process performance before and after the implementation is problematic in and of itself because changing environmental factors may be the primary driver of changes in process performance (or lack thereof).

To mitigate these problems, Satyal et al. propose using AB testing when transitioning from the analysis to the implementation phase [3]. This would mean that the redesigned version is deployed in parallel with the old process version in the production environment, allowing for a fair comparison. Since AB testing is not traditionally used in such a high-risk and long-running setting as BPM, the authors extend the usual testing methodology with RL (specifically, a multi-armed bandit algorithm). Using RL methods should allow us to use the obtained learning faster and minimize the risk of exposing customers to suboptimal process versions for too long. Altogether, AB-BPM could enable a shorter theoretical analysis of the redesign, in line with the DevOps mantra “fail fast, fail cheap.” Figure 2 presents the improved AB-BPM life cycle.

In addition to the RL-supported AB testing of improvement hypotheses, the complete AB-BPM methodology proposes some more test and analysis techniques, such as simulation and shadow testing. This inquiry will focus on the RL-supported AB testing of process variants. We focus on the latter because it is at the methodology’s core, with the other steps supporting the design of the AB tests. Furthermore, business process simulation has already been subject to relatively extensive research scrutiny in the BPM field [16]. References to AB-BPM in this work solely refer to the RL-supported AB testing of business process variants. A graphical overview of the broader AB-BPM methodology, with the focus of our investigation highlighted, is shown in Fig. 3.

Recent work presents a prototype of AB-BPM that extends the method with possibilities for more human control. The prototype is referred to as human-in-the-loop AB-BPM (HITL-AB-BPM) [17]. The tool and extension of the method aim to mitigate the risk of a hands-off approach by introducing features for human interference. However, the HITL-AB-BPM extension was not part of this study and not presented to the participants. A discussion of HITL-AB-BPM in the context of this work can be found in Sect. 5.

2.3 DevOps

The term DevOps implies a coupling of development and operations. While bringing the functions of software development and operations organizationally together is a core idea in DevOps, it actually includes a whole range of other concepts. In fact, DevOps is a principle from software engineering and operations that includes various methodologies employed to reduce the time from developers making changes to a system and those changes running in production, while increasing quality [2]. DevOps is of interest to the BPM community since many of the problems that DevOps aims to solve in the field of software map to problems that the BPM community faces with BPI, for example poor communication strategy, lack of methodology, little monitoring and evaluation of outcome, little consultation with stakeholders, poor engagement with employees, and under-resourced implementation teams [18, 19]. One of the methodologies used in DevOps is AB testing, which can bee seen as tackling the problems of lack of methodological rigor, monitoring, and evaluation. The application of AB testing to BPM and BPI is one of the core ideas behind the AB-BPM methodology, as described above.

2.4 AB testing

The main goal of AB testing is to quickly determine whether a particular change will improve important performance metrics [2]. Two versions are tested using randomized experiments in a production environment (A vs. B). The new version is often only made available to a select group of consumers, limiting any potential adverse impacts. The method is mainly used to introduce and test changes to web applications, user interfaces, recommender systems, and advertisements [20]. Numerous online controlled experiments are regularly conducted by companies like Amazon and Meta to test changes to their consumer-facing products [20]. Duolingo, a popular language-learning app, is using AB testing so frequently that they have even publicly presented their internal AB testing dashboard [21].

2.5 Reinforcement learning

Reinforcement learning can be seen as a third subcategory of machine learning, in addition to supervised and unsupervised learning. While supervised learning aims to learn how to label elements from a collection of labeled data, and unsupervised learning tries to find hidden structures in unlabeled data, RL has the goal of optimizing the behavior of a software agent based on a numeric reward in an interactive environment [22].

In RL, the agent is situated in a specific environment, as shown in Fig. 4, and has to decide on an action. This action is then evaluated, and a reward is calculated. The overall objective is reward maximization over a sequence of actions. Note that the chosen action might affect not only the current reward but also the following situation and thereby subsequent rewards. Learning which choices to make in what situation happens, essentially, through trial and error [22]. Sutton and Barto note that “these two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning” [22].

RL is successfully applied in a range of domains, going beyond the well-known examples of competitive games (e.g., in the case of AlphaGo [23]). An example of such a domain is process control [24]: Google was able to improve energy efficiency in Google data centers by up to \(40\%\) by using RL agents to control the parameters of various types of cooling equipment deployed in the data center, which affect the temperature and energy with complex dependencies and relationships [25].

In the following, we explain multi-armed bandits (MAB), which solve a stateless subproblem of RL.

2.5.1 Multi-armed bandits

The core problem behind MABs is making a particular choice from a set of options. The method is named after slot machines (“one-armed bandits”): An analogy is a player that sits in front of a slot machine with multiple levers, needing to decide which lever to pull at any given time. More generally, the player can be mapped to the MAB agent and the levers to the choice set [22]. After the agent has selected an action, the environment reacts accordingly, and a reward is calculated. The agent learns based on this reward, with the goal of maximizing the cumulative reward over time [22]. The trade-off between exploitation and exploration is one of the key challenges for MAB algorithms and in RL in general. Exploiting means choosing tried-and-true options where the rewards are more certain. Exploration aims to trial choices that could eventually result in greater rewards [22]. The options or arms are modeled as two distinct versions of a business process in AB-BPM, and the reward is some process performance indicator. An extension of general MABs, contextual MABs, enables the agents to use contextual variables that might affect the reward distribution before making a choice [26].

The MAB algorithm employed in the AB-BPM methodology presented in [3], called long-term average router (LTAvgR), is based on LinUCB. LinUCB is a contextual MAB algorithm that has been developed with the aim of and commercially employed for personalizing news article recommendations [27].

3 Research method

In order to address the research questions, we applied the grounded theory (GT) research method [8]. GT is suitable for building a theory and answering questions about areas where little is known. We selected GT because so far, no research has been conducted on the industry perspective of applying AB testing for BPI. After an initial round of data collection with semi-structured interviews (purposive sampling in GT [8]), we wanted to further fill knowledge gaps and expand on some of the outcomes of the first round. Therefore, we approached the second stage of the GT methodology (also called theoretical sampling [8]) with a second round of semi-structured interviews. By modifying the questions asked and interviewing more participants until we reached a state of theoretical saturation, this second round can be seen as both an extension of the results and a validation, according to GT principles [8]. Regarding the name of coding stages used in this work, we decided to use the overarching phase terms of initial, intermediate, and advanced coding as proposed by [28]. For reference, we include the mapping of the terms used here to related terms used in other works: (initial coding \(\rightarrow \) open coding, initial coding), (intermediate coding \(\rightarrow \) selective coding, axial coding, focused coding), (advanced coding \(\rightarrow \) theoretical coding, selective coding). Even though there are subtle differences between these terms, we believe the overall terms, together with the elaborations below, more appropriately describes our research process. In the following paragraphs, we describe the research methods used in more detail. An overview is shown in Fig. 5.

Table 1 Interview participants information; S/E stands for software/engineering, and B/M stands for business/management. Years of experience based on full-time experience. Other industries refers to substantial work experience in those industries, i.e., prior employment or larger consulting projects

Full size table

3.1 Cohort I

Expert selection

We have recruited experts from a multi-national software company (MNSC) with more than 100,000 employees. The company develops enterprise software, and the majority of study participants are employees of a subunit that specializes in developing BPM software and providing BPM services. Due to the study’s exploratory nature, the aim was to obtain a perspective from a broad range of experts. For this purpose, we set a number of goals for the selection of the experts: (1) to include people who develop BPM software as well as people working in consulting (however, not necessarily both at once); (2) to cover various areas of technical skills, e.g., software engineering and data science; and (3) the study participants should have experience with business process improvement initiatives. The aim for the first round of interviews was to have a panel with ten experts. After reaching out to eleven people, ten people agreed to take part.

Interviews

As is common in GT research, we conducted semi-structured qualitative interviews with subject matter experts, aiming to capture a wide range of ideas and thoughts on the complexities of a topic by openly engaging in conversation with subjects and analyzing these interactions [29, 30]. Since the order and wording of initial questions and follow-up questions are highly flexible in semi-structured interviews, the interview guide is more of a collection of topics to be covered and not a verbatim script. There have also been minor adjustments to the interview guide during the interview phase in response to gained knowledge and differing relevance of questions for certain participants, in line with standard practice. Such adjustments are considered unproblematic since the goal is not a comparison of different subgroups, to test a hypothesis, or to find out how many people hold certain beliefs, but to find out what kind of beliefs are present [30]. We used the following interview guideline, given in a condensed version^{Footnote 2}:

1.
Prior experience and pain points with BPI,
2.
Short introduction to AB-BPM (not a question, short presentation; 5–10 min),
3.
Execution of AB tests/feasibility,
4.
Suitability,
5.
Prerequisites to adopt the AB-BPM method,
6.
risks,
7.
Tool requirements,
8.
Open discussion.

A detailed version of the interview guide can be found in the digital appendix. Due to the slight adjustments and variations of the interview questions, the interview guide represents the merging of the questions asked. The interviews were conducted by Author I, but they were recorded and transcribed for later review and discussion with the other authors.

Analysis

During and after the first round of interviews, the transcripts were coded, and topics were consolidated (GT phase initial coding [8]). We did not use a preformed list of codes but instead created all codes from the data. Most of the main categories of codes were already anticipated by the researchers (see interview guides in digital appendix), but there were no preconceived items for the content of the categories. Furthermore, all within-category grouping (e.g., see risks and tool features in Sect. 4) is based solely on the data. The codes we created are a mixture of analytical, theoretical, and natural codes [31]. This means that some codes are created from abstractions of what study participants said, some are theoretical concepts that align what interviewees mentioned, and some are verbatim codes (also referred to as in vivo codes). The coding was mainly done by Author I, but all researchers met frequently (during active study periods once weekly) to discuss the codes and verify and refine them. In case of disagreement, the final decision was made on a consensus basis, by further discussing and refining until reaching unanimity. To minimize the risk of bias and improve the quality of the discussion and resulting codes, all interviews have been recorded and transcribed, and these documents were used in the collaborative refinement of the coding. After the initial coding, relevant topics for further study in Cohort II were selected. This selection was done based on the collected data, literature research and thorough discussion among the researchers.^{Footnote 3} This ongoing work with collected data but also external data and intuition can be mapped to the GT principles of constant comparative analysis, coupled with our own theoretical sensitivity [8].

3.2 Cohort II

Expert selection

For the second round, we did not set a specific number of participants beforehand, but instead interviewed new participants that seemed to offer promising expertise (see paragraph on interviews below for details on topics), until we did not gather any substantially new information. We set out to include more participants until the questions we raised for Cohort II were sufficiently answered or no more substantially new content was added. After interviewing 4 more participants we already felt that our answers were sufficiently answered, we however decided to still carry out two more interviews we had already scheduled at that time. The decision to stop the interview process was made after all authors felt that a sufficient level of data and theory had been collected and built. That is, we conducted interviews until we reached theoretical saturation [8]. Note that this term has been topic of some debate [33], which is why we attempted to provide such a detailed account of how we determined the stopping point for participant inclusion.

Finally, this led to the inclusion of six more participants, resulting in a total of 16 participants. Note that for the second cohort of the study, we relaxed condition iii) from the selection of participants for Cohort I, in order to find answers to specific questions by selecting the most suitable BPM practitioners regardless of their BPI experience. Furthermore, we also included a process improvement expert from one of the customers of the MNSC, to further increase the general applicability of the results. An overview of all participants is given in Table 1. To better describe the expert panel, we also include some aggregate metrics. A graphical representation of the years of experience is shown in Fig. 6. Because of the diverse background of the participants, we wanted to gauge the BPM competencies among the interviewed panel of experts. We therefore asked them to self-rate which particular BPM competencies they feel like they have expertise in. The results are given in Table 2, expressed as percentage of interviewees who said they have expertise in a certain competency. The competency model and more detailed descriptions can be found in [34], and participants were provided with the detailed descriptions of each competency, not only the title.

Table 2 Percentage of participants who self-rated themselves as having expertise in BPM competencies; BPM competency model minimally changed from [34]

Full size table

Interviews

In the second round of interviews, there were several topics we wanted to explore. First, we wanted to learn more about the perceived (technical) feasibility of the methodology. Second, the first round of interviews revealed that a broad range of objectives for the reward function would be an important tool feature. Therefore, we wanted to learn more about this requirement. Last, we wanted to find out how the methodology is seen in the broader context of the evolution of business process management and business process improvement. Thus, we tried to learn more about long-term visions. We chose these topics either because of outstanding interest in the first cohort, or theoretical gaps we felt still needed more inquiry. As in the previous cohort, there were no items created prior to the interviews and coding of the data within these topics. Again, a more detailed version of the interview guide can be found in the digital appendix. The interviews have been more different between participants of Cohort II than of Cohort I, owing to the participant selection and more specific research inquiry. The guide represents the merging of the questions asked. The interviews were conducted by Author I, but they were recorded and transcribed for later review and discussion with the other researchers.

Analysis

Based on the data from both cohorts, we condensed the knowledge along relevant axes (see Sect. 4), in the GT stage of intermediate coding [8]. We provide some examples of the aggregation from initial to intermediate codes in the digital appendix. Regarding distribution of work and collaboration, the procedure was the same as in Cohort I. The second round of interviews led to not only heightened confidence in the already observed items, but also the addition of some new items. Furthermore, the new codes and discussions made us work through the initial interviews again and see if any of the new topics had already been mentioned in Cohort I. This is another application of the GT technique constant comparative analysis [8]. We also added the new code category long-term visions. Finally, we connected various topics of the identified item categories to create an overarching story line on the implications of the findings on the AB-BPM methodology and how it could evolve through the lens of BPM practitioners. Here we follow the GT concept of advanced coding, “build[ing] a story that connects the categories and produces a discursive set of theoretical propositions” [8]. These results can be found in the discussion of the findings in Sect. 5.

4 Results

The main insights from the interviews are described in more detail in the following subsections. The order of presentation follows a semantically meaningful arch of technology development. First, we present current pain points in Sect. 4.1 and then move on to findings on suitability in Sect. 4.2. Section 4.3 discusses feasibility and prerequisites, after which we will explore potential risks to keep in mind (Sect. 4.4). We end this chapter with a presentation of tool feature requirements in Sect. 4.5 and long-term visions in Sect. 4.6. As software vendor employees and process experts from its customers, the study participants’ answers, to some extent, reflect the experience of the wider industry, i.e., the customers’ challenges. Statements of the experts are marked as quotations and the numbers refer to the interview numbers from Table 1. Note that in the following, we present multiple lists of items and codes, where only a subset of those are discussed in more detail in the paper itself. For the other codes, we provide more detailed explanations and quotes from the interviews in the digital appendix.

Before going into the detailed subresults, we want to give our impression of the overall sentiment toward AB-BPM, which was mainly positive. Multiple consultants brought up that some companies they worked with tried testing new process versions in production and comparing them with the status quo. However, the tests were mostly unstructured and consisted only of a few instances or even no “real” instances (i.e., only mocked/simulated instances in the production environment). This means that any drawn conclusions are not dependable, due to the low number of instances and lack of statistical rigor when it comes to controlling confounding factors. Thus, AB testing provides a useful process improvement method that supports the structured testing of alternative versions, with one interviewee mentioning: “I think this is a direction we need to move towards” (Int10). The overall sentiment is well captured by this statement: “Even though I was mentioning quite some potential roadblocks, I want to emphasize that this has big potential. I think it is a really valuable thing to do, but it will take some time to get there, and it might not be applicable for all use cases” (Int11).

4.1 Negative prior experiences

To find out how the methodology relates to existing pain points in BPI initiatives, we asked the interviewees about their prior negative BPI experiences (before presenting the AB-BPM methodology to them). We first present a list of the overall findings, and subsequently describe some points in more detail. The choice of the points discussed in detail is based on how relevant we perceive them to be for the AB-BPM methodology and accompanying tools:

Badly communicated process changes;
Chaotic data;
Difficult implementation;
Lack of long-term focus;
Lack of process knowledge;
Lack of resources and sponsorship;
Missing dimensions in data;
Process drift;
Resistance to change;
Unclear impact of process changes.

Missing dimensions in data

One of the prior negative experience that participants noted is regarding the available data during BPI efforts. One participant noted that “the transformation [...] was taking place without enough data about the customer experience” (Int5). Another participant asked: “How can we trust that we are looking at the right data?” (Int10). Overall, the notion seems to be that limited data or only partial coverage of relevant performance metrics leads to challenges during process improvement. This problem is, to some extent, exacerbated by the AB-BPM methodology, where a system is supposed to make autonomous BPI decisions based on the available data. This underlines the need for broad metric coverage (see Sect. 4.3).

Resistance to change

Multiple interview participants have noted difficulties with resistance to change among process participants. One interviewee summarized: “What I have seen in the past years is: people are resistant to change. So even about a tiny change, people are not happy with it. So usually the heads of the center of excellence [...] need to put a lot of effort in” (Int11). As with the point above, this problem is also important to keep in mind when evaluating and further developing AB-BPM, since it proposes applying experimental changes in production.

Unclear impact of process changes

Some study participants criticized the unclear impact of process improvements during/after BPI projects. One participant said: “I do not think anyone currently really knows what effect one process change, or even a change of a process group, really has on the overall performance” (Int13). This may be due to constantly changing environmental factors and the resulting difficulty to compare process data that has been collected at different points in time. This highlights the possible positive impact that AB-BPM could have on BPI efforts, by giving BPM experts a better data basis to evaluate improvement efforts.

4.2 Suitability

One question to the study participants was about the suitability of the method regarding contextual factors. The study participants were not presented with a list of categories but were free to elaborate on their intuition. More concretely, they were asked what type of processes and what surrounding circumstances (e.g., company, market, industry) they think the methodology would be well- or ill-suited for. Their statements were then mapped to the categorization of BPM contexts by [35]. The result is given in Table 3. The characteristics in italics present special cases for factors where every characteristic was deemed suitable (to some extent), which we will give more details on in the following.

Table 3 Suitability of AB-BPM method regarding BPM context, number/color coding: , , . Categorization from [35]. Items in italics present particularly interesting/suitable cases for factors where every characteristic is suitable

Focus

No agreement could be reached on whether AB-BPM was suitable for radical changes. Satyal et al. [3] present the method primarily for evolutionary changes, while some study participants believe it is also suitable for larger, more exploratory process changes. There was a consensus among the interviewees, that evolutionary changes would work well. One participant said: “I believe it will work well if we make and test changes piece by piece, as opposed to major changes” (Int10). As a reason for this, another participant noted that they “lean towards smaller changes because you have better control over these” (Int1). Regarding more substantial process changes, there were some participants who believed the AB-BPM method would be suitable: “I believe it is suitable for both, as long as the routing to each version works” (Int8). However, a majority thought it would be more difficult to make radical process changes. This is mostly due to the fact that radical changes often include changes in underlying information systems: “In the most radical case, I am thinking of changing the IT system. You can not rely on some tests in that case, it would be way too costly. It mostly includes discussions spanning multiple years and once you even get to using the new system, the decision to purchase it has already been made” (Int8). Somewhat larger changes within the same information system may be feasible, however.

Value contribution

Using the AB-BPM method might be especially useful in core processes. This is because other processes are found at many companies and they mostly perform well enough given industry best practices. One study participant noted that it is advisable to “differentiate where you differ.” They said, “as a sports shoe company, we could strive to have the best finance processes, but that won’t make people buy our shoes—we need better shoes and better shoe quality to win in the marketplace” (Int7). They therefore recommended using standard processes for everything but the core processes. This is already common practice and also suggested by academic studies [36]. Another interviewee also recommended the usage “for strategically critical process areas” (Int2). For core processes, however, experimentation with the AB-BPM method would be highly favorable.

Competitiveness

In general, there were no opinions indicating that any level of market competitiveness would render the method unsuitable. Study participants noted, however, that highly competitive markets would increase the need for such a tool to allow for faster process iterations, “to stay competitive” (Int5).

4.3 Feasibility and prerequisites

In order to assess the viability of the AB-BPM methodology, we asked interviewees about the feasibility of and prerequisites for the methodology from both a technical and a cultural standpoint. This led us to uncover a list of prerequisites (as given in Table 4) they deemed necessary for the implementation and adoption of AB-BPM in organizations, as well as possible routes for technical implementation. Overall, the methodology is seen as feasible given a high BPM and IT maturity, management buy-in, as well as a suitable culture: “I think if the maturity and knowledge is there, this is a super interesting concept which many companies would be eager to try” (Int11).

**Table 4 Item list of prerequisites, in alphabetical order. Letters/Colors in “Code” column describe categories: ,**

In the following, we will outline the interviewees’ stances on routes for technical implementation of the methodology. The implementation and adoption of AB-BPM as presented in [3] assumes the existence of a Business Process Management System (BPMS) that allows for the direct deployment of BPMN models. However, most processes are executed by non-BPMS software, i.e., they are not executed from models directly [37]. Therefore, whether the usage of a BPMS is a requirement for technical feasibility is a research question of this study. Altogether, AB testing of business processes seems technologically feasible without a BPMS (see NWF in Table 4), one interviewee noted: “I do not think that it is a problem that processes are executed over several IT systems since you only need to be able to start either process version. The route they are going afterward, even if it is ten more systems, is no longer relevant” (Int8). However, if we want to use live analytics to route incoming process instantiation requests (e.g., as proposed with RL) without a BPMS, we would need something like an extract–transform–load (ETL) tool. ETL software is responsible for retrieving relevant data from various sources while bringing the data into a suitable format [38]. There are other possible solutions besides ETL tools (e.g., all relevant systems proactively push data to a common store), but essentially there is a need to centralize data for analysis. Relying on a BPMS would not only have the benefit of easier data collection and access, it would also make deploying and executing experimental processes more straightforward. Furthermore, such an ETL tool might be highly complex due to the many systems processes can potentially touch. One study participant noted that when a BPMS does not exist, “you will have to put a lot of effort into mining performance data; it would be more difficult to get the same data from process mining, covering every path and system” (Int5). In fact, most study participants deemed a BPMS, or something similar, to be a prerequisite (see YWF in Table 4). One study participant stated, however, that while some central execution platform would probably be needed, it remains unclear whether these have to be in the shape of current BPMS. Overall, there seemed to be the notion that the integrated, model-driven way of orchestrating and executing business processes offered by BPMS is the “direction the industry should move towards” (Int7).

4.4 Risks

A key goal of this work is to determine the AB-BPM method’s principal risks that hinder its usage and implementation in organizations. The risks that have been mentioned during the interviews are listed in Table 5, and the items have been categorized as follows: Culture are risks regarding the working culture and employees of the company. Results include risks regarding results, decisions, and outcomes; Operations consists of risks regarding the implementation and execution of the AB-BPM method itself, but also the normal business operations; Legal includes risks regarding the cost and loss of income caused by legal uncertainty [39].

**Table 5 Item list of risks, in alphabetical order. Letters/Colors in “Code” column describe categories: , , ,**

In the following, some risks are explained in more detail. This can be either due to their high relevance in the context of existing theory, or because of especially insightful data collected on them.

Employee dissatisfaction due to feeling like one’s job is about to be automated

Overall, the theme of employee experience was a recurring topic in the interviews. The overall notion was that the approach hinged on the integration and participation of the employees. This could present a challenge, as one participant mentioned: “In my experience, as soon as a process manager walks in, and maybe even starts talking about metrics, the employees start to worry about their jobs” (Int16). Many of the tool features in Sect. 4.5 and themes in Sect. 5 tackle this issue, albeit presumably only partly. Further work on how to create a positive employee experience in such a dynamic, continually changing environment is therefore vital.

Employees purposely acting with a certain goal for the experiment in mind

Employees may have goals that are not aligned with process optimization initiatives, and try to act accordingly to trick the system to reach their desired outcomes. One participant noted: “People may, openly or subversively, take non-conform actions - trying to push ahead versions they like and slow down versions they dislike” (Int7). This finding can be related to research regarding workers resisting algorithmic control, which has been discussed under the term algoactivism [40]. The main tactics of individual resistance that have been observed in practice include “non-cooperation, leveraging algorithms, and personal negotiation with clients” [40].

Non-aware testing difficult

Currently, AB testing works under the assumption that users are not or only minimally aware of the experiment. This is because AB tests are mostly used for visual or algorithm changes in online services that are consumed by a high volume of users [5, 41]. AB-BPM presents a departure here. Interviewees noted that non-aware testing (i.e., testing where people interacting with the process do not know of the test/changes) would probably not be possible, for example: “In internal processes the people will notice that something changed and that they are part of a test. I mean they know their usual working steps and will notice that something is off” (Int2), and “I think either we go for maximal transparency and cleanly explain the objectives and what is happening, or we do it absolutely non-aware. But I strongly doubt that the latter is possible for most processes” (Int9). This will pose novel challenges both regarding change management (see also risk CHM) and regarding the statistical modeling. Even though the parallel tests in production would increase reliability of measurements in comparison with the traditional sequential life cycle, effects of testing awareness would need to be taken into account: “How does it influence peoples’ behavior when they notice they are part of a test?” (Int9).

Not enough instances

As mentioned in the paragraph above, commonly AB testing relies on a large volume of instances (e.g., website visits). This, besides the benefit that such use cases can practically be seen as non-aware, also helps with reaching statistical significance in the results. Therefore “high volume” processes might be more suitable (Int8), or one needs to ensure that the smaller number of instances is taken into account in the underlying decision model.

Unclear results due to high process variance and process drift

As mentioned before, the execution of business processes can differ from how they were intended to be executed and is subject to (unintended) changes over time. This phenomenon, called process drift [42], leads to a high variance of executed process versions. Variance and drift pose a risk for the AB-BPM method since it is then unclear whether process participants execute the two versions as they are intended. If the process cases vary from the intended way of execution, it is hard to draw conclusions from the results since they might be based on a change that occurred spontaneously instead of the planned process changes. One example might be that “people exchange emails instead of following the steps in the process execution software” (Int2).

Unquestioningly following machine-generated analysis results leading to erroneous decisions

Many interviewees noted that solely relying on the algorithm’s interpretation of the data might cause problems. One study participant noted that “such models are always an abstraction of reality [...] and relying on them completely can lead to mistakes” (Int1). This topic also came up during the discussion of bad prior experiences, when a study participant noted that sometimes wrong decisions were made because of a lack of understanding of data. One example is the use of team performance metrics, which are often highly subjective (e.g., workload estimates in some project management methods), without context. Putting data into context and not unquestioningly following statistical reasoning is, therefore, a core challenge that needs to be addressed.

4.5 Tool features

The elicitation of requirements for a tool that executes and supports the AB-BPM method was also part of the study, and the identified feature requirements are presented in the following. First, we present a list of the elicited desired tool features (see Table 6). Afterward, more details on some of the items are provided. The feature requirements have been categorized into presentation, procedure, and support. This categorization has been created after and based on the interviews, during the coding of the interviews. Presentation includes features regarding the presentation of data, or features that are more focused on the front end of the tool in general; Procedure are features regarding the underlying technical or methodological procedure; Support includes features that already exist in the AB-BPM method but that have not been presented to the study participants during the introduction to AB-BPM (see Sect. 2); they, therefore, support the equivalent suggestions by [3].

Table 6 Item list of desired tool features, in alphabetical order. Letters/Colors in “Code” column depict categories: , , (-ive of already implemented features in AB-BPM that are not part of this study)

In the following, some tool features are explained in more detail. This can be either due to their high relevance in the context of existing theory, or because of especially insightful data collected on them.

Capture process participants sentiments and feedback on process variants

A common theme throughout the interviews was the high importance of the experience of process participants (e.g., see risks EDA, EPG, and TST in Table 5). Capturing how employees experience the experiments and process version was therefore also an important concern. One interviewee noted the potential discrepancy between process analysts’ and process participants’ view on changes, and that this needs to be taken into account: “What I have experienced in projects is that the process analysts go out of some initiative saying this variant is clearly better, but the people involved are unhappy. They might feel like the prior way of doing things was more interesting or less hectic. We need to think about that, too” (Int9). Another expert noted that this feature could also increase acceptance and therefore success of the methodology: “When you involve the people executing the process, the acceptance will be better, since they realize they have a say in the changes” (Int11).

Communicating process changes efficiently for teaching and enablement of employees

The need for process participants to learn how new versions have to be executed was stressed by multiple interviewees. One study participant stated that “one needs to notify the people working on steps in the process of the changes (Int8)”. More “enablement is needed to teach employees the changes” (Int4), and another study participant noted that “seeing how this [aspect of change management] can be integrated would be an interesting question” (Int10). This would go beyond just teaching single steps but also create openness and transparency about goals and project setup (see also prerequisite TRA in Sect. 4.3), allowing for “a lot of change, even in parallel, without people being lost” (Int10). Change notification management is a feature that has already received research attention in the context of business process management software [43, 44], an integration and extension of existing approaches could be useful.

Offer broad range of possible metrics to take into account

Given the fact that the routing of process instances to either version would happen automatically during the experiment, based on the obtained rewards, the interviewees highlighted the importance of a tool for allowing flexible reward metric setting: “We need to make sure that we have relevant metrics, and that we have multiple metrics to choose from” (Int7). Another participant noted: “Goal setting in business environments is extremely difficult. So when you talk about setting metrics to optimize for, this is a very interesting perspective. What kind of metrics do you measure and what kind of metrics you put as goals? What do you optimize for? For example, if you’re only focusing on profits, there is of course a lot of... you’re trying to reduce something that’s very complex to a single thing” (Int6).

One potential solution for this would be to have a user interface feature where participants could select various process performance indicators to take into account, and how important they each are for this experiment. To learn more about this idea, we inquired more about this during Cohort II. Overall, we found multiple reward design challenges that need to be taken into account when incorporating multiple, flexibly settable rewards:

1.
Correlation of metrics;
2.
Sparse rewards for some metrics;
3.
Need for guidance on reward selection.

First, there is a need to handle potential correlation of selected metrics, since this could lead to suboptimal decision making. One potential solution could be analyzing historical data in that regard and notifying experts of any potential correlations. Second, some metrics could be measured too seldom for meaningful results. This also needs to be taken into account for usability and usefulness of such flexible reward selection feature. Third, it would be preferable to have certain reward metric configurations that work out of the box. One interviewee noted that they “would not start with making process experts configure whatever metrics they want. I would give them stuff out of the box. And then, over time, you learn where configuration is needed and in what way” (Int14). Another participant succinctly captured this line of reasoning, saying “companies want solutions, not only tools” (Int13).

Potential exclusion of certain instance types

In practice, applying this methodology truly randomly on incoming process instances might not be ideal. One participant noted: “We have to be aware of the different levels of criticality in some processes. How important are certain customers? Which customers would probably dislike being part of an experiment or differing process variant?” (Int7). In essence, it seems like an exclusion of instances based on certain parameters might be useful. Customer (type) is one possible dimension to consider, whereas other context factors such as time of invocation or monetary value might be more exemplary criteria. However, the effects this has on the validity and generalization of results has to be carefully considered. This requirement raises the question whether certain process participant groups might need to be excluded, too. Indeed, some of the participants noted potential change management risks that indicate this (CHM), for example: “Experts that have been at the company for a long-time might not be particularly open to it. I have experienced them to be animals of habit, and they mostly don’t want to change their routine even a bit. You are going to have some change management issues there” (Int4). Note that this also raises questions regarding validity of results.

Presetting stop and notification criteria

Participants noted that a tool should offer the ability to warn the conductors of the experiment of problematic process states: “I think there should be some conditions where it is clear that they have to be upheld at all times. The tool should be able to capture those and include mechanisms to ensure that in case something goes wrong, we do something” (Int7). The topic of what to do when this situation arises and some constraints are not met anymore was brought up by multiple participants. Another interviewee put it as follows: “If we talk about production testing, we need to keep in mind what happens if a test was not successful. When do we stop? What happens to users in those instances?” (Int7).

Result can be different variants for recognized contextual factors

In traditional AB testing, the outcome of the experiment is often that either the new or old version is better, and then the winning version will be applied for everyone. However, an interviewee noted that it could also be useful to allow for the selection of various winning versions, depending on the process instantiation context: “I should be able to know if I have multiple variants, which variant works best for which customer type, for example” (Int7).

See potential impact beforehand (amount and business-wise)

According to the study participants, process experts should be able to see estimations on possible impacts beforehand to support an informed decision-making process, e.g., how many customers or what order volume would be affected by the test. They would hope “to see relatively quickly how many people are part of it and what is the business impact” (Int9).

4.6 Vision

Besides examining the methodology and potential tool choices directly, we also wanted to understand how AB-BPM relates to the wider trajectory of the BPM discipline, especially regarding future opportunities. To this end, we asked the interviewees about the connection of AB-BPM to industry trends and the possibilities for potential next steps. The findings are listed in Table 7 and described in more detail in the following.

Table 7 Item list of long-term visions, in alphabetical order

Full size table

Automatic validation of algorithmic process changes

Participants noted that the methodology would allow for “automated decision making” (Int12) on which process version performs better, and brought up the idea of using this capability to validate automatically generated process versions: “You wouldn’t need to analyze manually or ask anyone, the proof would be in the data. This opens up the possibility of using, for example, AI to generate process versions and then this tool to make decisions on their quality” (Int15). This notion is especially relevant given the current interest in the usage of generative machine learning methods in general, and in BPM in particular [45,46,47]: “Instead of human designed versions you could use some sort of generative algorithm to create process changes before the tests. I think such models often inhibit some sort of creativity too, so I think that would be interesting” (Int15).

Base model/transfer learning

In the currently proposed AB-BPM method, learning which variant is better does not consider any prior knowledge when the experiment starts, i.e., the agent does not exhibit any knowledge. Given that currently the approach uses a relatively simple MAB, this makes sense, since the only assumption could be that one version is better than the other, already in the beginning of the experiment. However, if the methodology evolves further, e.g., to detect various process context patterns (see also tool wish VRC in Table 6) or to be aware of the process structure itself (e.g., see description of vision item IPD below), one could start building general process knowledge, which can be used when starting a new experiment. This has been remarked by one of the interviewees as follows: “It would make sense to not always start at zero, but to be able to provide some sort of base intelligence, on top of which more specific data can be added” (Int15).

In-process decision learning

This item actually originates from Cohort I (one motivating factor for this line of inquiry in Cohort II), with an interviewee saying: “In my opinion, it would be desirable to take into account both information from when an instance starts, but also during instance execution. Each process contains multiple decision points in the process where you could apply such a learning algorithm to decide which direction to take” (Int7). The suggestion can be linked to two concepts in BPM. First, one could use MABs and RL to improve upon the status quo on how the decision points of a given process model are executed, e.g., to improve static concepts, such as decision model and notation (DMN) specifications. With DMN, process modelers can set conditions for each path based on various process conditions at design time, and they are executed and evaluated at runtime by the so-called decision DMN rule engines [1]. Allowing for continually improving decision policies has the potential to enhance outcomes. Second, there are conceptual parallels to prescriptive process monitoring, which tries to monitor ongoing process instances and apply specific treatments to aid in reaching certain instance objectives. In fact, existing work proposes the use of reinforcement learning for deciding when to modify a given process instance, e.g., [48, 49].

Long-term learning

The current AB-BPM approach executes both variants in parallel, to make a decision in the end on which version is better. One participant, however, said that it would be favorable to just keep learning and always use the version that is more adept to current environmental factors: “However, it could be that the situation changes over time and one option is not needed anymore after some time. It would be nice to develop in this direction where we keep modifying based on current information” (Int15). This suggestion might be especially valuable if the methodology includes the execution of more than two variants, as well as for identifying different variants for different instantiation context parameters (see tool features MTT and VRC). Importantly, in the case of using a routing agent long term, one might want to consider the potential non-stationarity of the process performance in the underlying algorithm (the current AB-BPM from [3] algorithm does not consider that), as noted by another participant (Int12).

5 Discussion

In this section, we discuss the implications of the results obtained from the interviews, put the findings in context and present emerging themes. In Sect. 4, the results are categorized with the general technology life cycle and requirements analysis in mind. Here we take a more abstract approach to see patterns that are found throughout the categories. We identify four core themes, namely:

1.
Involvement of executing experts,
2.
Participant involvement and cultural alignment,
3.
Integrated process execution environment, and
4.
Semi-self-improving business processes.

With these themes, we aim to propose tentative theoretical concepts from the presented results. We hope these theoretical implications can guide researchers and practitioners in taking steps toward the realization of this relatively ambitious and transformative BPI approach. The categories are also related to the AB-BPM methodology visually in Fig. 7. In the following, the themes are elaborated upon more. Finally, we conclude this section by pointing out the limitations and threats to validity of our study.

5.1 Involvement of executing experts

Involvement of human experts in machine learning systems is a commonly applied method nowadays and often referred to as human-in-the-loop (HITL) [50]. In fact, as described in Sect. 2, in a previous publication we presented a HITL extension of the AB-BPM methodology that allows for expert control. Although the HITL-AB-BPM extension [17] was not presented to the study participants, remarkably, many of the points mentioned in the interviews and surveys closely matched aspects of HITL-AB-BPM. This can be seen as support for the idea of human expert involvement and points to a need for further research. The main features of HITL-AB-BPM are batching and the possibility for an emergency exit. Batching refers to the splitting of the experiment into distinct batches and the possibility for a human expert to modify the routing proposed by the RL. Another feature for human control in [17] is the emergency exit, which allows the human expert to choose, at any given time, to which version all future requests are routed. Batching is potentially able to mitigate several of the identified risks (BLE, LRE) and partly or wholly satisfies multiple tool feature requirements (EXC, MRL). Similarly, the emergency exit option also limits risks (DNO, DIE, SEP) and realizes tool features (EES).

Furthermore, a whole range of other items point to more human-in-the-loop mechanisms (i.e., they play to the same underlying notion of finding a balance between expert/human and machine decision making). For example, a risk that could be aided by additional HITL features is IGK (problems due to improperly set goals/reward); a tool wish that points to more HITL features is PSN (presetting stop and notification criteria); and a vision that could be partially moved toward is the one of a base model (BTL) enriched with expert knowledge.

Generally speaking, the collected data strongly supports the idea of extending the AB-BPM method with elements of human expert control, for increasing both the viability and potentially the outcome quality of the methodology.

5.2 Participant involvement and cultural alignment

Besides the aforementioned influence of experts on the AB-BPM execution, the findings also indicate a need for the participation of process participants, both to make AB-BPM culturally viable and also to improve the potential outcomes that can be obtained by applying it.

One of the prerequisites that could potentially be fulfilled by enabling participant involvement in AB-BPM is the need for organizational transparency (TRA). For example, an implementation of AB-BPM that would capture participant feedback (CPS) could heighten the transparency on change processes and thereby minimize problems arising from distrust among employees (EDA, TST, EPG). Another desired feature for the involvement of process participants, effective enablement and training (COM), could increase transparency and build trust, but also reduce process drift (UVD) and allow for early participant feedback, thereby potentially improving overall BPI initiative results. Together with the previous section on involvement of executing experts, a clear picture emerges: Human oversight and control is a vital requirement not only for the humans running semiautonomous systems, but also for the people whose lives are affected by such systems. However, the involvement in the experiments may only be a partial solution to making the method culturally viable and successful. Several other findings point toward intricate social and organizational caveats to navigate, such as the need for an experiment culture and buy-in from decision makers. Further studies of the application of the methodology could investigate potential problems and solutions in this domain.

5.3 Integrated process execution environment

We understand an integrated process execution environment to be a platform that centrally orchestrates process execution among a variety of systems and resources, while capturing process-relevant data. Examples of such systems are BPMS and WfMS. As discussed in Sect. 4.3, there are some participants that do think that some parts of the methodology are feasible even without an integrated process execution environment. However, the general line of reasoning of the practitioners points toward the need for an integrated, process-aware execution and observation environment. This will be important for both the seamless deployment of parallel process variants and the availability of rich real-time data for dynamic, semiautonomous decision making. Thus, an integrated process execution environment is supported not only by the explicit answers regarding the BPMS prerequisite (YWF), but also a variety of other codes, such as other prerequisites, risks, and tool features. It would help to provide a broad and continuous data basis (BCM) and present a robust IT basis (HIT). Due to the more process-focused execution, this could potentially mitigate the risk of disturbed operations through the experiments (DNO) and also streamline the implementation and execution of multiple process variants in parallel (DIE). Furthermore, it would support the realization of several requested tool features. Through the central orchestration of and connection to third party systems, it could help with allowing for processes spanning multiple systems and gathering the data from them (ETL). By facilitating data availability as mentioned before, an integrated process execution environment could also help enable the tool of a modifiable, broad range of possible metrics to take into account (BRK).

5.4 Semi-self-improving business processes

RL-supported AB testing for BPI can be considered a substantial step toward the vision of self-improving business processes that require merely limited human involvement in the BPM life cycle. Recently, such visions have been proposed in the context of position papers on AI-augmented BPM [51], as well as on BPM in the era of generative artificial intelligence [45]. These proposals do not claim that the BPM life cycle can be fully automated, but rather discuss the potential of AI-based systems to augment human tasks, moving more of the efforts involved from humans to machines, while potentially achieving better results. Enabling the AI-augmented semi-automated improvement of business processes by automating most parts of the BPM life cycle can be considered one of the cornerstones and long-term goals of both proposals.

In this context, the AB-BPM methodology is, presumably, the furthest reaching approach that has been systematically studied in the literature. This notion is grounded in the vision item AVC (automatic validation of algorithmic process changes) that has been brought up by the participants. Taking a visionary stance, one could go even further and propose that new process variants that have the potential to lead to performance improvements could be automatically identified on a continuous basis to then be systematically assessed using reinforcement learning, whereas under-performing variants are eventually and automatically discarded. This idea is partly supported by the vision item LTL (long-term learning). This rapid, semi-automated and data-driven variant management could be especially important given a potential widespread use of generative AI to process improvement and modeling. This would move the bottleneck from creating new process versions to validating and finding ideal use cases for them.

However, one may claim that, in the light of the results of our study, the practicability of such proposals is highly speculative. After all, our results highlight that utilizing AB testing and reinforcement learning for automating business processes is, while potentially promising, fraught with challenges that mostly have engineering and socio-organizational characteristics, which have so far not been emphasized. Examples of the former are insufficiently flexible systems for business process execution and lack of broad continuously measured metrics for process performance assessment (prerequisites YWF and BCM); examples of the latter are lacks of BPM maturity and experiment culture (prerequisite HBM and XCU). Here, an open question is to what extent new capabilities provided by emerging generative AI-based solutions can potentially alleviate some of the challenges. As outlined in [45], generative AI models such as large language models can potentially help contextualize knowledge and data to aid decision making along the AB-BPM life cycle, with the human remaining in the loop. For example, unstructured textual data about a process can be turned into queries that are then executed on the process data to provide a more nuanced and potentially qualitative overview of process performance beyond standard process performance indicators.

5.5 Limitations

The presented study has several limitations. One limitation of the results is the lacking quantification on which overarching topic axes and underlying items are most important. The previous publication [9] includes a preliminary ranking. However, the focus of the work and the main contribution is the qualitative elicitation of knowledge on the AB-BPM methodology. For the selection of the most relevant items and the discussion of related items in this section, we relied on our own knowledge of the research field, following the GT philosophy of applying theoretical sensitivity to the collected data [8].

A possible threat to validity is that most study participants are employed by the same company. We tried to attenuate this by selecting experts with extensive experience from various teams and backgrounds as well as an external process expert at one of the company’s customers. Additionally, the consultants brought in their experience with business process improvement projects from additional companies. Overall, we tried to reach business process practitioners with expertise in various fields relating to the future of AB-BPM, online experiments, and parallel execution of business process variants in production in general—covering software engineering for BPM, BPM consulting, and machine learning applications in BPM.

Another potential threat to validity is the fact that the study participants are not primarily process owners or other key stakeholders for a process as enacted by one particular organization. We deliberately selected a diverse set of study participants who can, for example, utilize their experience across several processes and organizations (consultant) or have deep expertise working with a broad range of requirements for business process management and process execution software (designer, lead engineer). Because we can reasonably assume that the application of reinforcement learning to business process AB testing is not a topic on top of the minds of today’s process owners, we argue that our participant selection allows for a better informed and more nuanced qualitative assessment. Still, studying the perspectives of process owners and other direct process stakeholder can be considered valuable future research that can augment the results presented in this paper.

Lastly, in this study the methodology has not been evaluated side by side in comparison with other BPI frameworks [52]. However, this is because the focus of this work is not to find out how the methodology compares to others, but rather to further improve upon the methodology at hand to get to a point where it is mature enough to be compared methodologically.

6 Conclusion

The main aim of this study was to obtain practitioners’ perspectives on the AB-BPM method. Using qualitative research methods, we shed light on the requirements for the further development of AB-BPM tools and the underlying method. The results offer guidance to researchers as well as practitioners on challenges and opportunities of applying online controlled experiments in the field of business process management, and more concretely business process improvement. Besides the AB-BPM methodology itself, the results have implications on further possible DevOps concept applications in BPM, as well as for AI-augmented BPM research.

Overall, the study participants perceived the methodology as advantageous in comparison with the status quo, given certain prerequisites for adoption and in suitable scenarios. The four main insights from the study are:

1)
more possibilities of human expert intervention and more interaction between the RL agent and the human expert are core requirements;
2)
options for the involvement of process participants and other measures are needed to make AB-BPM culturally viable;
3)
an integrated process execution environment is necessary to facilitate the seamless deployment of parallel process variants and deliver the real-time data needed for dynamic RL and routing;
4)
in the long run, the methodology could aid in effective and efficient validation of algorithmically (co-)created business process variants, and their continuous management.

The openness of the semi-structured interviews facilitated the discovery of future research opportunities, e.g., studying companies carrying out unstructured process tests in a production environment, tool-driven training, participation of process participants in BPI initiatives and reward engineering for AB-BPM. To further validate the methodology, extended design science work, real-world application as well as comparison with other business improvement methods could provide valuable insights.

Notes

Partial results of this paper were already published in [9]. Contributions 1 and 2 are extended from the previous paper, Contribution 3 is completely new, and Contribution 4 is mostly new, and mentioned merely as a side note in the previous paper. For this extension, we interviewed six more participants (bringing the total to sixteen), to reach a point of theoretical saturation, which is the natural stopping point for data collection in a GT study. The methodological footing of the findings has therefore been strengthened significantly.
Note that the categories prior experiences and prerequisites were only presented tangentially in the prior publication [9]. However, we present the results in more detail in this work, including additional insights into those categories from the data of Cohort II.
For the aforementioned prior paper [9], we validated and ranked the items of the categories where participants of the first round had provided the most information (risks and tool wishes), in the style of a shortened Delphi study [32]. However, since the ranking was somewhat preliminary, we decided to further validate the knowledge with more interviews for this extension, in line with common GT principles. This led to the validation of many items, as well as to some new ones. Therefore, this paper does not include the ranking; interested readers can find this information in the prior publication.

References

Dumas, M., La Rosa, M., Mendling, J., Reijers, H.A.: Fundamentals of Business Process Management. Springer, Berlin, Heidelberg (2018). https://doi.org/10.1007/978-3-662-56509-4
Bass, L., Weber, I., Zhu, L.: DevOps: A Software Architect’s Perspective, 1st, edition Addison-Wesley, New York (2015)
Google Scholar
Satyal, S., Weber, I., Paik, H.-Y., Di Ciccio, C., Mendling, J.: Business process improvement with the AB-BPM methodology. Inf. Syst. 84, 283–298 (2019). https://doi.org/10.1016/j.is.2018.06.007
Article Google Scholar
Holland, C.W., Cochran, D.: Breakthrough Business Results With MVT: A Fast, Cost-Free, “Secret Weapon’’ for Boosting Sales, Cutting Expenses, and Improving Any Business Process, 1st, edition Wiley, Hoboken, NJ (2005)
MATH Google Scholar
Ros, R., Runeson, P.: Continuous experimentation and A/B testing: a mapping study. In: Proceedings of the 4th International Workshop on Rapid Continuous Software Engineering. RCoSE ’18, pp. 35–41. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3194760.3194766
Manifesto, Chaos: Technical Report. The Standish Group International, West Yarmouth, USA (2018)
MATH Google Scholar
Hussain, A., Mkpojiogu, E.O., Kamal, F.M.: The role of requirements in the success or failure of software projects. Int. Rev. Manag. Mark. 6(7), 306–311 (2016)
MATH Google Scholar
Chun Tie, Y., Birks, M., Francis, K.: Grounded theory research: a design framework for novice researchers. SAGE Open Med. 7, 205031211882292 (2019)
Article Google Scholar
Kurz, A.F., Kampik, T., Pufahl, L., Weber, I.: Reinforcement Learning-Supported AB Testing of Business Process Improvements: An Industry Perspective. In: International Conference on Business Process Modeling, Development and Support, pp. 12–26 (2023). Springer
Aguilar-Savén, R.S.: Business process modelling: review and framework. Int. J. Prod. Econ. 90(2), 129–149 (2004)
Article MATH Google Scholar
Rosing, M., White, S., Cummins, F., Man, H.: Business process model and notation-BPMN. In: Rosing, M., Scheer, A.-W., Scheel, H. (eds.) The Complete Business Process Handbook, pp. 433–457. Morgan Kaufmann, Boston (2015)
Chapter MATH Google Scholar
Bonnet, F., Decker, G., Dugan, L., Kurz, M., Misiak, Z., Ringuette, S.: Making bpmn a true lingua franca. BPM Trends (2014)
Aalst, W.M.P., La Rosa, M., Santoro, F.M.: Business process management: don’t forget to improve the process! Bus. Inf. Syst. Eng. 58(1), 1–6 (2016)
Article MATH Google Scholar
Reijers, H.A., Liman Mansar, S.: Best practices in business process redesign: an overview and qualitative evaluation of successful redesign heuristics. Omega 33(4), 283–306 (2005). https://doi.org/10.1016/j.omega.2004.04.012
Article MATH Google Scholar
Kohavi, R., Longbotham, R., Sommerfield, D., Henne, R.M.: Controlled experiments on the web: survey and practical guide. Data Min. Knowl. Disc. 18(1), 140–181 (2009)
Article MathSciNet MATH Google Scholar
Rosenthal, K., Ternes, B., Strecker, S.: Business process simulation: A systematic literature review. In: ECIS, p. 199 (2018)
Kurz, A.F., Santelmann, B., Großmann, T., Kampik, T., Pufahl, L., Weber, I.: HITL-AB-BPM: Business Process Improvement with AB Testing and Human-in-the-Loop. Proceedings of the Demo Session of the 20th International Conference on Business Process Management (2022)
Radnor, D.Z.: Review of Business Process Improvement Methodologies in Public Services. AIM – the UK’s research initiative on management, 94 (2010)
Díaz, J., López-Fernández, D., Pérez, J., González-Prieto, Á.: Why are many businesses instilling a devops culture into their organization? Empir. Softw. Eng. 26(2), 25 (2021). https://doi.org/10.1007/S10664-020-09919-3
Kohavi, R., Longbotham, R.: Online controlled experiments and A/B testing. In: Sammut, C., Webb, G.I. (eds.) Encyclopedia of Machine Learning and Data Mining, pp. 922–929. Springer, Boston, MA (2017)
Aprameya, L.: Improving Duolingo, one experiment at a time (2020). https://blog.duolingo.com/improving-duolingo-one-experiment-at-a-time/ Accessed 2024-05-07
Sutton, R.S., Barto, A.G.: Reinforcement Learning, Second Edition: An Introduction. MIT Press, Cambridge, Massachusetts (2018)
MATH Google Scholar
Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature 550(7676), 354–359 (2017). https://doi.org/10.1038/nature24270
Article MATH Google Scholar
Nian, R., Liu, J., Huang, B.: A review on reinforcement learning: introduction and applications in industrial process control. Comput. Chem. Eng. 139, 106886 (2020). https://doi.org/10.1016/j.compchemeng.2020.106886
Article MATH Google Scholar
Richard Evans, J.G.: DeepMind AI Reduces Google Data Centre Cooling Bill by 40 — deepmind.google. https://deepmind.google/discover/blog/deepmind-ai-reduces-google-data-centre-cooling-bill-by-40/. [Accessed 17-08-2024] (2016)
Burtini, G., Loeppky, J., Lawrence, R.: A Survey of Online Experiment Design with the Stochastic Multi-Armed Bandit. arXiv. arXiv:1510.00757 [cs, stat] (2015). https://doi.org/10.48550/arXiv.1510.00757 . Accessed 2024-08-18
Li, L., Chu, W., Langford, J., Schapire, R.E.: A contextual-bandit approach to personalized news article recommendation. In: Proceedings of the 19th International Conference on World Wide Web. WWW Association for Computing Machinery, New York, NY, USA ‘10, pp. 661–670. (2010). https://doi.org/10.1145/1772690.1772758
Mills, Jane, Birks, Melanie, Hoare, Karen: Grounded theory. In: Mills, Jane, Birks, Melanie (eds.) Qualitative Methodology: A Practical Guide, pp. 107–122. SAGE Publications, Inc., 1 Oliver’s Yard, 55 City Road London EC1Y 1SP (2014). https://doi.org/10.4135/9781473920163.n7
Chapter MATH Google Scholar
Robson, C.: Real World Research: a Resource for Social Scientists and Practitioner-researchers, Reprint Blackwell, Oxford u.a (1999)
MATH Google Scholar
Brinkmann, S., Kvale, S.: InterViews: Learning the Craft of Qualitative Research Interviewing. Third edition edn. SAGE Publications Inc, Los Angeles (2014)
MATH Google Scholar
Kuckartz, Udo: Qualitative text analysis: a systematic approach. In: Kaiser, Gabriele, Presmeg, Norma (eds.) Compendium for Early Career Researchers in Mathematics Education, pp. 181–197. Springer International Publishing, Cham (2019). https://doi.org/10.1007/978-3-030-15636-7_8
Chapter MATH Google Scholar
Paré, G., Cameron, A.-F., Poba-Nzaou, P., Templier, M.: A systematic assessment of rigor in information systems ranking-type Delphi studies. Inf. Manag. 50(5), 207–217 (2013)
Article Google Scholar
Sebele-Mpofu, Favourate Y.: Saturation controversy in qualitative research: complexities and underlying assumptions. A literature review. Cogent Soc. Sci. 6(1), 1838706 (2020). https://doi.org/10.1080/23311886.2020.1838706
Sonteya, Thembela, F Seymour, Lisa: Towards an understanding of the business process analyst: an analysis of competencies. J. Inf. Technol. Educ.: Res. 11, 043–063 (2012). https://doi.org/10.28945/1568
Article MATH Google Scholar
Vom Brocke, J., Zelt, S., Schmiedel, T.: On the role of context in business process management. Int. J. Inf. Manag. 36(3), 486–495 (2016)
Stiehl, V., Danei, M., Elliott, J., Heiler, M., Kerwien, T.: Effectively and Efficiently Implementing Complex Business Processes: A Case Study. In: Lübke, D., Pautasso, C. (eds.) Empirical Studies on the Development of Executable Business Processes, pp. 33–57. Springer, Cham (2019)
Kampik, T., Weske, M.: Event log generation: an industry perspective. In: Augusto, A., Gill, A., Bork, D., Nurcan, S., Reinhartz-Berger, I., Schmidt, R. (eds.) Enterprise, Business-Process and Information Systems Modeling, pp. 123–136. Springer, Cham (2022)
Chapter MATH Google Scholar
Vassiliadis, P.: A survey of extract-transform-load technology. Int. J. Data Warehousing Min. (IJDWM) 5(3), 1–27 (2009)
Article MATH Google Scholar
Tsui, T.C.: Experience from the Anti-Monopoly Law Decision in China. Rochester, NY (2013)
MATH Google Scholar
Kellogg, K.C., Valentine, M.A., Christin, A.: Algorithms at work: the new contested terrain of control. Acad. Manag. Ann. 14(1), 366–410 (2020). https://doi.org/10.5465/annals.2018.0174. (Academy of Management. Accessed 2024-08-27)
Article MATH Google Scholar
Quin, F., Weyns, D., Galster, M., Silva, C.C.: A/b testing: A systematic literature review. J. Syst. Softw. 211, 112011 (2024). https://doi.org/10.1016/j.jss.2024.112011
Article MATH Google Scholar
Sato, D.M., De Freitas, S.C., Barddal, J.P., Scalabrin, E.E.: A survey on concept drift in process mining. ACM Comput. Surveys (CSUR) 54(9), 1–38 (2021)
Article Google Scholar
Yan, Zhiqiang, Dijkman, Remco, Grefen, Paul: Business process model repositories–framework and survey. Inf. Softw. Technol. 54(4), 380–395 (2012). https://doi.org/10.1016/j.infsof.2011.11.005
Article MATH Google Scholar
La Rosa, M., Reijers, H.A., Van Der Aalst, W.M., Dijkman, R.M., Mendling, J., Dumas, M., Garcia-Banuelos, L.: Apromore: an advanced process model repository. Expert Syst. Appl. 38(6), 7029–7040 (2011)
Article MATH Google Scholar
Kampik, T., Warmuth, C., Rebmann, A., Agam, R., Egger, L.N.P., Gerber, A., Hoffart, J., Kolk, J., Herzig, P., Decker, G., Aa, H., Polyvyanyy, A., Rinderle-Ma, S., Weber, I., Weidlich, M.: Large process models: a vision for business process management in the age of generative AI. KI - Künstliche Intelligenz (2024). https://doi.org/10.1007/s13218-024-00863-8
Article Google Scholar
Busch, K., Rochlitzer, A., Sola, D., Leopold, H.: Just tell me: prompt engineering inÂ business process management. In: Aa, H., Bork, D., Proper, H.A., Schmidt, R. (eds.) Enterprise, Business-Process and Information Systems Modeling, pp. 3–11. Springer, Cham (2023)
Chapter MATH Google Scholar
Klievtsova, N., Benzin, J.-V., Kampik, T., Mangler, J., Rinderle-Ma, S.: Conversational process modelling: state ofÂ theÂ art, applications, andÂ implications inÂ practice. In: Di Francescomarino, C., Burattin, A., Janiesch, C., Sadiq, S. (eds.) Business Process Management Forum, pp. 319–336. Springer, Cham (2023)
Chapter Google Scholar
Bozorgi, Z.D., Dumas, M., Rosa, M.L., Polyvyanyy, A., Shoush, M., Teinemaa, I.: Learning When toÂ Treat Business Processes: Prescriptive Process Monitoring withÂ Causal Inference andÂ Reinforcement Learning. In: Indulska, M., Reinhartz-Berger, I., Cetina, C., Pastor, O. (eds.) Advanced Information Systems Engineering. Lecture Notes in Computer Science, pp. 364–380. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-34560-9_22
Metzger, A., Kley, T., Palm, A.: Triggering Proactive Business Process Adaptations via Online Reinforcement Learning. In: Fahland, D., Ghidini, C., Becker, J., Dumas, M. (eds.) Business Process Management vol. 12168, pp. 273–290. Springer, Cham (2020). Series Title: Lecture Notes in Computer Science
Wu, X., Xiao, L., Sun, Y., Zhang, J., Ma, T., He, L.: A survey of human-in-the-loop for machine learning. Futur. Gener. Comput. Syst. 135, 364–381 (2022)
Dumas, Marlon, Fournier, Fabiana, Limonad, Lior, Marrella, Andrea, Montali, Marco, Rehse, Jana-Rebecca., Accorsi, Rafael, Calvanese, Diego, De Giacomo, Giuseppe, Fahland, Dirk, Gal, Avigdor, La Rosa, Marcello, Völzer, Hagen, Weber, Ingo: AI-augmented business process management systems: a research manifesto. ACM Trans. Manag. Inf. Syst. 14(1), 1–19 (2023). https://doi.org/10.1145/3576047
Article Google Scholar
Malinova, M., Gross, S., Mendling, J.: A study into the contingencies of process improvement methods. Inf. Syst. 104, 101880 (2022)

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

SAP Signavio, Berlin, Germany
Aaron Friedrich Kurz & Timotheus Kampik
School of CIT, Technical University of Munich, Heilbronn, Germany
Luise Pufahl
School of CIT, Technical University of Munich, Munich, Germany
Ingo Weber
Fraunhofer Gesellschaft, Munich, Germany
Ingo Weber

Authors

Aaron Friedrich Kurz
View author publications
You can also search for this author inPubMed Google Scholar
Timotheus Kampik
View author publications
You can also search for this author inPubMed Google Scholar
Luise Pufahl
View author publications
You can also search for this author inPubMed Google Scholar
Ingo Weber
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Luise Pufahl.

Additional information

Communicated by Han van der Aa and Rainer Schmidt.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (zip 359 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kurz, A.F., Kampik, T., Pufahl, L. et al. Business process improvement with AB testing and reinforcement learning: grounded theory-based industry perspectives. Softw Syst Model 24, 87–109 (2025). https://doi.org/10.1007/s10270-024-01229-2

Download citation

Received: 31 October 2023
Revised: 29 August 2024
Accepted: 11 October 2024
Published: 28 November 2024
Issue Date: February 2025
DOI: https://doi.org/10.1007/s10270-024-01229-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Business process improvement with AB testing and reinforcement learning: grounded theory-based industry perspectives

Abstract

Similar content being viewed by others

Reinforcement Learning-Supported AB Testing of Business Process Improvements: An Industry Perspective

Improving Business Processes: Does Anybody have an Idea?

Process Improvement Using the Scientific Method

1 Introduction

2 Background

2.1 BPM

2.2 AB-BPM

2.3 DevOps

2.4 AB testing

2.5 Reinforcement learning

2.5.1 Multi-armed bandits

3 Research method

3.1 Cohort I

3.2 Cohort II

4 Results

4.1 Negative prior experiences

4.2 Suitability

4.3 Feasibility and prerequisites

4.4 Risks

4.5 Tool features

4.6 Vision

5 Discussion

5.1 Involvement of executing experts

5.2 Participant involvement and cultural alignment

5.3 Integrated process execution environment

5.4 Semi-self-improving business processes

5.5 Limitations

6 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (zip 359 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords