1 Introduction

IT management has evolved from a human-centric and labor-intensive activity to a process driven by automation with a few notable exceptions, such as change and service request management. Traditionally performed via ticketing systems, the current process involves several humans coordinating its execution: forming, submitting and analyzing requests, obtaining approvals where needed, assigning work to a subject matter expert, performing the work, updating records, and notifying the original requester upon completion. Although an underlying service management platform enables it, the process is merely facilitating the exchange of messages between human performers. As a result, it still takes a lot of time and involves many people, each with their own and distinct role in the process.

In this article, we present our work on automating change management in a large managed service provider environment. Our work was motivated by the difficulties inherent to the largely manual change management workflow described in the Information Technology Infrastructure Library (ITIL) - a set of detailed best practices for IT Service Management. Not only a multitude of human errors are possible because of the manual nature of the process (choosing the wrong endpoint, misinterpreting the request, getting the wrong approval, miscommunicating), but change requests wait for a long time in a queue to be analyzed, approved, or reviewed by a subject matter expert. So, aside from process automation, we have also aimed to automate various business functions, like approvals, or determining entitlements. We have found out that automation benefits change management in several ways: not only the process is faster, as it bypasses several manual steps and the need for coordination, but it also reduces process complexity (and implicitly risk), and offers predictable outcomes.

2 System Architecture

We have built an automated change management workflow starting from the ITIL specification, aiming to keep ITIL functionality intact, while automating as much of the process as possible. This workflow has a reduced number of personas, and an automation role does all the work in most cases. Humans are only needed to initiate changes, approve changes that are not pre-approved automatically, or perform manual pre- and post-execution where needed. Below, in Sect. 2.1 we describe the building blocks of our automated implementation, and in Sect. 2.2 we illustrate the automated functionality using the AIX memory management use case.

2.1 Automated Functionality

We have identified a set of key components that must be automated to streamline the ITIL change management workflow. These building blocks, and the mapping graph between the ITIL and our Self-Service Delivery (SSD) workflows are presented in Fig. 1, and described in more detail below.

Fig. 1.
figure 1

ITIL (left) vs. SSD (right) automated change management process.

  1. 1.

    Defining User Entitlements - users are added to groups that give them specific rights to initiate or approve different types of change requests, perform capacity approvals, or manually execute specific operations on the endpoints.

  2. 2.

    Providing an Interface that Validates User Requests - users specify change requests through interfaces, or chat bots that provide structure to the received requests and eliminate ambiguity or request misinterpretation.

  3. 3.

    Retrieving Up-to-date Server State - real-time access to endpoint state provides accurate input for building change requests, and allows to automatically validate the change request outcomes. Scripts discover on each managed node the state of its resources (file systems, memory, CPU, cron jobs, etc.), and store it in a repository. Discovery runs before (to check the current state) and after (to validate the execution result) a change is made to an endpoint.

  4. 4.

    Developing Resource Models and Validators - each managed resource is associated with a software model and a set of validators that check the correctness and the technical feasibility of the change requests for that resource.

  5. 5.

    Defining Business Policies for Pre-approved Requests - business policies allow pre-approving change requests with parameters within acceptable ranges, and limit manual approvals to a handful of special cases. They also eliminate the need to monitor a system after a pre-approved change was made, as the successful execution of such a change guarantees its correctness.

  6. 6.

    Providing Change Window Schedules and Rules - each request type has (in the business policies) a flag that specifies whether it needs a change window, or can be executed immediately. If a change window is needed, the requester can choose one from a list computed using change window schedules and rules, or let the change execute during the next available change window.

  7. 7.

    Generic Business Process Diagram for Change Requests - all change requests, regardless of their type, follow the same business cycle illustrated in Fig. 2. After initiation, requests undergo syntactic, technical feasibility and business policy compliance checks. Requests that pass all the checks are automatically approved. Requests that fail the business policy checks are approved manually. Approved requests are checked for any pre-requisites, and scheduled for execution immediately, or in a change window. After execution, the system takes any post-execution steps, discovers the new endpoint state, and determines whether the change was successful, or needs to be backed out.

2.2 Case Study: Memory Allocation for AIX LPARs

To illustrate how the building blocks described above automate change management, consider memory allocation on AIX Logical Partitions (LPARs) managed by Hardware Management Consoles (HMC). The LPARs and HMCs are the equivalents of Virtual Machines and Hypervisors. LPAR memory specification includes: minimum memory - the smallest amount acceptable to boot and operate with, desired memory - the amount of memory used under normal conditions, and maximum memory - the high watermark that will never be exceeded.

The entitlements ensure that logged in users only see the machines on which they are authorized to manage the memory. A user interface retrieves the minimum, desired, and maximum memory for the selected LPAR, as well as the total and free memory on the HMC from an server state repository. The software model and the validators for the AIX memory resource check whether the request is technically feasible, i.e. that the amount of memory requested for the LPAR is less than the free memory available on the HMC. Next, requests are checked for compliance with business policies that govern memory management. These policies specify that requests are pre-approved, with the exception of the cases when an LPAR is allocated less than 1 GB of memory, more than 12 GB of memory, or when allocating memory to the LPAR drops the amount of free memory available on the HMC below \(10\%\) of the total memory. If the new requested desired memory does not fall between the current values of the minimum and maximum memory, the change will require a server reboot and it will run during a change window; otherwise, it can proceed immediately. Finally, the process described above follows the business process diagram described in Fig. 2.

3 System Analysis

We enhance a prior model ([1, 2]) to analyze and quantify the complexity of the change management process. We keep the construction of the overall complexity metric based on execution, coordination, and business object complexity, and retain the coordination complexity model. We refine the base model to better reflect three key factors of complexity in IT change management: execution, coordination (link) and business object outcome. Figure 2 shows the T tasks evaluated for complexity. Complexity analysis is performed on a per task basis (\(C\_exe\), \(C\_link\) and \(C\_bo\) are respectively the execution, coordination, and business object complexity of a task), and it also includes inter-task (between tasks i and \(i+1\)) coordination (\(C\_link_{i,i+1}\)) and business object complexity (\(C\_bo_{i,i+1}\)):

$$\begin{aligned} C_{total} = \sum _{i=1}^{T} (C\_exe_i + C\_link_i + C\_bo_i) + \sum _{i=1}^{T-1} (C\_link_{i,i+1} + C\_bo_{i,i+1}) \end{aligned}$$
(1)
Fig. 2.
figure 2

Task breakdown of ITIL reference model and SSD process

Each task t consists of a set of execution blocks \(T_t\), and a set of decision blocks \(D_t\), and its execution complexity is the sum of the complexities of each component in sets \(T_t\) and \(D_t\). The complexity of an execution block is the product of its baseline execution complexity \(C\_base_i\) and the number of roles involved in the execution \(R_i\). \(C\_base_i\) takes the values 0 for automated, \(\{1,2\}\) for tool assisted, and \(\{2,3\}\) for manual execution. The complexity of a decision block is the product of three factors: \(g_i=\{1,2\}\), which accounts for how well the decision is guided, \(c_i=\{1,2,3\}\), which factors the risk/impact if wrong decision is made, and \(R_i\), the number of roles participating in the decision block:

$$\begin{aligned} C\_exe_t = \sum _{i=1}^{T_t} (C\_base_i R_i) + \sum _{i=1}^{D_t} g_i c_i R_i \end{aligned}$$
(2)

The coordination complexity of a task t is the sum of the complexities of the links that connect its execution blocks. We define the complexity of a link l as the product between its coordination complexity \(LinkType_l\), and the number of roles involved in that link (\(R_l\)) minus one. \(LinkType_l\) takes integer values that account for the communication complexity between two execution blocks: 1 for a straight pass, 2 when one back-forth communication is needed, and 3 when multiple back-forth communication is needed.

$$\begin{aligned} C\_link_t = \sum _{l=1}^{L_t} LinkType_l (R_l - 1) \end{aligned}$$
(3)

The business object complexity captures the difficulty of sending, acquiring, and understanding the information communicated between two execution blocks. We denote by \(BO_t\) the set of all the business objects that are passed between the execution blocks of a task t. The complexity of a business object o is the product between its ambiguity factor \(ambi_o\), and the number of roles involved in the object exchange \(R_o\). The ambiguity factor \(ambi_o\) takes the following values: 1 when data can be readily looked up (e.g., ID, Category, etc.); \(\{2,3\}\): when data represents system or state information that needs to be discovered (e.g., filesystem path, runstate of a server, etc.); \(\{4,5\}\) when data is complex and may need further user input and entitlement verification (e.g., sudo right for a user, system fold access permissions, etc.).

$$\begin{aligned} C\_bo_t = \sum _{o=1}^{BO_t} ambi_o R_o \end{aligned}$$
(4)

Note that we can use Eqs. 3 and 4 to compute the inter-task coordination and business object complexities, by looking at the links and business objects exchanged between tasks, instead of execution blocks. By plugging in Eqs. 2, 3, and 4 into Eq. 1, we compute the total complexity of the change process to account for all the execution blocks, coordination efforts and business objects produced.

We carried out the computation for the ITIL reference change management process, the change management process implemented by a client, and the SSD change process for DB, Hardware and Network change categories. Figure 3 shows the results. We can observe that for each category, the client’s process tends to be more complex than ITIL reference model. This is expected as ITIL is a reference and additional process and coordination are typically needed when a client implements the change process according to the ITIL reference. Overall, we see SSD significantly reduces the complexity across all the change categories evaluated, showing a reducing of \(66\%-70\%\) compared to the ITIL reference process and a reduction of \(79\%-80\%\) compared to the client change process.

Fig. 3.
figure 3

ITIL, Client and SSD complexity scores for DB, Hardware and Network changes.

4 System Evaluation

To provide a quantitative estimate of the time savings introduced by our automation process, we have analyzed data from the ticketing system repository for three accounts (A for an IT services, B for a logistics, and C for a financial services customer) served by IBM. The change request records contain a text description of the change, the date and time when the request was received (\(\mathbf {t_{received}}\)), when its execution started (\(\mathbf {t_{exec-start}}\)), and ended (\(\mathbf {t_{exec-end}}\)), and when it was closed (\(\mathbf {t_{closed}}\)). We analyze change requests in the database, hardware, networking and OS management categories. The total time taken by a change request is the sum of pre-execution, execution and post-execution times, calculated using these formulas: \(t_{pre-execution} = t_{exec-start} - t_{received}\), \(t_{execution} = t_{exec-end} - t_{exec-start}\), and \(t_{post-execution} = t_{closed} - t_{exec-end}\)

Table 1. Pre-execution times (automated and current) and post-execution times (current)

Table 1 shows the pre- and post-execution times for the three accounts. The pre-execution time for the automation represents a conservative upper-bound, where we assumed that each automated change request will wait for the next available change window, calculated using the schedule for each account (account A has three change windows a week, while accounts B and C have two change windows a week). We calculated this upper bound, as we could not determine from the available data whether a given request would execute immediately, or in a change window. Even under these conservative assumptions, the automation reduces significantly the pre-execution time, between \(55\%\) and \(82\%\) for different client accounts. The pre-execution times vary between accounts, depending on the complexity of the implementation of the ITIL processes currently in place. We did not see significant improvements in the change execution time; this is not surprising, as considerable research and effort has been put in the execution of the changes. The post-execution time is larger for the accounts where it is customary to monitor the systems where a change took place for several days prior to closing that change. As the monitoring becomes unnecessary for the automated pre-approved changes, we expect our system to considerably cut down the post-execution time, by a percentage proportional to the percentage of pre-approved requests.

5 Related Work

ServiceNow [3], is a commercially available IT Service Management framework that includes both service catalog creation and self-service capabilities, but requires a high degree of customization ([4]). Configuration management software (like Chef [5], or Ansible [6]) allows discovering state and making changes to the endpoints, but does not support the business aspects of the change management process, including entitlements, validation, change windows, compliance with business policies. From the analysis of the complexity of IT service management perspective, [7] analyzes key performance indicators and their inter-relationships, to reason and schedule the transformation of the service delivery systems, while [8] proposes a framework for minimizing human errors in change management from the point of view of change preparation and execution. An infrastructure for evaluating change risk is proposed in [9], by looking at the history of similar changes, performed on endpoints with similar configuration. A model to quantify the complexity of the IT service management process, and the business value of introducing new IT processes is introduced in [1] and [2].

6 Conclusion and Future Work

We have presented a change management system that automates the ITIL workflow, while preserving its functionality, and a model to measure the reduction in complexity brought by the automation. Going forward, we are going to investigate using Terraform [10] for orchestration, and OpenWhisk [11] for implementing the actions in the workflows. By gathering data as our solution is deployed in new accounts, we will prove there is a correlation between the complexity analysis model and the time it takes to process various change requests.