Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

To improve the efficiency of IT service management processes, a lot of focus has been placed on automation [1] to reduce human error and streamline the operations. Consider IT service change management process that is designed to ensure that codified procedures are followed to handle all changes to control IT infrastructure such as, adding filesystems, recycling database instances, upgrading the memory, and so on. Traditionally, a human expert would submit a ticket to initiate a change request type. These requests would be assigned to another human expert (typically in delivery), who would be executing them. Without standardizations and automation this has often resulted in inconsistent execution (e.g., each administrator executing their own scripts to process the requests) and incomplete and inaccurate requests (e.g., missing parameters, unknown system state, etc.). Compared to earlier days, current user interface based tools have significantly reduced the burden of accessing a specific IT system, checking the availability of IT components and packages on the system, and executing appropriate commands with required parameters. To illustrate it further we show an example of a sample user interface in Fig. 1. The user interface shows a couple of drop-down menus to execute a database specific change requests such as table restore, table backup, Oracle reporting, and so on. IT Change service catalogs are then created to present a common front-end UI for all available IT change service requests. A user is to expected browse and navigate through a catalog service in largely self-service fashion.

Fig. 1.
figure 1

Catalog service for executing database specific change requests.

Through our experience in designing, developing and delivering self-service based change management services for various clients, we have stumbled upon two interesting problems. Firstly, as the number of supported change request types, systems, middlewares, and applications grow, the number of service APIs also grows rapidly. Although a hierarchical organization can be created to assist the user in navigating through the service catalog, the organization may not always conform with a client’s IT change organization or fits with its user’s IT expertise level. Keyword based search is another popular method of navigating the service catalogs. Unfortunately, it works very poorly on IT service changes due to the large variance in utterances and terminologies used to describe a specific change task and assumes the user know what terminology they should use to search. Secondly, a service API requires a specific set of parameters to be filled, typically presented as a set of list and/or text boxes to a user, who may not perfectly understand what values are expected in what system context, as the API developers intend. To overcome these challenges in an efficient, flexible and user-friendly way, we propose a catalog recommendation service (Cataloger) that takes a natural language based change request as its input and identifies the right service API. It also extracts any parameters as required by the API, from the change request.

Contribution. Cataloger provides the following capabilities: (1) a classification approach to categorize IT change requests into categories, tasks, and actions. The actions can be mapped to APIs, (2) a sequential classification approach to identify parameters in IT change requests that can be mapped to specific parameters in APIs, (3) a feedback based approach that utilizes our parameter classification process to improve the classification of categories, tasks, and action, also to train the change request classifiers for new client data sets. For the evaluation, we consider IT change request types such as database and hardware from multiple clients. We consider that our approach can be generalized to other change categories such as operating system (OS) management, middleware, networking, application, and so on.

Organization. First, we describe the related work. Then, we describe the overall process of associating change requests to a specific catalog service. In the process, we describe the classification techniques to identify categories, tasks, and actions and extract parameters from change requests. Further, we propose a feedback mechanism that improves the classification of categories, tasks, and actions. We provide evaluation for classification and parameter extraction techniques on multi-client change requests using traditional approaches. Further, we provide evaluation of the proposed feedback mechanism. Finally, we summarize our findings and discuss the future work.

2 Related Work

In our contribution, we emphasize on matching the intent of users’ change requests to APIs. In terms of identifying intents from change requests, Lucca et al. [13], Kadar et al. [10], Bogojeska et al. [5] propose supervised approaches based on support vector machine, logistic regression and random forests, respectively. The limitations with supervised approach such as support vectors is that they do not consider multi-label or hierarchical multi-label aspect in the intents. There have been several works on minimizing the labeling effort using active learning [21] and hierarchical clustering techniques [14], however, they do not emphasize on matching the intent of users’ change requests to the intent of an API or it’s parameters. Again, none of the approach emphasize on extracting parameters from change requests that can be mapped to the parameters of an API. Le et al. [12] consider text descriptions to map it to corresponding APIs, however, they do not focus on discovering the intents of text descriptions.

Given the existing approaches, we find the following limitations with the approaches: (1) the approaches to classify change requests are independent of hierarchical intents of APIs. To map change requests to APIs, we need approaches that can extract hierarchical or at least multiple intents from change requests. For example, consider a change request restore a database xyz that has labels such as a category: database, a task: database backup, and an action: restore database, (2) existing approaches ignore the parameters of APIs that can be extracted from the change requests using sequential classification techniques. The parameters extracted can be reasoned about for improving the classification of change requests.

3 The Cataloger Approach

We label a IT change requests as a hierarchy of categories, tasks and actions (CTAs): (1) a category describes a IT change task based on its broad-stroke technology service areas. For examples, database, hardware, and os management are some of the categories. Categories are long standing and generally technology neutral, (2) a task refers to a group of similar change activities under a category. For example, the database category has tasks such as backup, management, user administration, etc. Tasks are technology dependent and distinct from each other, (3) an action maps to specific automation APIs which are technology and service provider dependent. For example, the database management task has actions such as create and drop a database (DB), increase and reduce a tablespace size as provided by a service catalog.

CTAs provide us with multiple scopes of increasingly refined coverage of matching APIs to a user’s intent. In Table 1, we show examples of change requests and hierarchical labels. Based on CTA, we represent a change request (\({\mathsf {CR}}\)) as the tuple of category (\({\mathsf C}\)), task (\({\mathsf T}\)), and action (\({\mathsf A}\)), and parameters (\({\mathsf {PR}}\)) i.e., \({\mathsf {CR}}\) = \(\langle \) \({\mathsf C}\), \({\mathsf T}\), \({\mathsf A}\), \({\mathsf {PR}}\) \(\rangle \). To map a \({\mathsf {CR}}\) to a catalog service, first, we identify \(\langle \) \({\mathsf C}\), \({\mathsf T}\), \({\mathsf A}\) \(\rangle \) associated to a \({\mathsf {CR}}\). We adopt different classification techniques to obtain the labels. Second, based on the obtained label \({\mathsf A}\), we extract parameters \({\mathsf {PR}}\) for a user to execute the catalog service. Third, based on parameters \({\mathsf {PR}}\) we evaluate whether a \({\mathsf {CR}}\) has been classified correctly into \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\).

Table 1. Change requests and the hierarchical labels for each request.

Consider the example of a \({\mathsf {CR}}\) i.e., increase CPU for X to 8vcpu from Table 1. Based on the classification technique, we first identify \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) as hardware, cpu, and increase cpu, respectively. Then, we extract parameters such as the server name X and the number of cpu i.e., 8vcpu. Then, we use the parameters to verify if the \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) identified are valid or not. If not we iterate through other values for \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) to match with the appropriate \({\mathsf {PR}}\). We further describe the feedback approach in Sect. 3.3. Accordingly, our process involves the following four subprocess (Fig. 2): (1) preprocess CR: we preprocess users’ change requests by removing stop words and stemming them. This helps to reduce the feature space, thereby, reducing the chances of over-fitting, (2) identify CTA: we investigate several classification techniques such as the single-label classification (SLC), the multi-label classification (MLC), and the hierarchical multi-label classification (HMLC) to determine which specific technique is most effective in predicting CTAs, (3) extract Parameters: We adopt sequential classification techniques such as the conditional random fields (CRF) and the long short-term memory (LSTM) for extracting API-specific parameters from change requests.

In the following subsections, we first present details on the main subprocess: identify CTA and extract parameters. Then, we describe the feedback mechanism.

Fig. 2.
figure 2

The process illustrating the identification of a catalog service.

3.1 Classification of Change Requests

As a first step, we extract n-grams from the change requests \({\mathsf {CR}}\). We preprocess the n-grams to remove stop words and stemmed them using the Porter stemmer. We remove punctuation such as {@,&, -, _, #, <, >, (, ), [, ], {, }, *, +, =, :}. To vectorize the words, we build a tf-idf vectorizer tf-idf(tr, d) where tr represents the words in the change request d. To normalize tf-idf(tr, d), we use L2-normalization where we convert each tf-idf vector to its normal form. We consider a change request \({\mathsf {CR}}\) to have multiple labels in terms of a category \({\mathsf C}\), task \({\mathsf T}\), and an action \({\mathsf A}\), organized into a hierarchy based on a specific catalog service, where \({\mathsf C}\) is the parent of \({\mathsf T}\) which is the parent of \({\mathsf A}\). For classifying the labels, we evaluate a set of well-known techniques. To the best of our knowledge, these techniques have not been applied and evaluated in the context of IT change requests.

Single-Label Classification (SLC) For the single-label classification, we tried two approaches SLC-A and SLC-B. In the first approach SLC-A, we append the labels \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) to create new labels \({\mathsf C}\):\({\mathsf T}\):\({\mathsf A}\). The number of classes generated in this approach is same as number of action labels present in \({\mathsf A}\). To predict the classes, we build a classifier that uses the linear Support Vector Machines (SVM) [7]. For training the model we consider the input data as a set of change requests \({\mathsf {CR}}\) and classes in \({\mathsf C}\):\({\mathsf T}\):\({\mathsf A}\). Figure 3(a) captures the approach at the level of tasks \({\mathsf T}\). For brevity, we omit the action level nodes.

Fig. 3.
figure 3

Two approaches for single-label classification.

In the second approach SLC-B, we create individual classifiers for \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). For example, in Fig. 3(b), we first create a classifier to identify if a \({\mathsf {CR}}\) belongs to a label in \({\mathsf C}\), i.e., hardware or database. Then, for each label in \({\mathsf C}\), we create classifiers to predict their corresponding tasks in \({\mathsf T}\). For example, for database and hardware labels, we create separate classifiers to predict their respective tasks. We repeat the process for \({\mathsf A}\) where we create classifiers with respect to labels in \({\mathsf T}\). For example, we create separate classifiers for database management and user admin. For hardware, we create a classifier for cpu. To create classifiers for \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\), we used the linear SVM [7]. For prediction, we start with prediction of labels in \({\mathsf C}\). Based on the predicted label and the confidence score, we determine the classifier to invoke for labels in \({\mathsf T}\). Figure 3(b) captures the approach as SLC-B at the level of \({\mathsf T}\). The approach is similar to the approach provided by Barutcuoglu et al. [2].

Multi-label Classification (MLC) For multi-label classification, we adopt two approaches. One, RAKEL (RAndom K-labELsets) [19] that considers ensemble of labeled powerset (LP) classifiers [6]. Two, the classifier chain (CC) approach [17] that model correlations between labels while maintaining the acceptable computational complexity.

The RAKEL approach [19] uses the concept of k-labelsets. Consider we have labels \(L = 1, \ldots , |L|\) from \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). A k-labelset refers to a set Y \(\subseteq \) L with k = |Y|. For simplicity, we use \(L^k\) to denote the set of all distinct k-labelsets. The RAKEL approach creates an ensemble of m LP classifiers. For each i in 1, \(\ldots \), m, RAKEL selects a k-labelset \(Y_i\) from \(L^k\) without replacement. Then, it learns an LP classifier \(h_i\): X \(\rightarrow \) P(\(Y_i\)). For classifying a new instance or a change request, each model \(h_i\) provides a binary decision \(h_i\)(x, \(\lambda _j\)) in the corresponding k-labelsets \(Y_i\). For each label \(\lambda _j\) \(\in \) L RAKEL computes the average. If the average is greater than 0.5 RAKEL provides a positive result.

The CC approach [17] involves |L| binary classifiers. The classifiers are linked as a chain where each classifier is trained to predict \(l_j\) \(\in \) L. Consider an input domain x = [\(x_1\), \(\ldots \), \(x_d\)] with d attributes extracted from \({\mathsf {CR}}\)s and a set of labels \(\mathcal {L}\) = [1, \(\ldots \), L] that corresponds to labels in \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). Each instance of x is mapped with a subset of labels [\(y_1\), \(\ldots \), \(y_L\)] represented as an L-vector. In the vector, \(y_1\) is 1 if the label j is associated with an instance x. For training, the approach considers a training data D = {\(x_i\), \(y_i\)} with N samples. A label is represented as \(y_j^i\) where j represents a label in the ith example. During the training phase, the approach forms a classifier chain h = (\(h_1\), \(\ldots \), \(h_L\)) where \(h_j\) in the chain represents a classifier and is responsible for learning and predicting the binary associations of jth label given the attribute space augmented with prior binary relevance predictions in the chain. The classification approach begins at \(h_1\) and propagates along the chain. For prediction, jth binary classifier predicts the relevance of the jth label.

Hierarchical Multi-label Classification (HMC) Hierarchical label classification considers both labels and the hierarchy constraint among the labels to create a classifier. We examined two approaches. One, the CLUS-HMC approach provided by Vens et al. [20] that learns one tree to predict all the classes. Two, the CSSAG provided by Bi and Kwok [3] that uses the Condensing Sort and Select algorithm (CSSA) to find an optimal approximation of a subtree in a tree.

CLUS-HMC [20] is a decision tree based learner and it considers the following: one, all the labels present in \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) as its classes CL with a partial ordering \(\le _h\) among the classes, i.e., \(c_1\) \(\le _h\) \(c_2\); two, it considers a set of T examples (\(x_i\), \(S_i\)) where \(x_i\) are the features extracted from a change request \({\mathsf {CR}}_i\) and \(S_i\) \(\in \) CL; three, a quality criterion q that rewards models with high predictive accuracy and low complexity. The goal of the approach is to find f: \({\mathsf {CR}}\) \(\rightarrow \) 2\(^{CL}\) such that f maximizes q and c \(\in \) f(x) \(\implies \) \(\forall \) \(c\)’\(\le _h\) c : c’ \(\in \) f(x). The approach uses predictive clustering tree (PCT) framework [4] to view a decision tree as a hierarchy of clusters. In the framework, the top node corresponds to a single cluster that is recursively partitioned into smaller clusters.

CSSAG [3] uses kernel dependency estimation (KDE) to reduce large number of labels to manageable single-label learning problems. To preserve the hierarchy information among the labels, CSSAG uses Condensing Short and Select Algorithm that finds an optimal approximation subtree in a tree. The subtree is used to construct a multi-label that is consistent with respect to the tree. For CSSAG, the training data is represented as {(\(\mathbf {x_i}\), \(\mathbf {y_i}\)} where \(\mathbf {x_i}\) represents features extracted from a change request \({\mathsf {CR}}_i\) and belongs to an input space \(\mathcal {X}\), \(\mathbf {y_i}\) \(\in \) {0,1}\(^d\) is an output vector, and d is the number of labels in \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). Each \(\mathbf {y_i}\) can have more than one nonzero entries based on d.

Comparing the Classification Approaches SLC approaches have some limitations. For the SLC-A approach, the classes at the lower levels have less frequencies in the data. In the SLC-B approach, number of classifiers increase based on the number of labels in \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). The multi-label approach does not suffer from the limitations of SLC, however, it does not consider hierarchy organization of the labels. We evaluated the classifiers on different change requests. The details of our data set and the evaluation methodology is described in Sect. 4. Figure 4 shows the results for the classification of \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) using SLC-A, SLC-B, CC, LP, CLUS-HMC, and CSSAG, respectively. Our evaluations show CC has the best performance. We assume there are multiple reasons contributed to this outcome: one, CTA favors multi-label approaches over single-label; two, CTA has only three hops between the root (\({\mathsf C}\)) and leaf (\({\mathsf A}\)) and is a complete tree, therefore the organization is too simple for HMLC to be advantageous. However, we are able to leverage the hierarchy of CTA when using the CC approach in our feedback mechanism, as we will describe in Sect. 3.3.

3.2 Extracting Parameters from Change Requests

The classification subprocess associates a change request to a service API. Now, we extract parameters, if present, from the change request. Two methods are considered: Conditional Random Fields (CRF) [8, 15] and long short term memory networks (LSTMs) [9]. Some approaches use Hidden Markov Model (HMM) [22] to extract method specification. However, HMM based models consider conditional independence among the observations. Compared to HMM, CRF are agnostic to dependencies between the observations. Apart from CRF and HMM, there are ontology and rule based approaches [12, 16, 18], however, with the inclusion of data from different clients, they are susceptible to failure.

Conditional Random Fields (CRF) We adopt the named entity recognition technique based on CRF to extract parameters \({\mathsf {PR}}\) from changed requests \({\mathsf {CR}}\). A change request contains a set of words that can be represented as observations \(\mathbf {x}\). Each word can be associated with a label \(\mathbf {y}\) that represents a state. \({\mathsf {PR}}\) contains a set of parameters that is subset of the labels \(\mathbf {y}\). Given \(\mathbf {x}\) and \(\mathbf {y}\), CRF captures the relationship between (\(\mathbf {x}\), \(\mathbf {y}\)) as feature functions. For the classification, CRF employs discriminative modeling where the distribution of p(\(\mathbf {y}\) \(|\) \(\mathbf {x}\)) is learned directly from the data. A feature function in CRF are of two types: one is based on the state-state pair (\(\mathbf {y}_t\), \(\mathbf {y}_{t-1}\)) and another is based on the state-observation pair (\(\mathbf {x}_t\), \(\mathbf {y}_{t}\)).

Long Short-Term Memory Network (LSTM) Apart from CRF, we consider LSTM for extracting parameters from change requests. We consider LSTM since it has been recently used for named entity recognition [11]. LSTM is based on recurrent neural networks (RNN) that takes a sequence of inputs (\(x_1, x_2, \ldots , x_n\)) as its input and outputs another sequence (\(h_1, h_2, \ldots , h_n\)). LSTM captures long range dependencies by incorporating a memory cell. The LSTM using several gates controls the proportion of input to give to the memory cell and the proportion from the previous state to forget. The gates are composed out of a sigmoid neural network layer and a pointwise multiplication operation.

Comparing CRF and LSTM We evaluated the performance of CRF and LSTM against our change requests data set in Sect. 4. Figure 5 shows the performance of the two. We find that LSTM performs much worse than CRF. We believe it is because we do not have sufficient training data. On the contrary, CRF requires the feature set to be specified as input while LSTM does not.

3.3 Feedback Approach

In our automated process, we first classify the change request and then extract parameters from said request. Through our experiments, we came upon the following two observations: one, with CTA, the classification of a \({\mathsf {CR}}\) is more accurate at \({\mathsf C}\) and \({\mathsf T}\) levels than \({\mathsf A}\). This is not surprising as due to CTA’s hierarchical nature, we expect a loss of accuracy further down the hierarchy (hierarchical loss); two, if the classification was wrong, there is no chance parameter extraction would produce a set of valid parameter match (parameter confusion). Hence, it’s intuitive to use a failed parameter extraction (for a \({\mathsf {CR}}\)) as a negative API feedback to the classifier, and have a good probability of finding a positive API match by performing parameter extractions on the immediate sibling APIs of the negative one. Our feedback mechanism has the following advantages:

  • improved accuracy of classification: classification approaches rely on specific words that helps in identifying a relevant \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). Since change requests have similar words such as action verbs and nouns, confusion increases while narrowing down from categories to actions.

  • a method for onboarding new client \({\mathsf {CR}}\) with unsupervised learning: feedback from parameter extractor on new client’s \({\mathsf {CR}}\) with correct labels.

  • decoupled training of catalog classifiers and parameter extractors: onboarding of new CTA types only requires training of catalog classifiers; onboarding of catalog APIs (to existing CT) only requires training of new parameter extractor of the API.

In Table 2, consider the first two change requests that has been identified with the labels \({\mathsf C}\) = database, \({\mathsf T}\) = dbbackup, and \({\mathsf A}\) = backupdb. Clearly, the first change request does not fall into the database category, however, the keyword such as backup lead to the false classification. Similarly, key words such as add and server lead to the false classification of the fourth change request in Table 2.

Table 2. Examples of change requests with confusion.

To avoid the misclassification, we rely on parameters extracted from a change request to identify \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). Each action \({\mathsf A}\) can be associated with some special parameters \({\mathsf {PR}}\) that is relevant to \({\mathsf A}\). For example, backup database has a special parameter such as the backup mode (online, offline), increase tablespace has a special parameter such as the buffer size, increase cpu, or memory has special parameters such as the amount of cpu or memory to be increased, and so on. In our approach, first, based on default parameters for each action, we assign weights to the parameters that indicate the specificity of the action. Based on the weights of parameters in an action we compute expected_weight. Table 3 shows the example of parameters of actions and their weights.

Table 3. Examples of parameters of actions and their weights.

For any parameter that occurs in more than one action, we consider the weight of the parameter as 1. For example, for parameters such as database and server names, we consider their weight as 1. For parameters, specific to a change request, we consider their weight as 2. For example, we assign the weights for mode for database backup and buffer for tablespace as 2. Based on an incoming change request, we extract their parameters as described in Sect. 3.2. From the parameters, we determine the actual_weight. We consider \(param\_confusion\) = \(\frac{actual\_weight }{expected\_weight }\) to reason about \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). Consider the examples of cpu change requests from Table 2. For the change request Add 2vCPU to server X, the extracted parameters are the action (add), amount (2vCPU), and the server (X). We compute the actual weight by combining the weights of the action, amount, and the server from Table 3. The \(param\_confusion\) comes to be 1 since we could identify all the parameters from the change request. For the change request add outbound servers to X for RDP access, we could identify the action (add) and the server (X). Thus, the \(param\_confusion\) is computed as 0.5. Setting up a high threshold for the \(param\_confusion\) above 0.8 will put the change request to others category.

4 Evaluation Methods

We evaluate our approach shown in Fig. 2 using the real-world data. For the evaluation, we create datasets with change requests from collected from different clients of IBM. Table 4 shows the details of datasets in terms of categories database and hardware. We chose database and hardware categories since they are considered as the most common categories for clients.

Table 4. Datasets prepared from various clients.
Fig. 4.
figure 4

The results for classification of change requests for different clients in terms macro precision, recall, and f-measure.

For the database changes, the tasks we consider are (1) backup, management, (2) run operations, and (3) user admin. For the hardware changes, the tasks we consider are (1) cpu and (2) memory. For the database management task, we consider (1) create database, (2) drop database, (3) increase tablespace, and (3) reduce tablespace. For the database backup task, we consider (1) backup and (2) restore actions. For the database run operations task, we consider (1) run sql script, (2) start database, and (3) stop database actions. For the database user admin, we consider (1) grant user and (2) revoke user actions. For the hardware cpu task, we consider (1) increase cpu and (2) reduce cpu actions. For the hardware memory task, we consider the (1) increase memory and (2) reduce memory actions.

4.1 Evaluation of CTA Classification Approaches

In the first step of the classification, we label the data with respect to \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\). For the labels, we collect the annotations from two annotators and compute their inter-rater agreement score. Then, we resolve the ambiguities to achieve a satisfactory agreement score (> 80%). For the evaluation, we consider six approaches (1) SLC-A, (2) SLC-B, (3) CC, (4) LP, (5) CLUS-HMC, and (5) CSSAG describe in Sect. 3.1. For each approach, we perform the three-fold cross-validation. For each fold, we collect results in terms of macro precision, recall, and f-measure. We provide the results by averaging the results over each fold. Figure 4 shows the results for the classification of \({\mathsf C}\), \({\mathsf T}\), and \({\mathsf A}\) using SLC-A, SLC-B, CC, LP, CLUS-HMC, and CSSAG.

From the results, we observe that CC and LP perform better than other approaches across all the datasets. CC and LP perform better than hierarchical approaches such as CLUS-HMC and CSSAG may be due to the following reasons: one, the depth of the hierarchy we consider is short (i.e., 3) and two, for the classification we consider the abstract for the change requests rather than their descriptions since for most cases descriptions were missing.

4.2 Evaluation of Parameter Extraction Approaches

To evaluate the extraction of parameters we employ CRF and LSTM. For the evaluation, we create separate datasets for each action. Then, we annotate words for each change request. For example, we annotate the change requests [add, 2GB, RAM, to, server, abc01] as [action, amount, , ,server name]. For empty slots, we extract the postag for each word and replace the empty slots with postags. Thus, for the change request the final set of labels are [action, amount, ADP, NOUN,server name].

Since, CRF needs features to train a model, we extract the following features from a change request: (1) word is numeric, (2) alphanumeric, (3) is in the lower case, (4) is in the upper case, (5) the first letter in the word is upper case, (6) the word is a verb, (7) is a digit, (8) the postag of a word, (9) features related to the previous and (10) the next word in the change request. Compared to CRF, LSTM is agnostic of input features as it learns them directly from the data. Figure 5 shows the results for few actions for CRF and LSTM evaluated across four datasets. For brevity, we omitted results for other actions. The result shows that CRF performs significantly better than LSTM. The result is not surprising considering that LSTM needs a lot more data to train than CRF.

Fig. 5.
figure 5

The results for parameter extraction in terms precision, recall, and f-measure in terms of macro average.

Fig. 6.
figure 6

The results for classification in terms precision, recall, and f-measure in terms of macro average.

4.3 Evaluation of Feedback Approach

In this approach, we evaluate if the extracted parameters from a change request can be used to improve the classification of the change request. Before we evaluate we create a balanced dataset for each action by oversampling the underrepresented actions. For the evaluation, we consider CC as the baseline approach to identify categories and CRF to extract parameters.

In case of CC, the predicted labels for a database change request can be either [database], [database, management, increase tablespace], or [database, run operations, management, start database, increase tablespace]. Thus, based on the labels, first, we determine the actions to consider. For example, if the label is [database] we consider all the actions under the database category. Then, for each action, we extract parameters from the change request using CRF and compute \(param\_confusion\). Based on \(param\_confusion\) values for each action, we determine the final label for the change request by choosing the action with maximum value. We compare the label obtained from the feedback approach with the labels obtained from CC. Figure 6 shows the results. Results indicate the feedback approach obtains higher accuracy results than the CC approach.

5 Conclusion and Discussion

We provide Cataloger that classifies IT change requests into categories, tasks, and actions. For the classification, we employ six approaches: SLC-A, SLC-B, CC, LP, CLUS-HMC, and CSSAG. From the evaluation, we find CC and LP performs better than the other approaches. To extract parameters, we employ sequential classification techniques such as CRF and LSTM. Based on our evaluation, we observe that CRF performs better than LSTM. For the feedback approach, we consider CC and CRF. The feedback based approach based on the parameters improved over CC approach.

Our approach has several limitations. One, the dataset we use is not balanced across all the actions. Thus, we plan to use clustering based approaches [21] to minimize the labeling effort and get more labels. Two, the datasets we create for specific actions to identify parameters are not large. Thus, LSTM performed worse than CRF. We can increase the number of samples for each action to remove the dependency of extracting features using CRF. Third, in the feedback approach, we propose the heuristic approach based on \(param\_confusion\) to make decisions. In future, we plan to improve the heuristic approach to improve the accuracy results.