1 Introduction

Customer care service centers for IT service providers have human agents who resolve large number of infrastructure and application issues [7] on a daily basis. The issues are very diverse in terms of complexity and impact to customer. Not surprisingly the percentage of automated remediation tools for the IT domain has remained low (10–15 %). Generating automation cookbooks for these IT issues is of primary interest to service management. IT manuals typically written for troubleshooting the stand-alone product of interest often fall short of providing adequate guidelines for automation in wake of complex inter-dependencies among components. Blogs and open forums like stack overflow where human agents pose questions and provide answers act as a better data source with rich content. However in these uncontrolled environments, a lot of data cleaning and vetting needs to be done before they can turn into reliable data sources for automation. The good news is that the actual resolution work is often recorded along with the problem ticket logged for the IT issue; an example of one such recorded resolutions is “Took Remote; Configured Lotus Notes; She was able to load mails; Issue Resolved”. If curated, this resolution history in the IT environments can turn into a very valuable data source for a provider looking for automated remediation tools. The resolution actions taken to solve customer issues in the face of service level agreements also has higher credibility than the open source forums.

Currently most service providers mandate some remediation steps to be manually entered by human agents and this resolution data entered by the human agents resides within the service providers’ ticket management systems. However, the written text suffers from quality problems. As a true characteristic of any human generated data source, every agent tends to write their low level resolution notes differently from the other. Some humans seem to be naturally less inclined towards adequately recording their actions than others e.g. their resolution summaries may often contain text like “problem closed” or “action taken”. Often there are distinct styles of writing, manifested through language, tone, usage of grammar and certain words. It is extremely challenging to extract relevant content out of many different styles. Manual analysis also confirmed that grammatically well written text does not always indicate a good content. Given this context, and the over-arching need to be a step closer towards automation of IT issues, the community is faced with two immediate research challenges: (i) design efficient techniques to tap into the resolution notes data source; mine and extract relevant resolution knowledge from it and (ii) encourage/recommend agents to improve their summaries that enable curation of high quality resolution text and enhance its usefulness for mining and analytics. While there is active research happening for mining information from the resolution text, there has been very little in-depth work in the context of IT services that actively persuades human agents to write better quality resolutions by providing them run-time recommendations.

In this paper we propose a system that addresses the curation problem with help of specially designed automated data analysis techniques. The system focuses on how to help the human agents write better quality resolution text. As the agent enters the resolution text, the goal of the system is to (i) analyze the resolution text and assign it a score from the quality perspective and (ii) provide a real time feedback that identifies concrete areas of improvement in the resolution summary being created by the agent with appropriate recommendations. This timely recommendation helps agents write better quality resolution text, which in turn will help curate better content for automation. The main contributions of the paper are: (i) Comprehensive set of linguistic and non-linguistic features that capture goodness of resolution text, (ii) Method for building domain catalog (vocabulary) automatically, (iii) Prediction model for computing text quality scores on the fly and model’s in-depth analysis, and (iv) Recommendation model that gives pointed recommendations to agents on how to improve text. We have developed a multinomial logistic regression based prediction model that uses the specially designed feature set and achieved an accuracy of 88.2 % which is significantly higher than accuracy of a naive text based classification model (68.5 %).

Rest of the paper is organized as follows: Sect. 2 motivates the need for a specialized scoring model. Section 3 describes the system that has been designed to help the agents in data curation. Section 4 describes features considered for good quality text and their extraction and Sects. 5 and 6 present the prediction and recommendation model respectively. The system is evaluated in Sect. 7.

2 Motivation for Specialized Text Quality Scoring Model

As a first step to address the problem of assessing resolution text quality, we started by trying to map the text quality to a standard readability metric like Flesch-Kincaid, Gunning Fog and some others [2]. In order to carry out this experiment, the first task was to create reliable annotated data. We gathered 1000 random incident resolution texts from a repository of 60000 incident resolutions for a client account. These resolution texts were then independently annotated by two domain experts. For each resolution, each annotator gave a score from 1 to 5 (with 1 denoting poor quality and 5 denoting high quality). While assessing quality score manually, aspects like re-usability, well formedness, self sufficiency (or completeness), relevant information only, clarity of exposition, appropriate cross referencing were considered for goodness. We obtained value of 0.57 as inter-annotator agreement using Cohen’s Kappa [6], which indicates that there is good agreement. The distribution of the score is: 12.6 % resolution texts had score 1, 13.1 % had score 2, 27.3 % had score 3, 21.6 % had score 4 and 25.4 % had score 5.

The correlation scores between these standard readability measures and resolution quality scores are shown in Table 1 where the abbreviation are: FK = Flesch-Kincaid, FR = Flech Reading and GF = Gunning Fog. As can be seen from the table, all measures show either no correlation or little correlation. This can be attributed to the fact that these readability metrics were created for typical English text and a well written English sentence need not form a valid resolution text at all.

Table 1. Correlation: existing readability measures and resolution text quality
Table 2. Correlation: attributes and resolution quality

Having ruled out use of any off the shelf readability measures, we studied if resolution quality score has any correlation with other ticket attributes like severity, incident type etc. The ticket attributes for studying correlation were selected based on (i) attribute type, (ii) significance and (iii) presence. The attributes of type timestamp were not considered. The unstructured text type attributes were represented as count of words and sentences in order to study correlation with quality. The significance of attributes was determined by their role in aiding the resolution process e.g. resolver agents. Presence implies that the attribute is non-null in at least 90 % data points. Table 2 shows the results obtained for correlation study on the four main attributes of interest. There appears to be no correlation of score with severity and type. However, there is a positive correlation of quality with number of words in the resolution text as suggested by the Spearman score. The number of sentences in the text did not show correlation. Finally, we studied if scores have any correlation with resolvers who write resolution text. Agents (resolvers) with atleast 15 ticket resolutions were picked and variations in their quality scores was plotted as shown in Fig. 1. Though the amount of variations rule out agent driven prediction, we observed that agents who write good quality continue to do so for majority times while mediocre quality writing oscillate between good and bad. This indicates that there is a good scope for helping agents write good quality text so that their writing quality can stabilize eventually.

Fig. 1.
figure 1

Score variations by assignees

With no obvious correlation predictors for the resolution quality, a machine learning approach on the unstructured resolution text was tried. The annotated resolution text was used to train a SVM based text classification model. The words in the resolution text became the features and each resolution text was represented as a vector using tf*idf score of the word-term. The ten-fold cross validation accuracy was found to be 68.5 % which leaves lot of room for improvement. This provided us the motivation to develop a specialized quality assessment system for resolution text, that uses specialized features extracted from text to learn logistic regression based prediction model and provide recommendations.

3 Automated Quality Assessment System

The automated scoring and recommendation system for data curation is shown in Fig. 2. The system takes as input a resolution text once an agent completely enters it and passes it to a quality assessment model. The block called ‘Quality Assessment Model’ in the figure shows that the input text is processed to get the features of the text (explained in Sect. 4) and then these features are used to predict the score using multinomial logistic regression model, as will be explained in Sect. 5. The multinomial logistic regression based model for predicting scores uses carefully chosen feature set such that the features meet the criteria of (i) being quantitative or categorical and (ii) being computable automatically from the input text. The model has been trained on the feature set apriori and can be re-trained as per requirement. The predicted score along with the feature information is fed into the recommendation process which uses rule based model to gives pointed areas of improvement to the agent. The system provides a real time feedback to the agent on how to improve the text. Upon receiving the recommendations, the agent can decide to improve the text by incorporating the feedback or if the score is above a threshold, the agent has an option to continue with storing it in the ticket database. The value of threshold is more of a business decision and we do not put any restrictions by design.

Next, we explain the first step in design of quality assessment system which is to identify and extract the features indicative of a good resolution text. These features are then used to learn the prediction and recommendation models.

Fig. 2.
figure 2

Automated quality assessment system

4 Identified Features and Their Extraction

Feature identification is key to quality assessment model. It is important to ensure that the features in the model capture the aspects that have gone into manual evaluation of resolution text for quality scores. In order to achieve this, feature identification was done using two step process. In the first step, we went through thousands of random samples of resolution texts and came up with a set of features that could potentially play a role in quality scoring model. This step was guided by two considerations: (i) It should be possible to algorithmically extract the feature without any manual intervention. (ii) The features should be an eclectic mix of linguistic and non-linguistic features. This is because, not all linguistic features are relevant or amenable in the domain; e.g. resolution text is usually terse making it non-amenable to linguistic cohesion or coherence and many good to have domain based features are not linguistic in conventional sense. The linguistic feature categories like syntactic structure, vocabulary, coherence and discourse relations as prescribed in [15] provided a starting point to identify the feature set suitable for linguistic features. For non-linguistic features, we used the features described in [5, 8, 20] as a starting point. In the second step, the domain experts who had annotated 1000 tickets for quality score were shown the list of features that we had obtained in the first step. They were asked to pick the features that played significant role in at least 20 annotations. Based on their combined input, we finalized the feature set to be used for regression based model. The features that were marked significant by both the experts were picked. The categorized list of finalized features is presented in Table 3. All features are categorical or numeric. Some of the features require learning over historical data in which case the learned model is stored in the memory which can be accessed efficiently real time. The feature extraction is described next for each category.

Table 3. Features grouped by categories

4.1 Language

This category contains features that capture desired use of language. Most of these features are domain adaptations of standard linguistic features.

Verbosity - The resolution summary should not be too verbose. At the same time being too terse does not capture enough information. Verbosity is captured using (i) Number of words - This is extracted by simply splitting the input text on spaces, and (ii) Number of Lines - This again is extracted by splitting the input text on the newline character.

Percentage of misspelled words - We use an English dictionary and determine the number of words which are not present in the dictionary.

Percentage of abbreviated words - Any abbreviations and acronyms that are colloquial and not standard are best avoided. For example, usage of HK instead of Housekeeping is bad for readability. This is found using the heuristics such as a word in all capital letters is an acronym and a word with alphanumeric characters is likely to be a server name.

Has well formed sentences - Good quality resolution text typically uses sentences to describe the resolution. We have often seen poor resolution text which contains single words such as “resolved”, “reran” and so on. Good resolution text should adhere to the desired linguistic syntactic structure. We use the confidence score output of natural language processing parsers to determine if the resolution text has well formed sentences.

4.2 Presentation

This category contains features that capture quality of presentation. Formatting the text appropriately is important to improve readability. As an example, an instruction list is much more easy to follow if written in bulletted style compared to a paragraph style of presentation.

Has Cause-Action-Prevention Information - We have observed the good quality resolution text often contains contingency discourse relation by explicitly stating the “cause” of the problem, the “action” taken to resolve the problem and the way to “prevent” the problem from occurring in future. We analyze the text for the presence of these words or their abbreviations like (C:, A:, P:) to determine if the text has Cause-Action-Prevention Information.

Has Bulleted-List - Good quality resolution text is often written in a nice bulleted list format using ASCII delimiters such as “*” and “-”. We search the text for a minimum of 3 such occurrences of these ASCII delimiters to determine if the resolution text has Bulleted-List.

Uses EMail/Commands To Aid Description - The copy-pasting of email conversations or low level shell commands often enhance the descriptiveness of the resolution text. This maps to domain specific expansion discourse relation. We analyze the text for common email headers such as “From:”, “To:”, “Subject:” to determine if the resolution text uses emails to aid description.

4.3 Domain Relevance

Use of proper language and well formatted presentation does not guarantee that the text is indeed relevant to the domain. The presence of relevant domain information builds confidence that the text is indeed useful. In the following, we present the proposed domain relevance features that capture domain information. They are all numeric and use domain vocabulary learned from historical data. Domain vocabulary (knowledge) is cataloged in form of entities, operations and phrases by mining historical data. The domain features and their extraction is presented first. The automatic construction of domain vocabulary referred to as catalog is explained thereafter.

Entity/Operation Density - Entities in IT domain are typically hardware objects or softwares on which some operation is done for e.g. server, filesystem, database etc. Some examples of operations from the domain are copy, move, restart etc. A good resolution text should contain such entities and operations from IT domain. The operations and entities are extracted from an input text by finding the verbs and nouns respectively in dependency parse tree relations of type dobj, nsubj, nsubjpass or nn generated by NLP parsers [13] for the text. For example, in input text sent mail to the user, one of the dependency parse relations output is sent-VBD mail-NN dobj; being of type dobj, the verb sent is extracted as operation and noun mail as entity. The extracted operations and entities in the dependency parse relations are searched for in the domain catalog. The total number of matches is the value of the feature.

Action Phrases Density - Action phrases are a combination of operation and entities eg. server rebooted, backup completed. Action phrases provide a better confidence compared to just entities or operations. The phrases that have the dependency parse tree relation of type ‘dobj’, ‘nsubj’ or ‘nsubjpass’ as generated by NLP parsers for an input text are marked as Action phrases. For example, in input text sent mail to the user, the phrases generated that match the desired type are: sent mail, sent user. These phrases are then searched in the domain action phrase catalog. The feature is present if there is a match. The total matches are counted to come up with phrase density number.

Table 4. Learning the domain - entities, operations
Table 5. Sample output of learnt relevant POS patterns
Table 6. Domain knowledge catalogue sample (with frequencies for entities and operations)

Learning the Domain Catalog. The domain vocabulary is built at mainly two levels, (a) operation and the entity keywords, (b) action phrases that denote the actions that have been taken historically.

(a) Entity and Operation Catalog: First, identification of text fragments from historical resolution text that are likely to represent actions is done using two methods: (i) using the POS patterns obtained using the method 1 shown in Table 4, illustrated samples shown in Table 5, (ii) using phrases from text that are of type ‘dobj’, ‘nsubj’, ‘nsubjpass’, or ‘nn’ as given by dependency parse trees generated by NLP parsers.

Then, deduction of domain vocabulary in terms of operations and entities from the extracted fragments is done. Verbs are marked as operations and Nouns are marked as entities; for example, server reboot has server as entity and reboot as operation. These form the domain catalog for entities and operations. Table 6 shows examples of catalog entries for entities, operations that we got from the sample of 1000 tickets.

(b) Action Phrase Catalog: The phrases of type ‘dobj’, ‘nsubj’ or ‘nsubjpass’ as generated by dependency parse tree relations are added to the catalog of actions phrases. We also use the entity and operation catalog to create succinct action phrases. This is done by associating operations with suitable entities based on proximity in resolution text. Proximity rule relates entities and operations if they co-occurr within a n-gram window (n ranged from 2 to 6) in the raw resolution text. In case of multiple operations/entities in a n-gram window, the first operation in the window is usually the most relevant and in case of entity, the last one is usually the most meaningful. The phrases thus obtained are much refined and less noisy. Table 6 shows sample of action phrases catalogue entries obtained using this method.

The domain catalog is stored in a data store to enable scoring and recommendation of text.

Table 7. Model fitting information
Table 8. Statistically Significant Features

5 Quality Score Prediction Model

The prediction model for quality scoring is learnt using multinomial logistic regression. The extracted features as described in Sect. 4 are the predictor (independent) variables and quality score is the dependent variable. We chose multinomial logistic regression because the independent variables are nominal or numeric and the dependent variable (score) is multi-level nominal with mutually exclusive and exhaustive categories. Multinomial logistic regression does not make any assumptions of normality, linearity, and homogeneity of variance for the independent variables. The dependent variable is a score and intuitively ordinal regression model should be used, but the assumption in ordinal regression that the effect of the predictors on the odds of an event occurring in every subsequent category is the same for every category does not hold here. So, it was ruled out. The annotated data as prepared in Sect. 7 was used to learn multinomial logitistic regression model. All the features that were identified in Sect. 4 were computed for the training dataset. The model was built incrementally with the starting point as three independent variables, namely, number of words, domain keywords and bulleted text. Then we incrementally included all the features as independent variables. As can be seen from Table 7, the “Final” row presents information on whether the variables we added statistically significantly improve the model compared to the intercept alone (i.e., with no variables added). The “Sig.” column shows that p = .001, which means that the full model statistically significantly predicts the dependent variable better than the intercept-only model alone. The McFadden pseudo R-Square value was 0.742 and Cox and Snell pseudo R-Square value was 0.842. As can be seen from Table 8, the features that helped are fairly generic in nature and the model is not subject to overfitting.

Table 9. Recommendation rules

6 Recommendation Model for Improving Quality

The recommendations for improving resolution text quality is an important component of the proposed system for data curation. The recommendations are subjective to the predicted quality scores. The quality scores have the following interpretation: 1 - Poor, 2 - Below Average, 3 - Average, 4 - Good, 5 - Very Good. For each score, there are set of recommendations that are provided to improve the text quality. Table 9 provides the rules for coming up with recommendations. For score levels 1 and 2, the recommendations are at an overall level and a detailed level that capture specific areas of improvement, while level 3 onwards the recommendations are more specific to the areas which need improvement. The specific areas of improvement are determined by the feature values as computed for predicting quality score. If the feature does not have the desired value, then that is flagged as an area of improvement as can be seen in the table. The agent who has entered the text gets all the recommendations corresponding to if-condition statements that are true. In addition to the features that were identified in Sect. 4, there is one more feature, namely, Affirmative Evidence that is used in recommendation rules. This feature was added after getting feedback from agents who were asked to evaluate recommendationsFootnote 1. This non-linguistic feature answers the question if the text really indicates resolving the problem? There are some texts that describe only the problem and there is no conclusive evidence of problem resolution. We have tackled this problem by casting it as a sentiment analysis problem. The idea is that if a text contains only symptom descriptions, then the sentiment of such text is usually negative because of presence of words like ‘fail’, ‘not working’, ‘error’ etc. When a text has a significant focus on actions and possibly outcomes, then the sentiment shifts to being neutral or positive. For sentiment analysis, we have used AlchemyAPI [1] service. The output is in the form of ‘positive’,‘negative’ or ‘neutral’.

We have developed a prototype service for scoring and recommendation of resolution text. The screenshot of the user interface is shown in Fig. 3. An agent enters the resolution text and submits it for evaluation. The service returns with the score predicted from the learned model and provides recommendations using the rule based model as described above. The usefulness of recommendations is evaluated next.

Fig. 3.
figure 3

Service demonstration

7 Evaluation

We now evaluate how good and accurate is the proposed quality assessment model for resolution text. A total of 2000 resolution summaries were manually annotated by domain experts and a common agreement was reached on the scores. These 2000 datapoints were chosen from six months of ticket data by selecting summaries from all ticket categories which were predefined in the dataset. Not more than three repetitions of same summaries were allowed. The repetitions, wherever they occurred, were kept to check consistency in manual scoring. Out of this annotated set of 2000 resolution summaries, 1000 were used as training data and 1000 as test data.

The accuracy of the model is presented in Table 10. The model achieves an accuracy of 88.2 % which is quite a good improvement upon the baseline accuracy of 68.5 % based on SVM based classification on unstructured text as mentioned in Sect. 2. The accuracy for score value of 2 is low at 65.6 % which can be attributed to comparatively smaller set of training samples.

Table 10. Classification accuracy

The individual effects of the features as extracted in Sect. 4 were studied and not all were found to contribute in statistically significant manner for the prediction of dependent variable. Table 8 provides the list of the features that were significant with p-value of 0.05 as the threshold. The model interestingly showed that quality scores for resolution texts are not so much a function of grammatical wellformedness as other features, which is the reason that state of art for readability measures did not work well. Though affirmative evidence feature did not play an important role in prediction with \(p=0.334\), it was found useful in recommendations.

We also compared the performance of our model against an existing method for computing quality scores in IT domain [12]. In [12], a score is arrived at using weighted linear combination of proportion of actual technical content with desired technical content and the total content. The paper also suggests values for the co-efficients used in the formula. We implemented this and used the technical content as determined by our domain dictionary for phrases. The objective of this experiment was to see if the formula can be re-used in our system setting. The accuracy obtained was 51 % (rounded off). Last but not the least, we compared with SVM model obtained using the extracted features as proposed in the Sect. 4 and the accuracy obtained was 78.3 %.

7.1 How Good Are the Recommendations?

The recommendation model is based on predicted scores and specific feature values as seen in Table 9. The goodness of recommendations is therefore judged on two parameters: (i) Accuracy of score prediction model which is 88.2 % as evaluated earlier, (ii) Accuracy of automated feature values which is discussed next. We found that the features in Language, Presentation and Domain relevance categories had fairly good accuracy of 89.0 % and above. The accuracy of Affirmative evidence feature was found to be 81.2 %. On deeper analysis, we found that positive sentiment accuracy was 91.6 % and negative sentiment accuracy was 43.5 %. Though the text having negative sentiment formed a small percentage of 21.6 %, this is an area for improvement. Evaluating goodness based on accuracy is only one dimension. We also performed controlled experiments to determine the usefulness of recommendations for the agents. Two agents were asked to provide 30 random samples of resolutions written by them. We used the recommendation service shown in Fig. 3 to generate recommendations and asked agents to rate their usefulness. In 67 % of the cases, the recommendations were rated very useful and in the remaining cases, they were rated as partially useful. We conclude that recommendations are good for purpose of helping agents in improving text quality.

8 Related Work

Readability Scores for evaluating complexity of English text [2, 14] have received lot of attention in computational linguistics. Feature based readability assessment [8, 15] has found vocabulary and discourse based features very useful. We used this insight to design IT domain specific features. There has also been a fair body of work on recommending and evaluating quality of technical content for online forums, posts and blogs [5, 18]. They have emphasized on author specific interactions and link navigations that do not have a meaningful mapping in resolution text. For domain based document readability measures, [20] proposes document scope and cohesion using manually predefined concept ontology which got ruled out because we wanted a technique that could discover domain knowledge automatically. Some of the notable works that are specifically focused for IT domain are [12, 17]. The comparison with [12] has been done in the evaluation section already. The work in [17] describes a method to compute the quality score of ticket data as a whole. The quality score is a function of number of populated structured fields that are marked important, size of the problem description and number of domain specific keywords present in the description. The domain dictionary is created manually. In order to try the effectiveness of this method, we applied this formula for resolution summary text and mapped their grading to our scores. Different variations of the grade mapping were tried and the best accuracy obtained was 36 % (rounded off). The low accuracy can be attributed to the mapping being not perfect and the coefficients being not suitable for the dataset. Nonetheless, this work provided us a good direction in terms of feature selection. Technical readability has been explored in [10] in context of ranking. The method used is latent semantic indexing. We tried this technique but due to terse text, the term cohesion could not be captured well in latent space. The work in [11] uses the NLP techniques to identify actions in tasks in the context of commitment for service engagements. Sentiment analysis of text for affirmative evidence is something that we have not come across yet in any of quality score analysis. In a somewhat related work, there has been some work on defining good quality code using NLP techniques [3]. In the space of resolution text mining, there have been efforts as seen in [4, 16, 19, 21]. These efforts rely heavily on good quality data and have been shown to be useful in context of IT automation.

9 Conclusion

We presented a solution to automatically assess the quality of resolution text entered by humans and offer recommendations to improve the same. Our solution had an accuracy of 88.2 % in assessing the resolution quality (when compared with a gold standard created by human experts). We conclude that a good quality resolution text encompasses aspects of text layout, discourse relations (contingency and expansion) and domain vocabulary and these aspects can be used to learn an accurate score prediction model. As future work, we plan to extend the system to other types of manual data like incident description, change plans.