Keywords

1 Introduction

In IT services, the goal is to help customers resolve defects or usage issues for products or services they acquire. Agent-assist and self-assist applications are industry staples that help address the content complexity challenge [2]. These applications leverage AI technology to understand user requests and provide solutions or recommendations for next steps. A key part of understanding user requests involves identification of the relevant key terms in the domain.

The business of IT services is highly challenging, as customer satisfaction is directly influenced by speed and quality of support resolution provided. Most challenges stem from the large volume of products, components, their configuration context and documentation that have to be at the fingertips of technical support agents. AI technology based components are trained at large on product documents, but a considerable performance boost is achieved by explicitly utilizing domain-specific terms, such as names and actions related to products and components. These terms become the key terms or features that help evolve the functionality for request and response understanding in those components, as they provide discriminative topical information about the content [22]. Furthermore, key terms in a document are building blocks for constructing a knowledge graph, which provides context information and inference capabilities for search and chat-bots.

It is a challenge to identify all relevant domain-specific key terms across staggering volumes of installation guides, user manuals and troubleshooting documents. For instance, there are over 100K relevant articles for IBM Server Power PlatformFootnote 1, and over 60k documents for the storage brand of StorwizeFootnote 2. While the manual approach of identifying key terms is the most accurate, it is not feasible to do so for technical support content. Automation and accuracy of key term extraction are critical solution requirements. In this paper, we address the problem of automatically extracting key terms in IT services content, and propose and evaluate the accuracy of several methods. Namely, domain-specific key term extraction is to identify lexical expressions that represent important concepts in a given domain. These key terms are building blocks of entity recognition which affects the quality of cognitive applications on top of search, chat-bot/dialog systems, and knowledge graph creation etc.

Key term extraction in the literature has focused on detecting mentions of person, place, and organization [15, 20]. Large open knowledge bases like Yago, DBPedia and Wordnet include general concepts, with limited coverage for the technical domain. Figure 1(a) shows a troubleshooting article and its important domain specific key terms (highlighted using green boxes). The article also shows (marked in red boxes) generic English terms which have minimal relevance in the context of the domain. If the article needs to be returned by a search application then the important key terms in green boxes should be extracted, and not the generic terms. Table 1 shows a sample of key terms from the page in Fig. 1(a), their domain-importance, and whether they can be extracted using standard noun phrase extractions, Yago, or through DBPedia concepts. These repositories contain a large amount of generic words and concepts and the idea is to validate if they can capture domain specific key terms indigenously. Table 1 shows that these approaches miss important terms like drive fault type 3 while reporting non-useful terms like user response and following steps, indicating that these approaches are noisy and not useful for building technical support specific domain applications.

Table 1. Key terms extracted from relevant methods

Prior art has addressed the problem of automated extraction of domain-specific key terms in different settings of domain knowledge. Methods are supervised - using large set of manually created key term annotations ([19], weakly-supervised - using only a few seed key terms [22], and unsupervised - using no annotation at all [9, 11]. Weakly supervised and unsupervised methods use various statistical filters to reduce the large share of false positives. However, these methods do not address specific features of technical support content. Technical documents are loaded with acronyms, error codes, their explanations, executable commands and their outputs. Key terms often also tend to be subjective. The dense information in the pages, different vocabularies used, subjectivity across documents and less proportion of English dictionary words necessitate research effort to address them.

These tangible points for improved techniques to extract technical key terms have motivated us to design and develop a system for domain specific key term mining. Our intuition is that from a generic collection of key terms, domain specificity implemented as a set of noise removal methods can help us arrive at a good set of relevant key terms. We propose a novel weakly-supervised technique that uses various linguistic and non-linguistic features to output a relative ranking of key terms for any technical document. The algorithm begins with a generic extraction of candidate terms and refines them with domain glossaries. Domain annotations based on linguistic features like action-orientation or symptom-description along with position and frequency of the terms within the documents are used as indicators of document theme. We show that the proposed mining improves key term mining by as much as 30%.

2 Motivating Example

A big motivator for mining accurate key terms is related to automated knowledge graph creation. A Knowledge Graph (KG) consist of entities and relations connecting these entities. KG based approaches have been used extensively for question answering in the open domain. For example, with entities and relations in a KG, system can do a better job on query disambiguation and understanding a user’s intent, which leads to more accurate answers. KG based organization of content has also been used for efficient information retrieval and conversation applications in technical support.

2.1 Knowledge Graph Creation

The construction of Knowledge Graph relies heavily on extraction and linking of correct entities and relations. Identifying domain-specific key terms therefore becomes a crucial first step toward this endeavor. For example, as explained in [4], before extracting relations to populate a knowledge base from a large corpus of text documents, the system needs to run Entity Detection and Linking to identify entity mentions, which are essentially terms that people use to refer to those entities.

In Sect. 2.2, we will show an example to illustrate the domain specific terms important for IT Support, which are crucial for forming entities to construct a knowledge graph for this domain.

Fig. 1.
figure 1

Error code 1686 & its resolution for IBM Storwize 5500 product

2.2 Example on Term Extraction

A typical technical-support troubleshooting article describes issues that a user may encounter (called symptoms) along with the suggested fixes thereof (called resolution procedures). Figure 1(a) illustrates a troubleshooting document for a particular error code (1686) related to the IBM Storwize productFootnote 3. This article mentions the error code 1686 and the accompanying error message Drive fault type 3 in its title. Following the title is a detailed description of the symptom (in this case it is vacuous repetition of the error message). The User Response section details the resolution procedure, which are the steps a user can take to address the symptom.

There are several domain-specific key terms that we need to automatically identify in this text for generation of nodes and relations in a knowledge graph. First, the error code, 1686 is an important key term. Second, the entire error message (Drive fault type 3) is a key term that appears in text analysis as a noun term (i.e., sequence of common nouns and cardinal numbers). For a different error message symptom, such as, The enclosure identity cannot be read, the subject noun term enclosure identity is a relevant keyterm.

Other terms of interest are mentioned in the resolution procedure. These terms are typically objects of the actions described in the resolution. For example, in the example above, Reseat the drive, the object drive, and the action reseat, are important keyterms in the technical support domain. Other objects like canister and enclosure are also domain-specific key terms. Similarly, the term sense data is an important term in the storage domain as well. The example article also contains terms, such as Explanation and User Response, which are not relevant for the technical support domain directly.

We submit that it is necessary to leverage available domain-specific knowledge and linguistic analysis to refine relevant terms identified by domain agnostic state-of-art language analysis services, such as Watson Natural Language Understanding (NLU)Footnote 4. Figure 1(b) contains keywords extracted using Watson NLU service with confidence scores above our set threshold of 0.5. While some of the identified terms, like Drive fault type or sense data are extremely relevant for the technical support domain, other terms, such as following steps and User response, are not relevant. Furthermore, some of the relevant terms cannot be identified (e.g., error codes), or have low confidence scores.

3 Domain Knowledge Driven Term Extraction

There are two possible approaches to extracting domain-specific key terms from text: (1) domain knowledge in the form of glossaries is used as a seed for generating the candidates followed by an item set expansion approach using word embeddings for creating the final set of domain key terms [1], and (2) candidates are created by generic methods (noun phrases and standard tools like NLU and then domain glossaries along with word embeddings are used to filter out noise to create the final set of domain key terms. To evaluate the relative merit of the two approaches, we have annotated relevant key terms for twenty documents related to a hardware product and obtained the precision/recall/F1 scores in both cases. The noise removal methods had almost twice the precision, slightly less recall, and overall a 60% better F1 score over the item set expansion method. Hence, for the rest of the paper, we have started with generic terms and refined them using domain knowledge, in order to mine the most relevant key terms.

Our method, as illustrated in Fig. 2, comprises the following steps:

  1. 1.

    Candidate key terms are extracted using generic techniques

  2. 2.

    Noise filtering techniques are used to discard key terms not relevant to the domain of technical support service. Our proposal includes the following noise filter criteria:

    • Document-level relevance metric;

    • IT product domain-specific glossaries and word embeddings;

    • IT-support-specific annotations, referring to concepts such as symptom, problem, and resolution.

Filters are applied sequentially, and key terms ranked beyond a threshold are discarded. In the remainder of this section, we discuss each of these steps.

Fig. 2.
figure 2

Major components for key term extraction

3.1 Select Candidate Generic Key Terms

The set of generic candidate key terms comprises of noun phrases and adjective phrases [5, 8, 13] that are identified using part-of-speech analysis and pattern matching. The pattern is defined by the regular expression (adjective) * (noun)+. Chunking, lemmatization, and POS tagging are done using the Spacy toolkitFootnote 5. Candidate contiguous words are concatenated together to form candidate phrases. In addition, document analysis with generic key term extraction tools, such as Watson NLU [21], are included to generate candidate key terms.

3.2 Filter by Document Relevance

Relevant key terms in a document demonstrate a strong correlation with where and how many times they appear in a document. Building on the work proposed by [3], we use an unsupervised method for key term extraction and relevance scoring by incorporating words’ sentence position information, and their frequency in the document.

Given a document, sentence tokenization and part-of-speech (POS) filter (using Spacy) are applied to generate key term features. The relevance score for ranking the terms is based on the insight that a key term is highly likely to be important if it occurs frequently and more recently in the current document. Recency of a word is related to its sentential positioning in the document. Each candidate key term is weighted with its inverse sentence position in the document. The position score is multiplied with its frequency score to get the final score of a term. For example, if a term is found on the following positions: \(2^{nd}\), \(5^{th}\) and \(10^{th}\), its score is \(\frac{1}{2} \times 1 + \frac{1}{5} \times 2 + \frac{1}{10} \times 3 = 1.2\). The score of each term is divided by the total number of sentences to get a normalized score.

Mathematically, for a document p, let \(\phi (c)\) denote the set of candidate terms, and freq(ci) be the frequency of c in the \(i^{th}\) position. The score S(.) of a key term c in p is

$$ S(c,p) = \frac{\sum _i \frac{1}{pos_i} \times freq(c,i)}{n_{sents}} $$

\(pos_i\) is the ith sentence where c appears, \(n_{sents}\) is the number of sentences in the document.

The well-known Pareto principle [10] is used to select the set of key terms for domain-specific documents. The procedure for Pareto analysis requires

  1. 1.

    Sorting the key terms in decreasing order of scores,

  2. 2.

    Key term-wise cumulating the scores,

  3. 3.

    Normalizing the cumulative scores so that the maximum cumulative normalized score equals 1.

Key terms are filtered out if they account for more than 98% of cumulative normalized score. Key terms are considered important and relevant if they account for less than 40% of cumulative normalized score. For the remaining key terms, domain-specific knowledge are used to further remove noisy key terms.

3.3 Filter Based on Domain Glossaries

Presence of key terms in domain glossaries and dictionaries is a good indicator of whether the word is relevant for the domain or not. Glossaries contain important words in the domain and their meanings. If available, they can serve as good noise removal filters. However, removing all extractions which do not match glossary terms exactly leads to sparsity. To overcome this limitation, we propose using embeddings to retain words that are “close to” glossary words. Word vector embeddings are used to measure if an extraction is “close” to any glossary term To create word vectors for potentially multi-word glossary terms, we used the word2term script from [17], running it twice over a large corpus of unlabeled documents. word2vec is applied after annotating common terms of upto three words as a single token. However the algorithm doesn’t create vectors for all bigrams or trigrams, but only ones that frequently co-occur. Also, some extracted key terms or glossary terms could be longer than three words. The algorithm described in Algorithm 1 is used to create vectors for multi-word terms by averaging over all possible segmentations of the term, and adding vectors of segments inside a segmentation.

To compare the distance of an extracted key-term to a glossary term, both L2 distance (the euclidean or straight-line distance between two points or vectors) and cosine distance between their vectors are considered with different values of distance thresholds.

figure a

3.4 Filter Based on Domain Knowledge Annotations

Domains are often expressed by an ontology of important “aspects”. Technical support involves various well understood attributes like a detailed description of the problem (symptom), error codes, and steps (procedures) that should be taken to fix the problem. Consider the snapshot of a DB2 support document in Fig. 3. Here, the document mentions the symptoms (marked within red boxes), the error code SQL3004N, and procedures (marked within the blue box). Annotations representing these attributes can be good indicators of key term importance. For example, when a key term occurs in the problem description it is highly likely that its domain relevance is high. key terms are ranked based on the document annotations and those below a rank threshold are discarded.

A Semantic Parser [7] is used to extract these attributes from technical support documents using rules built on the grammatical structure of sentences. Two deep parsing components are used: English Slot Grammar (ESG) followed by Predicate Argument Structure (PAS) for linguistic analysis of text [16].

The method for re-ranking key terms based on domain annotations is presented below, where \(\mathbbm {1}\) is an indicator function, \(\bar{S}\) denotes the average of scores (S) for all key terms in a document. DA is the set of all terms in domain annotations for the document, and G is the set of all glossary terms. The adjusted score, \(\hat{S}\), is increased if the key term is present in domain annotations or glossary, by the amount \(\bar{S}\). \(\hat{S}\) is then used for re-ranking key terms.

$$ \hat{S}(c,p) = \mathbbm {1}[\![ c \in DA ]\!]\bar{S} + \mathbbm {1}[\![ c \in G ]\!]\bar{S} + S(c,p) $$

4 Evaluation

The experiments in this section demonstrate that using document relevance and domain knowledge helps noise removal effectively. Two domain datasets are considered, which are very different with respect to type/structure of IT support content, and to the technology and terms. The first dataset comes from a hardware domain of hybrid storage solutions, referred to as Storwize. The second dataset is a popular middleware product, referred to as DB2. Ground truth knowledge of important key terms for each of the two domains is collected from human annotators, across 50 documents. The metrics F1 and MRR (Mean Reciprocal Rank) are used to evaluate the efficacy of the noise removal techniques on the above datasets. In addition, a dedicated evaluation framework is used to evaluate the extraction of key terms in a larger set of documents without ground truth.

Fig. 3.
figure 3

Illustration of Annotation of attributes on a Support document

4.1 Data

The documents collected for DB2 and Storwize are from IBM’s Knowledge Center sitesFootnote 6. We use a set of 172 DB2 documents and 50 Storwize documents, which are identified as answers of certain questions from end users in IT services. Ground truth knowledge of important key terms for each of the two domains is collected from human annotators, across 50 documents. Glossaries for Storwize and DB2, with 5K and 2K items respectively, are collected from the same site. Two question-answer datasets are created, comprising of those questions from end users along with accepted answer documents. A subset of these question-answer pairs were annotated with important key terms by domain experts. Table 2 summarizes both the datasets, where DB2WithGT and StorwizeWithGT datasets contain ground truth-labeled key terms, #_Docs is the number of documents in each dataset, #_Keyterm is the number of ground truth key terms or encompassing terms in each dataset, and Avg_Keyterm is the average number of ground truth key terms per document.

For running our evaluations, we used the ground truth key terms where available, and used key terms from problem titles otherwise.

Word embeddings are built only for the Storwize domain using 50, 000 documents. Details are described in Sect. 4.3.

Table 2. Datasets used in our experiments

4.2 Evaluation Metrics

The following evaluation metrics are used to compare performance of the different experimental setups [14]:

  1. 1.

    Precision (P), Recall (R), and F1 measure (F1) are computed for each experimental setup. Precision is the percentage of correctly extracted key terms by the total extracted key terms, Recall is the percentage of correctly extracted key terms by the total ground truth key terms, and F1 is the harmonic mean of precision and recall.

  2. 2.

    Mean Reciprocal Rank (MRR) [12] between the ground truth key terms and the final relevance ranking of the key term.

    $$ MRR = \frac{1}{|D|}\sum _{d \in D} \sum _i \frac{1}{rank_i} $$

    where, D is the set of documents, and \(rank_i\) is the rank of the key term in document d.

4.3 Results and Discussions

The focus of the evaluation presented in this section is to compare the various noise removal methods.

First, we address the efficacy of the individual stages in the domain specific noise removal pipeline, i.e. generic key terms extraction, key term extraction with document relevance, glossaries and word embeddings. Generic candidate key terms are identified using tools like Spacy and Watson NLU as discussed in Sect. 3. Spacy’s lemmatizer is used to lemmatize the extracted and assigned key terms. Precision, recall and F1 are computed based on the ground truth key terms for all the setups.

Table 3 reports the comparative analysis. Here, domain knowledge refers to using both glossaries and domain knowledge annotations. The results show that adding domain knowledge increases precision thus helping in noise removal. However, negative filtering using domain knowledge is likely to reduce the importance of a term to the extent that a key term is removed from the “gold” set. This could lead to a reduction in the recall measure. In terms of the F1 score, use of document relevance with domain knowledge outperforms the other approaches by 28% for the two datasets. In Storwize, additionally using word embeddings improved the F1 score over baseline by 30%.

The results for embeddings in Table 3 for Storwize are reported for the best word2vec [17] model. We next show the results of experiments with the two variations of the word2vec model: CBOW and Skipgram along with some relevant window sizes. Having trained a word2vec model on 50, 000 Storwize documents, Table 4 shows the performance of the eight best embedding approaches on ground truth key terms. These are the models with different methods for training word vectors and different distance metrics used for pruning. Here, CBOW-NEG\(\langle \)X\(\rangle \)-L2 are the word vectors trained with CBOW model with Negative Sampling and window size of X, with L2 distance used for pruning. Similarly, SKIPGRAM-NEG\(\langle \)X\(\rangle \)-COSINE are the word vectors trained with Skipgram model and negative sampling and window size of X, with Cosine distance used for pruning. We show here only the embedding results for Storwize and not for DB2 because the low number of documents in the latter domain are insufficient to learn embeddings.

Table 3. Performance comparison: with ground truth
Table 4. Performance comparison (storwize): with ground truth

Next we evaluate the ranking scheme discussed in Sect. 3.4 where key terms are re-ranked based on domain knowledge. In Table 5, KT refers to generic key terms, and DK refers to domain knowledge. Precision, recall and F1 scores are shown in comparison for the top@N (N = 2,4,6,8) on the two datasets. The use of domain knowledge increases F1 scores as N gets larger - beyond N larger than 2, there is a 3–5% increase in F1. The table also reports the MRR scores for the different combinations on these datasets with similar gains.

Table 5. Ranking performance comparison: precision, recall, F1

Because of the lack of ground truth for a majority of domain-specific technical support documents, we designed a new evaluation framework to evaluate the ranking consisting of these steps: (i) First, we shortlisted a set of questions from end users for the two product with their corresponding answer document(s). We shortlisted 350 questions for Storwize and 400 questions for DB2. (ii) Subject matter experts were provided with these questions and were asked to annotate possible important key terms in them. Five experts in total participated in this task. (iii) We assumed these key terms extracted from the questions to be the “pseudo” ground truth. (iv) We calculated the precision, recall and F1 measure for each experimental setup.

Table 6 presents the comparative analysis of the different setups. We present the result of the best threshold value of domain knowledge embeddings to remove noisy extractions. As shown in the table, the precision is relatively low. This is because, as shown in Table 2, the number of key terms annotated from questions is very small, since in most cases the questions are very short in length. Compared to questions, the documents are much longer and the average number of key terms extracted from documents is much bigger.

Table 6. Performance comparison: with pseudo ground truth

Based on the results shown above, we can clearly see the benefit of using domain knowledge to extract and rank relevant key terms for domain-specific documents, thus helping us to remove noisy key terms.

5 Related Work

Domain-specific terms are essential to many knowledge management applications, and the limitations of manual identification have motivated many research efforts.

Supervised methods, suggested by [19], start from a significantly large volume of content labeled with domain-specific terms, and provide classification models that decide if a given term is relevant for the domain. The approach provides the best performance, but, the overhead of generating labeled content has redirected research to focus on unsupervised and weakly supervised methods.

One group of related works is interested to discover domain specific terms given a domain corpus. These words address the problem, similar to the one that we have addressed in the paper. Wang et al. [22] described an approach for using deep learning model together with a weakly supervised bootstrapping paradigm to automatically extract domain specific terms. It is an approach to boost the performance of deep learning models with very few training examples. This approach does not leverage any prior knowledge of a domain explicitly, as we do. Riloff et al. [18] proposed a mutual bootstrapping method to both the semantic lexicon and extraction patterns simultaneously, starting from unannotated training texts and a handful of “seed words”. However, this method does not address the issue of extracting domain-specific terms.

In a different context, domain-specific key terms are identified in order to distinguish across domains [9, 11] and then used for feature extraction in broader text analysis tasks. More specifically, given a corpus that spans multiple domains, domain-specific statistics, such as term frequency and inverse document or domain frequency [9], and entropy impurity [11] are used to determine the set of key terms most representative for a domain. In [23] domain-specific terms are extracted using an iterative bootstrapping method to learn term components from seed terms and then a maximum forward matching method and domain frequency is used to extract additional components. Our solution uses similar term frequency methods but applies them along with other domain-specific linguistic features in order to reduce the amount of false positives.

Frakes et al. [6] have emphasized the need for accurate automatic domain vocabulary selection. The objective of their work is to evaluate various automatic vocabulary extraction metrics (normalized and non-normalized term frequency metrics), domain analysis against the domain vocabulary provided by subject matter experts. Their analysis of various metrics confirms that term frequency is one of most important factors along with stop word list (removal of common English words) and stemming (grouping the related terms in a common term by removing suffixes and prefixes). Our solution starts with curated list of terms as input, hence need for stop word removal is not necessary. In addition to term frequency, our algorithm takes the position of the term into consideration to boost the term score.

6 Conclusion and Future Work

In this paper, we present a weakly-supervised approach to extract key terms from IT services documents, leveraging prior knowledge in the domain. Experimental results show that using domain knowledge effectively improves the quality of terms extracted from generic tools. Hence, having domain knowledge (in terms of glossaries and domain corpus) will definitely help in further extracting domain-specific key terms. In the future, we will further refine the key term extraction process by leveraging the cross document features. In addition, we will discover semantic relationships between the extracted terms (for the purpose of knowledge graph construction).