1 Introduction

Cognitive agent can be thought of as a virtual agent which can observe, learn and infer and interact. It has such capabilities as a result of massive training in relevant domains using technologies like machine learning, natural language processing, dialog decision tree flows etc. IT support services are increasingly moving towards self assist mode by means of such cognitive agents. Such an agent is able to act as the first line of contact for customers who would have typically called a human helpdesk. One of the more common manifestation of such agents is in the form of chatbots e.g. ‘Spoke’ [1]. Many such conversational cognitive agents e.g. ‘WSS’ [11], have already started making inroads as a frontend for customer support. With the score of services available now [18] for building chat bots/conversational cognitive agent, it has become very easy to design one. The challenge lies in building something which is intelligent and quick learner. Too many of ‘I don’t know’ or wrong answers can make the agents useless. Currently, this problem is handled by designing domain specific agents which hand off to a human agent when it cannot reply satisfactorily [3]. It should be noted that conversational cognitive agents are supposed to understand context and have notions of intents.

Conversation systems(service) in cognitive agents are bootstrapped with basic knowledge through initial training. In order for a conversational agent to be intelligent to identify intents, huge amount of training is needed. There is some work happening around area of active learning [12], semi-supervised [20] learning in order to reduce dependence on labeled data for training. However, none of these techniques leverage feedback from users. The conversational cognitive agents typically support feedback mechanism, for example, some of them may ask the users to vote for the answers, or rate the experience/response that they got for their query. In addition to this, there is a scope of capturing usage logs and gauge implicitly the engagement of users with the system and consider this as a feedback. Such feedback is an extremely important source of self learning and improvement for the cognitive agents. As the agent interacts with users, it should continuously observe, infer and learn from feedback as to what is it that it is doing well, what topics is it not able to handle well and what topics it does not seem to know about at all. These topics are usually the intents of the utterance/query of the user, and a good training data is key to identify intents and have good quality conversations. However, continuous learning to improve training and thereby conversations is a challenging problem and needs utmost care so as to not degrade the existing accuracy. In this paper, we have proposed a service that enables feedback based learning in cognitive agents to continuously improve training data for intent classification in conversation services. More specifically, we propose a reinforcement learning [14] based method to come up with a learning policy for improving the training data for the agent.

The main contributions of this paper are: (i) Modeling the learning problem as a feedback based reinforcement learning one by appropriately defining rewards and state value functions (ii) Deriving the action policies, using this model, that provide improvement suggestions automatically in the form of actionable utterances for continuous learning (iii) Designing automated and manual workflows to act upon the actionable set of utterances and update the training data (iv) Designing the continuous learning framework as a service that uses feedback from user interactions to improve training data by incorporating the policies based workflows mentioned above (v) Implementing the service workflows and evaluate against real data. One of the decisions that was crucial in designing the services was: Should there be a human intervention who vets the automatically generated suggestions before modifying the training data. If there is no human intervention, then it is difficult to address the cases where the feedback adds to the confusion instead of firming up the training data, so we decided to have automated and manual workflows.

To the best of our knowledge, this is a first attempt in modeling the continuous learning problem for intents in conversational systems as a reinforcement learning problem and designing the learning problem as a service. Rest of the paper is outlined as follows. The problem is described in more detail in Sect. 2. Section 3 describes the model, action policies and the workflow algorithm. Service components of the continuous learning service are explained in detail in Sect. 4. Section 5 discusses the experiment results, Sect. 6 covers the related work and we conclude with Sect. 7.

Fig. 1.
figure 1

Overview of Interactions for feedback driven learning

2 Problem Overview

Starting with an initial training dataset consisting of utterance and intent pairs used for training conversation service to predict intent from user utterance, the problem objective is to augment it over time using implicit and explicit feedback, and improve the intent classification performance. The problem overview is provided in Fig. 1. The utterance data is a stream of data consisting of user queries/utterances and agent answer/response. The utterances are either the main query or can be supportive dialog to understand the main question. The utterances which capture the main question have to be identified from the dialog flow; we assume that the main utterance identification has been done on the conversation data before passing to learning service. The implicit and explicit feedback associated with the responses are also captured as part of conversation. For example, if the response is in the form of a solution document, then the clicks on the page and time spent in reading the document is an implicit indicator of usefulness while a vote is an explicit feedback. User feedback is an important piece of information to leverage for improvement of a system. Observing and incorporating feedback is however, one of the most challenging aspects of the problem because the feedback interpretation is not always straightforward. For example, we have situations where explicit and implicit feedback convey the opposite sentiment. It can be attributed to an imperfect implicit feedback model, unfriendly user interface even though the content was fine or unmatched user expectations. We do not dwell upon the subjectivity of feedback and model it as per standard notions of implicit feedback in literature [5].

As shown in Fig. 1, the learning cycle needs to be continuous, possibly starting with small training data set. The training data consists of rows of 2-tuples denoted by <utterance(U), intent(I)> pairs such that each intent has at least few training utterances. For example, an intent can be ‘CreateSpaceInMailbox’ and one of the corresponding utterances can be ‘How do I clean my inbox?’. Learning manifests as making modifications to training data to improve prediction model. Modifications are actions like: adding utterances to an intent as examples to boost confidence, identifying utterances that are candidates for new intents, adding new intent labels and more. Feedback is the driving force of continuous learning for the cognitive agent. We propose to use reinforcement learning (RL) to learn action policies, that is, rules to modify the training data. Once the action policies are learned, then the modification logic can be integrated algorithmically in the learning service. One of the biggest challenges in using RL method for learning in a conversation system is to model states, value functions and rewards using logs and feedback from conversations.

The advantage of taking feedback based approach to continuously learn is that the intents that are more commonly used improve a lot over time and this information cannot be obtained in a more better way than user interactions. Thus, feedback based learning gives a direction. In many cases, the feedback helps in firming up the confidence automatically as utterances become training examples. The variety in training examples can be obtained with very diverse linguistics. The dependence on manual curator reduces a lot. The other advantage of feedback based approach is that the training data can be initialized with a small set of intent and utterance pairs and augmented based on user interactions.

3 Modeling the Continuous Learning Problem Using Reinforcement Learning (RL)

In the following, we give an overview of reinforcement learning and then provide the details of modeling the feedback based learning problem in cognitive agents as an instance of SARSA algorithm [14] for reinforcement learning. The output of reinforcement learning algorithm is optimal action policies for training data improvement which are then modeled as algorithmic workflows. These workflows are used by the continuous learning service as shown in next section.

In reinforcement learning [14], there is an agent, called RL agent, which observes an input state and takes an action determined by a decision policy. Once the action is performed, the agent receives a reward that acts as a reinforcement for the goodness of the action. The information of reward for state/action pair is recorded. By performing actions, and observing the resulting reward, the policy used to determine the best action for a state can be fine-tuned. Eventually, if enough states are observed, an optimal decision policy (referred to as action policy henceforth) will be generated and we will have an agent that performs perfectly in that particular environment. The algorithm used for reinforcement learning here is on-policy algorithm called SARSA [14] which is an iterative one which can be represented as:

\(Q(s_t,a_t)\leftarrow Q(s_t,a_t)+\alpha (r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s_t,a_t))\)

where,

\(\alpha \) is the learning rate,

\(\gamma \) is the discount factor, a factor of 0 will make the agent “opportunistic” by only considering current rewards,

\(Q(s_t,a_t)\) : the value of taking action \(a_t\) in state \(s_t\) under a policy at step t,

\(r_t\) the reward observed associated with action \(a_t\).

Fig. 2.
figure 2

Reinforcement Learning Model for Intents in Conversation

Having provided the background of reinforcement learning, we now explain the setting of SARSA algorithm for reinforcement learning in conversational cognitive agents. The setting is shown in Fig. 2. The conversation system is initialized with a model trained on initial training dataset denoted by Training Data 1. Based on this model, the utterances from the users are analyzed for intents and feedback is collected. To make this system learn and improve over time, we define epochs of learning. The RL agent acts at every epoch and Fig. 2 illustrates the flow for the one complete cycle that happens from epoch to epoch. At each epoch, the agent collects data in terms of conversations that is, user utterances, the output from the model and the user feedback, explicit and implicit. This is shown by the edge labeled 1. Let \(\mathcal {A}\) denote the possible action atoms responsible for updating the training data that can be taken by RL agent. These different action atoms are explained below.

- Add training example: There are situations where the correct intent is predicted with a low confidence. In such cases, the suggestion provided is to add the utterance as a training data to the low confidence intent.

- Find correct/alternate intent: If there is confusion between intents for an utterance, then the correct one should be chosen and the utterance should be added as training example for the correct intent.

- Add new intent: This action is suggested when no existing intent in the corpus matches with the intent of the utterance. This is an action type which augments the training data so that the agents knowledge increases.

- Generate more training data: The utterances for which more training data is required are taken and then similar utterances from the conversation corpus is found. We use cosine and Jaccard similarity to obtain similar utterances. In addition, paraphrasing is performed using LSTM [9] (out of scope of this paper). We also maintain a dictionary of acronyms in order to find similar utterances. For example, ooo for out of office.

- Report problem with Solution Quality: This actionable deals with the cases where intent has been identified correctly, however, the user is not satisfied with the solution provided. In such situations, this is flagged as a potential case of solution quality not being upto the mark.

Let there be \(\mathcal {U}\) utterances in an epoch. For each \(u\in \mathcal {U}\), it can potentially trigger actions which are of a type in \(\mathcal {A}\). The actions taken in an epoch are the augmented set of actions over the utterance set \(\mathcal {U}\). Let this action set be denote by \(A_t\). RL agent then takes actions on this data, updates the training data to become Training Data 2 and moves to next epoch of conversations as shown by edge labeled 2. The improvement in intent classification of the utterances in the passed epoch using updated training data is the reward of taking the actions. This is illustrated by edges 3 and 4 in the Fig. 2. The edge 3 shows that newly trained model is used to predict intents for utterances seen in epoch 1 and edge 4 shows the improvement in the prediction accuracy as a result that goes back into the state as a reward for taking the actions. The cycle now repeats for next epoch with the model based on Training Data 2.

We now define the state, Q-function and rewards at epoch t for our learning problem to model SARSA algorithm in order to learn the optimal action policy.

  • State, \(s_t\): A state at epoch t is defined as- (Training_Data_t,

    {<Utterance(\(U_t\)), Intent(\(I_t\)):Confidence(\(C_t\)), Feedback(\(F_t\))>})

  • Value \(Q(s_t,A_t)\) : Let \(A_t\) denote set of actions constituting of atoms from \(\mathcal {A}\) taken in state \(s_t\). Then, Q value of taking those actions in the state \(s_t\) is defined as the cross-validation accuracy of resultant training data.

  • Reward, \(r_t\) : The improvement in label prediction accuracy for the current epoch t using the updated training data.

  • \(\alpha \) and \(\gamma \) are fixed at 1.

A state consists of (training model, conversation history) where conversation history is a collection of tuples s.t. a tuple contains <utterances asked, the corresponding predicted intent and the confidence value and the feedback received>. The RL model is now used to learn the action policy. The aim of the policy being learned is: how to combine feedback from users with the model confidence in order to improve overall accuracy and user satisfaction. We now describe the action policy learned using SARSA algorithm for reinforcement learning.

3.1 SARSA Algorithm for Learning Action Policies

The SARSA algorithm was implemented as follows in our setup. The goal of the algorithm is to learn the best action policies, that is, in a state which actions lead to best results. To come up with the state-action combinations, we started with following basic guideline policy for actions: (i) for all negative feedback intent-utterance pair, check if finding correct intent is suitable action or solution quality is an issue. (ii) for all negative feedback intent-utterance pair, check if assigning new intent is most appropriate action. Generate more training data in case of new intent are suggested. (iii) for positive feedback with low confidence threshold, the action is to add the utterance as a training example. The algorithm steps are listed below:

  1. 1.

    Compute training data accuracy in state \(s_t\), denote it by training_acc\({_t}\). Initially \(t=1\).

  2. 2.

    Run the system for a duration and collect feedback for epoch t. In house teams were used to carry out this step.

  3. 3.

    Manually analyze the utterances that got negative feedback. Take the best possible action for each utterance and note down the details of the tuple <Utterance, Intent_1, Confidence, Feedback, Action, Intent_2, {Training Examples}>. Intent_2 is populated in case of find correct intent or new intent actions. Training Examples are also populated in case the actions are to generate more training examples.

  4. 4.

    Analysis of the utterances that got positive feedback is done automatically as there is only one action possible. Update the same tuple structure as in 3.

  5. 5.

    Finalize the updates to training data for next epoch. Determine the training data accuracy with updates denoted training_acc\({_{t+1}}\).

  6. 6.

    Now, run the training model obtained from training_acc\({_{t+1}}\) on the utterances of epoch t as test data and get accuracy. Check how many of the utterances that had been taken action on have been predicted correctly as per manual expert judgment. This step is done to compute the rewards for the learnings made in this epoch.

  7. 7.

    Repeat 2 to 6 for each epoch \(t+1\), epoch \(t+2\) and so on till we gain confidence on state-actions combinations.

Fig. 3.
figure 3

Action Policy Flow Learned by RL agent

The action policies are learned manually by mapping the majority times a type of action was taken for a particular combination of confidence and feedback values across utterance, intent pairs. The action policies, which are effectively state-action possibilities, thus learned are illustrated in the Fig. 3. This figure shows that only a subset of \(\mathcal {U}\) is selected as actionable utterances based on negative/positive feedback and high/low confidence. Thresholds are used to decide what is high/low and positive/negative. The actionable utterances are either subjected to direct action e.g. in case of positive feedback but low confidence or they are subjected to analyses namely, confusion analysis and new intent possibilities analysis to take the decision for type of action. We would like to note here that these decision making analyses are partially automated. Same is true for actions also. This is marked in the Fig. 3 with green color showing fully automated process while grey color shows manual intervention is needed at some point to complete the analysis/action. A mix of the two colors indicates partial automation. For e.g. finding correct intent is partially automated. The analysis for deciding actions is explained below.

Intent confusion analysis: When an intent is predicted by a model with high confidence but users end up giving a negative feedback, it becomes actionable. This is either a case where the model got confused and made wrong prediction for the utterance or the solution quality is bad. This is disambiguated as follows:

(i) If the explicit feedback is negative while the intent was predicted with high confidence and the user spent some time going through the corresponding solution, then it is considered as a case of bad solution document.

(ii) If above condition does not hold, then the intents getting confused are derived using the automated procedure as: (a) find the similar utterances in the data to the one identified for intent confusion analysis; (b) the intents corresponding to the similar ones form a probable set of confusing intent. Then the action for finding correct intent is triggered explained as follows. If the size of confusing intents set is two, then the other intent is chosen as the correct intent and utterance added as training example automatically. If the size is more or less than two, then the decision of correct intent is made manually. Note: Confusion can arise due to similarity in training utterances for different intents. For manual decision, it is good to check the probable sources of confusion as follows: (a) intents with a very fine level of distinction which is coming out through few keywords in utterances (b) very similar structured utterances have been constructed for different intents for e.g. ‘how to set up my printer’ and’how to set up my account’.

New Intent analysis:If there is a negative feedback for an utterance and the existing model is also not able to predict an intent with high confidence for any similar utterances, then the utterance becomes an actionable candidate for new intent. When there is a candidate for new intent found, then the analysis for possibility of matching intent from the existing set of intent corpus is done manually. Depending on the findings, there are two possible actions. Either a matching intent is selected by the expert, in which case training example is added or there is no appropriate intent found in the corpus, then a fresh intent is curated by the expert. Note that the manual selection from existing intent corpus is needed to ensure no similar intents get added.

3.2 Implementing the Action Policy

Having learned the policy, we now present the algorithm for implementing the action policies in the conversational agent as follows. This is shown in Algorithm 1.

figure a

The algorithm clearly distinguishes the manual and automated workflows. The automated actions are denoted with auto_action and manual actions with manual_action. This is run in each epoch to get the actionable utterances, the ones for which the actions got executed and result of actions on training data. F(u) is the feedback function which is a weighted combination of explicit and implicit feedback normalized to give value in [-1,1]. C(u) is the confidence value of the intent prediction for the utterance u. There are two thresholds defined \(th_1\) and \(th_2\). The threshold \(th_1\) is for confidence value, so any value above this threshold is considered high confidence. We took value of 0.85 for this. The second threshold \(th_2\) is for implicit feedback in terms of timespent. We used 10 sec for this value. The method findSimilarUtterances(u) finds the utterances that are textually similar to u based on Jaccard and cosine similarity. The function intentSet(S) outputs the set of intents that have been predicted for the utterances in the set S. The rest of the action statements follow the logic as explained in the ‘intent confusion analysis’ and ‘new intent analysis’ in Sect. 3.1.

Fig. 4.
figure 4

Continuous Learning Service Design

4 Service Design for Continuous Learning

The user interacts with Cognitive agent by asking a query. The Conversation Service used by the agent is designed as a classifier that classifies the intent corresponding to the user query, from the existing set of intents. The classifier is trained by providing a few manually annotated example queries for each intent in the system. As the user queries the system, its interactions are stored in a feedback database. The information for similar queries is grouped together. Similar queries are identified based on jaccard similarity index. At each epoch, the learning service reads from the feedback database and generates a set of suggestions based on the encoded action policy by taking into account the user feedback that is captured. The policy module has 3 components: New Intent Candidates, Confusion Analysis and New Intent Analysis. Based on these components, policy based actions are decided and passed on to the action module. The action module then updates the training data using both automated and manual actions. The details of how the learning policy recommendations are used to modify the training data are explained in Algorithm 1. The service can be configured to bypass the manual (Subject Matter Expert (SME)) route. Accuracy analysis is performed on the new training set to ensure that the system has not degraded. Once it is ensured, the training data is updated and the classifier is retrained based on that. There is scope for the training data to be vetted by a human expert (SME) before the classifier is retrained. This functionality may be required as the user feedback can be non conclusive, and thus it may add confusion to the system. It is possible to use the proposed continuous learning service with any conversation service. We use Watson’s Conversation Service [18] as a proof of concept. The service is independent of different notions of feedback as long as the feedback can be cast as explicit and implicit values normalized for a range of [-1,1]. The service should be able to handle different notions of feedback. We have adhered to the standard notions of feedback [5] commonly used for implicit feedback in the design of service and additionally made the implicit feedback value as an input to the system to handle special cases.

Table 1. Utterance Analysis
Table 2. Evaluation of Action Policies

5 Experiment Results

Having learned the policy, we implemented the service. We now evaluate the continuous learning that we get with the implementation of the action policies in the conversational agent. While certain policy flows could be automated, others require a manual intervention for which a domain expert was instituted. The data used for the experiments to determine the policy are real user conversations over two months. The service portal that has been trained on initial set of questions and intents was deployed and the conversation logs were collected. The training data had about 127 intents and 1417 utterances. The portal had up/down vote provision which was exercised optionally by users. The implicit feedback was captured through the links clicked on the solutions and the time spent in going through the solution documents. The user logs were sampled after two weeks. The Table 1 shows the number of main utterances asked in each epoch, how many of them were identified as actionable ones by Algorithm 1 and subjected to manual and auto actions. The final column shows the how many were finally acted upon. The ratio a:m shows that ‘a’ were automated actions and ‘m’ were analyzed manually. It can be seen from the table that some epochs have higher instances of utterances identified for analysis. This is primarily due to two reasons: users asked questions that could not be predicted with high confidence or were predicted wrong in that epoch. Such epochs provide better opportunity for learning than the ones where the utterances could be answered well by the training data of which epoch 2 is an example. It can also be seen that the actions taken are on a subset of utterances that were analyzed. This is attributed to manual selection that is a part of the action policy. We had many instances of revise solution quality case and these result in actual actionable utterances being less than the candidate actionable ones. Some of the recommendations do not make it to the training data as the negative feedback was just noise and not the wrong prediction. Then some of the utterances were out of scope or nonsensical (e.g. “What time did you wake up”) and hence are rejected for example, if the agent is supposed to answer questions on Websphere issues but some utterances actually ask questions on Oracle DB issues, then these are rejected. There was no pattern observed in terms of ratio of each of these causes for rejected actionable utterances.

We updated the training data, used cross validation to find training data accuracy, then took the next two weeks user utterances as a test data against that and then manually rated it. The accuracy of test data predictions is captured as reward. The Table 2 shows the value of rewards and training data accuracy at each epoch. The last epoch, 4, does not have rewards computed since that is considered as terminal state for the learning agent. The Table 2 shows that the rewards vary from 5% to 11%. An interesting observation was that not all the times adding a training example pushed the confidence of similar utterances above the threshold value of 0.85. This was attributed to structural similarity of the utterance with utterances of other intents. For example, ‘how do I setup an X’ and ‘how do I setup Y’ were sometimes confused resulting either low confidence or incorrect prediction. Hence, the percentage improvement is not a numeric function of number of actions taken. The table also shows how the training data got modified with each epoch. It can be seen that there was a new intent added in epoch 3. It can also be observed that the number of utterances is increasing in each epoch in different quantities. The accuracy of training data varies a little in acceptable range. It can be seen that the last epoch ended with an overall improvement in the accuracy. Note that the value of 76.7% in epoch 0 is the cross validation accuracy with initial training data. The accuracy of 76.6% at epoch 1 is the accuracy once the training data got modified based on actions taken after analyzing the utterances in epoch 0. The column called reward on actions captures the improvement in intent predictions as a result of actions taken on the training data. We observe that the rewards are higher when the number of actions taken are higher which is a validation of our action policy. The interesting observation is that training accuracy can decline even with fewer actions taken as observed in epoch 3. The utterances of epoch 2 resulted in very few actionables and yet the impact on resultant training data that was used in epoch 3 was negative. However, this is contrasted in epoch 4 which had minor increase in accuracy inspite of small actionables in epoch 3. Overall, the fact that the accuracy did not drop more than 1% is a validation of robustness of the action policies. This is largely attributed to quality of recommendations and manual decisions. We plan to completely automate the action policy as part of future work.

There is another analysis that was done to see how much of potential improvement is being captured using the feedback based action policies that we have proposed here. It is possible that users do not provide enough feedback due to which the policies are not able to identify the actionable utterances even though there are utterances which could have been worked upon. In this analysis, we had more than 75% cases getting captured across epochs. Thus, we can see that the feedback was very good and we covered most of the important cases for improvement in the training data. There were few cases of new intent possibility that were not captured because there was no feedback provided even though the predictions were with low confidence. To handle such cases, we plan to augment the approach with unsupervised learning in future where the remainder utterances after the ones that have been actioned upon using the policies shall be checked for prediction confidence and then the ones with low confidence will be subjected to clustering. The cluster output will besubjected to manual scrutiny to decide if there is a case for new intent.

Observations: Feedback based reinforcement learning for continuous learning comes with its own challenges in performance. The quality of feedback plays a crucial rule in our model. If the users give random feedback, then the performance of training can deteriorate as we observed in Table 1. In absence of feedback, there will not be much training that can be performed. We observed that the confidence of intent predictions for utterances that were asked repeatedly in more than an epoch and were actioned upon eventually got the intent prediction correct with high confidence. On a positive note, the manual effort in improving training data went down considerably by use of our approach, in some cases from 20 hours to 4 hrs as reported by the experts. This is primarily due to the effort being reduced to just decision making as opposed to first analyzing, evaluating and then deciding.

6 Related Work

We now present the various learning techniques from literature that are relevant to our work. Self training [20] is one such technique. In self training a classifier is first trained with the small amount of labeled data. The classifier is then used to classify the unlabeled data. Typically the most confident unlabeled points, together with their predicted labels, are added to the training set. The classifier is re-trained and the procedure repeated. Note the classifier uses its own predictions to teach itself. The procedure is also called self-teaching or bootstrapping (not to be confused with the statistical procedure with the same name). One can imagine that a classification mistake can reinforce itself. Some algorithms try to avoid this by unlearning unlabeled points if the prediction confidence drops below a threshold. Our approach is more robust and functionally rich compared to this.

Active learning [12] is being increasingly explored and used to come up with training dataset efficiently. Active learning algorithms select examples for labeling in a sequential, data- adaptive fashion, as opposed to passive learning algorithms based on preselected training data. The key to active learning is adaptive data collection. Most experimental work in active learning with real-world data is simulated by letting the algorithm adaptively select a small number of labeled examples from a large labeled dataset. This requires a large, labeled data set to begin with, which limits the scope and scale of such experimental work. The current relabeling based active learning approaches [8] try to relabel based on impact and may end up altering the existing training data a lot more than desired.

Particularly for dialog systems and conversation agents, [16, 19] exploited a combination of active and semi-supervised learning approach for better training. As the classifier labels the unlabeled utterances, the ones with high confidence are automatically added to the existing training data and the ones with low confidence are selected for active learning, to be labeled manually and then added to the training data. Understanding the importance of learning from the unlabeled user queries that are logged, [2, 4, 6, 7] exploit them by employing a semi-supervised learning approach to increase their training data. They model click-graphs to infer the labels for the unlabeled user query, using some proximity measures. Our paper differs in two aspects. Firstly, previous works do not take into account the user feedback that is recorded on these queries. Also, the focus is mainly on identifying more examples to expand the training data and improve the classification. We extend our work beyond this through two measures. We perform a confusion analysis to identify misclassified utterances based on the user feedback. We also include a new intent analysis that identifies if a new intent has to be added into the system to address some of the user queries. [13] talks about using reinforced learning for an intent classification task by incorporating user feedback. The task is limited to identifying the correct intent from the set of existing intents and doesn’t consider the possibility of adding new intents to the system. For the sake of completeness, we also mention other works [10, 17] that talk about using reinforcement learning in a dialog system. Their main aim is to improve the system by identifying the optimal dialog sequence that engages the user and the focus is not on the intent classification task. [15] is another similar work in this domain.

7 Conclusions

As cognitive systems are going to mature in basic functionality, the need for continuous learning service proposed in this work is inevitable. In this paper, we have focused on modeling and implementation of continuous learning for intent classification in conversational agents and showed promising results. Our experiment results for service performance, detailed in Sect. 5 show that (i) training data updates can be made very efficiently (ii) the impact of updates on cross-validation accuracy of training data is gradual which is good (iii) the training data expands with new intents and utterances with time leading to remarkable improvement in intent prediction accuracy (iv) interestingly, accuracy need not be directly proportional to the acted upon utterances (v) noise in feedback or no feedback is a challenge and we shall show in the evaluations how it can impact the actionable vs acted upon ratio. As part of future work, we plan to extend continuous learning (fully/semi automated) to other aspects of conversation systems like dialog flows also and make continuous learning services an integral component of any conversation service.