1 Introduction

Machine learning and AI algorithms are pervasive, underlying many services we consume, such as recommender systems, voice assistants, driving assistants and smart homes. In some high-stakes application fields, predictions of AI systems can have important consequences for our lives, such as in medical diagnosis and credit scoring. In this context, explainable AI (xAI) becomes vital as it seeks to make ML models and ML-based decision-making processes transparent and understandable, thus enhancing trust, accountability, and fairness in AI outcomes. By clarifying the behavior of otherwise opaque AI systems, xAI promotes better understanding and acceptance by different stakeholders (Gunning and Aha 2019; Meske et al. 2022). For instance, this is important to allow consumers to act on their rights and be able to contest or challenge an automated decision (Wachter et al. 2017).

Recently, research in xAI has gained significant interest and focuses on elucidating the decision-making processes of AI algorithms (Xu et al. 2019; Guidotti et al. 2019; Arrieta et al. 2020). While traditional xAI approaches often provide static, one-shot explanations, there is a growing interest in interactive, dialogue-based systems (Lakkaraju et al. 2022). This approach frames explanation as a conversation (Hilton 1990) and simulates human-like interactions, allowing for a dynamic exchange where users can ask questions and receive tailored, relevant explanations in real-time (Miller 2019).

Dialogue-based approaches to xAI have been recognized as having the following advantages over static, one-shot xAI systems:

  1. 1.

    They can effectively address user queries by tailoring responses to the specific question and user. This targeted approach prevents information overload, ensuring that users receive relevant and concise answers (Kuźba and Biecek 2020).

  2. 2.

    They can leverage the conversational context to provide the most relevant information and let the user guide the explanation process (Sokol and Flach 2018).

  3. 3.

    Dialogue systems can be rooted in argumentation and demonstrate a marked improvement in user comprehension of scientific topics. By engaging users in structured, logical discussions, these systems enhance the depth and clarity of understanding in complex subjects (De Vries et al. 2002; Shaheen et al. 2020).

  4. 4.

    They enable information to be presented in a layered approach, beginning with less complex details and offering users the flexibility to delve deeper into the subject matter, thus facilitating a tailored learning experience (Finzel et al. 2021).

Although these important benefits make dialogue-based approaches an interesting paradigm to explore, there is a lack of a systematic review to identify strengths, guide research, and optimize performance and user satisfaction.

To close this gap, we provide a systematic review that aims to answer the following key questions:

  1. Q1

    What are the main targeted user groups?

  2. Q2

    What are the main use cases?

  3. Q3

    What are the objectives of the systems?

  4. Q4

    What dialogue capabilities do these systems posses?

  5. Q5

    Which questions do they address and how do they answer them?

  6. Q6

    What approaches are used to evaluate dialogue-based xAI systems?

  7. Q7

    What theoretical frameworks are dialogue-based xAI Systems based on?

This review focuses on answering the above questions, and in particular on summarizing and evaluating the applications, components, and goals addressed by dialogue-based xAI solutions, as well as how they are evaluated. We specifically investigate how these systems facilitate natural language conversations, support interactive question-and-answer sessions, and adapt explanations to user inquiries. The goal is to assess the current state of these systems to identify blind spots as well as future avenues towards the goal of conversational xAI.

Adhering to PRISMA guidelines (Moher et al. 2009), we created targeted search queries for conversational xAI system within three key computer science databases. Following the filtering process outlined in Fig. 2, 15 studies were selected out of 1339 for comprehensive analysis and comparison.

Our main findings reveal that recent systems implement one of two types of dialogue management: one type considers only the current explanation while allowing follow-up questions, and the other type considers the entire conversation and initial setup as context. While most approaches consider the interaction history to condition explanatory behaviour on, only few systems feature adaptation to the specific user. Moreover, in Sect. 4.7 we identify several desirable functionalities drawn from theoretical work on explanatory dialogue frameworks, and we observe that most systems focus on a single functionality; this suggests an opportunity for more integrated approaches. The most common user inquiries are centered around understanding AI decisions, predominantly “Why” a prediction was made. Additionally, we observe a trend towards custom-developing individual components, which highlights the potential for a more unified approach that can help advance the field.

By reviewing the literature on dialogue-based xAI and answering the key questions above, we make the following contributions that can guide further research on xAI:

  • We identify five important dimensions for comparing and assessing dialogue-based xAI systems: (i) use cases and target audience; (ii) architectural components; (iii) dialogue capabilities, (iv) questions addressed and answers provided, and (v) evaluation procedure.

  • We identify “model users”, such as domain experts or other users interested in in trusting the model and gaining knowledge from it, as the primary user group addressed.

  • We find that the main objectives of these systems are Interactivity and Trustworthiness.

  • We provide a systematization of the main types of questions addressed, observing a significant bias toward answering Why questions, without observing a dominant response method for providing answers to them.

  • We identify the main theoretical frameworks adopted for the realization of dialogue-based xAI systems and discuss the extent to which current systems adhere to the suggested properties of these frameworks. We identify three key areas along which the type of dialogue supported by these systems can be differentiated: dialogue dynamics and user interaction, response adaptability, complexity management, and goal definition.

  • We present a general meta-level architecture and core components for the implementation of dialogue-based xAI systems, distilled from the specific architectures proposed in the papers analyzed. This can guide and help to systematize research in the field.

Taken together, we think that the insights provided here can both support researchers and practitioners in the field of xAI in the adoption of dialogue-based methods and guide them in the process of choosing architectures and dialogue paradigms, defining objectives and evaluation scenarios, and identifying research questions to study.

2 Related work

Our systematic literature review is unique in addressing dialogue-based xAI approaches, a topic not comprehensively covered in existing literature. Figure 1 illustrates the positioning and overlap of our review within related work. It highlights our intersection with interface design in the xAI domain, as well as our focus on a task-specific NLP system for explaining machine learning models. The closest related work in the xAI field is by Chromik and Butz (2021), which focuses on interactive xAI. Their research centers on defining interaction concepts and outlining design principles for interactive explainability interfaces. While their approach, like ours, acknowledges the limitations of static explanations, it does not focus on conversational systems in detail or their applications. Reviews in human–computer interaction (HCI) and xAI similarly overlook the analysis of xAI dialogue systems, opting instead for a broader, more conceptual analysis of user perspectives, such as target audiences and the nature of explanations.

Fig. 1
figure 1

Clustering of related work and its intersection with our review. Our work is positioned between interface design in xAI and task-specific dialogue systems

Our review, in contrast, specifically explores the applications and architectures of current dialogue-based xAI systems, establishing a crucial foundation for the development of future dialogue-based xAI systems.

2.1 Reviews in explainable AI

In recent years, several reviews have examined the current state of research in explainable AI (xAI) (Adadi and Berrada 2018; Arrieta et al. 2020). Approaches for xAI have been categorized by their purpose and target audience (e.g., affected users or model users), their data modality (e.g. natural language, images, tabular data, knowledge graphs), their explanation method (e.g. post-hoc explanation of single predictions versus interpretable, transparent models for all predictions), and their evaluation setup (e.g. automatic vs. manual). A commonality among classical xAI surveys (like Adadi and Berrada 2018; Arrieta et al. 2020) is their focus on one-shot explanation techniques, which provide a single explanation without allowing the user to contest it or seek further clarification, unlike holistic explanatory systems such as dialogue-based approaches.

Other reviews have focused on the specific needs of users in the xAI context, e.g. Eiband et al. (2021). This review integrates user mindsets and involvement, addressing and structuring assumptions about user interactions. However, it does not go into detail about specific explanatory systems and approaches.

The more theoretical review on interactive xAI by Chromik and Butz (2021) classified various interaction styles and provided brief descriptions about the motivation for each style, without delving into specific systems and implementations. In contrast, our analysis delves into specific systems, showcasing various possibilities for building conversational xAI systems. Our results highlight key use cases, target users and their questions, evaluation methods, and a common meta-level architecture.

With growing interest in interactive, dialogue-based xAI systems, numerous independent approaches have emerged. Our review surveys these systems, distinguishing itself from related work by focusing specifically on holistic dialogue-based xAI systems, rather than individual explanation algorithms, general xAI research, or theoretical classifications of interaction types.

2.2 Reviews of dialogue systems

Reviews on advances in NLP Literature reviews for dialogue systems cover a diverse range of topics. Surveys of conversational agents provide both general overviews of the field and in-depth analyses of new technical approaches (Ramesh et al. 2017; Hussain et al. 2019; Ni et al. 2022; Caldarini et al. 2022). These reviews categorize and highlight advancements in NLP techniques, such as the Transformer architecture (Vaswani et al. 2017), which power modern conversational systems. Some reviews focus on specific types of conversational systems, like recommender systems (Jannach et al. 2021).

In contrast to traditional chatbot reviews, our analysis addresses the unique requirements of conversational xAI systems. We explore how these systems provide answers using xAI explanations, representing a fundamentally different paradigm from the traditional chatbots surveyed, which are pretrained on datasets to discuss specific topics.

Domain-specific chatbot reviews Surveys often examine dialogue systems within specific domains, such as medicine (Laranjo et al. 2018; Vaidyam et al. 2019; Montenegro et al. 2019) and education (Paladines and Ramirez 2020), as well as systems designed for open-domain purposes (Huang et al. 2020). These reviews typically emphasize effectiveness, domain-specific techniques, and state-of-the-art practices rather than technical details. For instance, educational chatbots focus on the impact of pedagogical strategies (Paladines and Ramirez 2020).

Our review on dialogue-based explanations in XAI differs from these domain-specific reviews in several key aspects. Domain-specific chatbots often address static topics using a predefined knowledge base, with user questions answered by end-to-end NLP systems. In contrast, the systems analyzed in XAI require real-time computation of explanations during the conversation and the current topic of discussion changes with datasets or specific instances. Our review targets conversational systems designed to bridge the gap between user interest in understanding machine learning model decisions and the explanatory techniques needed to generate those answers. We explore how these systems facilitate user understanding of complex ML decisions through dialogical explanations and identify the components employed in state-of-the-art systems.

Reviews on Chatbots optimized for human-like conversations Lastly, there are surveys of dialogue systems that focus on the specific characteristics of human-like communication (see Van Pinxteren et al. 2020; Chaves and Gerosa 2021; Rheu et al. 2021), which emphasize communication behavior, social characteristics, and consistent chatbot personalities.

In contrast, the conversational systems we examine prioritize the identification of relevant questions for different user groups and the provision of the best answers using current xAI and NLP techniques, rather than mimicking human-like interactions. However, studies on human-like dialogue enhancements can inspire future developments in conversational xAI.

3 Methodology

3.1 Literature search

Our study adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines (Moher et al. 2009). In alignment with these guidelines, we initiated our literature search by developing targeted search queries for academic databases. These queries were designed to identify articles related to conversational xAI systems, incorporating key terms associated with interactive dialogue systems (such as dialogue, chat, conversation) and explainable AI (explain, xAI, interpret). The specific search queries we employed are detailed in Appendix A.1. The search was conducted on the 2023-02-02.

Our search strategy included Scopus, a popular meta-level research database, as well as two renowned computer science publishers, namely ACM and IEEE. Scopus indexes a range of relevant computer science databases, such as ACM, DBLP Computer Science Bibliography, IEEExplore, and SpringerLink, making it an appropriate choice for xAI research. By supplementing our search with research directly from ACM and IEEE, we aimed to achieve a high coverage of relevant studies. Figure 2 describes our filtering steps of the initial literature. The first step was duplicate removal: since Scopus references different journals, we expected to have duplicates in the 1339 records when including ACM and IEEE. After removing the duplicates, we were left with 1177 records to screen for relevant titles. Next, we applied inclusion criteria that are described in Sect. 3.2, resulting in 318 records. In a subsequent step, we screened the abstracts of these results, leading to 87 studies, of which 15 met the inclusion criteria mentioned below and were thus selected for analysis.

Fig. 2
figure 2

Process of our literature search steps according to Prisma (Moher et al. 2009) guidelines

3.2 Inclusion and exclusion criteria

Table 1 presents the inclusion and exclusion criteria that we used to screen the study items. These criteria were applied in two stages to the articles initially identified. During the first stage, only the titles and abstracts of the studies were reviewed. To minimize the possibility of false negatives, any study that met any of the criteria in Table 1 was included in the second phase, where the full texts were scanned for eligibility.

During the title and abstract screening process of the initial 1177 articles, we excluded studies that were not related to xAI explanations, such as those related to linguistics or explanations in contexts other than xAI. Unclear cases were included in the full paper scans. All four annotators thoroughly scanned the remaining 87 studies and applied the inclusion and exclusion criteria from Table 1, resulting in a final set of 15 distinct studies.

Table 1 Inclusion and exclusion criteria

4 Analysis and results

Our study provides a comprehensive analysis of 15 dialogue-based xAI studies, spanning target users, use cases, objectives, addressed questions and answers, and alignment to theoretical frameworks. In the following subsections, we provide a detailed description of the use cases and target audiences, along with the different types of objectives. An overview of the different properties that distinguish the systems can be seen in Fig. 3.

Fig. 3
figure 3

Properties identified for comparing dialogue-based xAI systems (grey boxes indicate that this property is missing in all analyzed studies)

4.1 What are the main targeted user groups?

We start with analyzing the different target groups (Q1) of explainable AI systems, according to the defined stakeholders in Arrieta et al. (2020) (see Table 2). While one study considers multiple target users (Braines et al. 2019), this is an exception.

Most studies analyzed emphasize the need to bridge the gap between machine learning model explanations and the model users of the predictions or recommendations. In particular, 12 out of 15 studies address model users of ML predictions as the main user group. This user group is characterized by an interest in the model’s decisions albeit with limited knowledge of xAI. These studies aim to bridge the gap between the technical xAI explanations developed by researchers and the understanding of model users with limited technical backgrounds. While many studies do not specify the model users’ backgrounds in detail, referring to them as “lay users” or “domain experts”, some explicitly target audiences such as fire chiefs (Valmeekam et al. 2021) or data scientists (Kuźba and Biecek 2020).

The strength of dialogical explanations to allow for natural language questions and a back and forth to discuss an issue is particularly suited toward this target group given its lack of technical understanding. However, the groups of Regulators and Managers could also profit conversational xAI systems and studies investigating their unique questions and requirements are missing among the analyzed studies.

In addition to the model users of AI systems, affected users also seek understandable ML predictions. Their common goals include understanding their own situation to improve it (recourse) and verifying the fairness of the decision. The three studies targeting affected users focus on individuals whose loan applications were declined (Sokol and Flach 2018), users of smart home systems (Houzé et al. 2020), and patients seeking to understand their medical diagnoses (Shaheen et al. 2020).

In terms of user groups that are not addressed by the analyzed studies, we find data scientists, developers and product owners, regulatory entities/agencies and managers and executive board members. However, the latter two groups would also benefit from non-technical explanations and dialogical access to thoroughly interrogate machine learning models and assess regulatory compliance. Given that the group of data scientists, developers, and product owners possess technical knowledge and machine learning related analytical skills, dialogical interfaces should offer them more complex or deep querying capabilities.

Table 2 Stakeholder groups and goals by Arrieta et al. (2020), abbreviation, and study count
Table 3 Use cases with datasets, data type, target audiences, and objectives of explainable dialogue systems ordered by publication year
Fig. 4
figure 4

Network showing the connection between target audiences in the middle, the data types the systems use on the left and the objectives of the systems on the right

4.2 What are the main use cases?

Today’s conversational XAI systems address a wide range of scenarios with almost every paper tackling a unique use case. Most of the analyzed papers focus on toy domains, highlighting opportunities to explore more realistic applications in future research. In the following, we discuss all 15 selected papers with respect to use cases to answer Q2 (see Table 3) grouping them into the four broad areas of “Robotics, manufacturing, and computer vision”, “IT operations and emergency response”, “Machine learning, fake news detection, and recommendation”, and “Healthcare and smart home”.

Robotics, manufacturing, and computer vision Sklar and Azhar (2018) focus on path finding-robots within a treasure hunt game where the robots can explain their decisions in argumentative dialogues using persuasive reasoning, inquiries, and active information-seeking. The approach by Arnold et al. (2021) allows robots to explain the actions taken by them with respect to goals and constraints given to them. Braines et al. (2019) tackle the problem of traffic congestion detection systems and provide conversational explanations tailored to different user groups. Viros Martin and Selva (2020) develop a virtual assistant to support the design of earth observing systems and employ explainability techniques for various components of the system. Akula et al. (2022) address the topic of explaining image classifications in natural language dialogues.

IT operations and emergency response In high-pressure situations, such as IT operations and firefighting, proactive explanation systems are crucial that actively provide explanations rather than waiting for user queries. Gao et al. (2021) describe a chatbot that proactively presents AI model insights to IT operations teams. Similarly, Valmeekam et al. (2021) provide plan suggestions for a firefighting scenario where the system initiates the conversation and reacts to user feedback.

Machine learning, fake news detection, and recommendation A number of approaches employ small toy datasets from the machine learning literature such as the UCI Machine Learning Repository. Kuźba and Biecek (2020) explain machine learning predictions on the Titanic dataset within an open-ended dialogue. Sokol and Flach (2018) explain credit scoring decisions on the German credit dataset. Finzel et al. (2021) employ a small taxonomy of living beings to showcase their approach to translate outputs from inductive logic programming into comprehensible natural language explanations. Malandri et al. (2023) utilize multiple datasets from the UCI Machine Learning Repository for their system that adapts responses to the user knowledge. Chi and Liao (2022) developed a system to identify fake news on social media that responds to queries in natural language and explains its reasoning. Hernandez-Bocanegra and Ziegler (2023) propose a dialogue system to explain hotel recommendations.

Healthcare and smart home Shaheen et al. (2020) propose treatment plans for multi-condition patients. Their solution involves a Satisfiability Modulo Theories (SMT) Solver (Moura and Bjørner 2011) to recommend optimal treatment paths, coupled with explainable dialogues to explain the suggestions. Houzé et al. (2020) explain unexpected behaviors by smart home systems.

4.3 What are the objectives pursued by xAI systems?

The xAI literature outlines various objectives for developing xAI methods and systems, with Arrieta et al. (2020) providing a comprehensive overview. To answer Q3, we aligned these established xAI goals with the explicitly stated objectives among the studies we examined. Arrieta et al. (2020) in particular introduce the following goals:

  1. 1.

    Trustworthiness: Foster trust in AI models, ensuring that their actions align with their intended purpose. [trust]

  2. 2.

    Causality: Find causal relationships among data variables, aiding in distinguishing causation from mere correlation. [causal]

  3. 3.

    Transferability: Understand a model’s limits and how it works to estimate whether it is appropriate to transfer it to a different task or domain. [transfer]

  4. 4.

    Informativeness: Provide insights into the decision-making process of models, offering information about the problem being solved and the internal workings of the model to avoid misconceptions. [internal working, internal operation, internal reasoning]

  5. 5.

    Confidence: Assess the robustness and stability of a model, ensuring reliable interpretations and dependable performance. [robustness, stability, stable, reliable, reliability]

  6. 6.

    Fairness: Promote fairness and ethical considerations in AI, advocating for equitable outcomes, and ensuring ethical and unbiased use of models. [fair]

  7. 7.

    Accessibility: Allow model users to get involved in the model development cycle. [accessible, access]

  8. 8.

    Interactivity: Support user interaction with models, enhancing user engagement. [interactive, interaction]

  9. 9.

    Privacy Awareness: Understand and mitigate privacy concerns related to AI models, ensuring confidentiality and responsible data handling. [privacy, personal data]

We scanned the studies to find any mentions of the above goals, including their abbreviations and different forms as noted in the brackets next to each goal. In our study set, Interactivity and Trustworthiness emerged thus as most frequently highlighted goals in 14 and 9 studies. Conversely, Transferability, Accessibility, and Privacy Awareness were not referenced in any of the studies. We detail the objectives of all the studies in the column “Objectives” in Table 3, and show a diagram of Audience and Objectives in Fig. 4.

4.4 What dialogue capabilities do the examined systems have?

Effective dialogue management is crucial for facilitating natural interactions in dialogue-based xAI systems. Table 10 details all studies and their dialogue management techniques, while Fig. 7 visualizes the frequency of each technique. In order to categorize the different approaches to dialogue management, we focus on the following categorization:

  1. 1.

    Context-based interpretation: This approach focuses on understanding and interpreting user inputs by considering the context of the ongoing conversation. It allows systems to adapt based on prior exchanges and the established context.

  2. 2.

    State-driven dialogues: In this method, the dialogue system navigates through various predefined states based on user inputs and context. The transitions between these states are determined by rules, ensuring the conversation progresses logically and coherently.

  3. 3.

    Computational argumentation: Grounded in argumentative principles, this strategy frames conversations as dialogue games. Participants adhere to a set of rules, enabling structured arguments and counterarguments to develop.

  4. 4.

    Stateless intent interpretation: This approach processes each user input independently, without retaining information from previous messages. At most, a new ambiguous message can be conditioned on the previous. It is suitable for scenarios where each query is distinct and does not require information from past messages.

  5. 5.

    Scripted sequences: This approach is the most straightforward among the ones considered in the sense that user inputs are matched directly to predefined responses or actions. It is particularly suitable for scenarios with limited variability in interactions, guiding users through a set sequence of options or steps.

Our analysis reveals that the two most prevalent dialogue management approaches, context-based interpretation and stateless intent interpretation, are each employed in three studies. While the latter are capable of deducing incomplete requests from the most recent user intent, the former utilize longer dialogue histories to interpret the user requests. The next most prevalent technique is computational argumentation, referenced not only in implemented systems but also in conceptual studies like those by Chi and Liao (2022) and Shaheen et al. (2020). Our review identified a single instance of a state-driven approach (Arnold et al. 2021), a pioneering system utilizing stateless intent interpretation (Sokol and Flach 2018), and a single example of a scripted sequences method (Gao et al. 2021). While context-based interpretation seems to be the most adaptive technique, it may not always be necessary and some controlled use cases can profit from a scripted sequences approach.

With respect to Context-based interpretation, Kuźba and Biecek (2020) developed a dialogue manager using context variables to interpret ambiguous queries, allowing for follow-up questions without re-specifying the context. This enables answering queries like “What if the person had been older” based on previously established context. While this approach is powerful, adapting it to new datasets requires defining new entities and intents. Viros Martin and Selva (2020) integrated FlowQA and Dialogue-to-Action into their systems, enabling the recall of recent conversational contexts. This capability aids in addressing vague follow-up queries. Moreover, Malandri et al. (2023) combined rule-based Dialogue State Tracking with a tailored Dialogue Policy module, facilitating dynamic response decisions based on current context and conversation state.

Regarding state-driven dialogues, Arnold et al. (2021) allow the user to introduce their own rules to change the specifications of the robot and ask questions about its behavior in counterfactual scenarios. Here, the intermediate results are stored to allow quick computation of replies to follow-up questions.

A recurring theme is the use of dialogues based on Computational argumentation as a means of explaining AI systems. Computational argumentation is the representation of the reasoning process as an argumentative dialogue or debate, with specified rules and argument structures to evaluate and convey the rationale behind decisions or beliefs (Cyras et al. 2021). Along these lines, Hernandez-Bocanegra and Ziegler (2023) present a dialogue system for hotel recommendations that provides explanations based on an argumentative framework. In robotics, Sklar and Azhar (2018) model the robot’s reasoning as structured arguments and counterarguments. Their approach, built around selectable questions, dynamically updates the robot’s beliefs about both the user and the system throughout the conversation.

One of the earliest dialogue-based xAI systems in our review was proposed by Sokol and Flach (2018), following a stateless intent interpretation. The system uses three explanations to answer distinct user questions about a selected instance. Similarly, Braines et al. (2019) proposed an approach that allows the user to question the current system response with a Why” question, without using additional context. Lastly, Valmeekam et al. (2021) design an approach where users can challenge the plan suggested by a system via alternative plans or “foils”, and the system crafts an explanation contrasting the plans.

Lastly, the study by Gao et al. (2021) uses a scripted interaction sequence in a controlled setting. The authors designed a concise dialogue consisting of three questions with binary response options followed by an open question to gather feedback.

4.5 Which questions do the examined systems address and how do they answer them?

When examining the questions handled by current systems in xAI, it is valuable to consider both the nature of the questions and the types of answers provided. In our analysis to address Q5, we follow the taxonomy proposed by Liao et al. (2020) (see Sect. 4.5.1 for a description of the questions). Table 4 shows the coverage of question types in each system and Fig. 5 the frequency of the question types across all systems. We note that nearly all studies focus on “Why’‘’ questions, aiming to elucidate a model’s prediction. How (global) and “Why not” questions are also frequently targeted, while queries related to Input and Output are less common. Notably, no system addresses “How is it that” “How is it still the case” questions. In terms of answer types, particularly among systems utilizing model-agnostic xAI techniques, the most prevalent method is Local Feature Importance, which explains individual predictions based on feature importance. Other important types of explanations include Counterfactual and Contrastive Explanations, with only a single system implementing Global Feature Importance. An emerging trend is observed in more recent systems, such as the one developed by Malandri et al. (2023), towards incorporating a broader range of answer types. Regarding the answer modalities, most systems rely on text templates to express the computed explanation. A minority of systems employs images (Braines et al. 2019; Valmeekam et al. 2021) and one system provides interactive plots and buttons (Kuźba et al. 2019). We discuss the types of questions and answer in more detail in the following subsections.

Table 4 User question categorization following Liao et al. (2020)
Fig. 5
figure 5

Pie chart showing the frequency of addressed question types from Liao et al. (2020) across the systems

4.5.1 Questions types

While multiple studies extend previous collections (see Kuźba and Biecek 2020 and Malandri et al. 2023), we refer to Liao et al. (2020)’s general question taxonomy for a comprehensive first overview, which encompasses:

  • Input: Questions regarding training data

  • Output: Questions regarding system output

  • Performance: Questions regarding model accuracy and errors

  • How (global): Questions regarding the system’s prediction process

  • Why: Questions regarding reasons behind a specific prediction

  • Why Not: Questions regarding the rationale for a prediction’s absence

  • What If: Questions regarding the projected outcome for a changed instance

  • How to be that: Questions regarding key features or minimal-difference examples to achieve a different prediction

  • How still be this: Questions regarding features or rules that do not change the current prediction

From our analysis in Table 4, a dominant focus on “Why” questions can be observed among the systems, being addressed in 12 systems. Following are the “Why not” and “How (global)” questions, occurring in 4 systems. A subset of systems also delves into “What if” questions. On the lower end, only one system included questions related to the performance category.

Understanding robot decision-making is demonstrated by Sklar and Azhar (2018). The system permits users to inspect a robot’s strategy during collaborative games, allowing for questions that probe the robot’s intentions and alternative actions. Similarly, Arnold et al. (2021) focus on creating safe robots whose actions are transparent to users. These robots are equipped with the ability to provide a rationale (Why) for their specifications and behaviors and What-if scenarios, illustrating potential behavior shifts based on altered specifications.

Delving into the intricacies of algorithm decisions, Finzel et al. (2021) present a system that explains classifications using an Inductive Logical Programming (ILP) approach. By employing a multi-level approach, their system provides for broad overarching questions concerning things such as the significance of a term, as well as local explanations concerning specific examples. Users seeking a deeper understanding can drill down further into each classification component. When the conversation reaches foundational facts, the system offers a visual representation for clarity.

In the healthcare domain, Shaheen et al. (2020) explain machine learning for treatment plan recommendations. In their approach, users can explore the underlying reasons for specific medication advice and further probe alternatives. These “Why” and “Why not” inquiries aid users in understanding the suggestions provided by the system and allow them to drill down on specific information.

Specializing in the accommodation services sector, Hernandez-Bocanegra and Ziegler (2023) provide insights into hotel recommendations derived from reviews. This system equips users with explanations behind specific hotel recommendations via aggregation of reviews and answers direct factual queries about a specific aspect of a hotel.

It is worth noting that although similar questions are addressed in the systems, a range of different methods are employed to answer them. Therefore, in the next sub-section we explore which model-agnostic methods are used to answer specific questions. We omit this analysis for model-specific methods, since these are tailored to their unique contexts and are difficult to compare.

4.5.2 Answer types

While the taxonomy of Liao et al. (2020) offers a general overview of question types, it falls short in connecting these questions to the distinct methods used for answering them. This gap becomes evident when realizing that a broad category, such as “Why” questions, can encompass diverse queries such as probing a specific classification or drawing comparisons between instances. To bridge this gap, Table 5 contrasts model-agnostic methods used to address these questions in various studies and shows which question they answer. As the table shows, “Why” questions are answered by different systems through local feature importance explanations, counterfactual explanations, or contrastive explanations. A notable trend we observed is the predominance of local feature importance explanations, which were employed in four of the five studies analyzed. This is followed by the application of counterfactual and contrastive explanations in two studies. Global feature importance explanations emerge as the least utilized approach.

Analyzing the spectrum of xAI methods, it is evident that (Malandri et al. 2023) showcase a rich repertoire for explaining black-box classifiers. The authors use LIME (Ribeiro et al. 2016) for local explanations, SHAP (Lundberg and Lee 2017) for both local and global explanations, and FoilTree (Waa et al. 2018) for contrastive explanations. For “What If” questions, they adjust the instance to match the hypothetical scenario described in the question and make a new model inference to report the answer.

Kuźba and Biecek (2020) also focus on explaining black box ML classifications. They leverage BreakDown Charts (Gosiewska and Biecek 2019) to answer “Why” questions and highlight influential features. Moreover, they use Ceteris Paribus analysis (Kuźba et al. 2019) to answer What-if questions and provide a broader context of the model’s behavior.

In the domain of decision support systems, Valmeekam et al. (2021) employ contrastive or minimally complete explanations (Chakraborti et al. 2017) to explain the suggested plans. They specifically target “Why” and “Why not” questions and allow the user to propose alternative plans to solicit feedback.

Lastly, focusing on multi-AI service architectures, Braines et al. (2019) employ different explanation techniques regarding the global workings of the composite AI system that do not require xAI techniques, and use local feature importance in the form of LIME explanations for image classification explanations.

Table 5 Model-agnostic answer types and the questions they answer

In the process of explaining machine learning models in conversations, most systems leverage text templates to put the computed numerical results into a human-readable textual form. Apart from textual answers, some studies include plots or images to enrich the conversation. The full overview is given in Table 10.

Notably, Kuźba and Biecek (2020) and Finzel et al. (2021) do not rely solely on text but also intersperse their explanations with illustrative aids like plots and visual representations. The former even goes a step further by incorporating interactive buttons that prompt users for subsequent actions, potentially deepening their understanding and guiding them to related topics. In a unique approach, Finzel et al. (2021) craft explanatory trees, illustrating the decision-making of the ILP model. As users ask questions, the tree is traversed to retrieve relevant explanations. Lastly, the system can also show images on the last layer of the tree.

As a concluding observation, we note that none of the systems we identified make use of free or adaptive text (as indicated by the grey box for Free/Adaptive Text in Fig. 3). Numerous studies have pointed to free-form text generation as an important area for future development.

4.6 What approaches are used to evaluate dialogue-based xAI Systems?

Dialogical xAI systems have evolved rapidly, with research exploring a wide range of questions from fundamental system capabilities to user interactions and preferences. Our analysis for Q6 shows that the research questions were answered mainly through user studies with exploratory evaluations (Kuźba and Biecek 2020; Hernandez-Bocanegra and Ziegler 2023; Valmeekam et al. 2021) and subjective questionnaires (Hernandez-Bocanegra and Ziegler 2023; Gao et al. 2021; Malandri et al. 2023). Only one study, Sklar and Azhar (2018), employed an objective evaluation task to determine whether providing explanation capabilities aids human-robot decision-making. The results indicated no significant improvement in performance. However, subjective evaluations revealed a positive perception of providing explanations in dialogues (Gao et al. 2021), an increase in perceived trust, transparency, and effectiveness (Hernandez-Bocanegra and Ziegler 2023), and a preference for textual over graphical explanation representations (Malandri et al. 2023). We have summarized the main hypotheses, research questions and evaluation strategies of the systems examined in Table 6.

Table 6 Overview of studies that conducted an evaluation of their dialogue system

The study by Sokol and Flach (2018) focuses on whether providing counterfactual explanations outperform an inherently interpretable model in terms of explanation quality and informativeness. They planned to do the study at a conference, yet the evaluation is not included in the paper we found.

At a similar foundational level, Gao et al. (2021) focus on understanding whether chatbots can be genuinely helpful in aiding users’ understanding of AI systems. While their findings lean towards a positive impact, the feedback from the domain experts they interviewed indicated a longing for more nuanced and flexible dialogue options beyond the binary decision of accepting or rejecting explanation suggestions.

Several studies have produced insightful findings via the exploration of user behavior in relation to system functionality. Kuźba and Biecek (2020) analyzed 621 dialogues, identifying prevalent questions users had about ML model decisions. Meanwhile, Hernandez-Bocanegra and Ziegler (2023) found that users were often more intrigued by domain-specific details than by the complexities of their recommender system. On the topic of system functionality and user preferences, Malandri et al. (2023) examine responses to different explanatory methods. Their results showcase a clear user preference toward textual explanations over graphical ones, and they also introduced a novel clarification dialogue that was well-received across the board.

Delving into decision-making scenarios in the logistics domain, Valmeekam et al. (2021) employ the RADAR-X framework to explore users’ motivation to seek contrastive explanations and their behavior when working with alternative plans (foils). Participants undertook tasks such as plan selection and system suggestion evaluation. During the interactions, they could ask for explanations, either about why the system presented its plan or about why their chosen plan was not the top suggestion. Out of 35 participants, 32 actively sought explanations, with a notable 62% specifically requesting contrastive explanations. This outcome confirmed their hypothesis, further supported by findings highlighting a user preference for explanations presented in segmented formats.

Lastly, Sklar and Azhar (2018) investigated the question of whether providing answers to “Why” queries within a dialogue game could enhance user performance. Although their results did not show a marked performance improvement, it was noteworthy that less than a third of participants felt the need to request explanations.

These varied evaluations underscore the richness of questions and experiments within dialogical xAI systems. As the field evolves, understanding useful and correct evaluation procedures remains paramount to testing and comparing different approaches.

4.7 What theoretical frameworks are dialogue-based xAI systems based on?

Our analysis, addressing Q7, reveals that many of the examined studies frame explanatory interactions in terms of the theory of dialogue games (Walton 2016). Understanding communication as a dialogue game involves seeing dialogues as a series of turns taken by two or more players, where each turn consists of moves known as locutions, speech acts, or utterances. Furthermore, a dialogue game is split into different stages and characterized by a specific goal and a related success criterion. Building on this idea, a dialogue protocol is a pre-established set of rules that the players need to follow. The framework in Walton (2016) for explanation dialogues defines their primary goal as facilitating the transfer of understanding between individuals. The success of a dialogue is gauged by this transfer. The dialogue game is divided into three stages: the opening stage (establishing an agreement to engage and adhere to dialogue rules), the explanation stage (initiating with a request for an explanation and subsequent response, with potential for further queries and answers), and the closing stage (assessing the success of the explanation). Walton differentiates between three types of dialogues: explanation, argumentation, and clarification. Each type has unique objectives. Argumentation seeks to prove a disputed claim, explanation aims to resolve a lack of understanding about an event, and clarification focuses on explaining misunderstanding of a message. This distinction is crucial for transitioning effectively between dialogue types, especially in the closing stage, where the success of the explanation is evaluated. Moreover, Walton highlights the importance of not only measuring the success of a dialogue in terms of transfer of understanding but also considering the accuracy of the explanation. This consideration led to the introduction of a truth condition and a methodology for evaluating competing explanations, thus shifting towards an examination dialogue that focuses on the correctness of the explanation. Building on Walton (2016)’s dialogue framework, several protocols have been developed for AI-system explanations. Madumal et al. (2019) introduce a protocol for explanation dialogues that incorporates elements of argumentation, grounded in empirical studies of real dialogues. Their approach is formalized through an Agent Dialogue Framework (ADF), and it differs from Walton’s in that it does not specify a precise question type to initiate the dialogue; any question may thus start the explanation process. These differences aside, both Madumal et al. (2019) and Walton (2016) propose general protocols for explanation dialogues that can be applied to different contexts.

Understanding explanations about AI systems as a natural communication between humans and machines raises a key theoretical challenge: defining what exactly constitutes a dialogue in this context? We found five studies that either propose a theoretical framework for dialogue-based xAI or, stemming from such a framework, propose a protocol for explanation dialogues. In this section, we survey these theoretical frameworks in order to obtain suggestions for xAI dialogue models. Table 7 presents a basic overview of those papers. Table 8 outlines the dialogue types of the frameworks that present a protocol while Table 9 categorises the suggested properties identified in the framework studies.

Table 7 General overview
Table 8 Dialogue protocols
Table 9 Suggested properties in dialogue-based XAI systems

Following the theoretical framework of dialogue games, we can characterize systems by their ability along the following three dimensions: (i) supported dialogue types, (ii) ability to adapt to the user, and (iii) explicitness of goal. We discuss the systems along these lines in more detail below.

Supported dialogue types Some systems, such as those proposed by Hernandez-Bocanegra and Ziegler (2023) and Malandri et al. (2023), incorporate mechanisms to clarify ambiguous user requests. However, while Madumal et al. (2019) advocate for systems that can detect false premises in questions, we found no system with this feature. Allowing users to challenge or inquire further about provided explanations makes the dialogues more human-like. For example, Braines et al. (2019) propose a system where users can question and follow up on an explanation. Furthermore, both Sklar and Azhar (2018) and Hernandez-Bocanegra and Ziegler (2023) advocate for argumentation-based dialogues, fostering dynamic exchanges where users can question the system’s suggestions. Inspired by the protocol of Madumal et al. (2019), Malandri et al. (2023) incorporate the possibility of clarification dialogues into their dialogue system, allowing users to question the explanation provided. “Drill-down dialogues” empower users to probe deeper into explanations by navigating between different levels of detail. Such systems, showcased in studies by Finzel et al. (2021), Viros Martin and Selva (2020), and Shaheen et al. (2020), facilitate the exploration of explanations from high-level overviews to intricate specifics. While each study approaches the drill-down technique with its unique nuances, their core methodology remains similar. For instance, Finzel et al. (2021) employs an “explanatory tree” to guide users through interconnected drill-down explanations, starting with a top-level answer. Similarly, Viros Martin and Selva (2020) provide users with an initial broad answer, such as an overview of costs, but offers paths to focus on specifics, such as the costs of individual program elements. Following this trend, Chi and Liao (2022) introduce users to a first explanation and subsequently suggested avenues for more in-depth exploration. In the same vein, the system proposed by Shaheen et al. (2020) begins with a high-level response about medication predictions and then allows users to iteratively probe deeper with “Why” questions. Despite their individual differences, all these systems fundamentally emphasize empowering users to delve deeper into explanations on demand.

Ability to adapt to the user Some systems need an initial phase where users provide contextual information, enhancing the subsequent interaction quality. For example, the treatment plan suggestion system (Shaheen et al. 2020) requires users to answer specific health-related questions before engaging in the main dialogue. Similarly, in the hotel recommendation system (Hernandez-Bocanegra and Ziegler 2023), users needed to pick their favorite aspects of a hotel to prime the system to prioritize those in the conversation. Lastly, systems like Malandri et al. (2023) and Braines et al. (2019) are designed to accommodate various user categories rather than focusing on individual users. These systems can differentiate between information accessibility permissions and levels of understanding.

Explicitness of dialogue goal Following (Walton 2016), dialogues usually end once their objectives are met. In the studies we examined, these objectives often align with the broader goals of the study, even if they are not always explicitly mentioned. The few studies that mention an objective do not differentiate it from the main objective. For example, Shaheen et al. (2020) has a dialogue objective to justify a node’s role in the SMT solver’s path, while the overarching aim is to enhance user trust and understanding. Similarly, while Malandri et al. (2023) focuses on answering user questions in their dialogues, their main goal is to clarify the ML model for users. In a collaborative scenario involving a human and a robot, Sklar and Azhar (2018) center their dialogue on jointly planning the next steps. Notably, the end of these dialogues is not necessarily marked by user comprehension, but often by user disengagement.

Overall, while different systems implement single components of the suggested properties, there are a few properties that are poorly addressed in our analyzed studies. No system implements an explainer that can question the explainee’s request. Apart from asking for clarification when the user provides insufficient information, the systems do not show an ability to identify false assumptions and clarify or question those. Furthermore, there is rarely a stated goal of the dialogue and there are no functionalities to check whether such a goal is reached.

5 Meta-architecture

In examining various dialogue-based xAI systems, we identified a recurring architectural pattern with several key components, each playing a crucial role in communication and functionality. Figure 6 presents an overview of this architecture, to which the majority of the analyzed systems at least partly conform. Figure 7 shows a tree map, visualizing the most common techniques for each component in the architecture. The following list summarizes the different components, which are discussed in more depth in the indicated sub-sections.

  1. 1.

    Communication/User Interface: This is the primary touchpoint for the user, allowing them to pose questions either through natural language or by using UI elements. Post-computation, this component also converts the retrieved answer into a format that is easily comprehensible to the user. (See Sect. 5.1)

  2. 2.

    Input Understanding: Here, the user’s query undergoes translation into a machine-readable format. This translation can be facilitated by rule-based systems or through the application of machine learning techniques. (See Sect. 5.2)

  3. 3.

    Dialogue Management: This component manages the interaction dynamics. Depending on its sophistication, the dialogue manager might follow fixed sequences, apply conditional rules, or even maintain a track record of user-system interactions to determine the most suitable subsequent action. (See Sect. 4.4)

  4. 4.

    Answer retrieval: Upon receiving an action from the dialogue manager, the system computes the response. The methodology deployed varies with the nature of the query. For instance, questions about the dataset might invoke an exploratory data analysis (EDA) module, while inquiries about the ML model’s decisions would require the expertise of an xAI module. Questions about model training, error metrics, or broader ones like the general workings of the model might require the system to access stored information. (See Sect. 4.5)

In the following, each of these components is analyzed in detail.

Fig. 6
figure 6

Meta-architecture of systems. Grey boxes such as “Communication/UI” represent components of the system. The arrows indicate information flow with the information type in rounded boxes above them

Table 10 Overview of studies that suggest a system architecture ordered by publication year. Studies that did not provide information on the system architecture are omitted

5.1 Communication and user interface

Systems in the realm of conversational xAI predominantly use custom-developed chat windows, with a minority using existing chat frameworks. Apart from text-based interactions, some systems employ other UI functionalities, such as interactions through buttons (e.g. Gao et al. 2021).

Leveraging existing tools and platforms, Sokol and Flach (2018) use Google’s DIY AI Voice KitFootnote 1 to receive the user question via text or speech. Similarly, the study by Kuźba and Biecek (2020) harnesses the capabilities of Dialogflow,Footnote 2 a versatile tool known for rendering rich content such as buttons and charts. Dialogflow also facilitates voice-based conversations and allows multimodel input. Furthermore, the Watson Assistant ServiceFootnote 3 can be used as a UI interface, as demonstrated by Gao et al. (2021). Watson’s compatibility and ease of integration with platforms such as Slack allow for quick chatbot development. In this system, users can accept or dismiss suggested explanations via buttons and use text inputs to provide feedback on further questions.

Rather than using existing solutions, Sklar and Azhar (2018), Viros Martin and Selva (2020), Malandri et al. (2023), and Hernandez-Bocanegra and Ziegler (2023) take a different approach by crafting custom user interfaces with chat windows, dependencies and registrations to external services.

Apart from a standalone dialogue interface, some studies integrate a chat window in existing applications to allow users to interact and ask questions. The research by Arnold et al. (2021) integrates the dialogue system within the expansive natural language pipeline of a unique robotic architecture known as DIARC or Distributed Integrated Affective Reflection Cognition (Scheutz et al. 2007). This architecture illustrates the seamless fusion of dialogue functionalities with robotics. Similarly, the study by Valmeekam et al. (2021) extends the functionalities of the existing RADAR decision-making tool by introducing a novel plan-editing component that makes it possible to propose a new plan through either speech or UI components. This strategic inclusion offers users the possibility to propose alternative plans and to seek deeper explanations within a familiar environment. Lastly, Sklar and Azhar (2018) use a custom user interface where the chat window is part of the decision-making tool ArgHRI (Azhar and Sklar 2016).

5.2 User input understanding

Accurately interpreting a user’s question is a crucial yet challenging task. Addressing this challenge commonly involves one of three approaches: employing Natural Language Understanding (NLU) methods via external frameworks or self-trained models; utilizing Controlled Natural Language (CNL); or adopting structured query processing alongside direct system interpretation through user interface (UI) elements. The majority of systems utilize NLU to interpret user intent, with many relying on external frameworks instead of developing their own models (Sokol and Flach 2018; Kuźba and Biecek 2020; Hernandez-Bocanegra and Ziegler 2023; Malandri et al. 2023). Though less common, structured query processing and user interface (UI) interaction demonstrate potential in certain use cases. Among the analyzed systems, only one used an approach based on Controlled Natural Language (Braines et al. 2019). An overview is given in Table 10.

Many systems employ external frameworks to leverage Natural Language Understanding (NLU) for interpreting user input. By using pretrained NLU components, there is no need for training or dataset collection. Instead, it only requires defining intents and subsequent actions, thereby saving both time and manual effort. For instance, the system by Sokol and Flach (2018) uses Google’s DIY AI Voice KitFootnote 4 to interpret the user question and transform it into a counterfactual statement that can be queried. Moreover, the RASA framework,Footnote 5 as utilized in the dialogue system by Malandri et al. (2023), is an open-source platform for building conversational AI. Its NLU model recognizes entities and categorizes user intents through the synergy of conditional random fields and transformers.

Fig. 7
figure 7

Tree Map showing the distribution of the implemented system component types

A group of studies has focused on training their own NLU models for interpreting user input. Creating a suitable training dataset is challenging and often requires manual labour that might include user studies to gather relevant intents and assigning them to suitable actions such as xAI methods.

The approach by Kuźba and Biecek (2020) integrates an NLU component that combines machine learning techniques with rule-based taggers to process user queries and extract entities. They trained the intent classifier using 40 intents and 874 training sentences.

Similarly, Dialog-To-Action (Guo et al. 2018) and FlowQA (Huang et al. 2019), as employed by Viros Martin and Selva (2020), represent state-of-the-art dialogue context approaches. The former is tailored for task-based dialogues and the latter for clarification questions, and both are capable of retaining recent information for conversational context. While the authors do not detail the training of their NLU component, it is implied that a suitable Question Answering (QA) dataset is necessary for their approach, given their work on QA datasets.

Similarly, Hernandez-Bocanegra and Ziegler (2023) designed a system that uses the BERT (Devlin et al. 2019) model for intent classification, with sentences and the NLTK toolkit (Wagner 2010) for slot filling, two crucial NLP tasks in dialogue systems. While intent classification categorizes a user’s needs from a query, slot filling identifies specific entities, like hotel names, within that query. The authors collected 1,806 user-generated questions about recommendations to train their system.

Instead of taking an NLU approach, Braines et al. (2019) utilize a Controlled Natural Language (CNL) for interpreting user intent. The CNL has a restricted grammar and vocabulary that makes it clear and simple enough to be easily processed by machines while also remaining human-readable. While this method bypasses the complexity of developing NLU models, it necessitates the crafting of specific conceptual models or ontologies within the CNL framework to suit new domains.

Moving from the nuances of language interpretation to the specifics of query structuring, the focus shifts to structured query processing techniques. In this domain, Arnold et al. (2021) use a TLDL parser to convert user queries into a predicate format, which are then translated into a Violation Enumeration Language (VEL) format. Similarly, Finzel et al. (2021) base their approach on the Prolog environment (Sterling and Shapiro 1986), demonstrating another facet of structured query handling.

Shifting the lens from language and query processing to user interface design, conversational AI also encompasses systems that favor structured UI approaches. For instance, Sklar and Azhar (2018) utilize an interface where users select questions via buttons, and Gao et al. (2021) also guide users predominantly through button-based interactions. In the same vein, Valmeekam et al. (2021) emphasize UI components for the primary mode of user interaction. These examples illustrate a trend towards simplifying user input methods, moving away from complex text-based query interpretation to more straightforward, interactive UI designs.

6 Critical discussion

In this section, we first summarize our key findings before moving to a more critical discussion of the limitations of existing approaches. Our findings can be summarized as follows:

  • We identify five dimensions to compare dialogue-based xAI approaches: (i) use cases and target audience; (ii) objectives, (iii) dialogue capabilities; (iv) questions addressed and answers provided; (v) evaluation procedure.

  • Our answer to Q1–Q3 are presented in Table 3. The majority of the studies focus on model users, while a hand full of studies develop systems for affected users. The applications are diverse, encompassing fields such as robotics, manufacturing, computer vision, IT operations, emergency response, machine learning, fake news detection, and recommendation systems.

  • To answer Q3, we used a list of common xAI objectives by Arrieta et al. (2020) as key terms to search for in the studies and identified Interactivity and Trustworthiness as the top stated objectives, with Transferability, Accessibility and Privacy Awareness not being mentioned at all.

  • To answer Q4, we identified 4 possible dialogue management strategies: context-based interpretation, state-driven dialogues, computational argumentation, stateless intent interpretation, and scripted sequences. Most of the studies employ a context-based interpretation where the whole conversation history serves as context for further messages, or stateless intent interpretation, where only the current prediction or explanation is in focus.

  • Analysing studies that describe the questions and answers, we answer Q5 with a categorization based on the taxonomy introduced by Liao et al. (2020) and provide an overview in Table 4. We observe a prominent focus on the “Why” question type, followed by “Why not” and “How (global)”, and only a small subset addressing the “What if” question type. The least addressed questions were related to the Input, Output, and Performance of the ML model.

  • Looking at the studies that performed an evaluation of their system, we answer Q6 by providing an overview of the research question, evaluation procedure, sample size, and results in Table 6. We observe that the research questions were mainly addressed with qualitative or exploratory results, with only one study performing an objective evaluation.

  • As our analysis for Q7, we identified key theoretical backgrounds for explanatory dialogues, primarily from dialogue games such as Walton (2016), its extension by Madumal et al. (2019) and a use case-specific protocol in Shaheen et al. (2020). From this analysis, we highlighted three critical areas: dialogue dynamics and user interaction, response adaptability and complexity management, and goal definition, and evaluation. Section 4.7 details the extent to which these theoretical properties are present in current systems. While some systems exhibit drill-down and argumentative dialogues, other properties are lacking and could guide future research.

  • Based on an analysis of the systems that propose a specific architecture and describe the components, we provide a meta-architecture presented in Fig. 6. Most architectures can be mapped to this meta-architecture and the main components are divided into user communication, input understanding, dialogue management, and answer retrieval.

The studies analyzed represent significant progress towards building general-purpose systems to analyze dialogue and user questions in xAI, as well as targeted systems for specific domain users. They investigate specific research questions related to the usefulness of dialogues, and the questions of the users, comparing xAI approaches and dialogue-specific capabilities such as clarification dialogues.

Nevertheless, the current state of dialogue-based explanatory systems reveals several shortcomings and opens up multiple avenues for future work. Firstly, while many studies identify Interactivity and Trustworthiness as important objectives, the systems developed often do not explicitly identify any intended goal of the dialogue, which makes it difficult to determine when a dialogue is successful. Moreover, Accessibility was not in focus in any of the studies we identified, even though it is among the goals for the target group of model developers (Arrieta et al. 2020). Furthermore, the target groups of regulators, managers and model developers are not adequately addressed, only being mentioned in a single study that tried to cater to many target users in theory, but which did not implement a working prototype (Braines et al. 2019). The answers provided by recent work are largely based on a template-filling approach, which limits their adaptability to a specific type of user question and disturbs the perception of natural dialogue. While it is often stated as a goal for future work that natural language generation techniques can be used to align the answer to the specific user question, it is an open question whether the impact of that would be meaningful. Almost every system targets a “Why” question to explain a system’s decision. This is in line with the prominent objectives Trustworthiness and Informativeness. We observe a significant gap in the inability of systems to address questions regarding the data distribution, as well as answer “How is it the case that” and “How is it still the case” questions; there is potential here for future work. Apart from local and global model explanations, other scopes of analysis such as regional explanations (Dandl et al. 2023) could be used to answer a broader set of question regarding subgroups. Another limitation is that the analyzed systems cannot question the user’s request or detect incorrect assumptions, missing an opportunity for a more adaptive and accurate dialogue. Research on personalization in dialogue-based xAI is still in its infancy, with the focus being on differentiating between stakeholder groups, while some use cases might need a dialogue tailored to the needs of a specific individual. System evaluations often rely on subjective and exploratory methods to explore research questions. However, there is evidence suggesting a mismatch between subjective understanding and objective comprehension (Buçinca et al. 2020). This disparity underscores the need for objective measures in evaluations, similar to those used in non-dialogical xAI studies, to enable more accurate comparisons between different explanatory approaches. For example, Sklar and Azhar (2018) showed that in a human-robot decision-making task, performance was not improved by the provided explanations, a finding that only became apparent through the use of objective measures.

6.1 Recommendations for future work

We identified Interactivity and Trustworthiness as the primary objectives explored in current research. Future work should broaden the scope by considering additional objectives, such as Accessibility. Furthermore, existing research has primarily focused on model users and affected users. Future studies should also address a more comprehensive range of target groups, including regulators, managers and developers.

We have noted that research on personalization in the area of dialogue-based AI is still in an early stage. This hints at the fact that future work should further consider how explanations can be personalized or “localized” to the preferences of particular geographical segments, groups or regions.

Beyond these open directions that directly follow from our analysis, we would like to point towards further areas for future work.

Above all, new dialogue management approaches are needed so that the process of explanation is guided by some objective related to the level of understanding of the user. This means that future research will need to investigate different objectives to optimize in a dialogue-based xAI setting. One obvious objective is related to maximizing understanding by the user, which in turn would require the development of metrics by which the level of understanding can be measured or approximated in real-time during the interaction. This seems key to yielding adaptive systems that are able to adapt the explanation to the (evolving) level of the user’s understanding. This touches on the general question of how a dialogue management approach that is suited to such explanatory contexts might look. Most likely, such dialogue management systems will have to rely on different principles than current state-of-the-art systems that retrieve information to fill into a limited number of pre-defined slots. An important related question is to define the speech acts or dialogue moves that a dialogue system could use to effectively move the explanatory dialogue further towards achieving the overall objectives.

An important point for future work will also be to develop approaches by which misalignments, misunderstandings, biases and incorrect assumptions made by a user can be identified and resolved. This would facilitate a more informed and accurate understanding of the system’s explanations. Most importantly, further research is needed on how to interpret user signals during the interaction in relation to the evolving explanation. Interpreting users’ speech, gestures, facial expressions, signals of doubt etc. is key to understanding whether the current explanation satisfies their needs and provides a basis for adapting appropriately. An open question here would also be how to implement a reasoning component that can deduce information needs from what the user says, and that can recognize knowledge gaps effectively. This will let the system decide how an explanation is to be adapted to meet the user’s needs and close knowledge gaps.

A crucial question is how to facilitate an explanatory interaction in real-time in which explanations are recomputed or adapted in reaction to users’ signalling of understanding, doubts, uncertainty, etc. A balance needs to be found between quickly providing pre-computed answers for certain standard information needs and computing highly adapted and contextualized, but more time-intensive answers. While the first approach might be fast, it might lack contextual adequacy and suitability. The second approach might be slower but yield highly relevant and contextually adequate explanations. Finding a balance between both approaches to ensure a real-time interaction seems an important avenue for future work.

Explanations might be provided in different (complimentary) modalities including speech, text, diagrams, etc. Further research should investigate strategies that can be followed to determine which type of explanations fit best for a certain information need. Systems should have the ability to select the most appropriate explanation type based on the specific user and use case during the interaction. Moreover, although the categorization into static, interactive, and conversational appears to encompass most current xAI systems, it also leads to the question of establishing a defined taxonomy.

Finally, an important goal is to investigate and understand the role that new technologies and architectures can play in the design of systems that go beyond template-based approaches towards the realization of more flexible, adaptive and context-aware interactions. We anticipate that large language models could play an important role here, but many questions would need to be answered, such as how the context and model can be encoded for the LLM, and which prompting strategies are needed for generating contextually relevant, informative, consistent, coherent and valid explanations.

Beyond engineering questions, there are questions related to the interfaces to other disciplines, including, but not limited to sociology and psychology. An open question is to what extent a dialogue-based approach fosters a more participatory approach and thus higher acceptance levels of generated explanations. A further question is whether it can be demonstrated that dialogue-based explanations are in fact a good approach for empowering users to contest or challenge systems’ decisions by providing sufficient relevant information; this is an important part of the rationale for xAI in general.

While we have mentioned only a small set of potential directions and avenues for future work, we believe that there are many more that could be explored towards realizing full-fledged dialogue-based approach to xAI. In any case, the questions we have identified clearly underscore the interdisciplinary nature of conversational xAI research, encompassing technical, cognitive, and user experience aspects, and highlight the possibilities for innovative approaches to address these challenges. It is thus likely that questions such as the above should be addressed by interdisciplinary teams.

7 Conclusion

Dialogue-based xAI systems are increasingly recognized as valuable tools in the AI domain, primarily for their role in enhancing understanding between complex AI models and end-users. In this literature review, we have taken a systematic look at these systems, discussing their architecture, functionality, methodologies, as well as evaluation techniques and capabilities.

In particular, we have provided five key dimensions along which dialogue-based xAI systems can be compared and assessed: use cases and target audience, dialogue capabilities, questions addressed and answers provided, evaluation procedure, and the adherence to properties derived from theoretical work on dialogue-based explanation systems. Following this five-dimensional classification, we have provided a systematic overview of use cases, objectives, targeted audiences and main types of questions addressed by existing dialogue-based xAI systems. Most importantly, we have identified theoretical frameworks that can be used to define the type of dialogue carried out by the system, and we have presented a general meta-level architecture and core components for the implementation of dialogue-based xAI systems. Finally, we have highlighted shortcomings of current systems that can be addressed in future work.

Our review focused on publications from the ACM Digital Library, Scopus, and the IEEE libraries. While this selection encompasses a considerable range of high-quality databases, it carries the limitation of possibly overlooking pertinent information from non-peer-reviewed sources and other databases. Further studies can build upon ours by including more sources and extending the framework, as well as documenting the shift with the emergence of foundation models.

In our literature filtering process, we omitted studies not centered on explaining AI systems since we wanted to capture the current efforts of the xAI community, which may have led us to overlook Human-Computer Interaction research with potentially valuable properties.

With the analysis and discussion provided in this systematic review, developers of future xAI systems can make informed decisions about which areas to emphasize based on their system goals. Additionally, by pointing out the commonly addressed questions and those that are less frequently tackled, this review can guide future systems to consider a broader range of user queries, if they align with the intended use case. The evaluation techniques and questions that we discussed can also serve as reference points for assessing the efficacy of new systems relative to their primary research objectives.

Overall, with this review, we hope to provide a systematic overview of the current landscape of dialogue-based xAI systems that can inform the choices of researchers and practitioners regarding architectures, paradigms, objectives, evaluation procedures as well as choices of further research questions.