The nature of human and computer interactions are inherently multimodal, which has led to substantial interest in building interpretable, interactive, and reliable multimodal interfaces. However, modern multimodal models and interfaces are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize their internal workings in order to empower stakeholders to visualize model behavior, perform model debugging, and promote trust in these models? Our paper proposes MultiViz, a method for analyzing the behavior of multimodal models via 4 stages: (1) unimodal importance, (2) cross-modal interactions, (3) multimodal representations and (4) multimodal prediction. MultiViz includes modular visualization tools for each stage before combining outputs from all stages through an interactive and human-in-the-loop API. Through user studies with 21 participants on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available at https://github.com/pliang279/MultiViz, will be regularly updated with new visualization tools and metrics, and welcomes input from the community1.

Figure 1:

1 Introduction

Many real-world problems are multimodal: from the early research on audio-visual speech recognition [21] to the recent interest in language, vision, and video understanding [21] for applications such as multimedia [52, 64], affective computing [51, 75], robotics [43, 47], finance [34], dialogue [73], human-computer interaction [20, 66], and healthcare [24, 98]. Subsequently, their impact towards real-world applications has inspired recent research in visualizing and understanding their internal mechanics [15, 55, 68, 69, 71] as a step towards building interpretable, interactive, and reliable multimodal interfaces [32, 35, 67]. However, modern parameterizations of multimodal models are typically black-box neural networks [48, 57]. How can we enable users to accurately visualize and understand the internal modeling of multimodal information and interactions for effective human-AI collaboration?

As a step towards human-centric interpretations of multimodal models, this paper performs a set of detailed user studies to evaluate how human users use interpretation tools to understand multimodal models. Specifically, we build upon a set of existing and proposed interpretation tools: gradient-based feature attribution [30, 61, 78], higher-order gradients, EMAP [33], DIME [58], sparse concept models [96], by creating a general-purpose toolkit, MultiViz, for visualizing and understanding multimodal models. MultiViz first scaffolds the problem of interpretability into 4 stages: (1) unimodal importance: identifying the contributions of each modality towards downstream modeling and prediction, (2) cross-modal interactions: uncovering the various ways in which different modalities can relate with each other and the types of new information possibly discovered as a result of these relationships, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction for a given task. Through extensive user studies, we show that MultiViz helps users (1) gain a deeper understanding of model behavior as measured via a proxy task of model simulation, (2) assign interpretable language concepts to previously uninterpretable features, and (3) perform error analysis on model misclassifications. Finally, using takeaways from error analysis, we present a case study of human-in-the-loop model debugging. We release MultiViz datasets, models, and code at https://github.com/pliang279/MultiViz.

2 Related Work

Interpretable ML aims to further our understanding and trust of ML models, enable model debugging, and use these insights for joint decision-making between stakeholders and AI [1, 17, 18, 26]. Interpreting multimodal models is of particular interest in the HCI [41, 65, 68], multimedia [11], user interface [50, 80, 97], and mobile interface [85, 95] communities since interactions between humans and computer interfaces are naturally multimodal (e.g., via voice [72], gestures [62], touch [28], and even taste [50]). We categorize related work in interpreting multimodal models into:

Unimodal importance: Several approaches have focused on building interpretable components for unimodal importance through soft [71] and hard attention mechanisms [16]. When aiming to explain black-box multimodal models, related work rely primarily on gradient-based visualizations [8, 22, 82] and feature attributions (e.g., LIME [78], Shapley values [61]) to highlight regions of the image which the model attends to.

Cross-modal interactions: Recent work investigates the activations of pretrained transformers [13, 49], performs diagnostic experiments through specially curated inputs [23, 46, 70, 89], or trains auxiliary explanation modules [40, 71]. Particularly related to our work is EMAP [33] for disentangling the effects of unimodal (additive) contributions from cross-modal interactions in multimodal tasks, as well as M2Lens [94], an interactive system to visualize multimodal models for sentiment analysis through both unimodal and cross-modal contributions, as well as other quantification [69] and visualization [86] tools for multimodal interactions.

Representation and prediction: Existing approaches have used language syntax (e.g., the question in VQA) for compositionality into higher-level features [2, 4, 93]. Similarly, logical statements have been integrated with neural networks for interpretable logical reasoning [27, 87]. However, these are typically restricted to certain modalities or tasks. Finally, visualizations have also uncovered several biases in models and datasets (e.g., unimodal biases in VQA questions [3, 12] or gender biases in image captioning [32]). We believe that MultiViz will enable the identification of biases across a wider range of modalities and tasks.

3 Multimodal Visualization

Figure 2:

Figure 3:

Figure 4:

We assume multimodal datasets take the form \(\mathcal {D} = \lbrace (\mathbf {x}_1, \mathbf {x}_2, y)_{i=1}^n \rbrace = \lbrace (x_1^{(1)}, x_1^{(2)},..., x_2^{(1)}, x_2^{(2)},..., y)_{i=1}^n \rbrace\), with boldface x denoting the entire modality, each x₁, x₂ indicating modality atoms (i.e., fine-grained sub-parts of modalities that we would like to analyze, such as individual words in a sentence, object regions in an image, or time-steps in time-series data), and y denoting the label. These datasets enable us to train a multimodal model \(\hat{y} = f(\mathbf {x}_1, \mathbf {x}_2; \theta)\) which we are interested in visualizing. We scaffold the problem of interpreting f into unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction. Each of these stages provides complementary information on the decision-making process (see Figure 1). We now describe each step in detail:

Unimodal importance (U) aims to understand the contributions of each modality towards prediction. It builds upon ideas of gradients [8, 22, 82] and feature attributions (e.g., LIME [78], Shapley values [61]). These approaches take in a modality of interest x and returns importance weights across atoms x of modality x.

Cross-modal interactions (C) describe how atoms from different modalities relate with each other and the types of new information discovered as a result of these relationships. Formally, a function f captures statistical non-additive interactions between 2 unimodal atoms x₁ and x₂ if and only if f cannot be decomposed into a sum of unimodal subfunctions g₁, g₂ such that f(x₁, x₂) = g₁(x₁) + g₂(x₂) [25, 83, 91, 92]. Using this definition, we include (1) EMAP [33] which decomposes f(x₁, x₂) = g₁(x₁) + g₂(x₂) + g₁₂(x₁, x₂) into strictly unimodal representations g₁, g₂, and cross-modal representation \(g_{12} = f - \mathbb {E}_{x_1} (f) - \mathbb {E}_{x_2} (f) + \mathbb {E}_{x_1,x_2} (f)\) to quantify the degree of global cross-modal interactions across an entire dataset and (2) DIME [58] which extends EMAP using feature visualization on each disentangled representation locally (per datapoint). We also propose a higher-order gradient-based approach by identifying that a function f exhibits interactions iff \(\mathbf {E}_{x_1,x_2} \left[ \frac{\partial ^2 f(x_1,x_2)}{\partial x_1 \partial x_2} \right]^2> 0\). Taking a second-order gradient (extending first-order gradient-based approaches [31, 78, 99]) zeros out the unimodal terms and isolates the interaction terms.

Multimodal representations aim to understand how information is represented at the feature representation level. Specifically, given a trained multimodal model f, define the matrix \(M_z \in \mathbb {R}^{N \times d}\) as the penultimate layer of f representing (uninterpretable) deep feature representations. For the ith datapoint, z = M_z(i) collects a set of individual feature representations \(z_{1}, z_{2},..., z_{d} \in \mathbb {R}\). Local representation analysis (R_ℓ) informs the user on parts of the original datapoint that activate feature z_j (via unimodal or cross-modal visualizations with respect to feature z_j). Global representation analysis (R_g) provides the user with the top k datapoints that also maximally activate feature z_j, which is especially useful in helping humans assign interpretable language concepts to each feature by looking at similarly activated input regions across datapoints (e.g., the concept of color in Figure 1, right).

Figure 5:

Multimodal prediction (P): Finally, the prediction step takes the set of feature representations z₁, z₂,..., z_d and composes them to form higher-level abstract concepts suitable for a task. We approximate the prediction process with a sparse linear combination of penultimate layer features [96]. Given the penultimate layer \(M_z \in \mathbb {R}^{N \times d}\), we fit a linear model \(\mathbb {E}\left(Y|X=x \right) = M_z^\top \beta\) (bias β₀ omitted for simplicity) and solve for sparsity using \(\hat{\beta } = \mathop{arg\,min}_{\beta } \frac{1}{2N} \Vert M_z^\top \beta - y \Vert _2^2 + \lambda _1 \Vert \beta \Vert _1 + \lambda _2 \Vert \beta \Vert _2^2\). The resulting understanding starts from the set of learned weights with the highest non-zero coefficients β_top = {β₍₁₎, β₍₂₎,...} and corresponding ranked features z_top = {z₍₁₎, z₍₂₎,...}. β_top tells the user how features z_top are composed to make a prediction, and z_top can then be visualized with respect to unimodal and cross-modal interactions using the representation stage.

Table 1:

Level	Methods
Unimodal importance	Grad [8, 22, 82],
	LIME [31, 78, 99],
	SHAP [61, 79]
Cross-modal interactions	Cross-modal { Grad, LIME, SHAP} (new),
	EMAP [33], DIME [58]
Representation	Local & global analysis (new)
Prediction	Sparse linear model (new)

Table 1: We scaffold the problem of interpreting multimodal models into the following stage. For each stage, MultiViz includes existing and newly proposed approaches for visualizing models across modalities and tasks.

3.1 MultiViz visualization interface

We summarize the included approaches for visualizing each step of the multimodal process in Table 1. Figure 6 also shows an illustration of the overall code structure available in MultiViz, spanning various multimodal data loaders, recent multimodal models, visualization methods, and visualization tools.

The final MultiViz interface combines outputs from all stages through an interactive and human-in-the-loop API. We show the overall MultiViz interface in Figure 5 using an example from the VQA dataset [5]. This interactive API enables users to choose multimodal datasets and models using the control panel on the left side. The control panel also shows all information about the data point (original image and question in the case of VQA) as well as the ground truth (“GT”) label and the predicted (“Pred”) class. Clicking the model prediction will show visualizations with respect to that specific class label, with an Overview Page showing general unimodal importance, cross-modal interactions, and prediction weights, as well as a Feature Page for local and global analysis of user-selected features. Specifically, the right side of the Overview Page shows the top 5 features with the highest weights for a selected prediction class (the weights are shown as numbers on the lines). By clicking the circle on the graph representing each feature, the user can access R_ℓ and R_g visualizations of that specific feature via the Feature Page. Please see Appendix B for more examples.

Figure 6:

4 Experiments

Our experiments are designed to verify the usefulness and complementarity of the 4 MultiViz stages. We start with a model simulation experiment to test the utility of each stage towards overall model understanding (Section 4.1). We then dive deeper into the individual stages by testing how well MultiViz enables representation interpretation (Section 4.2) and error analysis (Section 4.3), before presenting a case study of model debugging from error analysis insights (Section 4.4). We showcase the following selected experiments and defer results on other datasets to Appendix B.

Table 2:

Research area	QA		Fusion		Fusion
Dataset	VQA 2.0		MM-IMDb		CMU-MOSEI
Model	LXMERT		LRTF		MulT
Metric	Correctness	Agreement	Correctness	Agreement	Correctness	Agreement
U	55.0 ± 0.0	0.39	50.0 ± 13.2	0.34	71.7 ± 17.6	0.39
U + C	65.0 ± 5.0	0.50	53.7 ± 7.6	0.51	76.7 ± 10.4	0.45
U + C + R_ℓ	61.7 ± 7.6	0.57	56.7 ± 7.6	0.59	78.3 ± 2.9	0.42
U + C + R_ℓ + R_g	71.7 ± 15.3	0.61	61.7 ± 7.6	0.43	100.0 ± 0.0	1.00
MultiViz	81.7 ± 2.9	0.86	65.0 ± 5.0	0.60	100.0 ± 0.0	1.00

Table 2: Model simulation: We tasked 15 humans users (3 users for each of the following local ablation settings) to simulate model predictions based on visualized evidences from MultiViz. Human annotators who have access to all stages visualized in MultiViz are able to accurately and consistently simulate model predictions (regardless of whether the model made the correct prediction) with high accuracy and annotator agreement, representing a step towards model understanding.

Setup: We use a large suite of datasets from MultiBench [54] which span real-world fusion [6, 37, 100], retrieval [74], and QA [29, 38] tasks. For each dataset, we test a corresponding state-of-the-art model: MulT [90], LRTF [56], LF [9], ViLT [42], CLIP [76], CNN-LSTM-SA [38], MDETR [39], and LXMERT [88]. These cover models both pretrained and trained from scratch. We summarize all 6 datasets and 8 models and provide details in Appendix B. Participation in all human studies were fully voluntary and without compensation. There are no participant risks involved and we obtained consent from all participants prior to each short study. This line of research is aligned with similar IRB-exempt annotation studies at our institution. The authors manually took notes on all results and feedback, in such a manner that the identity of the human subjects cannot readily be ascertained, directly or through identifiers linked to the subjects. Participants were not the authors nor in the same research groups as the authors, but they all hold or are working towards a graduate degree in a STEM field and have knowledge of ML models. None of the participants knew about this project before their session and each participant only interacted with the setting they were involved in, so it is not possible to manipulate users to achieve desired outcomes.

4.1 Model simulation

We first design a model simulation experiment to determine if MultiViz helps users of multimodal models gain a deeper understanding of model behavior. If MultiViz indeed generates human-understandable explanations, humans should be able to accurately simulate model predictions given these explanations only, as measured by correctness with respect to actual model predictions and annotator agreement (Krippendorff’s alpha [44]). To investigate the utility of each stage in MultiViz, we design a human study to see how accurately humans can simulate model predictions based on the output of a model analysis results and visualizations. For each dataset (VQA 2.0, MM-IMDb, CMU-MOSEI), we divide 15 total human annotators into 5 groups of 3, each group getting one of the five settings above, and then we compute average accuracy and inter-rater agreement within each group.

(1)

U: Users are only shown the unimodal importance (U) of each modality towards label y.

(2)

U + C: Users are also shown cross-modal interactions (C) highlighted towards label y.

(3)

U + C + R_ℓ: Users are also shown local analysis (R_ℓ) of unimodal and cross-modal interactions of top features z_top = {z₍₁₎, z₍₂₎,...} maximally activating label y.

(4)

U + C + R_ℓ + R_g: Users are additionally shown global analysis (R_g) through similar datapoints that also maximally activate top features z_top for label y.

(5)

MultiViz (U + C + R_ℓ + R_g + P): The entire MultiViz method by further including visualizations of the final prediction (P) stage: sorting top ranked feature neurons z_top = {z₍₁₎, z₍₂₎,...} with respect to their coefficients β_top = {β₍₁₎, β₍₂₎,...} and showing these coefficients to the user.

Quantitative results: We show these results in Table 2 and find that having access to all stages in MultiViz leads to significantly highest accuracy of model simulation on VQA 2.0, along with lowest variance and most consistent agreement between annotators. On fusion tasks with MM-IMDb and CMU-MOSEI, we also find that including each visualization stage consistently leads to higher correctness and agreement. More importantly, humans are able to simulate model predictions, regardless of whether the model made the correct prediction or not.

Figure 7:

We also conducted qualitative interviews to determine what users found useful in MultiViz: (1) Users reported finding the local and global representation analysis particularly useful: global analysis with other datapoints that also maximally activate feature representations were important for identifying similar concepts and assigning them to multimodal features. (2) Between Overview (U + C) and Feature (R_ℓ + R_g + P) visualizations, users found Feature visualizations more useful in \(31.7\%\), \(61.7\%\), and \(80.0\%\) of the time under settings (3), (4), and (5) respectively, and found the Overview page more useful in the remaining points. This means that for each stage, there exists a significant fraction of data points where that stage is most needed. (3) While it may be possible to determine the prediction of the model with a subset of stages, having more stages that confirm the same prediction makes them a lot more confident about their prediction, which is quantitatively substantiated by the higher accuracy, lower variance, and higher agreement in human predictions. For MM-IMDb, we were especially surprised to find that including C stage actually helped, since MM-IMDb did not seem to be a task that relies much on cross-modal interaction. We also include additional experiments and visualizations in Appendix B.1.

4.2 Representation interpretation

We now take a deeper look to check that MultiViz generates accurate explanations of multimodal representations. Using local and global representation visualizations, can humans consistently assign interpretable concepts in natural language to previously uninterpretable features? Using VQA 2.0, we perform a representation interpretation experiment, where we give human annotators some visualizations on a particular representation feature and ask them to describe what concept they think that feature represents. We found 15 human annotators (with same qualifications as those in model simulation experiment), and divide them into 3 groups of 5. Each group is given a different setting (with different amounts of MultiViz visualizations available):

(1)

R_ℓ: local visualization analysis of unimodal and cross-modal interactions in z only for the given local datapoint.

(2)

R_ℓ + R_g (no viz): including both local visualization analysis of the given datapoint as well as global analysis through retrieving, but not visualizing, similar datapoints that also maximally activate feature z.

(3)

R_ℓ + R_g: we further add visualizations of highlighted unimodal and cross-modal interactions of global datapoints, resulting in the full MultiViz functions.

We gave the same 13 representation features to all 15 human annotators, where the first feature serves as an example and the other 12 are the ones we actually record for the experiment. The instructor first explains to each annotator what each visualization means, and then goes over the first feature together. Then, the annotator must write down a concept for the other 12 features on their own. We also ask each annotator to rate a confidence of 1-5 on how confident they are that this feature indeed represents this concept.

Quantitative results: Since there are no ground-truth labels for feature concepts, we rely on annotator confidence (1-5 scale) and annotator agreement [44] as a proxy for accuracy. Once we have collected all 180 annotations (15 annotators each on 12 features), we manually cluster these into 29 distinct concepts. For example, annotations like "things to wear", "t-shirts" and "clothes" all belong to "clothes" concept; all color-related annotations belong to "colors" concept; "material question", "made-of question" and "material of object" all belongs to "material" concept. We then compute inter-rater agreement score on each feature within each group of 5 annotators using Krippendorff’s alpha with 29 possible categories. We report inter-rater agreement and average confidence in Table 3.

Table 3:

Research area	QA
Dataset	VQA 2.0
Model	LXMERT
Metric	Confidence	Agreement
R_ℓ	1.74 ± 0.52	0.18
R_ℓ + R_g (no viz)	3.67 ± 0.45	0.60
R_ℓ + R_g	4.50 ± 0.43	0.69

Table 3: Across 15 human users (5 users for each of the following 3 settings), we find that users are able to consistently assign concepts to previously uninterpretable multimodal features using both local and global representation analysis.

As shown in Table 3, as we give annotators both local and global visualizations, they were able to assign concepts more consistently (higher inter-rater agreement) and more confidently (higher average confidence score). Under setting 3 with full MultiViz visualizations on feature representations, the 5 annotators completely agreed with each other on 7 out of 12 features, which is really impressive since there are so many possible concepts annotators could assign to each feature. Therefore, this shows that our visualizations, i.e. R_ℓ and R_g, really helps humans to better understand what concept (if any) that each feature in representation represents, and that R_g examples and visualizations are especially helpful.

Qualitative interviews: We show examples of human-assigned concepts in Figure 7 (more in Appendix B.2). Note that the 3 images in each box of Figure 7 (even without feature highlighting) does constitute a visualization generated by MultiViz, as they belong to data instances that maximize the value of the feature neuron (i.e. R_g in stage 3 multimodal representations). Without MultiViz, it would not be possible to perform feature interpretation without combing through the entire dataset. Participants also noted that feature visualizations make the decision a lot more confident if its highlights match the concept. Taking as example Figure 7 top left, the visualizations serve to highlight what the model’s feature neuron is learning (i.e., person holding sports equipment), rather than what category of datapoint it is. If the visualization was different (e.g., the ground), then users would have to conclude that the feature neuron is capturing ‘outdoor ground’ rather than ‘sports equipment’. Similarly, for text highlights (Figure 7 top right), without using MultiViz to highlight ‘counter’, ‘countertop’, and ‘wall’, along with the image crossmodal interactions corresponding to these entities, one would not be able to deduce that the feature asks about material - it could also represent ‘what’ questions, or ‘household objects’, and so on. These conclusions can only be reliably deduced with all MultiViz stages.

Figure 8:

4.3 Error analysis

We further examine a case study of error analysis on trained models. We task 10 human users to use MultiViz and highlight the errors that a multimodal model exhibits by categorizing these errors into one of 3 stages:

(1)

Unimodal perception error: The model fails to recognize certain unimodal features or aspects. (For example, in Figure 8 top left example, the FRCNN object detector was unable to recognize the thin red streak as an object).

(2)

Cross-modal interaction error: The model fails to capture important cross-modal interactions such as aligning words in question with relevant parts or detected objects in image. (For example, in Figure 8 first one in middle column, the model is erroneously aligning "creamy" with the piece of carrot).

(3)

Prediction errors: The model is able to perceive correct unimodal features and their cross-modal interactions, but fails to reason through them to produce the correct prediction. (For example, in Figure 8 top right example, the model was able to both perfectly identify the chair with object detector and associate it with the word "chair" in the question (as shown by second-order-gradient analysis), but the model was still unable to reason with the given information correctly to predict the correct answer).

Table 4:

Research area	QA		QA
Dataset	CLEVR		VQA 2.0
Model	CNN-LSTM-SA		LXMERT
Metric	Confidence	Agree.	Confidence	Agree.
No viz	2.72 ± 0.15	0.05	2.15 ± 0.70	0.14
MultiViz	4.12 ± 0.45	0.67	4.21 ± 0.62	0.60

Table 4: Across 10 human users (5 users for each of the following 2 settings), we find that users are also able to categorize model errors into one of 3 stages they occur in when given full MultiViz visualizations.

For each of the 2 datasets we used in this experiment (VQA and CLEVR), 10 human annotators are divided them into 2 groups of 5, one group for each setting: (1) under MultiViz setting, for each data point, the human annotator is given access to full MultiViz webpage as well as live Second-Order Gradient (i.e. the human annotator may request to compute second order gradient for a specific subset of words in the question, and he will be presented with the resulting second order gradient result); (2) under No Viz setting, the human annotator is given nothing but the original data point, the correct answer and the predicted answer. Each human annotator needs to classify each point into one of the three categories above, and they are also asked to rate their confidence in categorizing the error on a scale of 1-5.

Using 20 datapoints per setting, these experiments with 10 users on 2 datasets and 2 models involve roughly 15 total hours of users interacting with MultiViz. From Table 4, we find that MultiViz enables humans to consistently categorize model errors into one of 3 stages. We show examples that human annotators classified into different errors in Figure 8 (more in Appendix B.3). Out of the 23 total errors, human annotators reported that on average 8.8 of them are category 1 (unimodal perception error), 6.8 of them are category 2 (cross modal interaction error), and 7.4 of them are category 3 (prediction error). This suggests that the majority of errors present in LXMERT is still caused by misunderstanding the basic unimodal concepts and cross-modal alignments rather than high-level reasoning of the perceived information, and that one possible future direction for improving the model pipeline is to use better unimodal encoders (than FRCNN) and find out some way to force the model to learn to align visual and text concepts correctly.

4.4 A case study in model debugging

Figure 9:

Following error analysis, we take a deeper investigation into one of the errors on a pretrained LXMERT model fine-tuned on VQA 2.0. Specifically, we first found the top 5 penultimate-layer neurons that are most activated on erroneous datapoints. Inspecting these neurons through MultiViz, users found that 2/5 neurons were consistently related to questions asking about color, which highlighted the model’s failure to identify color correctly (especially blue). The model has an accuracy of only \(5.5\%\) amongst all blue-related points (i.e., either have blue as correct answer or predicted answer), and these failures account for \(8.8\%\) of all model errors. We these examples in Figure 9: observe that the model is often able to capture unimodal and cross-modal interactions perfectly, but fails to identify color at prediction.

Curious as to the source of this error, we looked deeper into the source code for the entire pipeline of LXMERT, including that of its image encoder, Faster R-CNN [77]². We in fact uncovered a bug in data preprocessing for Faster R-CNN in the popular Hugging Face repository that swapped the image data storage format from RGB to BGR formats responsible for these errors. This presents a concrete use case of MultiViz: through visualizing each stage, we were able to (1) isolate the source of the bug (at prediction and not unimodal perception or cross-modal interactions), and (2) use representation analysis to localize the bug to the specific color concept.

5 Conclusion

This paper proposes MultiViz, a comprehensive method for analyzing and visualizing multimodal models. MultiViz scaffolds the interpretation problem into 4 modular stages of unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction, before providing existing and newly proposed analysis tools in each stage. MultiViz is designed to be modular (encompassing existing analysis tools and encouraging research towards understudied stages), general (supporting diverse modalities, models, and tasks), and human-in-the-loop (providing a visualization tool for user-centric model interpretation, error analysis, and debugging), qualities which we strive to upkeep by ensuring its public access and regular updates from community feedback.

5.1 Limitations and Future Directions

We are aware of some directions in which MultiViz can still be improved and outline these for future work:

(1)

Number of prediction classes: For complex tasks like VQA 2.0 where there are over three thousand prediction classes, there will be many sparse output weights, so it is difficult to find global related datapoints for representation analysis. On the other hand, for VQA 2.0 subsets with ‘yes/no’ answer choices (i.e., too few classes), we found that the final-layer activated features contain too much overlap to reliably visualize, and we have to extend MultiViz to rely on more intermediate-layer features. MultiViz works best with a reasonable number of prediction classes, such as those in multimodal emotion recognition, multiple-choice question answering, and others.

(2)

Model requirements: Currently the two requirement of models is that they have categorical outputs (classification) and we can easily compute gradients via AutoGrad. For regression, we can discretize the output space into categorical outputs. The second requirement means that we cannot currently support architectures that have discrete steps [60] that prevent gradient flow. We plan to extend MultiViz via approximate gradients such as perturbation or policy gradients to handle these cases.

(3)

User studies: We spent a lot of time into finding and training users, through a training video before each study session. Future work can explore more standardized ways of human-in-the-loop interpretation and debugging of multimodal models, and we hope that MultiViz can provide the initial data, models, tools, and evaluation as a step in this direction.

(4)

Evaluating interpretability remains a challenge [14, 19, 36, 81, 84]. Model interpretability (1) is highly subjective across different population subgroups [7, 45], (2) requires high-dimensional model outputs as opposed to low-dimensional prediction objectives [71], and (3) has desiderata that change across research fields, populations, and time [63]. We plan to continuously expand MultiViz through community inputs for new metrics to evaluate interpretability methods. Some metrics we have in mind include those for measuring faithfulness, as proposed in recent work [14, 19, 36, 59, 81, 84].

(5)

Finally, we have plans for engagement with real-world stakeholders to evaluate the usefulness of these multimodal interpretation tools. We plan to engage these stakeholders in the healthcare domain to evaluate interpretability on the MIMIC dataset and those in the affective computing domain to evaluate interpretability on the CMU-MOSEI dataset. We also refer the reader to recent work examining the issues surrounding real-world deployment of interpretable machine learning [10, 17, 45].

Acknowledgments

This material is based upon work partially supported by the NSF (Awards #1722822 and #1750439), NIH (Awards #R01MH125740, #R01MH096951, and #U01MH116925), Meta, and BMW of North America. PPL is partially supported by a Facebook PhD Fellowship and a CMU’s Center for Machine Learning and Health Fellowship. RS is supported in part by ONR award N000141812861 and DSTA. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, NIH, Facebook, CMU’s Center for Machine Learning and Health, ONR, or DSTA, and no official endorsement should be inferred. We thank Jane Hsieh and the anonymous reviewers for valuable feedback on the paper. Finally, we would also like to acknowledge NVIDIA’s GPU support.

Appendix

A MultiViz Visualization Tool

A.1 The MultiViz website

Figure 10:

Figure 11:

Figure 12:

We also created a visualization website accompanying MultiViz which organizes visualizations of all stages on a particular datapoint of specific dataset-model pairs. The URL link of the webpage is available at https://github.com/pliang279/MultiViz.

Figure 10 is one example webpage for a data point in VQA. On the left there is a control panel that allows users to switch between different datasets and instances (i.e., data points), and then below the two boxes shows all information about the data point (image and question in the case of VQA) and also the ground truth ("GT") label and the predicted ("Pred") label. On the right side, we have a graph showing a simplified version of the Sparse Linear Model: we only show the top 5 features with the highest weights for each label (the weights are shown as numbers on the lines). Note that we will show both correct and predicted labels in the graph (so if the model got the answer wrong, there will be two labels shown under "classes" as shown in Figure 12, and clicking on each label will navigate to a webpage that shows visualizations with respect to that specific label). In the middle tab titled Main View, we show the visualizations from U and C stages. In the case of VQA we present unimodal LIME as U stage visualization (first column under Main View) and DIME as C stage visualization (second and third column under Main View). We call this webpage the Overview webpage. For each of the top five representation features shown within the graph, the user can access R_ℓ and R_g visualizations of each feature by clicking on the circle in the graph representing that feature and the user will see a feature webpage like Figure 11. Under Main View, we include local analysis visualizations (unimodal lime with respect to the feature in the case of VQA) on the top and then global analysis visualizations on the bottom. To return to the Overview page, the user can just press the label circle under "classes" in the graph on the right again.

We also show additional example webpages for CMU-MOSEI (Figure 13 and Figure 14, with first order gradient for U stage, second order gradient for C stage). Note that we only ran U stage for MIMIC LF model because its cross-modal interactions are negligible (second order gradients are all zero) and there are too few representation features to do sparse linear models.

We have also used modified versions of these webpages to conduct all our experiments with human annotators. See Appendix B for details.

Figure 13:

Figure 14:

B Additional Experiments and Details

In this section, we provide additional details on the experiments and results on several other multimodal datasets.

B.1 Model simulation

B.1.1 VQA 2.0.

In this experiment, we perform model simulation on VQA 2.0 dataset with pretrained LXMERT (https://huggingface.co/unc-nlp/lxmert-vqa-uncased). We randomly selected 22 points from the validation split of the VQA dataset under the following criterion: (1) it is not a yes/no question and (2) the answer to the question is not infrequent (i.e. it occurs at least 220 times over 220K+ validation points). For each of the point, we run MultiViz analysis and visualization: for U stage we run LIME on each modality; for C stage we run DIME; for R_ℓ we run LIME with respect to the representation feature on this data point; and for R_g we run LIME on each modality with respect to the representation feature on 3 examples that maximally activates the feature; and for P we show the top 5 representation features with the highest weights with respect to the predicted class in a Sparse Linear Model trained on the training set of VQA. The webpage for each datapoint is organized into Overview page (containing U and C) as well as five Features page (R_ℓ and R_g for each of the top 5 representation features) as well as a "graph" on the right showing P. An example Overview page is shown in Figure 15 and an example Features page is shown in Figure 16. In settings (1)-(4), we will use versions of the webpage with certain stages removed (for example, Figure 17 is the webpage for setting (2), only showing U and C).

Within each of the five groups, on each of the 22 points, human annotators are asked to predict what the model (LXMERT) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). In addition, they are given an answer sheet (see Figure 18) where they are given 4 answer choices for each data point to predict with, and they have to select one of the choices they think LXMERT most likely predicted as the answer to each data point. Before each annotator starts, they are taught how to interpret each analysis visualization, and then the instructor goes over 2 points together with the annotators as examples and the annotators need to finish the remaining 20 points on their own. Only the remaining 20 points counts towards the data collected in the experiment. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group. In addition, groups under settings (3), (4) and (5) are asked whether they found the Overview or Features page more helpful.

Figure 15:

Figure 16:

Figure 17:

Figure 18:

B.1.2 MM-IMDb.

In this experiment, we perform model simulation on MM-IMDb dataset with the LRTF model from MultiBench [54]. We randomly selected 21 points from the test split of MM-IMDb dataset. The original MM-IMDb dataset is designed for multi-label classification, but for simplicity, we only take the label with the highest prediction probability from LRTF as the predicted class, and effectively treat it as a single-label classification task during analysis, visualization and model simulation experiment. For each of the points, we run MultiViz analysis and visualization: for U stage we show first order gradient analysis on image and text; for C stage we perform second order gradient analysis on the top ten words with maximum first order gradient; for R_ℓ we show first order gradient on image and text with respect to each representation feature; for R_g, on each representation feature we present 3 data points that maximally activates the feature, and also show first order gradient visualization for each; for P stage we show the "graph" on the right that ranks the top 5 representation features from Sparse Linear Model analysis as well as their respective weights. The webpage organization is the same as the webpage for VQA with the Overview page and Features.

Within each of the five groups, on each of the 21 points, human annotators are asked to predict what the model (LRTF) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). In addition, we give human annotators 10 possible movie classes that the model could predict for these 21 points ("Drama/Romance", "Crime", "Sci-Fi", "Comedy", "Thriller", "Western", "Action", "War", "Documentary", "Horror"). Note that in reality, some of these categories are not mutually exclusive, but we intentionally designed our experiment this way to see if human annotators were able to determine the model’s prediction by looking at what specific properties within the movie’s poster or description the model focused on during the prediction process. Before each human annotator starts, they are taught how to interpret each analysis visualization, and then the instructor goes over the first point together with the annotator as example and the annotator need to finish the remaining 20 points on their own. Only the remaining 20 points counts towards the data collected in the experiment. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group.

B.1.3 CMU-MOSEI.

In this experiment, we perform model simulation on CMU-MOSEI dataset with the MulT model from MultiBench [54]. We randomly selected 20 points from the test split of CMU-MOSEI dataset. The original CMU-MOSEI dataset is designed for a 7-way sentiment classification (-3 to +3), but we follow the preprocessing in MultiBench and convert it into a binary classification problem (where -1, -2, -3 are "Negative" and 0,1,2,3 are "Positive"). For each of the points, we run MultiViz analysis and visualization: for U stage we show first order gradient analysis on image, audio and text (for image and audio, we compute gradient on each feature on each timestep, resulting in a 2d heatmap, while for text we just have a 1d heatmap), and we also show a processed video where we add bounding boxes around the visual features the model picked up (such as facial landmarks, facial expressions, lip movements, eye gaze, etc); for C stage we perform second order gradient analysis with selected words on image and audio; for R_ℓ we show first order gradient on image, audio and text with respect to each representation feature; for R_g, on each representation feature we present 3 data points that maximally activates the feature, and also show first order gradient visualization for each; for P stage we show the "graph" on the right that ranks the top 5 representation features from Sparse Linear Model analysis as well as their respective weights. The webpage organization is the same as the webpage for VQA with the Overview page and Features pages.

Within each of the five groups, on each of the 20 points, human annotators are asked to predict what the model (MulT) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). Before each human annotator starts, they are taught how to interpret each analysis visualization, and the annotator needs to finish the 20 points on their own. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group.

As shown in Table 2, in general, human annotators were able to better predict the model’s predictions when they were given more information, as the groups that got more information almost always end up with both higher average accuracy and higher inter-rater agreement. Moreover, human annotators were able to get perfect accuracy and agreement in settings (4) and (5), showing that including global analysis R_g provides enough information to simulate model predictions.

B.2 Representation interpretation

B.2.1 VQA 2.0.

For VQA 2.0 dataset, we perform a representation interpretation experiment, where we give human annotators some visualizations on a particular representation feature and ask them to describe what concept they think that feature represents. We found 15 human annotators (with same qualifications as those in model simulation experiment), and divide them into 3 groups of 5. Each group is given a different setting (with different amounts of MultiViz visualizations available):

(1)

R_ℓ: R_ℓ only, i.e. one random example and Unimodal LIME explanation on the example with respect to this example. See Figure 19 for example.

(2)

R_ℓ + R_g (no viz): In addition to R_ℓ with LIME, we also provide R_g (top 3 examples that maximizes the feature’s value and top 3 examples that minimizes the feature’s value), but no LIME visualizations for R_g. See Figure 20 for example.

(3)

R_ℓ + R_g: Same as setting 2, but we also provide Unimodal LIME visualizations for all examples in R_g.

Figure 19:

Figure 20:

Figure 21:

A concrete example: In Figure 21, we show a concrete example of human annotators using MultiViz to assign concepts to feature representations in multimodal models trained on VQA 2.0. We show the information provided to users in each of the 3 ablation cases as part of the experiment, along with the actual user annotations from the user study:

(1)

In R_ℓ, we only provide the original seed datapoint and show visualizations of unimodal and cross-modal interactions with respect to a feature z for that datapoint. Using just local information, annotators struggle to identify the concept captured by the feature z, with disagreement between ‘mirror’, ‘brushing teeth’, ‘bathroom’, ‘material’, and ‘none’, each with relatively lower confidence. Indeed, any of the concepts are present in the image and question, which makes it hard to choose a precise one.

(2)

In R_ℓ + R_g (no viz), we provide both the original seed datapoint (local analysis), along with 2 similar datapoints that also maximally activate the feature z (global analysis), for 3 datapoints in total. Using both local and global information, users are better able to identify the commonalities between all 3 datapoints which all active feature z, leading to 3/5 users identifying the concept as ‘asking about material’. However, the remaining 2 users answered ‘household objects/components’, which is another valid concept shared across those datapoints.

(3)

In R_ℓ + R_g, we show both local and global analysis (so 3 datapoints in total), in addition to the visualizations of unimodal and cross-modal interactions with respect to a feature z for all datapoints. With all pieces of information, all 5/5 users identified the concept as ‘asking about material’. Providing visualizations helps to resolve ambiguity in feature interpretation - the text importance identifies words like ‘counter’, ‘countertop’, and ‘wall’, along with the image crossmodal interactions highlighting these entities, which leads to high agreement and confidence among annotators in identifying the ‘material’ concept.

Figure 22:

Figure 23:

B.3 Error analysis

In this section, we conduct an experiment to see if human annotators will be able to categorize the reasons why the model fails to predict the correct answer.

B.3.1 Setup.

We present three categories of errors:

(1)

(2)

(3)

For each of the 2 datasets we used in this experiment (VQA and CLEVR), we found 10 human annotators and divide them into 2 groups of 5, one group for each setting: (1) under MultiViz setting, for each data point, the human annotator is given access to full MultiViz webpage as well as live Second-Order Gradient (i.e. the human annotator may request to compute second order gradient for a specific subset of words in the question, and he will be presented with the resulting second order gradient result); (2) under No Viz setting, the human annotator is given nothing but the original data point, the correct answer and the predicted answer. Each human annotator needs to classify each point into one of the three categories above, and they are also asked to rate their confidence in categorizing the error on a scale of 1-5.

B.3.2 VQA 2.0.

In this experiment, we perform error analysis on VQA 2.0 with LXMERT. We first randomly selected 24 data points which the model got wrong, and then we ask 10 human annotators to categorize each point into one of the 3 categories above (5 annotators under MultiViz setting and 5 annotators under No Viz setting). The webpage that the human annotators under MultiViz setting sees is the same as the ones described in Appendix A. In addition, since the LXMERT prediction pipeline is differentiable with respect to the detected objects by FRCNN object detector but not with respect to each pixel in the original image, the human annotators under MultiViz setting will also be given all the bounding boxes of objects detected by FRCNN and also which ones have the highest second order gradient with respect to the specific words they picked.

During the experiment, the instructor first informs the annotators what each of the 3 categories of errors mean, and then explains each part of the visualizations they are given (if under MultiViz setting). Then the instructor goes over the first data point together with the human annotators, and the human annotators must categorize the remaining 23 points on their own, and only those 23 points’ annotations will count towards the final result.

The result for VQA error analysis experiment is shown in Table 3. As shown in the table, on average the human annotators are much more confident in categorizing each error, and also tend to agree with each other a lot more often when given MultiViz compared to No Viz. This shows that MultiViz can indeed help humans identify types of errors within a multimodal model. In addition, human annotators from the MultiViz setting report that they can tell whether a model is able to perceive unimodal information correctly via U stage analysis as well as the bounding boxes produced by FRCNN, and they found second order gradient requested on specific words most helpful among all C stage visualizations (such as DIME) when determining if the model was able to find the correct cross-modal interactions. The data point presented in Figure 24 is one good example of this.

Figure 24:

Figure 25:

A concrete example: In Figure 25, we show a concrete example of human annotators using MultiViz to perform error analysis on incorrect predictions made by trained models, specifically into one of 3 stages: failures in (1) unimodal perception, (2) capturing cross-modal interaction, and (3) prediction with perceived unimodal and cross-modal information. We show the information provided to users in each of the 2 ablation cases, along with the actual user annotations from the user study:

(1)

No Viz does not provide the user with any information. Note that there are no intermediate stages we can ablate, since errors can occur at all stages, so removing any stage from MultiViz by definition cripples its ability to detect errors at that stage. However, users still use their intuition to make a most educated guess on which stage the model is likely to make an error in. For example, if some odd object seems hard to detect, users tend to guess unimodal error, and if the prediction involves complex reasoning that is hard even for humans, users tend to guess prediction error.

(2)

MultiViz provides the user with the unimodal importance and cross-modal interactions visualized for that incorrectly predicted datapoint. In the top example, users can tell that the unimodal importance on ‘cheese’ and ‘pizza’ are correct, along with the right image-text interaction highlighting the bounding pizza around pizza. Hence, it is a prediction error, which all users agree on. In the bottom example, users can see that while ‘man’ and ‘jeans’ are unimodally highlighted correctly, none of the image-text interactions highlight the bounding box around the man’s jeans, so they agree on a cross-modal interaction error.

B.3.3 CLEVR.

In this experiment, we perform error analysis on CLEVR with CNN+LSTM+SA model. We first randomly selected 11 data points which the model got wrong, and then we ask 10 human annotators to categorize each point into one of the 3 categories above (5 annotators under MultiViz setting and 5 annotators under No Viz setting). The webpage that the human annotators under MultiViz setting sees is the same as the ones described in Appendix A. In addition, the human annotators under MultiViz setting can request the second-order gradient analysis result on specific words or phrases they pick, both the pixel-wise heatmap and top 2 bounding boxes with the highest average absolute gradient per pixel (same procedure as described in Appendix ). See the bottom half of Figure 2 for an example of second-order gradient analysis result of CNN+LSTM+SA.

During the experiment, the instructor first informs the annotators what each of the 3 categories of errors mean, and then explains each part of the visualizations they are given (if under MultiViz setting). Then the instructor goes over the first data point together with the human annotators, and the human annotators must categorize the remaining 10 points on their own, and only those 10 points’ annotations will count towards the final result.

The result for CLEVR error analysis experiment is shown in Table 3. As shown in the table, on average the human annotators are much more confident in categorizing each error, and also tend to agree with each other a lot more often when given MultiViz compared to No Viz. This shows that MultiViz can indeed help humans identify types of errors within a multimodal model.

Error breakdown: Out of the 10 total errors, human annotators on average reported 6 of them belonging to category 2 (cross modal interaction error). This suggests that the major weakness of CNN+LSTM+SA is that it is not great at aligning phrases in text with the object the phrase refers to. This is expected because CNN+LSTM+SA is a late fusion model, which is known to be not great at capturing low-level cross-modal interactions.

Footnote

we used the popular Hugging Face implementation at https://huggingface.co/unc-nlp/lxmert-vqa-uncased

Supplementary Material

Supplemental Materials (3544549.3585604-supplemental-materials.zip)

Download
8.81 MB

MP4 File (3544549.3585604-video-figure.mp4)

Video Figure

Download
9.97 MB

MP4 File (3544549.3585604-video-preview.mp4)

Video Preview

Download
1.89 MB

MP4 File (3544549.3585604-talk-video.mp4)

Pre-recorded Video Presentation

Download
9.97 MB

References

[1]

Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity Checks for Saliency Maps. CoRR abs/1810.03292 (2018). arXiv:1810.03292http://arxiv.org/abs/1810.03292

Abstract

1 Introduction

2 Related Work

3 Multimodal Visualization

3.1 MultiViz visualization interface

4 Experiments

4.1 Model simulation

4.2 Representation interpretation

4.3 Error analysis

4.4 A case study in model debugging

5 Conclusion

5.1 Limitations and Future Directions

Acknowledgments

Appendix

A MultiViz Visualization Tool

A.1 The MultiViz website

B Additional Experiments and Details

B.1 Model simulation

B.1.1 VQA 2.0.

B.1.2 MM-IMDb.

B.1.3 CMU-MOSEI.

B.2 Representation interpretation

B.2.1 VQA 2.0.

B.3 Error analysis

B.3.1 Setup.

B.3.2 VQA 2.0.

B.3.3 CLEVR.

Footnote

Supplementary Material

References

Index Terms

Recommendations

DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations

Punchline Detection using Context-Aware Hierarchical Multimodal Fusion

Deconstructing and restyling D3 visualizations

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Upcoming Conference

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations