1 Introduction
Many real-world problems are multimodal: from the early research on audio-visual speech recognition [
21] to the recent interest in language, vision, and video understanding [
21] for applications such as multimedia [
52,
64], affective computing [
51,
75], robotics [
43,
47], finance [
34], dialogue [
73], human-computer interaction [
20,
66], and healthcare [
24,
98]. Subsequently, their impact towards real-world applications has inspired recent research in visualizing and understanding their internal mechanics [
15,
55,
68,
69,
71] as a step towards building interpretable, interactive, and reliable multimodal interfaces [
32,
35,
67]. However, modern parameterizations of multimodal models are typically black-box neural networks [
48,
57]. How can we enable users to accurately visualize and understand the internal modeling of multimodal information and interactions for effective human-AI collaboration?
As a step towards human-centric interpretations of multimodal models, this paper performs a set of detailed user studies to evaluate how human users use interpretation tools to understand multimodal models. Specifically, we build upon a set of existing and proposed interpretation tools: gradient-based feature attribution [
30,
61,
78], higher-order gradients, EMAP [
33], DIME [
58], sparse concept models [
96], by creating a general-purpose toolkit,
MultiViz, for visualizing and understanding multimodal models.
MultiViz first scaffolds the problem of interpretability into 4 stages: (1)
unimodal importance: identifying the contributions of each modality towards downstream modeling and prediction, (2)
cross-modal interactions: uncovering the various ways in which different modalities can relate with each other and the types of new information possibly discovered as a result of these relationships, (3)
multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4)
multimodal prediction: how decision-level features are composed to make a prediction for a given task. Through extensive user studies, we show that
MultiViz helps users (1) gain a deeper understanding of model behavior as measured via a proxy task of model simulation, (2) assign interpretable language concepts to previously uninterpretable features, and (3) perform error analysis on model misclassifications. Finally, using takeaways from error analysis, we present a case study of human-in-the-loop model debugging. We release
MultiViz datasets, models, and code at
https://github.com/pliang279/MultiViz.
2 Related Work
Interpretable ML aims to further our understanding and trust of ML models, enable model debugging, and use these insights for joint decision-making between stakeholders and AI [
1,
17,
18,
26]. Interpreting multimodal models is of particular interest in the HCI [
41,
65,
68], multimedia [
11], user interface [
50,
80,
97], and mobile interface [
85,
95] communities since interactions between humans and computer interfaces are naturally multimodal (e.g., via voice [
72], gestures [
62], touch [
28], and even taste [
50]). We categorize related work in interpreting multimodal models into:
Unimodal importance: Several approaches have focused on building interpretable components for unimodal importance through soft [
71] and hard attention mechanisms [
16]. When aiming to explain black-box multimodal models, related work rely primarily on gradient-based visualizations [
8,
22,
82] and feature attributions (e.g., LIME [
78], Shapley values [
61]) to highlight regions of the image which the model attends to.
Cross-modal interactions: Recent work investigates the activations of pretrained transformers [
13,
49], performs diagnostic experiments through specially curated inputs [
23,
46,
70,
89], or trains auxiliary explanation modules [
40,
71]. Particularly related to our work is EMAP [
33] for disentangling the effects of unimodal (additive) contributions from cross-modal interactions in multimodal tasks, as well as M2Lens [
94], an interactive system to visualize multimodal models for sentiment analysis through both unimodal and cross-modal contributions, as well as other quantification [
69] and visualization [
86] tools for multimodal interactions.
Representation and prediction: Existing approaches have used language syntax (e.g., the question in VQA) for compositionality into higher-level features [
2,
4,
93]. Similarly, logical statements have been integrated with neural networks for interpretable logical reasoning [
27,
87]. However, these are typically restricted to certain modalities or tasks. Finally, visualizations have also uncovered several biases in models and datasets (e.g., unimodal biases in VQA questions [
3,
12] or gender biases in image captioning [
32]). We believe that
MultiViz will enable the identification of biases across a wider range of modalities and tasks.
3 Multimodal Visualization
We assume multimodal datasets take the form
\(\mathcal {D} = \lbrace (\mathbf {x}_1, \mathbf {x}_2, y)_{i=1}^n \rbrace = \lbrace (x_1^{(1)}, x_1^{(2)},..., x_2^{(1)}, x_2^{(2)},..., y)_{i=1}^n \rbrace\), with boldface
x denoting the entire modality, each
x1,
x2 indicating modality atoms (i.e., fine-grained sub-parts of modalities that we would like to analyze, such as individual words in a sentence, object regions in an image, or time-steps in time-series data), and
y denoting the label. These datasets enable us to train a multimodal model
\(\hat{y} = f(\mathbf {x}_1, \mathbf {x}_2; \theta)\) which we are interested in visualizing. We scaffold the problem of interpreting
f into
unimodal importance,
cross-modal interactions,
multimodal representations, and
multimodal prediction. Each of these stages provides complementary information on the decision-making process (see Figure
1). We now describe each step in detail:
Unimodal importance (U) aims to understand the contributions of each modality towards prediction. It builds upon ideas of gradients [
8,
22,
82] and feature attributions (e.g., LIME [
78], Shapley values [
61]). These approaches take in a modality of interest
x and returns importance weights across atoms
x of modality
x.
Cross-modal interactions (C) describe how atoms from different modalities relate with each other and the types of new information discovered as a result of these relationships. Formally, a function
f captures statistical non-additive interactions between 2 unimodal atoms
x1 and
x2 if and only if
f cannot be decomposed into a sum of unimodal subfunctions
g1,
g2 such that
f(
x1,
x2) =
g1(
x1) +
g2(
x2) [
25,
83,
91,
92]. Using this definition, we include (1) EMAP [
33] which decomposes
f(
x1,
x2) =
g1(
x1) +
g2(
x2) +
g12(
x1,
x2) into strictly unimodal representations
g1,
g2, and cross-modal representation
\(g_{12} = f - \mathbb {E}_{x_1} (f) - \mathbb {E}_{x_2} (f) + \mathbb {E}_{x_1,x_2} (f)\) to quantify the degree of global cross-modal interactions across an entire dataset and (2) DIME [
58] which extends EMAP using feature visualization on each disentangled representation locally (per datapoint). We also propose a higher-order gradient-based approach by identifying that a function
f exhibits interactions iff
\(\mathbf {E}_{x_1,x_2} \left[ \frac{\partial ^2 f(x_1,x_2)}{\partial x_1 \partial x_2} \right]^2> 0\). Taking a second-order gradient (extending first-order gradient-based approaches [
31,
78,
99]) zeros out the unimodal terms and isolates the interaction terms.
Multimodal representations aim to understand how information is represented at the feature representation level. Specifically, given a trained multimodal model
f, define the matrix
\(M_z \in \mathbb {R}^{N \times d}\) as the penultimate layer of
f representing (uninterpretable) deep feature representations. For the
ith datapoint,
z =
Mz(
i) collects a set of individual feature representations
\(z_{1}, z_{2},..., z_{d} \in \mathbb {R}\).
Local representation analysis (Rℓ) informs the user on parts of the original datapoint that activate feature
zj (via unimodal or cross-modal visualizations with respect to feature
zj).
Global representation analysis (Rg) provides the user with the top
k datapoints that also maximally activate feature
zj, which is especially useful in helping humans assign interpretable language concepts to each feature by looking at similarly activated input regions across datapoints (e.g., the concept of color in Figure
1, right).
Multimodal prediction (P): Finally, the prediction step takes the set of feature representations
z1,
z2,...,
zd and composes them to form higher-level abstract concepts suitable for a task. We approximate the prediction process with a sparse linear combination of penultimate layer features [
96]. Given the penultimate layer
\(M_z \in \mathbb {R}^{N \times d}\), we fit a linear model
\(\mathbb {E}\left(Y|X=x \right) = M_z^\top \beta\) (bias
β0 omitted for simplicity) and solve for sparsity using
\(\hat{\beta } = \mathop{arg\,min}_{\beta } \frac{1}{2N} \Vert M_z^\top \beta - y \Vert _2^2 + \lambda _1 \Vert \beta \Vert _1 + \lambda _2 \Vert \beta \Vert _2^2\). The resulting understanding starts from the set of learned weights with the highest non-zero coefficients
βtop = {
β(1),
β(2),...} and corresponding ranked features
ztop = {
z(1),
z(2),...}.
βtop tells the user how features
ztop are composed to make a prediction, and
ztop can then be visualized with respect to unimodal and cross-modal interactions using the representation stage.
3.1 MultiViz visualization interface
We summarize the included approaches for visualizing each step of the multimodal process in Table
1. Figure
6 also shows an illustration of the overall code structure available in
MultiViz, spanning various multimodal data loaders, recent multimodal models, visualization methods, and visualization tools.
The final
MultiViz interface combines outputs from all stages through an interactive and human-in-the-loop API. We show the overall
MultiViz interface in Figure
5 using an example from the VQA dataset [
5]. This interactive API enables users to choose multimodal datasets and models using the control panel on the left side. The control panel also shows all information about the data point (original image and question in the case of VQA) as well as the ground truth (“GT”) label and the predicted (“Pred”) class. Clicking the model prediction will show visualizations with respect to that specific class label, with an
Overview Page showing general unimodal importance, cross-modal interactions, and prediction weights, as well as a
Feature Page for local and global analysis of user-selected features. Specifically, the right side of the
Overview Page shows the top 5 features with the highest weights for a selected prediction class (the weights are shown as numbers on the lines). By clicking the circle on the graph representing each feature, the user can access R
ℓ and R
g visualizations of that specific feature via the
Feature Page. Please see Appendix
B for more examples.
4 Experiments
Our experiments are designed to verify the usefulness and complementarity of the 4
MultiViz stages. We start with a model simulation experiment to test the utility of each stage towards overall model understanding (Section
4.1). We then dive deeper into the individual stages by testing how well
MultiViz enables representation interpretation (Section
4.2) and error analysis (Section
4.3), before presenting a case study of model debugging from error analysis insights (Section
4.4). We showcase the following selected experiments and defer results on other datasets to Appendix
B.
Setup: We use a large suite of datasets from MultiBench [
54] which span real-world fusion [
6,
37,
100], retrieval [
74], and QA [
29,
38] tasks. For each dataset, we test a corresponding state-of-the-art model:
MulT [
90],
LRTF [
56],
LF [
9],
ViLT [
42],
CLIP [
76],
CNN-LSTM-SA [
38],
MDETR [
39], and
LXMERT [
88]. These cover models both pretrained and trained from scratch. We summarize all 6 datasets and 8 models and provide details in Appendix
B. Participation in all human studies were fully voluntary and without compensation. There are no participant risks involved and we obtained consent from all participants prior to each short study. This line of research is aligned with similar IRB-exempt annotation studies at our institution. The authors manually took notes on all results and feedback, in such a manner that the identity of the human subjects cannot readily be ascertained, directly or through identifiers linked to the subjects. Participants were not the authors nor in the same research groups as the authors, but they all hold or are working towards a graduate degree in a STEM field and have knowledge of ML models. None of the participants knew about this project before their session and each participant only interacted with the setting they were involved in, so it is not possible to manipulate users to achieve desired outcomes.
4.1 Model simulation
We first design a model simulation experiment to determine if
MultiViz helps users of multimodal models gain a deeper understanding of model behavior. If
MultiViz indeed generates human-understandable explanations, humans should be able to accurately simulate model predictions given these explanations only, as measured by correctness with respect to actual model predictions and annotator agreement (Krippendorff’s alpha [
44]). To investigate the utility of each stage in
MultiViz, we design a human study to see how accurately humans can simulate model predictions based on the output of a model analysis results and visualizations. For each dataset (VQA 2.0, MM-IMDb, CMU-MOSEI), we divide 15 total human annotators into 5 groups of 3, each group getting one of the five settings above, and then we compute average accuracy and inter-rater agreement within each group.
(1)
U: Users are only shown the unimodal importance (U) of each modality towards label y.
(2)
U + C: Users are also shown cross-modal interactions (C) highlighted towards label y.
(3)
U + C + Rℓ: Users are also shown local analysis (Rℓ) of unimodal and cross-modal interactions of top features ztop = {z(1), z(2),...} maximally activating label y.
(4)
U + C + Rℓ + Rg: Users are additionally shown global analysis (Rg) through similar datapoints that also maximally activate top features ztop for label y.
(5)
MultiViz (U + C + Rℓ + Rg + P): The entire MultiViz method by further including visualizations of the final prediction (P) stage: sorting top ranked feature neurons ztop = {z(1), z(2),...} with respect to their coefficients βtop = {β(1), β(2),...} and showing these coefficients to the user.
Quantitative results: We show these results in Table
2 and find that having access to all stages in
MultiViz leads to significantly highest accuracy of model simulation on
VQA 2.0, along with lowest variance and most consistent agreement between annotators. On fusion tasks with
MM-IMDb and
CMU-MOSEI, we also find that including each visualization stage consistently leads to higher correctness and agreement. More importantly, humans are able to simulate model predictions, regardless of whether the model made the correct prediction or not.
We also conducted
qualitative interviews to determine what users found useful in
MultiViz: (1) Users reported finding the local and global representation analysis particularly useful: global analysis with other datapoints that also maximally activate feature representations were important for identifying similar concepts and assigning them to multimodal features. (2) Between Overview (
U + C) and Feature (
Rℓ + Rg + P) visualizations, users found Feature visualizations more useful in
\(31.7\%\),
\(61.7\%\), and
\(80.0\%\) of the time under settings (3), (4), and (5) respectively, and found the Overview page more useful in the remaining points. This means that for each stage, there exists a significant fraction of data points where that stage is most needed. (3) While it may be possible to determine the prediction of the model with a subset of stages, having more stages that confirm the same prediction makes them a lot more confident about their prediction, which is quantitatively substantiated by the higher accuracy, lower variance, and higher agreement in human predictions. For
MM-IMDb, we were especially surprised to find that including
C stage actually helped, since MM-IMDb did not seem to be a task that relies much on cross-modal interaction. We also include additional experiments and visualizations in Appendix
B.1.
4.2 Representation interpretation
We now take a deeper look to check that MultiViz generates accurate explanations of multimodal representations. Using local and global representation visualizations, can humans consistently assign interpretable concepts in natural language to previously uninterpretable features? Using VQA 2.0, we perform a representation interpretation experiment, where we give human annotators some visualizations on a particular representation feature and ask them to describe what concept they think that feature represents. We found 15 human annotators (with same qualifications as those in model simulation experiment), and divide them into 3 groups of 5. Each group is given a different setting (with different amounts of MultiViz visualizations available):
(1)
Rℓ: local visualization analysis of unimodal and cross-modal interactions in z only for the given local datapoint.
(2)
Rℓ + Rg (no viz): including both local visualization analysis of the given datapoint as well as global analysis through retrieving, but not visualizing, similar datapoints that also maximally activate feature z.
(3)
Rℓ + Rg: we further add visualizations of highlighted unimodal and cross-modal interactions of global datapoints, resulting in the full MultiViz functions.
We gave the same 13 representation features to all 15 human annotators, where the first feature serves as an example and the other 12 are the ones we actually record for the experiment. The instructor first explains to each annotator what each visualization means, and then goes over the first feature together. Then, the annotator must write down a concept for the other 12 features on their own. We also ask each annotator to rate a confidence of 1-5 on how confident they are that this feature indeed represents this concept.
Quantitative results: Since there are no ground-truth labels for feature concepts, we rely on annotator confidence (1-5 scale) and annotator agreement [
44] as a proxy for accuracy. Once we have collected all 180 annotations (15 annotators each on 12 features), we manually cluster these into 29 distinct concepts. For example, annotations like "things to wear", "t-shirts" and "clothes" all belong to "clothes" concept; all color-related annotations belong to "colors" concept; "material question", "made-of question" and "material of object" all belongs to "material" concept. We then compute inter-rater agreement score on each feature within each group of 5 annotators using Krippendorff’s alpha with 29 possible categories. We report inter-rater agreement and average confidence in Table
3.
As shown in Table
3, as we give annotators both local and global visualizations, they were able to assign concepts more consistently (higher inter-rater agreement) and more confidently (higher average confidence score). Under setting 3 with full
MultiViz visualizations on feature representations, the 5 annotators completely agreed with each other on 7 out of 12 features, which is really impressive since there are so many possible concepts annotators could assign to each feature. Therefore, this shows that our visualizations, i.e. R
ℓ and R
g, really helps humans to better understand what concept (if any) that each feature in representation represents, and that R
g examples and visualizations are especially helpful.
Qualitative interviews: We show examples of human-assigned concepts in Figure
7 (more in Appendix
B.2). Note that the 3 images in each box of Figure
7 (even without feature highlighting) does constitute a visualization generated by
MultiViz, as they belong to data instances that maximize the value of the feature neuron (i.e. R
g in stage 3 multimodal representations). Without
MultiViz, it would not be possible to perform feature interpretation without combing through the entire dataset. Participants also noted that feature visualizations make the decision a lot more confident if its highlights match the concept. Taking as example Figure
7 top left, the visualizations serve to highlight what the model’s feature neuron is learning (i.e., person holding sports equipment), rather than what category of datapoint it is. If the visualization was different (e.g., the ground), then users would have to conclude that the feature neuron is capturing ‘
outdoor ground’ rather than ‘
sports equipment’. Similarly, for text highlights (Figure
7 top right), without using
MultiViz to highlight ‘
counter’, ‘
countertop’, and ‘
wall’, along with the image crossmodal interactions corresponding to these entities, one would not be able to deduce that the feature asks about material - it could also represent ‘
what’ questions, or ‘
household objects’, and so on. These conclusions can only be reliably deduced with all MultiViz stages.
4.3 Error analysis
We further examine a case study of error analysis on trained models. We task 10 human users to use MultiViz and highlight the errors that a multimodal model exhibits by categorizing these errors into one of 3 stages:
(1)
Unimodal perception error: The model fails to recognize certain unimodal features or aspects. (For example, in Figure
8 top left example, the FRCNN object detector was unable to recognize the thin red streak as an object).
(2)
Cross-modal interaction error: The model fails to capture important cross-modal interactions such as aligning words in question with relevant parts or detected objects in image. (For example, in Figure
8 first one in middle column, the model is erroneously aligning "creamy" with the piece of carrot).
(3)
Prediction errors: The model is able to perceive correct unimodal features and their cross-modal interactions, but fails to reason through them to produce the correct prediction. (For example, in Figure
8 top right example, the model was able to both perfectly identify the chair with object detector and associate it with the word "chair" in the question (as shown by second-order-gradient analysis), but the model was still unable to reason with the given information correctly to predict the correct answer).
For each of the 2 datasets we used in this experiment (VQA and CLEVR), 10 human annotators are divided them into 2 groups of 5, one group for each setting: (1) under MultiViz setting, for each data point, the human annotator is given access to full MultiViz webpage as well as live Second-Order Gradient (i.e. the human annotator may request to compute second order gradient for a specific subset of words in the question, and he will be presented with the resulting second order gradient result); (2) under No Viz setting, the human annotator is given nothing but the original data point, the correct answer and the predicted answer. Each human annotator needs to classify each point into one of the three categories above, and they are also asked to rate their confidence in categorizing the error on a scale of 1-5.
Using 20 datapoints per setting, these experiments with 10 users on 2 datasets and 2 models involve roughly 15 total hours of users interacting with
MultiViz. From Table
4, we find that
MultiViz enables humans to consistently categorize model errors into one of 3 stages. We show examples that human annotators classified into different errors in Figure
8 (more in Appendix
B.3). Out of the 23 total errors, human annotators reported that on average 8.8 of them are category 1 (unimodal perception error), 6.8 of them are category 2 (cross modal interaction error), and 7.4 of them are category 3 (prediction error). This suggests that the majority of errors present in LXMERT is still caused by misunderstanding the basic unimodal concepts and cross-modal alignments rather than high-level reasoning of the perceived information, and that one possible future direction for improving the model pipeline is to use better unimodal encoders (than FRCNN) and find out some way to force the model to learn to align visual and text concepts correctly.
4.4 A case study in model debugging
Following error analysis, we take a deeper investigation into one of the errors on a pretrained
LXMERT model fine-tuned on
VQA 2.0. Specifically, we first found the top 5 penultimate-layer neurons that are most activated on erroneous datapoints. Inspecting these neurons through
MultiViz, users found that 2/5 neurons were consistently related to questions asking about color, which highlighted the model’s failure to identify color correctly (especially
blue). The model has an accuracy of only
\(5.5\%\) amongst all
blue-related points (i.e., either have
blue as correct answer or predicted answer), and these failures account for
\(8.8\%\) of all model errors. We these examples in Figure
9: observe that the model is often able to capture unimodal and cross-modal interactions perfectly, but fails to identify color at prediction.
Curious as to the source of this error, we looked deeper into the source code for the entire pipeline of
LXMERT, including that of its image encoder, Faster R-CNN [
77]
2. We in fact uncovered a bug in data preprocessing for Faster R-CNN in the popular Hugging Face repository that swapped the image data storage format from RGB to BGR formats responsible for these errors. This presents a concrete use case of
MultiViz: through visualizing each stage, we were able to (1) isolate the source of the bug (at prediction and not unimodal perception or cross-modal interactions), and (2) use representation analysis to localize the bug to the specific color concept.
5 Conclusion
This paper proposes MultiViz, a comprehensive method for analyzing and visualizing multimodal models. MultiViz scaffolds the interpretation problem into 4 modular stages of unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction, before providing existing and newly proposed analysis tools in each stage. MultiViz is designed to be modular (encompassing existing analysis tools and encouraging research towards understudied stages), general (supporting diverse modalities, models, and tasks), and human-in-the-loop (providing a visualization tool for user-centric model interpretation, error analysis, and debugging), qualities which we strive to upkeep by ensuring its public access and regular updates from community feedback.
5.1 Limitations and Future Directions
We are aware of some directions in which MultiViz can still be improved and outline these for future work:
(1)
Number of prediction classes: For complex tasks like VQA 2.0 where there are over three thousand prediction classes, there will be many sparse output weights, so it is difficult to find global related datapoints for representation analysis. On the other hand, for VQA 2.0 subsets with ‘yes/no’ answer choices (i.e., too few classes), we found that the final-layer activated features contain too much overlap to reliably visualize, and we have to extend MultiViz to rely on more intermediate-layer features. MultiViz works best with a reasonable number of prediction classes, such as those in multimodal emotion recognition, multiple-choice question answering, and others.
(2)
Model requirements: Currently the two requirement of models is that they have categorical outputs (classification) and we can easily compute gradients via AutoGrad. For regression, we can discretize the output space into categorical outputs. The second requirement means that we cannot currently support architectures that have discrete steps [
60] that prevent gradient flow. We plan to extend
MultiViz via approximate gradients such as perturbation or policy gradients to handle these cases.
(3)
User studies: We spent a lot of time into finding and training users, through a training video before each study session. Future work can explore more standardized ways of human-in-the-loop interpretation and debugging of multimodal models, and we hope that MultiViz can provide the initial data, models, tools, and evaluation as a step in this direction.
(4)
Evaluating interpretability remains a challenge [
14,
19,
36,
81,
84]. Model interpretability (1) is highly subjective across different population subgroups [
7,
45], (2) requires high-dimensional model outputs as opposed to low-dimensional prediction objectives [
71], and (3) has desiderata that change across research fields, populations, and time [
63]. We plan to continuously expand
MultiViz through community inputs for new metrics to evaluate interpretability methods. Some metrics we have in mind include those for measuring faithfulness, as proposed in recent work [
14,
19,
36,
59,
81,
84].
(5)
Finally, we have plans for engagement with real-world stakeholders to evaluate the usefulness of these multimodal interpretation tools. We plan to engage these stakeholders in the healthcare domain to evaluate interpretability on the MIMIC dataset and those in the affective computing domain to evaluate interpretability on the CMU-MOSEI dataset. We also refer the reader to recent work examining the issues surrounding real-world deployment of interpretable machine learning [
10,
17,
45].
Acknowledgments
This material is based upon work partially supported by the NSF (Awards #1722822 and #1750439), NIH (Awards #R01MH125740, #R01MH096951, and #U01MH116925), Meta, and BMW of North America. PPL is partially supported by a Facebook PhD Fellowship and a CMU’s Center for Machine Learning and Health Fellowship. RS is supported in part by ONR award N000141812861 and DSTA. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, NIH, Facebook, CMU’s Center for Machine Learning and Health, ONR, or DSTA, and no official endorsement should be inferred. We thank Jane Hsieh and the anonymous reviewers for valuable feedback on the paper. Finally, we would also like to acknowledge NVIDIA’s GPU support.