skip to main content
10.1145/3544549.3585604acmconferencesArticle/Chapter ViewFull TextPublication PageschiConference Proceedingsconference-collections
Work in Progress
Public Access

MultiViz: Towards User-Centric Visualizations and Interpretations of Multimodal Models

Published: 19 April 2023 Publication History

Abstract

The nature of human and computer interactions are inherently multimodal, which has led to substantial interest in building interpretable, interactive, and reliable multimodal interfaces. However, modern multimodal models and interfaces are typically black-box neural networks, which makes it challenging to understand their internal mechanics. How can we visualize their internal workings in order to empower stakeholders to visualize model behavior, perform model debugging, and promote trust in these models? Our paper proposes MultiViz, a method for analyzing the behavior of multimodal models via 4 stages: (1) unimodal importance, (2) cross-modal interactions, (3) multimodal representations and (4) multimodal prediction. MultiViz includes modular visualization tools for each stage before combining outputs from all stages through an interactive and human-in-the-loop API. Through user studies with 21 participants on 8 trained models across 6 real-world tasks, we show that the complementary stages in MultiViz together enable users to (1) simulate model predictions, (2) assign interpretable concepts to features, (3) perform error analysis on model misclassifications, and (4) use insights from error analysis to debug models. MultiViz is publicly available at https://github.com/pliang279/MultiViz, will be regularly updated with new visualization tools and metrics, and welcomes input from the community1.
Figure 1:
Figure 1: We scaffold the problem of multimodal interpretability and propose MultiViz, a comprehensive analysis method encompassing a set of fine-grained analysis stages: (1) unimodal importance identifies the contributions of each modality, (2) cross-modal interactions uncover how different modalities relate with each other and the types of new information discovered from these relationships, (3) multimodal representations study how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction studies how these features are composed to make a prediction. MultiViz provides an interactive visualization API across multimodal datasets and models, enabling user-centric and human-in-the-loop visualizations of multimodal models for model simulation, representation understanding, error analysis, and model debugging.

1 Introduction

Many real-world problems are multimodal: from the early research on audio-visual speech recognition [21] to the recent interest in language, vision, and video understanding [21] for applications such as multimedia [52, 64], affective computing [51, 75], robotics [43, 47], finance [34], dialogue [73], human-computer interaction [20, 66], and healthcare [24, 98]. Subsequently, their impact towards real-world applications has inspired recent research in visualizing and understanding their internal mechanics [15, 55, 68, 69, 71] as a step towards building interpretable, interactive, and reliable multimodal interfaces [32, 35, 67]. However, modern parameterizations of multimodal models are typically black-box neural networks [48, 57]. How can we enable users to accurately visualize and understand the internal modeling of multimodal information and interactions for effective human-AI collaboration?
As a step towards human-centric interpretations of multimodal models, this paper performs a set of detailed user studies to evaluate how human users use interpretation tools to understand multimodal models. Specifically, we build upon a set of existing and proposed interpretation tools: gradient-based feature attribution [30, 61, 78], higher-order gradients, EMAP [33], DIME [58], sparse concept models [96], by creating a general-purpose toolkit, MultiViz, for visualizing and understanding multimodal models. MultiViz first scaffolds the problem of interpretability into 4 stages: (1) unimodal importance: identifying the contributions of each modality towards downstream modeling and prediction, (2) cross-modal interactions: uncovering the various ways in which different modalities can relate with each other and the types of new information possibly discovered as a result of these relationships, (3) multimodal representations: how unimodal and cross-modal interactions are represented in decision-level features, and (4) multimodal prediction: how decision-level features are composed to make a prediction for a given task. Through extensive user studies, we show that MultiViz helps users (1) gain a deeper understanding of model behavior as measured via a proxy task of model simulation, (2) assign interpretable language concepts to previously uninterpretable features, and (3) perform error analysis on model misclassifications. Finally, using takeaways from error analysis, we present a case study of human-in-the-loop model debugging. We release MultiViz datasets, models, and code at https://github.com/pliang279/MultiViz.

2 Related Work

Interpretable ML aims to further our understanding and trust of ML models, enable model debugging, and use these insights for joint decision-making between stakeholders and AI [1, 17, 18, 26]. Interpreting multimodal models is of particular interest in the HCI [41, 65, 68], multimedia [11], user interface [50, 80, 97], and mobile interface [85, 95] communities since interactions between humans and computer interfaces are naturally multimodal (e.g., via voice [72], gestures [62], touch [28], and even taste [50]). We categorize related work in interpreting multimodal models into:
Unimodal importance: Several approaches have focused on building interpretable components for unimodal importance through soft [71] and hard attention mechanisms [16]. When aiming to explain black-box multimodal models, related work rely primarily on gradient-based visualizations [8, 22, 82] and feature attributions (e.g., LIME [78], Shapley values [61]) to highlight regions of the image which the model attends to.
Cross-modal interactions: Recent work investigates the activations of pretrained transformers [13, 49], performs diagnostic experiments through specially curated inputs [23, 46, 70, 89], or trains auxiliary explanation modules [40, 71]. Particularly related to our work is EMAP [33] for disentangling the effects of unimodal (additive) contributions from cross-modal interactions in multimodal tasks, as well as M2Lens [94], an interactive system to visualize multimodal models for sentiment analysis through both unimodal and cross-modal contributions, as well as other quantification [69] and visualization [86] tools for multimodal interactions.
Representation and prediction: Existing approaches have used language syntax (e.g., the question in VQA) for compositionality into higher-level features [2, 4, 93]. Similarly, logical statements have been integrated with neural networks for interpretable logical reasoning [27, 87]. However, these are typically restricted to certain modalities or tasks. Finally, visualizations have also uncovered several biases in models and datasets (e.g., unimodal biases in VQA questions [3, 12] or gender biases in image captioning [32]). We believe that MultiViz will enable the identification of biases across a wider range of modalities and tasks.

3 Multimodal Visualization

Figure 2:
Figure 2: Examples of cross-modal interactions on CLEVR captured by our proposed second-order gradient method. Left: MDETR model, where it picks up the correct cross-modal interaction and predicts the correct answer. Right: CNN+LSTM+SA model, where it does not pick up the correct cross-modal interaction and results in an incorrect answer. Observe that visualizing the incorrect interactions identified by the model is a step towards error analysis. For each example, the image on the left side is a heatmap of absolute second order gradient for each pixel, and the image on the right shows top 2 bounding boxes with the highest average absolute second order gradient per pixel, top 1 box in red, top 2 box in blue).
Figure 3:
Figure 3: Examples of cross-modal interactions captured by ViLT on Flickr-30k discovered by second-order gradients.
Figure 4:
Figure 4: Examples of cross-modal interactions captured by CLIP on Flickr-30k discovered by second-order gradients.
We assume multimodal datasets take the form \(\mathcal {D} = \lbrace (\mathbf {x}_1, \mathbf {x}_2, y)_{i=1}^n \rbrace = \lbrace (x_1^{(1)}, x_1^{(2)},..., x_2^{(1)}, x_2^{(2)},..., y)_{i=1}^n \rbrace\), with boldface x denoting the entire modality, each x1, x2 indicating modality atoms (i.e., fine-grained sub-parts of modalities that we would like to analyze, such as individual words in a sentence, object regions in an image, or time-steps in time-series data), and y denoting the label. These datasets enable us to train a multimodal model \(\hat{y} = f(\mathbf {x}_1, \mathbf {x}_2; \theta)\) which we are interested in visualizing. We scaffold the problem of interpreting f into unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction. Each of these stages provides complementary information on the decision-making process (see Figure 1). We now describe each step in detail:
Unimodal importance (U) aims to understand the contributions of each modality towards prediction. It builds upon ideas of gradients [8, 22, 82] and feature attributions (e.g., LIME [78], Shapley values [61]). These approaches take in a modality of interest x and returns importance weights across atoms x of modality x.
Cross-modal interactions (C) describe how atoms from different modalities relate with each other and the types of new information discovered as a result of these relationships. Formally, a function f captures statistical non-additive interactions between 2 unimodal atoms x1 and x2 if and only if f cannot be decomposed into a sum of unimodal subfunctions g1, g2 such that f(x1, x2) = g1(x1) + g2(x2) [25, 83, 91, 92]. Using this definition, we include (1) EMAP [33] which decomposes f(x1, x2) = g1(x1) + g2(x2) + g12(x1, x2) into strictly unimodal representations g1, g2, and cross-modal representation \(g_{12} = f - \mathbb {E}_{x_1} (f) - \mathbb {E}_{x_2} (f) + \mathbb {E}_{x_1,x_2} (f)\) to quantify the degree of global cross-modal interactions across an entire dataset and (2) DIME [58] which extends EMAP using feature visualization on each disentangled representation locally (per datapoint). We also propose a higher-order gradient-based approach by identifying that a function f exhibits interactions iff \(\mathbf {E}_{x_1,x_2} \left[ \frac{\partial ^2 f(x_1,x_2)}{\partial x_1 \partial x_2} \right]^2> 0\). Taking a second-order gradient (extending first-order gradient-based approaches [31, 78, 99]) zeros out the unimodal terms and isolates the interaction terms.
Multimodal representations aim to understand how information is represented at the feature representation level. Specifically, given a trained multimodal model f, define the matrix \(M_z \in \mathbb {R}^{N \times d}\) as the penultimate layer of f representing (uninterpretable) deep feature representations. For the ith datapoint, z = Mz(i) collects a set of individual feature representations \(z_{1}, z_{2},..., z_{d} \in \mathbb {R}\). Local representation analysis (R) informs the user on parts of the original datapoint that activate feature zj (via unimodal or cross-modal visualizations with respect to feature zj). Global representation analysis (Rg) provides the user with the top k datapoints that also maximally activate feature zj, which is especially useful in helping humans assign interpretable language concepts to each feature by looking at similarly activated input regions across datapoints (e.g., the concept of color in Figure 1, right).
Figure 5:
Figure 5: MultiViz provides an interactive visualization API across multimodal datasets and models. The overview page shows general unimodal importance, cross-modal interactions, and prediction weights, while the features page enables local and global analysis of specific user-selected features.
Multimodal prediction (P): Finally, the prediction step takes the set of feature representations z1, z2,..., zd and composes them to form higher-level abstract concepts suitable for a task. We approximate the prediction process with a sparse linear combination of penultimate layer features [96]. Given the penultimate layer \(M_z \in \mathbb {R}^{N \times d}\), we fit a linear model \(\mathbb {E}\left(Y|X=x \right) = M_z^\top \beta\) (bias β0 omitted for simplicity) and solve for sparsity using \(\hat{\beta } = \mathop{arg\,min}_{\beta } \frac{1}{2N} \Vert M_z^\top \beta - y \Vert _2^2 + \lambda _1 \Vert \beta \Vert _1 + \lambda _2 \Vert \beta \Vert _2^2\). The resulting understanding starts from the set of learned weights with the highest non-zero coefficients βtop = {β(1), β(2),...} and corresponding ranked features ztop = {z(1), z(2),...}. βtop tells the user how features ztop are composed to make a prediction, and ztop can then be visualized with respect to unimodal and cross-modal interactions using the representation stage.
Table 1:
LevelMethods
Unimodal importanceGrad [8, 22, 82],
 LIME [31, 78, 99],
 SHAP [61, 79]
Cross-modal interactionsCross-modal { Grad, LIME, SHAP} (new),
 EMAP [33], DIME [58]
RepresentationLocal & global analysis (new)
PredictionSparse linear model (new)
Table 1: We scaffold the problem of interpreting multimodal models into the following stage. For each stage, MultiViz includes existing and newly proposed approaches for visualizing models across modalities and tasks.

3.1 MultiViz visualization interface

We summarize the included approaches for visualizing each step of the multimodal process in Table 1. Figure 6 also shows an illustration of the overall code structure available in MultiViz, spanning various multimodal data loaders, recent multimodal models, visualization methods, and visualization tools.
The final MultiViz interface combines outputs from all stages through an interactive and human-in-the-loop API. We show the overall MultiViz interface in Figure 5 using an example from the VQA dataset [5]. This interactive API enables users to choose multimodal datasets and models using the control panel on the left side. The control panel also shows all information about the data point (original image and question in the case of VQA) as well as the ground truth (“GT”) label and the predicted (“Pred”) class. Clicking the model prediction will show visualizations with respect to that specific class label, with an Overview Page showing general unimodal importance, cross-modal interactions, and prediction weights, as well as a Feature Page for local and global analysis of user-selected features. Specifically, the right side of the Overview Page shows the top 5 features with the highest weights for a selected prediction class (the weights are shown as numbers on the lines). By clicking the circle on the graph representing each feature, the user can access R and Rg visualizations of that specific feature via the Feature Page. Please see Appendix B for more examples.
Figure 6:
Figure 6: An illustration of the modules available in MultiViz. Each dataset class provides multimodal dataset loading; each model class is a wrapper for a model and supports functionalities like making prediction and taking gradients; each analysis script performs a certain analysis method (such as LIME); and the visualization tools then output analysis results in a visual medium.

4 Experiments

Our experiments are designed to verify the usefulness and complementarity of the 4 MultiViz stages. We start with a model simulation experiment to test the utility of each stage towards overall model understanding (Section 4.1). We then dive deeper into the individual stages by testing how well MultiViz enables representation interpretation (Section 4.2) and error analysis (Section 4.3), before presenting a case study of model debugging from error analysis insights (Section 4.4). We showcase the following selected experiments and defer results on other datasets to Appendix B.
Table 2:
Research areaQAFusionFusion
DatasetVQA 2.0MM-IMDbCMU-MOSEI
ModelLXMERTLRTFMulT
MetricCorrectnessAgreementCorrectnessAgreementCorrectnessAgreement
U55.0 ± 0.00.3950.0 ± 13.20.3471.7 ± 17.60.39
U + C65.0 ± 5.00.5053.7 ± 7.60.5176.7 ± 10.40.45
U + C + R61.7 ± 7.60.5756.7 ± 7.60.5978.3 ± 2.90.42
U + C + R + Rg71.7 ± 15.30.6161.7 ± 7.60.43100.0 ± 0.01.00
MultiViz81.7 ± 2.90.8665.0 ± 5.00.60100.0 ± 0.01.00
Table 2: Model simulation: We tasked 15 humans users (3 users for each of the following local ablation settings) to simulate model predictions based on visualized evidences from MultiViz. Human annotators who have access to all stages visualized in MultiViz are able to accurately and consistently simulate model predictions (regardless of whether the model made the correct prediction) with high accuracy and annotator agreement, representing a step towards model understanding.
Setup: We use a large suite of datasets from MultiBench [54] which span real-world fusion [6, 37, 100], retrieval [74], and QA [29, 38] tasks. For each dataset, we test a corresponding state-of-the-art model: MulT [90], LRTF [56], LF [9], ViLT [42], CLIP [76], CNN-LSTM-SA [38], MDETR [39], and LXMERT [88]. These cover models both pretrained and trained from scratch. We summarize all 6 datasets and 8 models and provide details in Appendix B. Participation in all human studies were fully voluntary and without compensation. There are no participant risks involved and we obtained consent from all participants prior to each short study. This line of research is aligned with similar IRB-exempt annotation studies at our institution. The authors manually took notes on all results and feedback, in such a manner that the identity of the human subjects cannot readily be ascertained, directly or through identifiers linked to the subjects. Participants were not the authors nor in the same research groups as the authors, but they all hold or are working towards a graduate degree in a STEM field and have knowledge of ML models. None of the participants knew about this project before their session and each participant only interacted with the setting they were involved in, so it is not possible to manipulate users to achieve desired outcomes.

4.1 Model simulation

We first design a model simulation experiment to determine if MultiViz helps users of multimodal models gain a deeper understanding of model behavior. If MultiViz indeed generates human-understandable explanations, humans should be able to accurately simulate model predictions given these explanations only, as measured by correctness with respect to actual model predictions and annotator agreement (Krippendorff’s alpha [44]). To investigate the utility of each stage in MultiViz, we design a human study to see how accurately humans can simulate model predictions based on the output of a model analysis results and visualizations. For each dataset (VQA 2.0, MM-IMDb, CMU-MOSEI), we divide 15 total human annotators into 5 groups of 3, each group getting one of the five settings above, and then we compute average accuracy and inter-rater agreement within each group.
(1)
U: Users are only shown the unimodal importance (U) of each modality towards label y.
(2)
U + C: Users are also shown cross-modal interactions (C) highlighted towards label y.
(3)
U + C + R: Users are also shown local analysis (R) of unimodal and cross-modal interactions of top features ztop = {z(1), z(2),...} maximally activating label y.
(4)
U + C + R + Rg: Users are additionally shown global analysis (Rg) through similar datapoints that also maximally activate top features ztop for label y.
(5)
MultiViz (U + C + R + Rg + P): The entire MultiViz method by further including visualizations of the final prediction (P) stage: sorting top ranked feature neurons ztop = {z(1), z(2),...} with respect to their coefficients βtop = {β(1), β(2),...} and showing these coefficients to the user.
Quantitative results: We show these results in Table 2 and find that having access to all stages in MultiViz leads to significantly highest accuracy of model simulation on VQA 2.0, along with lowest variance and most consistent agreement between annotators. On fusion tasks with MM-IMDb and CMU-MOSEI, we also find that including each visualization stage consistently leads to higher correctness and agreement. More importantly, humans are able to simulate model predictions, regardless of whether the model made the correct prediction or not.
Figure 7:
Figure 7: Examples of human-annotated concepts using MultiViz on feature representations. We find that the features separately capture image-only, language-only, and multimodal concepts.
We also conducted qualitative interviews to determine what users found useful in MultiViz: (1) Users reported finding the local and global representation analysis particularly useful: global analysis with other datapoints that also maximally activate feature representations were important for identifying similar concepts and assigning them to multimodal features. (2) Between Overview (U + C) and Feature (R + Rg + P) visualizations, users found Feature visualizations more useful in \(31.7\%\), \(61.7\%\), and \(80.0\%\) of the time under settings (3), (4), and (5) respectively, and found the Overview page more useful in the remaining points. This means that for each stage, there exists a significant fraction of data points where that stage is most needed. (3) While it may be possible to determine the prediction of the model with a subset of stages, having more stages that confirm the same prediction makes them a lot more confident about their prediction, which is quantitatively substantiated by the higher accuracy, lower variance, and higher agreement in human predictions. For MM-IMDb, we were especially surprised to find that including C stage actually helped, since MM-IMDb did not seem to be a task that relies much on cross-modal interaction. We also include additional experiments and visualizations in Appendix B.1.

4.2 Representation interpretation

We now take a deeper look to check that MultiViz generates accurate explanations of multimodal representations. Using local and global representation visualizations, can humans consistently assign interpretable concepts in natural language to previously uninterpretable features? Using VQA 2.0, we perform a representation interpretation experiment, where we give human annotators some visualizations on a particular representation feature and ask them to describe what concept they think that feature represents. We found 15 human annotators (with same qualifications as those in model simulation experiment), and divide them into 3 groups of 5. Each group is given a different setting (with different amounts of MultiViz visualizations available):
(1)
R: local visualization analysis of unimodal and cross-modal interactions in z only for the given local datapoint.
(2)
R + Rg (no viz): including both local visualization analysis of the given datapoint as well as global analysis through retrieving, but not visualizing, similar datapoints that also maximally activate feature z.
(3)
R + Rg: we further add visualizations of highlighted unimodal and cross-modal interactions of global datapoints, resulting in the full MultiViz functions.
We gave the same 13 representation features to all 15 human annotators, where the first feature serves as an example and the other 12 are the ones we actually record for the experiment. The instructor first explains to each annotator what each visualization means, and then goes over the first feature together. Then, the annotator must write down a concept for the other 12 features on their own. We also ask each annotator to rate a confidence of 1-5 on how confident they are that this feature indeed represents this concept.
Quantitative results: Since there are no ground-truth labels for feature concepts, we rely on annotator confidence (1-5 scale) and annotator agreement [44] as a proxy for accuracy. Once we have collected all 180 annotations (15 annotators each on 12 features), we manually cluster these into 29 distinct concepts. For example, annotations like "things to wear", "t-shirts" and "clothes" all belong to "clothes" concept; all color-related annotations belong to "colors" concept; "material question", "made-of question" and "material of object" all belongs to "material" concept. We then compute inter-rater agreement score on each feature within each group of 5 annotators using Krippendorff’s alpha with 29 possible categories. We report inter-rater agreement and average confidence in Table 3.
Table 3:
Research areaQA
DatasetVQA 2.0
ModelLXMERT
MetricConfidenceAgreement
R1.74 ± 0.520.18
R + Rg (no viz)3.67 ± 0.450.60
R + Rg4.50 ± 0.430.69
Table 3: Across 15 human users (5 users for each of the following 3 settings), we find that users are able to consistently assign concepts to previously uninterpretable multimodal features using both local and global representation analysis.
As shown in Table 3, as we give annotators both local and global visualizations, they were able to assign concepts more consistently (higher inter-rater agreement) and more confidently (higher average confidence score). Under setting 3 with full MultiViz visualizations on feature representations, the 5 annotators completely agreed with each other on 7 out of 12 features, which is really impressive since there are so many possible concepts annotators could assign to each feature. Therefore, this shows that our visualizations, i.e. R and Rg, really helps humans to better understand what concept (if any) that each feature in representation represents, and that Rg examples and visualizations are especially helpful.
Qualitative interviews: We show examples of human-assigned concepts in Figure 7 (more in Appendix B.2). Note that the 3 images in each box of Figure 7 (even without feature highlighting) does constitute a visualization generated by MultiViz, as they belong to data instances that maximize the value of the feature neuron (i.e. Rg in stage 3 multimodal representations). Without MultiViz, it would not be possible to perform feature interpretation without combing through the entire dataset. Participants also noted that feature visualizations make the decision a lot more confident if its highlights match the concept. Taking as example Figure 7 top left, the visualizations serve to highlight what the model’s feature neuron is learning (i.e., person holding sports equipment), rather than what category of datapoint it is. If the visualization was different (e.g., the ground), then users would have to conclude that the feature neuron is capturing ‘outdoor ground’ rather than ‘sports equipment’. Similarly, for text highlights (Figure 7 top right), without using MultiViz to highlight ‘counter’, ‘countertop’, and ‘wall’, along with the image crossmodal interactions corresponding to these entities, one would not be able to deduce that the feature asks about material - it could also represent ‘what’ questions, or ‘household objects’, and so on. These conclusions can only be reliably deduced with all MultiViz stages.
Figure 8:
Figure 8: Examples of human-annotated error analysis using MultiViz on multimodal models. Using all stages provided in MultiViz enables fine-grained classification of model errors (e.g., errors in unimodal processing, cross-modal interactions, and predictions) for targeted debugging.

4.3 Error analysis

We further examine a case study of error analysis on trained models. We task 10 human users to use MultiViz and highlight the errors that a multimodal model exhibits by categorizing these errors into one of 3 stages:
(1)
Unimodal perception error: The model fails to recognize certain unimodal features or aspects. (For example, in Figure 8 top left example, the FRCNN object detector was unable to recognize the thin red streak as an object).
(2)
Cross-modal interaction error: The model fails to capture important cross-modal interactions such as aligning words in question with relevant parts or detected objects in image. (For example, in Figure 8 first one in middle column, the model is erroneously aligning "creamy" with the piece of carrot).
(3)
Prediction errors: The model is able to perceive correct unimodal features and their cross-modal interactions, but fails to reason through them to produce the correct prediction. (For example, in Figure 8 top right example, the model was able to both perfectly identify the chair with object detector and associate it with the word "chair" in the question (as shown by second-order-gradient analysis), but the model was still unable to reason with the given information correctly to predict the correct answer).
Table 4:
Research areaQAQA
DatasetCLEVRVQA 2.0
ModelCNN-LSTM-SALXMERT
MetricConfidenceAgree.ConfidenceAgree.
No viz2.72 ± 0.150.052.15 ± 0.700.14
MultiViz4.12 ± 0.450.674.21 ± 0.620.60
Table 4: Across 10 human users (5 users for each of the following 2 settings), we find that users are also able to categorize model errors into one of 3 stages they occur in when given full MultiViz visualizations.
For each of the 2 datasets we used in this experiment (VQA and CLEVR), 10 human annotators are divided them into 2 groups of 5, one group for each setting: (1) under MultiViz setting, for each data point, the human annotator is given access to full MultiViz webpage as well as live Second-Order Gradient (i.e. the human annotator may request to compute second order gradient for a specific subset of words in the question, and he will be presented with the resulting second order gradient result); (2) under No Viz setting, the human annotator is given nothing but the original data point, the correct answer and the predicted answer. Each human annotator needs to classify each point into one of the three categories above, and they are also asked to rate their confidence in categorizing the error on a scale of 1-5.
Using 20 datapoints per setting, these experiments with 10 users on 2 datasets and 2 models involve roughly 15 total hours of users interacting with MultiViz. From Table 4, we find that MultiViz enables humans to consistently categorize model errors into one of 3 stages. We show examples that human annotators classified into different errors in Figure 8 (more in Appendix B.3). Out of the 23 total errors, human annotators reported that on average 8.8 of them are category 1 (unimodal perception error), 6.8 of them are category 2 (cross modal interaction error), and 7.4 of them are category 3 (prediction error). This suggests that the majority of errors present in LXMERT is still caused by misunderstanding the basic unimodal concepts and cross-modal alignments rather than high-level reasoning of the perceived information, and that one possible future direction for improving the model pipeline is to use better unimodal encoders (than FRCNN) and find out some way to force the model to learn to align visual and text concepts correctly.

4.4 A case study in model debugging

Figure 9:
Figure 9: A case study on model debugging: we task 3 human users to use MultiViz visualizations and highlight the errors that a pretrained LXMERT model fine-tuned on VQA 2.0 exhibits, and find 2 penultimate-layer neurons highlighting the model’s failure to identify color (especially blue). Targeted localization of the error to this specific stage (prediction) and concept (blue) via MultiViz enabled us to identify a bug in the Hugging Face LXMERT repository.
Following error analysis, we take a deeper investigation into one of the errors on a pretrained LXMERT model fine-tuned on VQA 2.0. Specifically, we first found the top 5 penultimate-layer neurons that are most activated on erroneous datapoints. Inspecting these neurons through MultiViz, users found that 2/5 neurons were consistently related to questions asking about color, which highlighted the model’s failure to identify color correctly (especially blue). The model has an accuracy of only \(5.5\%\) amongst all blue-related points (i.e., either have blue as correct answer or predicted answer), and these failures account for \(8.8\%\) of all model errors. We these examples in Figure 9: observe that the model is often able to capture unimodal and cross-modal interactions perfectly, but fails to identify color at prediction.
Curious as to the source of this error, we looked deeper into the source code for the entire pipeline of LXMERT, including that of its image encoder, Faster R-CNN [77]2. We in fact uncovered a bug in data preprocessing for Faster R-CNN in the popular Hugging Face repository that swapped the image data storage format from RGB to BGR formats responsible for these errors. This presents a concrete use case of MultiViz: through visualizing each stage, we were able to (1) isolate the source of the bug (at prediction and not unimodal perception or cross-modal interactions), and (2) use representation analysis to localize the bug to the specific color concept.

5 Conclusion

This paper proposes MultiViz, a comprehensive method for analyzing and visualizing multimodal models. MultiViz scaffolds the interpretation problem into 4 modular stages of unimodal importance, cross-modal interactions, multimodal representations, and multimodal prediction, before providing existing and newly proposed analysis tools in each stage. MultiViz is designed to be modular (encompassing existing analysis tools and encouraging research towards understudied stages), general (supporting diverse modalities, models, and tasks), and human-in-the-loop (providing a visualization tool for user-centric model interpretation, error analysis, and debugging), qualities which we strive to upkeep by ensuring its public access and regular updates from community feedback.

5.1 Limitations and Future Directions

We are aware of some directions in which MultiViz can still be improved and outline these for future work:
(1)
Number of prediction classes: For complex tasks like VQA 2.0 where there are over three thousand prediction classes, there will be many sparse output weights, so it is difficult to find global related datapoints for representation analysis. On the other hand, for VQA 2.0 subsets with ‘yes/no’ answer choices (i.e., too few classes), we found that the final-layer activated features contain too much overlap to reliably visualize, and we have to extend MultiViz to rely on more intermediate-layer features. MultiViz works best with a reasonable number of prediction classes, such as those in multimodal emotion recognition, multiple-choice question answering, and others.
(2)
Model requirements: Currently the two requirement of models is that they have categorical outputs (classification) and we can easily compute gradients via AutoGrad. For regression, we can discretize the output space into categorical outputs. The second requirement means that we cannot currently support architectures that have discrete steps [60] that prevent gradient flow. We plan to extend MultiViz via approximate gradients such as perturbation or policy gradients to handle these cases.
(3)
User studies: We spent a lot of time into finding and training users, through a training video before each study session. Future work can explore more standardized ways of human-in-the-loop interpretation and debugging of multimodal models, and we hope that MultiViz can provide the initial data, models, tools, and evaluation as a step in this direction.
(4)
Evaluating interpretability remains a challenge [14, 19, 36, 81, 84]. Model interpretability (1) is highly subjective across different population subgroups [7, 45], (2) requires high-dimensional model outputs as opposed to low-dimensional prediction objectives [71], and (3) has desiderata that change across research fields, populations, and time [63]. We plan to continuously expand MultiViz through community inputs for new metrics to evaluate interpretability methods. Some metrics we have in mind include those for measuring faithfulness, as proposed in recent work [14, 19, 36, 59, 81, 84].
(5)
Finally, we have plans for engagement with real-world stakeholders to evaluate the usefulness of these multimodal interpretation tools. We plan to engage these stakeholders in the healthcare domain to evaluate interpretability on the MIMIC dataset and those in the affective computing domain to evaluate interpretability on the CMU-MOSEI dataset. We also refer the reader to recent work examining the issues surrounding real-world deployment of interpretable machine learning [10, 17, 45].

Acknowledgments

This material is based upon work partially supported by the NSF (Awards #1722822 and #1750439), NIH (Awards #R01MH125740, #R01MH096951, and #U01MH116925), Meta, and BMW of North America. PPL is partially supported by a Facebook PhD Fellowship and a CMU’s Center for Machine Learning and Health Fellowship. RS is supported in part by ONR award N000141812861 and DSTA. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, NIH, Facebook, CMU’s Center for Machine Learning and Health, ONR, or DSTA, and no official endorsement should be inferred. We thank Jane Hsieh and the anonymous reviewers for valuable feedback on the paper. Finally, we would also like to acknowledge NVIDIA’s GPU support.

Appendix

A MultiViz Visualization Tool

A.1 The MultiViz website

Figure 10:
Figure 10: An example of MultiViz webpage for VQA (Overview page). Best viewed zoomed in and in color.
Figure 11:
Figure 11: An example of MultiViz webpage for VQA (Features page). Best viewed zoomed in and in color.
Figure 12:
Figure 12: An example of MultiViz webpage for VQA (Overview page). Best viewed zoomed in and in color.
We also created a visualization website accompanying MultiViz which organizes visualizations of all stages on a particular datapoint of specific dataset-model pairs. The URL link of the webpage is available at https://github.com/pliang279/MultiViz.
Figure 10 is one example webpage for a data point in VQA. On the left there is a control panel that allows users to switch between different datasets and instances (i.e., data points), and then below the two boxes shows all information about the data point (image and question in the case of VQA) and also the ground truth ("GT") label and the predicted ("Pred") label. On the right side, we have a graph showing a simplified version of the Sparse Linear Model: we only show the top 5 features with the highest weights for each label (the weights are shown as numbers on the lines). Note that we will show both correct and predicted labels in the graph (so if the model got the answer wrong, there will be two labels shown under "classes" as shown in Figure 12, and clicking on each label will navigate to a webpage that shows visualizations with respect to that specific label). In the middle tab titled Main View, we show the visualizations from U and C stages. In the case of VQA we present unimodal LIME as U stage visualization (first column under Main View) and DIME as C stage visualization (second and third column under Main View). We call this webpage the Overview webpage. For each of the top five representation features shown within the graph, the user can access R and Rg visualizations of each feature by clicking on the circle in the graph representing that feature and the user will see a feature webpage like Figure 11. Under Main View, we include local analysis visualizations (unimodal lime with respect to the feature in the case of VQA) on the top and then global analysis visualizations on the bottom. To return to the Overview page, the user can just press the label circle under "classes" in the graph on the right again.
We also show additional example webpages for CMU-MOSEI (Figure 13 and Figure 14, with first order gradient for U stage, second order gradient for C stage). Note that we only ran U stage for MIMIC LF model because its cross-modal interactions are negligible (second order gradients are all zero) and there are too few representation features to do sparse linear models.
We have also used modified versions of these webpages to conduct all our experiments with human annotators. See Appendix B for details.
Figure 13:
Figure 13: An example of MultiViz webpage for CMU-MOSEI (Overview page). Best viewed zoomed in and in color.
Figure 14:
Figure 14: An example of MultiViz webpage for CMU-MOSEI (Features page). Best viewed zoomed in and in color.

B Additional Experiments and Details

In this section, we provide additional details on the experiments and results on several other multimodal datasets.

B.1 Model simulation

B.1.1 VQA 2.0.
In this experiment, we perform model simulation on VQA 2.0 dataset with pretrained LXMERT (https://huggingface.co/unc-nlp/lxmert-vqa-uncased). We randomly selected 22 points from the validation split of the VQA dataset under the following criterion: (1) it is not a yes/no question and (2) the answer to the question is not infrequent (i.e. it occurs at least 220 times over 220K+ validation points). For each of the point, we run MultiViz analysis and visualization: for U stage we run LIME on each modality; for C stage we run DIME; for R we run LIME with respect to the representation feature on this data point; and for Rg we run LIME on each modality with respect to the representation feature on 3 examples that maximally activates the feature; and for P we show the top 5 representation features with the highest weights with respect to the predicted class in a Sparse Linear Model trained on the training set of VQA. The webpage for each datapoint is organized into Overview page (containing U and C) as well as five Features page (R and Rg for each of the top 5 representation features) as well as a "graph" on the right showing P. An example Overview page is shown in Figure 15 and an example Features page is shown in Figure 16. In settings (1)-(4), we will use versions of the webpage with certain stages removed (for example, Figure 17 is the webpage for setting (2), only showing U and C).
Within each of the five groups, on each of the 22 points, human annotators are asked to predict what the model (LXMERT) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). In addition, they are given an answer sheet (see Figure 18) where they are given 4 answer choices for each data point to predict with, and they have to select one of the choices they think LXMERT most likely predicted as the answer to each data point. Before each annotator starts, they are taught how to interpret each analysis visualization, and then the instructor goes over 2 points together with the annotators as examples and the annotators need to finish the remaining 20 points on their own. Only the remaining 20 points counts towards the data collected in the experiment. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group. In addition, groups under settings (3), (4) and (5) are asked whether they found the Overview or Features page more helpful.
Figure 15:
Figure 15: Simulation experiment for VQA: MultiViz website Overview page showing LIME and DIME explanations. Best viewed zoomed in and in color.
Figure 16:
Figure 16: Simulation experiment for VQA: MultiViz website on a specific representation feature showing forwards and backwards analysis (a Features page). Best viewed zoomed in and in color.
Figure 17:
Figure 17: Simulation experiment for VQA: Setting 2 webpage with only LIME and DIME explanations. Best viewed zoomed in and in color.
Figure 18:
Figure 18: Simulation experiment for VQA: Multiple choice answer sheet given to the annotators.
B.1.2 MM-IMDb.
In this experiment, we perform model simulation on MM-IMDb dataset with the LRTF model from MultiBench [54]. We randomly selected 21 points from the test split of MM-IMDb dataset. The original MM-IMDb dataset is designed for multi-label classification, but for simplicity, we only take the label with the highest prediction probability from LRTF as the predicted class, and effectively treat it as a single-label classification task during analysis, visualization and model simulation experiment. For each of the points, we run MultiViz analysis and visualization: for U stage we show first order gradient analysis on image and text; for C stage we perform second order gradient analysis on the top ten words with maximum first order gradient; for R we show first order gradient on image and text with respect to each representation feature; for Rg, on each representation feature we present 3 data points that maximally activates the feature, and also show first order gradient visualization for each; for P stage we show the "graph" on the right that ranks the top 5 representation features from Sparse Linear Model analysis as well as their respective weights. The webpage organization is the same as the webpage for VQA with the Overview page and Features.
Within each of the five groups, on each of the 21 points, human annotators are asked to predict what the model (LRTF) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). In addition, we give human annotators 10 possible movie classes that the model could predict for these 21 points ("Drama/Romance", "Crime", "Sci-Fi", "Comedy", "Thriller", "Western", "Action", "War", "Documentary", "Horror"). Note that in reality, some of these categories are not mutually exclusive, but we intentionally designed our experiment this way to see if human annotators were able to determine the model’s prediction by looking at what specific properties within the movie’s poster or description the model focused on during the prediction process. Before each human annotator starts, they are taught how to interpret each analysis visualization, and then the instructor goes over the first point together with the annotator as example and the annotator need to finish the remaining 20 points on their own. Only the remaining 20 points counts towards the data collected in the experiment. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group.
B.1.3 CMU-MOSEI.
In this experiment, we perform model simulation on CMU-MOSEI dataset with the MulT model from MultiBench [54]. We randomly selected 20 points from the test split of CMU-MOSEI dataset. The original CMU-MOSEI dataset is designed for a 7-way sentiment classification (-3 to +3), but we follow the preprocessing in MultiBench and convert it into a binary classification problem (where -1, -2, -3 are "Negative" and 0,1,2,3 are "Positive"). For each of the points, we run MultiViz analysis and visualization: for U stage we show first order gradient analysis on image, audio and text (for image and audio, we compute gradient on each feature on each timestep, resulting in a 2d heatmap, while for text we just have a 1d heatmap), and we also show a processed video where we add bounding boxes around the visual features the model picked up (such as facial landmarks, facial expressions, lip movements, eye gaze, etc); for C stage we perform second order gradient analysis with selected words on image and audio; for R we show first order gradient on image, audio and text with respect to each representation feature; for Rg, on each representation feature we present 3 data points that maximally activates the feature, and also show first order gradient visualization for each; for P stage we show the "graph" on the right that ranks the top 5 representation features from Sparse Linear Model analysis as well as their respective weights. The webpage organization is the same as the webpage for VQA with the Overview page and Features pages.
Within each of the five groups, on each of the 20 points, human annotators are asked to predict what the model (MulT) predicts given a website containing some or all of the stages of analysis visualizations (depending on the group’s setting). Before each human annotator starts, they are taught how to interpret each analysis visualization, and the annotator needs to finish the 20 points on their own. We then compute average accuracy and inter-rater agreement score (Krippendorff’s alpha) within each group.
As shown in Table 2, in general, human annotators were able to better predict the model’s predictions when they were given more information, as the groups that got more information almost always end up with both higher average accuracy and higher inter-rater agreement. Moreover, human annotators were able to get perfect accuracy and agreement in settings (4) and (5), showing that including global analysis Rg provides enough information to simulate model predictions.

B.2 Representation interpretation

We now take a deeper look to check that MultiViz generates accurate explanations of multimodal representations. Using local and global representation visualizations, can humans consistently assign interpretable concepts in natural language to previously uninterpretable features?
B.2.1 VQA 2.0.
For VQA 2.0 dataset, we perform a representation interpretation experiment, where we give human annotators some visualizations on a particular representation feature and ask them to describe what concept they think that feature represents. We found 15 human annotators (with same qualifications as those in model simulation experiment), and divide them into 3 groups of 5. Each group is given a different setting (with different amounts of MultiViz visualizations available):
(1)
R: R only, i.e. one random example and Unimodal LIME explanation on the example with respect to this example. See Figure 19 for example.
(2)
R + Rg (no viz): In addition to R with LIME, we also provide Rg (top 3 examples that maximizes the feature’s value and top 3 examples that minimizes the feature’s value), but no LIME visualizations for Rg. See Figure 20 for example.
(3)
R + Rg: Same as setting 2, but we also provide Unimodal LIME visualizations for all examples in Rg.
Figure 19:
Figure 19: Example of R example with Unimodal LIME explanation given to annotators in the representation feature interpretation experiment.
Figure 20:
Figure 20: Example of Rg examples without Unimodal LIME explanation given to annotators under Setting 2 together with R visualizations in the representation feature interpretation experiment. Note that the left 3 examples are the ones that minimize the feature’s value, while the right 3 examples are the ones that maximize the feature’s value.
Figure 21:
Figure 21: An example of visualizations given to users for cases R, R + Rg (no viz), and R + Rg for feature interpretation (Section 4.2 in the main paper), along with actual feature concepts annotated by the users.
A concrete example: In Figure 21, we show a concrete example of human annotators using MultiViz to assign concepts to feature representations in multimodal models trained on VQA 2.0. We show the information provided to users in each of the 3 ablation cases as part of the experiment, along with the actual user annotations from the user study:
(1)
In R, we only provide the original seed datapoint and show visualizations of unimodal and cross-modal interactions with respect to a feature z for that datapoint. Using just local information, annotators struggle to identify the concept captured by the feature z, with disagreement between ‘mirror’, ‘brushing teeth’, ‘bathroom’, ‘material’, and ‘none’, each with relatively lower confidence. Indeed, any of the concepts are present in the image and question, which makes it hard to choose a precise one.
(2)
In R + Rg (no viz), we provide both the original seed datapoint (local analysis), along with 2 similar datapoints that also maximally activate the feature z (global analysis), for 3 datapoints in total. Using both local and global information, users are better able to identify the commonalities between all 3 datapoints which all active feature z, leading to 3/5 users identifying the concept as ‘asking about material’. However, the remaining 2 users answered ‘household objects/components’, which is another valid concept shared across those datapoints.
(3)
In R + Rg, we show both local and global analysis (so 3 datapoints in total), in addition to the visualizations of unimodal and cross-modal interactions with respect to a feature z for all datapoints. With all pieces of information, all 5/5 users identified the concept as ‘asking about material’. Providing visualizations helps to resolve ambiguity in feature interpretation - the text importance identifies words like ‘counter’, ‘countertop’, and ‘wall’, along with the image crossmodal interactions highlighting these entities, which leads to high agreement and confidence among annotators in identifying the ‘material’ concept.
Figure 22:
Figure 22: Examples of human-annotated concepts using MultiViz on feature representations. We find that the features separately capture image-only, language-only, and multimodal concepts.
Figure 23:
Figure 23: More examples of human-annotated concepts using MultiViz on feature representations. We find that the features separately capture image-only, language-only, and multimodal concepts.

B.3 Error analysis

In this section, we conduct an experiment to see if human annotators will be able to categorize the reasons why the model fails to predict the correct answer.
B.3.1 Setup.
We present three categories of errors:
(1)
Unimodal perception error: The model fails to recognize certain unimodal features or aspects. (For example, in Figure 8 top left example, the FRCNN object detector was unable to recognize the thin red streak as an object).
(2)
Cross-modal interaction error: The model fails to capture important cross-modal interactions such as aligning words in question with relevant parts or detected objects in image. (For example, in Figure 8 first one in middle column, the model is erroneously aligning "creamy" with the piece of carrot).
(3)
Prediction errors: The model is able to perceive correct unimodal features and their cross-modal interactions, but fails to reason through them to produce the correct prediction. (For example, in Figure 8 top right example, the model was able to both perfectly identify the chair with object detector and associate it with the word "chair" in the question (as shown by second-order-gradient analysis), but the model was still unable to reason with the given information correctly to predict the correct answer).
For each of the 2 datasets we used in this experiment (VQA and CLEVR), we found 10 human annotators and divide them into 2 groups of 5, one group for each setting: (1) under MultiViz setting, for each data point, the human annotator is given access to full MultiViz webpage as well as live Second-Order Gradient (i.e. the human annotator may request to compute second order gradient for a specific subset of words in the question, and he will be presented with the resulting second order gradient result); (2) under No Viz setting, the human annotator is given nothing but the original data point, the correct answer and the predicted answer. Each human annotator needs to classify each point into one of the three categories above, and they are also asked to rate their confidence in categorizing the error on a scale of 1-5.
B.3.2 VQA 2.0.
In this experiment, we perform error analysis on VQA 2.0 with LXMERT. We first randomly selected 24 data points which the model got wrong, and then we ask 10 human annotators to categorize each point into one of the 3 categories above (5 annotators under MultiViz setting and 5 annotators under No Viz setting). The webpage that the human annotators under MultiViz setting sees is the same as the ones described in Appendix A. In addition, since the LXMERT prediction pipeline is differentiable with respect to the detected objects by FRCNN object detector but not with respect to each pixel in the original image, the human annotators under MultiViz setting will also be given all the bounding boxes of objects detected by FRCNN and also which ones have the highest second order gradient with respect to the specific words they picked.
During the experiment, the instructor first informs the annotators what each of the 3 categories of errors mean, and then explains each part of the visualizations they are given (if under MultiViz setting). Then the instructor goes over the first data point together with the human annotators, and the human annotators must categorize the remaining 23 points on their own, and only those 23 points’ annotations will count towards the final result.
The result for VQA error analysis experiment is shown in Table 3. As shown in the table, on average the human annotators are much more confident in categorizing each error, and also tend to agree with each other a lot more often when given MultiViz compared to No Viz. This shows that MultiViz can indeed help humans identify types of errors within a multimodal model. In addition, human annotators from the MultiViz setting report that they can tell whether a model is able to perceive unimodal information correctly via U stage analysis as well as the bounding boxes produced by FRCNN, and they found second order gradient requested on specific words most helpful among all C stage visualizations (such as DIME) when determining if the model was able to find the correct cross-modal interactions. The data point presented in Figure 24 is one good example of this.
Figure 24:
Figure 24: Examples of second-order gradient request by human annotators during error analysis. Left: input image. Middle: all bounding boxes detected by FRCNN image encoder. Right: top 3 bounding boxes with the highest second order gradient with respect to the word "pizza" (red is top 1, blue is top 2, green is top 3). In this data point, we can clearly see that the model was able to detect the pizza (as it was included in a bounding box by FRCNN) and was able to associate the pizza in the image with the word "pizza" in the question (as shown by second order gradient analysis). Therefore, through MultiViz visualizations, all 5 human annotators agreed that this point is a category 3 prediction error. Best viewed zoomed in and in color.
Figure 25:
Figure 25: An example of visualizations given to users for error analysis on incorrect predictions made by trained models, specifically into one of 3 stages: failures in (1) unimodal perception, (2) capturing cross-modal interaction, and (3) prediction with perceived unimodal and cross-modal information (Section 4.3 in the main paper), along with actual error categories annotated by the users.
A concrete example: In Figure 25, we show a concrete example of human annotators using MultiViz to perform error analysis on incorrect predictions made by trained models, specifically into one of 3 stages: failures in (1) unimodal perception, (2) capturing cross-modal interaction, and (3) prediction with perceived unimodal and cross-modal information. We show the information provided to users in each of the 2 ablation cases, along with the actual user annotations from the user study:
(1)
No Viz does not provide the user with any information. Note that there are no intermediate stages we can ablate, since errors can occur at all stages, so removing any stage from MultiViz by definition cripples its ability to detect errors at that stage. However, users still use their intuition to make a most educated guess on which stage the model is likely to make an error in. For example, if some odd object seems hard to detect, users tend to guess unimodal error, and if the prediction involves complex reasoning that is hard even for humans, users tend to guess prediction error.
(2)
MultiViz provides the user with the unimodal importance and cross-modal interactions visualized for that incorrectly predicted datapoint. In the top example, users can tell that the unimodal importance on ‘cheese’ and ‘pizza’ are correct, along with the right image-text interaction highlighting the bounding pizza around pizza. Hence, it is a prediction error, which all users agree on. In the bottom example, users can see that while ‘man’ and ‘jeans’ are unimodally highlighted correctly, none of the image-text interactions highlight the bounding box around the man’s jeans, so they agree on a cross-modal interaction error.
B.3.3 CLEVR.
In this experiment, we perform error analysis on CLEVR with CNN+LSTM+SA model. We first randomly selected 11 data points which the model got wrong, and then we ask 10 human annotators to categorize each point into one of the 3 categories above (5 annotators under MultiViz setting and 5 annotators under No Viz setting). The webpage that the human annotators under MultiViz setting sees is the same as the ones described in Appendix A. In addition, the human annotators under MultiViz setting can request the second-order gradient analysis result on specific words or phrases they pick, both the pixel-wise heatmap and top 2 bounding boxes with the highest average absolute gradient per pixel (same procedure as described in Appendix ). See the bottom half of Figure 2 for an example of second-order gradient analysis result of CNN+LSTM+SA.
During the experiment, the instructor first informs the annotators what each of the 3 categories of errors mean, and then explains each part of the visualizations they are given (if under MultiViz setting). Then the instructor goes over the first data point together with the human annotators, and the human annotators must categorize the remaining 10 points on their own, and only those 10 points’ annotations will count towards the final result.
The result for CLEVR error analysis experiment is shown in Table 3. As shown in the table, on average the human annotators are much more confident in categorizing each error, and also tend to agree with each other a lot more often when given MultiViz compared to No Viz. This shows that MultiViz can indeed help humans identify types of errors within a multimodal model.
Error breakdown: Out of the 10 total errors, human annotators on average reported 6 of them belonging to category 2 (cross modal interaction error). This suggests that the major weakness of CNN+LSTM+SA is that it is not great at aligning phrases in text with the object the phrase refers to. This is expected because CNN+LSTM+SA is a late fusion model, which is known to be not great at capturing low-level cross-modal interactions.

Footnote

2
we used the popular Hugging Face implementation at https://huggingface.co/unc-nlp/lxmert-vqa-uncased

Supplementary Material

Supplemental Materials (3544549.3585604-supplemental-materials.zip)
MP4 File (3544549.3585604-video-figure.mp4)
Video Figure
MP4 File (3544549.3585604-video-preview.mp4)
Video Preview
MP4 File (3544549.3585604-talk-video.mp4)
Pre-recorded Video Presentation

References

[1]
Julius Adebayo, Justin Gilmer, Michael Muelly, Ian J. Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity Checks for Saliency Maps. CoRR abs/1810.03292 (2018). arXiv:1810.03292http://arxiv.org/abs/1810.03292
[2]
Saeed Amizadeh, Hamid Palangi, Alex Polozov, Yichen Huang, and Kazuhito Koishida. 2020. Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning. In International Conference on Machine Learning. PMLR, 279–290.
[3]
Ankesh Anand, Eugene Belilovsky, Kyle Kastner, Hugo Larochelle, and Aaron Courville. 2018. Blindfold baselines for embodied QA. arXiv preprint arXiv:1811.05013 (2018).
[4]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. 2016. Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 39–48.
[5]
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision. 2425–2433.
[6]
John Arevalo, Thamar Solorio, Manuel Montes-y Gómez, and Fabio A González. 2017. Gated Multimodal Units for Information Fusion. In 5th International conference on learning representations 2017 workshop.
[7]
Siddhant Arora, Danish Pruthi, Norman Sadeh, William W Cohen, Zachary C Lipton, and Graham Neubig. 2021. Explain, Edit, and Understand: Rethinking User Study Design for Evaluating Model Explanations. arXiv preprint arXiv:2112.09669 (2021).
[8]
David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. 2010. How to explain individual classification decisions. The Journal of Machine Learning Research 11 (2010), 1803–1831.
[9]
Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41, 2 (2018), 423–443.
[10]
Umang Bhatt, Alice Xiang, Shubham Sharma, Adrian Weller, Ankur Taly, Yunhan Jia, Joydeep Ghosh, Ruchir Puri, José MF Moura, and Peter Eckersley. 2020. Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 648–657.
[11]
Meera M Blattner and Roger B Dannenberg. 1990. CHI’90 Workshop on multimedia and multimodal interface design. ACM SIGCHI Bulletin 22, 2 (1990), 54–58.
[12]
Remi Cadene, Corentin Dancette, Matthieu Cord, Devi Parikh, 2019. RUBi: Reducing Unimodal Biases for Visual Question Answering. Advances in Neural Information Processing Systems 32 (2019), 841–852.
[13]
Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In European Conference on Computer Vision. Springer, 565–580.
[14]
Chun Sik Chan, Huanqi Kong, and Liang Guanqing. 2022. A Comparative Study of Faithfulness Metrics for Model Interpretability Methods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5029–5038.
[15]
Arjun Chandrasekaran, Viraj Prabhu, Deshraj Yadav, Prithvijit Chattopadhyay, and Devi Parikh. 2018. Do explanations make VQA models more predictable to a human?. In EMNLP.
[16]
Minghai Chen, Sen Wang, Paul Pu Liang, Tadas Baltrušaitis, Amir Zadeh, and Louis-Philippe Morency. 2017. Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163–171.
[17]
Valerie Chen, Jeffrey Li, Joon Sik Kim, Gregory Plumb, and Ameet Talwalkar. 2022. Interpretable Machine Learning: Moving from mythos to diagnostics. Queue 19, 6 (2022), 28–56.
[18]
Jason Chuang, Daniel Ramage, Christopher Manning, and Jeffrey Heer. 2012. Interpretation and trust: Designing model-driven visualizations for text analysis. In Proceedings of the SIGCHI conference on human factors in computing systems. 443–452.
[19]
Sanjoy Dasgupta, Nave Frost, and Michal Moshkovitz. 2022. Framework for Evaluating Faithfulness of Local Explanations. arXiv preprint arXiv:2202.00734 (2022).
[20]
Bruno Dumas, Denis Lalanne, and Sharon Oviatt. 2009. Multimodal interfaces: A survey of principles, models and frameworks. In Human machine interaction. Springer, 3–26.
[21]
Stéphane Dupont and Juergen Luettin. 2000. Audio-visual speech modeling for continuous speech recognition. IEEE transactions on multimedia 2, 3 (2000), 141–151.
[22]
Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. 2009. Visualizing higher-layer features of a deep network. University of Montreal 1341, 3 (2009), 1.
[23]
Stella Frank, Emanuele Bugliarello, and Desmond Elliott. 2021. Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9847–9857.
[24]
Christos A Frantzidis, Charalampos Bratsas, Manousos A Klados, Evdokimos Konstantinidis, Chrysa D Lithari, Ana B Vivas, Christos L Papadelis, Eleni Kaldoudi, Costas Pappas, and Panagiotis D Bamidis. 2010. On the classification of emotional biosignals evoked while viewing affective pictures: an integrated data-mining-based approach for healthcare applications. IEEE Transactions on Information Technology in Biomedicine 14, 2 (2010), 309–318.
[25]
Jerome H Friedman and Bogdan E Popescu. 2008. Predictive learning via rule ensembles. The annals of applied statistics 2, 3 (2008), 916–954.
[26]
Leilani H Gilpin, David Bau, Ben Z Yuan, Ayesha Bajwa, Michael Specter, and Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th International Conference on data science and advanced analytics (DSAA). IEEE, 80–89.
[27]
Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. 2020. Vqa-lol: Visual question answering under the lens of logic. In European conference on computer vision. Springer, 379–396.
[28]
Nan-Wei Gong, Jürgen Steimle, Simon Olberding, Steve Hodges, Nicholas Edward Gillian, Yoshihiro Kawahara, and Joseph A Paradiso. 2014. PrintSense: a versatile sensing technique to support multimodal flexible surface interaction. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 1407–1410.
[29]
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. 2017. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6904–6913.
[30]
Yash Goyal, Akrit Mohapatra, Devi Parikh, and Dhruv Batra. 2016. Towards transparent ai systems: Interpreting visual question answering models. arXiv preprint arXiv:1608.08974 (2016).
[31]
Xiaochuang Han, Byron C Wallace, and Yulia Tsvetkov. 2020. Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5553–5563.
[32]
Lisa Anne Hendricks, Kaylee Burns, Kate Saenko, Trevor Darrell, and Anna Rohrbach. 2018. Women also snowboard: Overcoming bias in captioning models. In Proceedings of the European Conference on Computer Vision (ECCV). 771–787.
[33]
Jack Hessel and Lillian Lee. 2020. Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 861–877.
[34]
Markus A. Hollerer, Dennis Jancsary, and Maria Grafstrom. 2018. A Picture is Worth a Thousand Words: Multimodal Sensemaking of the Global Financial Crisis. Organization Studies (2018).
[35]
Allan Jabri, Armand Joulin, and Laurens van der Maaten. 2016. Revisiting visual question answering baselines. In European conference on computer vision. Springer, 727–739.
[36]
Alon Jacovi and Yoav Goldberg. 2020. Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4198–4205.
[37]
Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-Wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data 3, 1 (2016), 1–9.
[38]
Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2901–2910.
[39]
Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. 2021. MDETR–Modulated Detection for End-to-End Multi-Modal Understanding. arXiv preprint arXiv:2104.12763 (2021).
[40]
Atsushi Kanehira, Kentaro Takemoto, Sho Inayoshi, and Tatsuya Harada. 2019. Multimodal explanations by predicting counterfactuality in videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8594–8602.
[41]
Runchang Kang, Anhong Guo, Gierad Laput, Yang Li, and Xiang’Anthony’ Chen. 2019. Minuet: multimodal interaction with an internet of things. In Symposium on spatial user interaction. 1–10.
[42]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583–5594.
[43]
Elsa A Kirchner, Stephen H Fairclough, and Frank Kirchner. 2019. Embedded multimodal interfaces in robotics: applications, future trends, and societal implications. In The Handbook of Multimodal-Multisensor Interfaces: Language Processing, Software, Commercialization, and Emerging Directions-Volume 3. 523–576.
[44]
Klaus Krippendorff. 2011. Computing Krippendorff’s alpha-reliability. (2011).
[45]
Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, and Himabindu Lakkaraju. 2022. The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective. arXiv preprint arXiv:2202.01602 (2022).
[46]
Benno Krojer, Vaibhav Adlakha, Vibhav Vineet, Yash Goyal, Edoardo Ponti, and Siva Reddy. 2022. Image Retrieval from Contextual Descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 3426–3440.
[47]
Michelle A Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei, Animesh Garg, and Jeannette Bohg. 2019. Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 8943–8950.
[48]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2019. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019).
[49]
Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. 2020. What does bert with vision look at?. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5265–5275.
[50]
Richard Li, Jason Wu, and Thad Starner. 2019. Tongueboard: An oral interface for subtle input. In Proceedings of the 10th Augmented Human International Conference 2019. 1–9.
[51]
Paul Pu Liang, Zhun Liu, Yao-Hung Hubert Tsai, Qibin Zhao, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2019. Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization. In ACL.
[52]
Paul Pu Liang, Ziyin Liu, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Multimodal Language Analysis with Recurrent Multistage Fusion. In EMNLP.
[53]
Paul Pu Liang, Yiwei Lyu, Gunjan Chhablani, Nihal Jain, Zihao Deng, 2023. MultiViz: Towards Visualizing and Understanding Multimodal Models. In ICLR.
[54]
Paul Pu Liang, Yiwei Lyu, Xiang Fan, Zetian Wu, Yun Cheng, Jason Wu, Leslie Chen, Peter Wu, Michelle A Lee, Yuke Zhu, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2021. MultiBench: Multiscale Benchmarks for Multimodal Representation Learning. NeurIPS Datasets and Benchmarks Track (2021).
[55]
Paul Pu Liang, Amir Zadeh, and Louis-Philippe Morency. 2022. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430 (2022).
[56]
Zhun Liu, Ying Shen, Varun Bharadhwaj Lakshminarasimhan, Paul Pu Liang, AmirAli Bagher Zadeh, and Louis-Philippe Morency. 2018. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256.
[57]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. 13–23.
[58]
Yiwei Lyu, Paul Pu Liang, Zihao Deng, Ruslan Salakhutdinov, and Louis-Philippe Morency. 2022. DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations. arXiv preprint arXiv:2203.02013 (2022).
[59]
Andreas Madsen, Nicholas Meade, Vaibhav Adlakha, and Siva Reddy. 2021. Evaluating the faithfulness of importance measures in nlp by recursively masking allegedly important tokens and retraining. arXiv preprint arXiv:2110.08412 (2021).
[60]
Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, and Jiajun Wu. 2019. The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision. In International Conference on Learning Representations. https://openreview.net/forum?id=rJgMlhRctm
[61]
Luke Merrick and Ankur Taly. 2020. The explanation game: Explaining machine learning models using shapley values. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 17–38.
[62]
Sarah Morrison-Smith, Aishat Aloba, Hangwei Lu, Brett Benda, Shaghayegh Esmaeili, Gianne Flores, Jesse Smith, Nikita Soni, Isaac Wang, Rejin Joy, 2020. Mmgatorauth: a novel multimodal dataset for authentication interactions in gesture and voice. In Proceedings of the 2020 International Conference on Multimodal Interaction. 370–377.
[63]
W James Murdoch, Chandan Singh, Karl Kumbier, Reza Abbasi-Asl, and Bin Yu. 2019. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences 116, 44 (2019), 22071–22080.
[64]
Milind Naphade, John R Smith, Jelena Tesic, Shih-Fu Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE multimedia 13, 3 (2006), 86–91.
[65]
Laurence Nigay and Joëlle Coutaz. 1995. A generic platform for addressing the multimodal challenge. In Proceedings of the SIGCHI conference on Human factors in computing systems. 98–105.
[66]
Zeljko Obrenovic and Dusan Starcevic. 2004. Modeling multimodal human-computer interaction. Computer 37, 9 (2004), 65–72.
[67]
Sharon Oviatt. 2007. Multimodal interfaces. The human-computer interaction handbook (2007), 439–458.
[68]
Sharon Oviatt, Antonella DeAngeli, and Karen Kuhn. 1997. Integration and synchronization of input modes during multimodal human-computer interaction. In Proceedings of the ACM SIGCHI Conference on Human factors in computing systems. 415–422.
[69]
Sharon Oviatt, Rebecca Lunsford, and Rachel Coulston. 2005. Individual differences in multimodal integration patterns: What are they and why do they exist?. In Proceedings of the SIGCHI conference on Human factors in computing systems. 241–249.
[70]
Letitia Parcalabescu, Albert Gatt, Anette Frank, and Iacer Calixto. 2021. Seeing past words: Testing the cross-modal capabilities of pretrained V&L models on counting tasks. In Proceedings of the 1st Workshop on Multimodal Semantic Representations (MMSR). 32–44.
[71]
Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, and Marcus Rohrbach. 2018. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8779–8788.
[72]
Bharat Paudyal, Chris Creed, Ian Williams, and Maite Frutos-Pascual. 2022. Inclusive Multimodal Voice Interaction for Code Navigation. In Proceedings of the 2022 International Conference on Multimodal Interaction. 509–519.
[73]
Johannes Pittermann, Angela Pittermann, and Wolfgang Minker. 2010. Emotion recognition and adaptation in spoken dialogue systems. International Journal of Speech Technology (2010).
[74]
Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision. 2641–2649.
[75]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion (2017).
[76]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
[77]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems 28 (2015).
[78]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. " Why should i trust you?" Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
[79]
Raquel Rodríguez-Pérez and Jürgen Bajorath. 2020. Interpretation of machine learning models using shapley values: application to compound potency and multi-target activity predictions. Journal of computer-aided molecular design 34, 10 (2020), 1013–1026.
[80]
Ritam Jyoti Sarmah, Yunpeng Ding, Di Wang, Cheuk Yin Phipson Lee, Toby Jia-Jun Li, and Xiang’Anthony’ Chen. 2020. Geno: A Developer Tool for Authoring Multimodal Interaction on Existing Web Applications. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology. 1169–1181.
[81]
Harshay Shah, Prateek Jain, and Praneeth Netrapalli. 2021. Do Input Gradients Highlight Discriminative Features?Advances in Neural Information Processing Systems 34 (2021).
[82]
Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2013. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013).
[83]
Daria Sorokina, Rich Caruana, Mirek Riedewald, and Daniel Fink. 2008. Detecting statistical interactions with additive groves of trees. In Proceedings of the 25th international conference on Machine learning. 1000–1007.
[84]
Suraj Srinivas and Francois Fleuret. 2020. Rethinking the Role of Gradient-based Attribution Methods for Model Interpretability. In International Conference on Learning Representations.
[85]
Arjun Srinivasan, Bongshin Lee, Nathalie Henry Riche, Steven M Drucker, and Ken Hinckley. 2020. InChorus: Designing consistent multimodal interactions for data visualization on tablet devices. In Proceedings of the 2020 CHI conference on human factors in computing systems. 1–13.
[86]
Arjun Srinivasan and John Stasko. 2017. Orko: Facilitating multimodal interaction for visual exploration and analysis of networks. IEEE transactions on visualization and computer graphics 24, 1 (2017), 511–521.
[87]
Riko Suzuki, Hitomi Yanaka, Masashi Yoshikawa, Koji Mineshima, and Daisuke Bekki. 2019. Multimodal Logical Inference System for Visual-Textual Entailment. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. 386–392.
[88]
Hao Tan and Mohit Bansal. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. CoRR abs/1908.07490 (2019). arXiv:1908.07490http://arxiv.org/abs/1908.07490
[89]
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. 2022. Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5238–5248.
[90]
Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558–6569.
[91]
Michael Tsang, Dehua Cheng, Hanpeng Liu, Xue Feng, Eric Zhou, and Yan Liu. 2019. Feature Interaction Interpretability: A Case for Explaining Ad-Recommendation Systems via Neural Interaction Detection. In International Conference on Learning Representations.
[92]
Michael Tsang, Dehua Cheng, and Yan Liu. 2018. Detecting Statistical Interactions from Neural Network Weights. In International Conference on Learning Representations.
[93]
Ramakrishna Vedantam, Karan Desai, Stefan Lee, Marcus Rohrbach, Dhruv Batra, and Devi Parikh. 2019. Probabilistic Neural Symbolic Models for Interpretable Visual Question Answering. In International Conference on Machine Learning. 6428–6437.
[94]
Xingbo Wang, Jianben He, Zhihua Jin, Muqiao Yang, Yong Wang, and Huamin Qu. 2021. M2Lens: Visualizing and explaining multimodal models for sentiment analysis. IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 802–812.
[95]
John Williamson, Roderick Murray-Smith, and Stephen Hughes. 2007. Shoogle: excitatory multimodal interaction on mobile devices. In Proceedings of the SIGCHI conference on Human factors in computing systems. 121–124.
[96]
Eric Wong, Shibani Santurkar, and Aleksander Madry. 2021. Leveraging sparse linear layers for debuggable deep networks. In International Conference on Machine Learning. PMLR, 11205–11216.
[97]
Jason Wu, Xiaoyi Zhang, Jeff Nichols, and Jeffrey P Bigham. 2021. Screen Parsing: Towards Reverse Engineering of UI Models from Screenshots. In The 34th Annual ACM Symposium on User Interface Software and Technology. 470–483.
[98]
Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band, Piyush Mathur, Frank Papay, Ashish K Khanna, Jacek B Cywinski, Kamal Maheshwari, 2019. Multimodal machine learning for automated ICD coding. In Machine Learning for Healthcare Conference. PMLR, 197–215.
[99]
Jason Yosinski, Jeff Clune, Thomas Fuchs, and Hod Lipson. 2015. Understanding neural networks through deep visualization. In In ICML Workshop on Deep Learning. Citeseer.
[100]
AmirAli Bagher Zadeh, Paul Pu Liang, Soujanya Poria, Erik Cambria, and Louis-Philippe Morency. 2018. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In ACL.

Index Terms

  1. MultiViz: Towards User-Centric Visualizations and Interpretations of Multimodal Models

          Recommendations

          Comments

          Information & Contributors

          Information

          Published In

          cover image ACM Conferences
          CHI EA '23: Extended Abstracts of the 2023 CHI Conference on Human Factors in Computing Systems
          April 2023
          3914 pages
          ISBN:9781450394222
          DOI:10.1145/3544549
          Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

          Sponsors

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          Published: 19 April 2023

          Check for updates

          Author Tags

          1. explainable AI
          2. human-in-the-loop
          3. interpretability
          4. model analysis and debugging
          5. multimodal machine learning
          6. visualization

          Qualifiers

          • Work in progress
          • Research
          • Refereed limited

          Funding Sources

          Conference

          CHI '23
          Sponsor:

          Acceptance Rates

          Overall Acceptance Rate 6,164 of 23,696 submissions, 26%

          Upcoming Conference

          CHI 2025
          ACM CHI Conference on Human Factors in Computing Systems
          April 26 - May 1, 2025
          Yokohama , Japan

          Contributors

          Other Metrics

          Bibliometrics & Citations

          Bibliometrics

          Article Metrics

          • 0
            Total Citations
          • 889
            Total Downloads
          • Downloads (Last 12 months)626
          • Downloads (Last 6 weeks)88
          Reflects downloads up to 08 Feb 2025

          Other Metrics

          Citations

          View Options

          View options

          PDF

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format.

          HTML Format

          Login options

          Figures

          Tables

          Media

          Share

          Share

          Share this Publication link

          Share on social media