Keywords

1 Introduction

In the last four decades, the growing computational power and the adoption of new sensors have increased the data volume and dimensionality in many domains [5]. More recently, AI Systems are becoming ubiquitous and assuming different roles: they can act as recommendation systems in multiple contexts (e.g., suggesting new things to buy, new content to watch), they can work as personal assistants (e.g., answering queries, notifying time to leave for the next appointment), they can tag images (e.g., organizing photos into different albums, detecting people in surveillance footage), etc. Whilst their contributions are clear and present on our daily lives, the reasoning behind them are not so transparent and may need additional explanations and interactions – (such as: why is the system recommending something? based on what dimensions of the object? based on what dimensions of the user? how can the user tweak the recommendation?). As Lipton states in his “Prototypical call to arms” [8]:

I work with medical data. We work with doctors and they’re interested in predicting risk of mortality, recognizing cancer in radiologic scans, and spotting diagnoses based on electronic health record data. We can train a model, and it can even give us the right answer. But we can’t just tell the doctor “my neural network says this patient has cancer!”. The doctor just won’t accept that! They want to know why the neural network says what it says. They want an explanation. They need interpretable models.

This need for interpretability created new challenges for developers and designers from different communities – visual analytics, information visualization, machine learning, data science, among others. Visualizing multidimensional data and exploring the objects’ similarities can help with the explainability of an AI system.

On the one hand, dimensionality is a well-known problem in visualization and statistical analyses communities, having a huge impact on data understanding. On the other hand, the identification of similar objects is not yet commonly explored in AI systems, although being useful to understand data and make relevant observations. Similarity is a main concept in psychology, playing a central role in the cognition and transfer-learning theories [9]. Many human activities hinge on similarity to construct new ideas and/or fill missing information by inference from similar objects. For instance, when doctors are studying treatments for a given patient, they try to find the most similar cases to ground their decisions.

In this work, we discuss the visual inspection of high-dimensional objects being complementary to machine learning techniques. We try to shed light on the question of how to visually compare high-dimensional objects and find similar objects.

In the following section, we present some background regarding the concept of explainable AI. In Sect. 3 we describe RAVA (Reservoir Analogues Visual Analytics), a system that employs machine learning and visual analytics techniques to empower geoscientists in the task of finding similar reservoirs. We conclude with Sect. 4 discussing some future steps in this research.

2 Background

Recommendation systems traditionally uses three kind of inputs (Fig. 1):

  1. 1.

    Knowledge about the objects: the object’s properties and characteristics.

  2. 2.

    Knowledge about the users: the users’ profiles, preferences, and behaviors.

  3. 3.

    Knowledge about the relation between users and objects: usually collected due to an user action, which can be:

    • Explicit: the action itself establishes the relation, generally expressed as some sort of rating (like/dislike, number of stars, etc.)

    • Implicit: the action is interpreted by the system as an indicator (e.g., watched a movie, bought a product, listened to a song).

Fig. 1.
figure 1

A simple overview of a recommendation system.

When giving a recommendation, these system usually displays a brief textual explanation with a previous user action, but not providing actual insights of the recommendation itself. For example, when Netflix suggests a content, it provides a brief explanation “Because you have watched something”, but it provides no clue whatsoever regarding why the items were suggested and how they are related to the previously watched content.

Fig. 2.
figure 2

Example of Netflix’s recommendation.

Figure 2 shows an example of such recommendation. Users (may) recognize their past behavior (“watched Brooklyn Nine-Nine”, a comedy TV show about a police precinct), but going through the list, the recommendations are not clear. It may suggest contents that are not the same type (“Popstar: Never Stop Never Stopping” is a movie) or even different genres (“Punisher” is a violent TV show). Users may try to build an explanation (“Popstar” has the same lead actor, the two first rows are comedies) but the system does not confirm such hypothesis.

Even if users try to drill down the available information, they may be unsuccessful. For example, Table 1 shows the provided information for the target TV show (“Brooklyn Nine-Nine”) and two other recommendations, highlighting in bold the similarities. It is possible to notice that the “Black Mirror: Bandersnatch” suggestion does not have any explicit similarity with the “Brooklyn Nine-Nine” regarding the content’s properties (“starring”, “genres”, and “labels”) while the “Punisher” has a single one (both are “Crime TV Shows”).

Table 1. Netflix displayed data regarding some titles

The user is left wondering if there are properties that are not being displayed (e.g. more of the known dimensions – “starring”, “genres”, and “labels” – or even other unknown dimensions – number of awards, viewership, etc.) – in which case the system could highlight the similarities – or if it is using other data sources (e.g. knowledge about user’s profile and previous ratings). Looking at data from different users (Table 2), the user’s data seems to play a role, but it still isn’t clear.

Table 2. Netflix recommendation order for different users. Lines in bold highlight the same content from Table 1.

AI systems build on recommendation systems by taking the users’ context and goals into account. Users, therefore, do not only want to “watch something”, but “watch something in this context” (e.g., while in transit, in different times of the day), and the system should act accordingly (e.g., giving different recommendations, adjusting video quality). As the complexity of the system increases, we must try to increase its explainability. This opens new ways to users provide more valuable input and tweak the recommendations to their likeness, improving the overall performance of AI systems.

Lipton [7] categorizes the techniques for model interpretability into two main categories: transparency and post-hoc explanations (Fig. 3). Transparency is related to the model’s inner workings (as opposed to a “black box”/“opaque” component), looking at the entire model (simulatability), individual components (decomposability), and the training algorithm (algorithmic transparency). Dues to the focus on the model, these techniques are better suited to AI systems developers, since they rely on a knowledge of the model itself. Post-hoc explanations techniques, focusing on the results of the model, can be better understood by final users, since they do not introduce a new object (the AI model) to deal with. They have many approaches, but the most common ones are text explanations, visualizations (of learned representations or models), and explanations by example. Analogues play a crucial role in the last approach, since what drives the explanation hypothesis is the similarity between the presented examples.

Fig. 3.
figure 3

Techniques to enable or comprise interpretations according to Lipton [7].

Medin et al. [9] observed that there is a research body that associate the concept of similarity with context and task: to define two objects as similar it is important to define in what aspects. For instance, going back to the doctor example, patient A can be considered similar to patient B in a diabetes study (they share the same age, physical conditions, and habits) but useless in a uterus disorder study (they are of different genders). AI Systems, therefore, should consider the user’s current context and task when selecting the most suitable set of attributes to compare two objects from the multidimensional data.

Once the similarity is numerically established by the AI system, we now have to face new challenges: the similarity presentation (how to visualize that the objects are similar) and validation (if the user’s notion of similarity matches the AI system one). They are closely intertwined, since the representation of the similarity may affect how users perceive it and the users’ perception should guide the representation. Several studies tackled these problems, analyzing the notion of similarity [6, 10], different visualizations [3, 4], and the order and arrangement of dimensions [1, 12].

3 RAVA: Reservoir Analogues Visual Analytics

Given the importance of similarity in common tasks and its usefulness in AI system explainability, we have developed RAVA – Reservoir Analogues Visual Analytics. RAVA [2, 11] is a system built to empower geoscientists in the task of finding similar reservoirs given a target one with incomplete information. Geoscientists have to make business critical decisions – such as the “go/no go” of new acquisitions – with incomplete data, due to errors (e.g.: malfunctioning of the measurement tools) or not having available data (e.g.: during the bidding phase for new areas to explore). A common strategy is, therefore, make use of available known information from similar reservoirs to better estimate the unknown information, drawing a more complete picture of the target area.

RAVA integrates the knowledge of the geoscientists with AI. On the one hand, it applies machine learning algorithms to go through extensive datasets of reservoir characterization in a timely manner and providing unseen results. On the other hand, it enables the geoscientists to visually explore such datasets, to retrieve analogues, to estimate unknown parameters, and, thus, to make better-informed decisions. In other to achieve this, RAVA follows a simple UI workflow illustrated in Fig. 4.

Fig. 4.
figure 4

RAVA main UI workflow.

Users start in the database exploration page (illustrated in Fig. 5). It shows the database information on a map, highlighting the geographical distribution of reservoirs (a circle in each reservoir’s position) and parameters (given a selected parameter, each circle is colored according to the parameter value or gray if the reservoir does not have information for the parameter). Besides selecting a parameter to visualize on the map (Fig. 5a), the user may also select a basin (using the dropdown on the left, Fig. 5b) or a reservoir (by clicking on a circle in the map).

Fig. 5.
figure 5

RAVA database exploration page.

The left panel changes according to the current selection, offering more insights about the parameters’ distribution, a list of related papers, and the previous experiments’ list (Fig. 6). The “Overall” tab allows the visualization of multiple parameters, whilst the “Parameter” tab focuses on the current selected parameter. If neither a basin nor a reservoir is selected (column “No selection” in Fig. 6), the “Overall” tab offers the distribution of the whole database and the “Parameters” offers the distribution of the current selected parameter grouped by continent. If a basin is selected, the “Overall” tab compares the distribution of the reservoirs in the selected basin to the whole database and the “Parameters” shows more details (such as mean, variance, minimum, and maximum values) regarding the distribution of the current selected parameter in the selected basin. Finally, if a reservoir is selected, the “Overall” tab compares the reservoir known values (the red dot) with the whole database distribution and the “Parameters” tab changes to display the parameter information associated to the reservoir.

Fig. 6.
figure 6

Detail of RAVA’s left panel depending on the current selection.

With a reservoir selected, users may start a new experiment to find analogues given the selected reservoir as the “target” one. In the experiment configuration page (Fig. 7), the user selects the parameters and their weights to be used in the experiment (the list on the left side). There are also different parameter templates that the user may choose as a starting point to the configuration, loading a preset list of parameters and weights. This configuration is closely related to the user’s current goal and s/he may experiment with different configurations to see how the AI is affected, exploring different results and scenarios.

Fig. 7.
figure 7

RAVA experiment configuration page.

A reservoir visualization of the target reservoir occupies the central part of the page, encoding the target reservoir’s parameters values. It uses colors to group the parameters into 4 main groups (“Petroleum Systems”, “Fluids”, “Petrophysics”, and “Production”). The visualization can be split into two parts: the radial chart – encoding numeric parameters – and the list visualization – encoding categorical parameters. The radial chart shows each numerical parameter in their own “slice”, displaying a mnemonic of their names in the outer circle and the value in the middle of the slice. The list visualization on the right side has a row for each categorical parameter. Each row has a small square indicating if the parameter is known (colored square) or unknown (empty square), the parameter value, and the parameter mnemonic.

Fig. 8.
figure 8

RAVA experiment results page.

After an experiment runs successfully, the user may go to the experiment results page (Fig. 8). The left panel shows the experiment configuration parameters, displaying the target reservoir’s known values and the estimated values for the missing parameters. At the bottom, a small map shows the target reservoir with a marker and the selected analogues. The right column shows the analogues list, ordered by similarity (the number in gray below each analogue reservoir name). Each analogue is displayed using the same reservoir visualization from the experiment configuration page. Finally the central part can toggle between two views. It starts with the target reservoir visualization (not shown in Fig. 8), showing both the known and estimated values. If the user selects a parameter, the user may toggle the visualization to a probability distribution graph of the selected parameter amongst the chosen analogues (besides the name of each analogue there is a checkbox to select the analogue).

With these tools, users can analyze the experiment results. The map allows users to evaluate the location of the target reservoir and its analogues. More than representing the distance between them, the location itself is loaded with valuable geographical information. For example, reservoirs in the west coast of Africa is traditionally similar to the ones in the east coast of Brazil due to their geological formation (they were close together at some point in time). Using the same representation for the target reservoir and analogues allow the user to visually evaluate the similarity and validate the calculated value. Finally, by allowing the selection of a subset of analogues to visualize the probability distribution function, we are empowering users to express their notion of similarity and indicate which results are actually relevant.

4 Final Remarks

Although with a simple UI workflow and few pages, RAVA is a promising solution to empower geoscientists in finding similar reservoirs. It enables users to visually explore the database and interact with the AI system given their current context, acting both in its input – the experiment configuration – and its output – the experiment results. Following Lipton [7] classification, we are exploring the AI system decomposability (allowing the user to input different configurations) and providing explanations by example (providing visualizations and tools to compare the analogues).

As future work, we plan to explore new opportunities for the user to interact with the AI system. For example, the underlying AI algorithm is comprised of a sequence of steps that we plan to allow user interference and customization, generating new exploration paths. One of such steps is the estimation of unknown values using machine learning techniques in the beginning of the process. We want to enable users to tweak these estimations, thus affecting the final analogues list. Another example is to help users selecting the experiment configuration, providing different parameter templates according to the user’s goals, for example.

We also aim to study different visualizations and evaluate the perceived notion of similarity. The current visualization was chosen with a different set of requirements (to be used in a tabletop) and may not be the better suited one. We plan to compare different visualizations and interaction techniques to see which one performs better. Moreover, we will study how users perceive the similarity and find ways to integrate this user feedback into the AI system’s suggestions.

Finally, RAVA was developed focused on the reservoir problem, but the underlying tools and techniques are way broader than that. We plan to apply this framework into different contexts and establish an analogues platform to be used in different scenarios.