1 Introduction

An important goal in Exploratory Data Analysis (EDA) [10] is to gain insight into different relations in the data. Knowledge of relations is essential for successful application of data mining and machine learning methods. Investigating relations can be efficiently performed using interactive visual EDA software, that presents the user different views of a dataset, thus leveraging the natural human pattern recognition skills to allow the user to discover interesting relations in the data.

Recently, an iterative data mining paradigm [1,2,3, 5] has been presented and also realised in software [7,8,9] with the emphasis that the user wants to find patterns that are subjectively interesting given what she or he currently knows about the data. The system shows the user maximally informative views, i.e., views that contrast the most with the user’s current knowledge. As the user explores the data and discovers patterns of relations in the data, these patterns are fed back into the system and taken into account during further exploration, so that the user is only shown views displaying currently unknown relations.

Although the knowledge of the user is taken into account one important problem still remains: by design, the user cannot know beforehand which views of the data differ the most from her or his present knowledge. Thus, exploration of the most informative views might seem somewhat random to the user and the views shown might be, even though surprising, not necessarily relevant for the task at hand. The user can have specific ideas (hypotheses) concerning relations in the data at the start of the exploration and such ideas typically also develop further during the exploration process. It is hence essential to be able to focus the exploration process to answer specific questions. This is realised in our novel EDA paradigm, termed Human-Guided Data Exploration (HGDE) [6].

In this paper we present tiler, a software tool for visual EDA that realises the HGDE paradigm for efficient interactive visual EDA. tiler aims to be an easy-to-use tool for exploring relations in datasets by allowing the user to focus the exploration on investigating different hypotheses. tiler is an MIT-licensed R-package available from https://github.com/aheneliu/tiler.

2 Human-Guided Data Exploration

We provide here a high-level description of the key concepts in the HGDE framework, for a complete discussion and theoretical details we refer to [6].

The goal of the user is to discover relations between the attributes in the data by a comparison of hypotheses, which can be viewed as a comparison of two distributions with the same known marginal distributions. A permutation-based scheme is used to obtain samples from the distributions, i.e., we permute the given data under a set of constraints defined by the hypotheses. The constraints represent the relations which are assumed to be known about the data: one extreme are unconstrained, column-wise permutations (preserving only the marginals) while the other extreme is the fully constrained case where only the identity permutation satisfies the constraints. In general, the constraints are formulated in terms of tiles: tuples of the form \(t = (R, C)\), where \(R \subseteq [N]=\{1,\ldots ,N\}\) and \(C \subseteq [M]\) are subsets of the rows (items) and columns (attributes) of an \(N \times M\) data matrix. A tile constrains permutations so that all items in a tile are permuted together, i.e., there is a single permutation for a tile operating on each \(c\in C\), thus preserving the relations inside t.

Hypotheses are represented in terms of tilings (non-overlapping sets of tiles). For example, the hypotheses can be that either all the attributes in the original dataset are dependent or they are all independent. These hypotheses can be represented with the following two hypothesis tilings: \(\mathcal {T}_{\mathcal {H}_1}=\{([N],[M])\}\) and \(\mathcal {T}_{\mathcal {H}_ 2}=\{([N], \{m\})\mid m\in [M]\}\). A correlation between two variables i and j in a subset of rows R could be studied with the following hypothesis tilings: \(\mathcal {T}_{\mathcal {H}_1}=\{(R,\{i,j\})\}\) and \(\mathcal {T}_{\mathcal {H}_ 2}=\{(R,\{i\}),(R,\{j\})\}\). In the general case, the user can focus on specific data items and specific attribute combinations. Focusing allows the user to concentrate on exploring relations in a subset of the data items and attributes, making the interactive exploration more predictable and allowing specific questions to be answered.

In tiler, the user is shown an informative projection of two data samples corresponding to the hypotheses and is tasked with comparing these and drawing conclusions. In an informative projection the two samples differ the most. A sample from a distribution corresponding to each hypothesis is obtained by randomly permuting each column in the data, such that the relations between attributes enforced by the tilings are preserved. A tiling hence constrains the permutation of the data. When a user discovers a new pattern, this is added as a constraint (a tile) to both \(\mathcal {T}_{\mathcal {H}_1}\) and \(\mathcal {T}_{\mathcal {H}_2}\), meaning that the relations expressed by this pattern no longer differ between the two hypotheses. This allows the user to iteratively build up an understanding of the relations in the data.

3 System Design

tiler is developed in R (v. 3.4.4) using Shiny (v. 1.0.5) and runs in a web browser. The tool supports the full HGDE framework and the usage of tiler is described in the video at https://youtu.be/fqKLjMwJHnk.

To explore relations between attributes, the user first specifies the hypotheses being compared. The tool implements different modes as shortcuts for typical hypotheses. The explore-mode (the default) corresponds to iterative data exploration where the two hypotheses to be compared are that (i) all attributes in the original dataset are dependent or (ii) they are all independent. In the focus-mode the exploration is focused on investigating all relations within a particular subset of rows and columns (a focus region). The compare-mode implements the general case by allowing the user to specify an arbitrary hypothesis by partitioning the attributes in the focus region into groups.

With tiler, the user visually explores a dataset by comparing two data samples corresponding to the two different hypotheses. The exploration is iterative and the user gradually finds new patterns concerning the relations in the data, which are then added as tiles. Figure 1 shows the main user interface of tiler with the following components:

  • Tool panel allows the mode (explore, focus, or compare) to be selected and contains tools for selection of points as well as creation of tiles and focus tiles. Points can be selected by brushing in the main view, or by selecting the data from a dropdown menu. Previously added tiles can be selected or deleted. The projection in the main view can be changed and the user can show/hide the original data and the two samples corresponding to the combined effect of the user and hypotheses tilings. The user can also update the distributions after addition of new tiles and then request the next most informative view.

  • Main view shows the original data (in black) together with samples (in green and blue) corresponding to the two hypotheses being compared. Points on the same row in the sampled data matrices are connected using lines. These lines indicate how points in the data move around due to the randomisation. Since projection of high-dimensional data to lower dimensions can make interpretation complicated, we have here chosen to use 2D axis-aligned projections. The x and y axis are hence directly interpretable on their original scales. We here use correlation as the measure of informativeness, as this is often intuitive and easy to interpret, but other distance measures between the two samples being compared can be used too. This measure is used to show the maximally informative view.

  • Selection info shows the five largest classes of the selected points (for data with class attributes). This helps the user in understanding what type of points are currently selected and gives insight into the relations in the data.

  • Navigation is guided by the scatterplot matrix of the five most interesting attributes in the data, in the bottom right corner. The correlations for both samples and their difference using the correlation-based measure is shown. The scatterplot helps the user to quickly obtain an overview of the data.

  • Tabs provide functions for loading data, listing tiles, and for defining an attribute grouping in the compare mode.

Fig. 1.
figure 1

The main user interface of tiler (showing UCI image segmentation dataset [4]). (Color figure online)