Exploring hierarchical multidimensional data with unified views of distribution and correlation

https://doi.org/10.1016/j.jvlc.2013.02.002Get rights and content

Abstract

Data analysts explore data by inspecting features such as clustering, distribution and correlation. Much existing research has focused on different visualisations for different data exploration tasks. For example, a data analyst might inspect clustering and correlation with scatterplots, but use histograms to inspect a distribution. Such visualisations allow an analyst to confirm prior expectations. For example, a scatterplot may confirm an expected correlation or may show deviations from the expected correlation. In order to better facilitate discovery of unexpected features in data, however, a combination of different perspectives may be needed. In this paper, we combine distributional and correlational views of hierarchical multidimensional data. Our unified view supports the simultaneous exploration of data distribution and correlation. By presenting a unified view, we aim to increase the chances of discovery of unexpected data features, and to provide the means to explore such features in detail. Further, our unified view is equipped with a small number of primitive interaction operators which a user composes to facilitate smooth and flexible exploration.

Highlights

► A visualisation that shows both distribution and correlation is presented. ► Interactive features such as filtering, colouring, hashing, rotating and zooming are described. ► Builds on a parallel tree visualisation for exploring multi-dimensional data. ► Includes support for dimension selection via dimension ranking.

Introduction

With the ubiquitous deployment of computer systems, vast amounts of data are being generated in many domains. A major challenge is to make use of this data in meaningful ways, such as to understand what has happened, what is happening or what may happen. Such data is often multidimensional. Within the business community both manual and automated approaches have emerged to address this challenge. Online analytical processing (OLAP) systems [20] support the interactive exploration of such data when it can be aggregated using hierarchies. Business intelligence (BI) systems extend this by using heuristic and statistical models to find interesting features and make predictions. In order to adapt to a particular dataset, however, such models require tuning and learning. This leads to a significant need for cognitively effective and efficient techniques for interactive exploration of the data. Stasko [8], [30] argues ‘visual representations of data are particularly germane in situations involving unfamiliar data when a person is performing exploratory data analysis’.

Interactive data exploration tools need to support two modes of analysis: confirmation of expected features and discovery of unexpected features. In the first, the analyst aims to confirm an expectation, having in mind particular subsets of the data that should exhibit particular data characteristics such as a distribution that shows an increasing trend or a correlation between data dimensions. The analyst can navigate to predetermined subsets and view the data with an appropriate display, such as a histogram, to confirm an increasing distribution, or a scatterplot to confirm a correlation. In the second mode, the analyst is hoping for a discovery and has no starting subset in mind. In this case, the analyst must start from an overview of the data, being guided by features or variations evident in the views or summaries of the data being presented. Having no particular data characteristic in mind, the choice of which display to use, such as histogram or scatterplot, becomes arbitrary.

The focus of traditional visualisation design is on how well users can perceive different kinds of relationships in the displayed data. With interactive data exploration, however, instead of performing a single task, such as inspecting a fixed visualisation, a user must perform multiple tasks, only some of which involve inspecting visualisations. Such tasks include inspecting an overview, selecting data subsets for filtering, selecting dimensions for view projection, assessing and comparing distributions, observing data clusters and interpreting correlations. The cognitive load of these multiple tasks can be divided into perception, memory and thinking [11], the latter two becoming more important as the number of tasks increases. Smooth animated transitions and fewer context switches can reduce such memory demands while multi-purpose displays that reduce the need for users to choose between visualisations can reduce cognitive load. We consider that tools which provide a mix of different kinds of visualisations for different tasks, which alter display orientation when different data dimensions are selected, are less than ideal because of the excessive cognitive load they impose. We therefore seek to find a more unified approach for exploratory data visualisation.

This paper will show that our structured graph viewer, SGViewer [26], [27], can support the multiple tasks required in data exploration with minimal visual context switches and smooth transitions between selected views. SGViewer can present all data dimensions simultaneously. It uses a single visualisation for multiple inspection and selection tasks, thus minimising the need for context switches. SGViewer presents hierarchical multidimensional data as parallel dimension trees, where each dimension acts as both a display and a filter. Each dimension's data distribution is presented in a tree-structured view, while comparison of distributions is also supported.

There are three main contributions. The first is a presentation of a visualisation for association and correlation of dimensions based on colour division. Icicle plot-based tree views of dimensions are divided by a colour legend into a collection of aligned mosaic plots that show association for categorical dimensions and correlation for ranked dimensions. The second is a description of how this smoothly integrates with existing SGViewer capabilities for presenting and filtering multidimensional, hierarchical data distributions so that association and correlation can be read for arbitrary data slices. The third is a description of dimension ranking and reordering to support users when there are large numbers of data dimensions. This is done either by ranking dimensions according to their highest pairwise association in order to provide a starting point for exploration, or by ranking dimensions by association with a nominated dimension in order to assist the analysis of any associations or correlations found. This is demonstrated with a small sales and larger census example.

Section 2 provides background on measuring and visualising correlation, and presents our key idea for inspecting correlation of multidimensional data through the use of an appropriate colour palette. Section 3 compares our approach with related work in the areas of visualisation techniques for distribution and correlation, statistical graphics, and interfaces that combine different visualisations in a single view. Section 4 presents our primitive interaction operators and then provides an interactive analysis of a student dataset using SGViewer to demonstrate the utility of our unified view of distribution and correlation. We then describe dimension selection assistance via dimension ranking. Section 5 formalises and elaborates on our approach. It relates our visualisation to scatterplots and bar graphs, and details the way in which an underlying dataset is implicitly reordered within each dimension, thereby making it feasible to assess correlations by the data blocks and colours used in the visualisation. It characterises some limitations and trade-offs in order to provide guidance on configuring our correlation visualisation, e.g. selecting dimension scales and the number of colours. It shows how the outliers and their distribution can be supported via an additional dimension.

We conclude with a discussion of colour selection, a summary comparison with parallel coordinates and sets, and note that our visualisation of correlation works for both aggregated and point data.

Section snippets

A novel visualisation of correlation

After reviewing correlation, scatterplots and parallel coordinates, we present a new technique for the visualisation of correlation combined with distribution.

Specialised visualisations

There is a variety of visualisations for distribution and correlation. Histograms show distribution while scatterplots show clustering and correlation, but both of these are limited to two or three dimensions. Scatterplot matrices and parallel coordinates show clustering and correlation in many dimensions but these point-based visualisations are subject to overplotting as datasets become large. Such overplotting can be counteracted through blurring and colouring point clusters in scatterplots,

User interaction

Our interface design is based on a parallel tree chart visualisation equipped with operators and an interactive visualisation, where user interactions can be defined in terms of operator sequences. In this section, we define those operators and show that each view presented by our system can be explained as the composition of one or more of these operators. We demonstrate interface use with a discovery scenario. Finally, we describe dimension selection assistance via dimension ranking and

Data reordering across multiple dimensions

We now present more detail on how data points are blocked into intervals, coloured and reordered to facilitate recognition of correlation in the data. First, we relate our visualisation to the conventional scatterplot and bar graph. Then, we provide a formal specification and discuss some properties. We also consider the effect of axis scale alignment because misaligned scales can hide relatively large deviations from correlation, while aligned scales can exaggerate the effect of minor

Discussion

When using SGViewer for data exploration, the choice of the number of colours is crucial for the quality of the visualisation. There is a clear trade-off. As the number increases, more minor crossings are visible. With the use of a sequential colour scheme, minor crossings are seen as small differences in lightness between adjacent areas, while major crossings are noticeable as large differences in colour or intensity between adjacent areas. Although choosing more colours allows more detail to

Conclusion

This paper has described the SGViewer and its support for the exploration of hierarchical multidimensional data. Our key contribution is the presentation of a unified view of distribution and correlation that supports investigation of expected correlations, variations from the expected, surprising distributions and outliers, all within a single view. In addition, SGViewer also supports filtering and the selection of hierarchical subsets of the multidimensional data. If there are many

Acknowledgement

We wish to thank Dr. Madeleine Cincotta for her generous help in copy-editing the final draft.

References (32)

  • M. Graham et al.

    Visual exploration of alternative taxonomies through concepts

    Ecological Informatics

    (2007)
  • M. Sifer

    Filter co-ordinations for exploring multi-dimensional data

    Journal of Visual languages and Computing

    (2006)
  • R. Spence et al.

    The attribute explorer: information synthesis via exploration

    Interacting with Computers

    (1998)
  • C. Ahlberg, B. Shneiderman, Visual information seeking: tight coupling of dynamic query filters with starfield...
  • M. Ankerst, S. Berchtold, D. Keim, Similarity clustering of dimensions for an enhanced visualization of...
  • F. Bendix, R. Kosara, H. Hauser, Parallel sets: visual analysis of categorical data, in: Proceedings of IEEE Symposium...
  • S. Brewer,...
  • S. Bremm et al.

    Assisted descriptor selection based on visual comparative data analysis

    Computer Graphics Forum

    (2011)
  • Y. Chiricota, F. Jourdan, G. Melancon, Metric-based network exploration and multi-scale scatterplot, in: Proceedings of...
  • R. Dachselt, M. Frisch, M. Weiland, FacetZoom: a continuous multi-scale widget for navigating hierarchical metadata,...
  • J. Fekete et al.

    The value of information visualization

  • A. Frank, A. Asuncion, UCI Machine Learning Repository, Irvine CA: University of California, School of Information and...
  • M. Friendly

    Mosaic displays for multi-way contingency tables

    Journal of the American Statistical Association

    (2004)
  • D. Gillian, N. Cooke, Methods of cognitive analysis for HCI, in: Proceedings of ACM CHI, 1995, pp....
  • H. Hauser, F. Ledermann, H. Doleisch, Angular brushing of extended parallel coordinates, in: Proceedings of IEEE...
  • D. Kiem et al.

    Hierarchical pixel bar charts

    IEEE Transactions on Visualization and Computer Graphics

    (2002)
  • Cited by (0)

    This paper has been recommended for acceptance by S.-K. Chang.

    View full text