Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We present a visual analytics tool designed to help analytic journalists explore large text corpora. Analytic journalists typically start by getting an overview of the field under investigation, then focus on specific aspects to identify facts and viewpoints that verify, refine or refute their hypothesis. Text corpora are often modeled by Term \(\times \) Document matrices, from which topics may be extracted using graph modularity-based diagonal biclustering [1]. Word cloud views are popular representations of individual topics and have been extended in many ways. In the considered use case, the journalist needs to grasp dozens of topics at a glance and appreciate topic importance. A good visualization may further ease this task by displaying topic relationships. Once the journalist has identified a topic of interest, his concern shifts to understanding topic variants and identifying distinctive documents and terms for each. The visualization of overlapping biclusters has been approached in various ways e.g. transparent overlapping hulls in node-link diagrams, matrix visualizations and parallel coordinates by Santamaría et al. [5]. BiSet [6] represents chained bipartite graphs enhanced with semantic bundles to represent chained bicluster relationships. These representations fail to convey an overview of a large number of overlapping biclusters while identifying common and distinctive terms and documents.

2 Tool Overview

To support the topic mapping task, we apply diagonal biclustering based on graph modularity [1] on the Term \(\times \) Document matrix. The Weighted Topic Map visualization in Fig. 1 is a hybrid Treemap view where rectangular tiles represent individual topics, tile area encodes topic importance, while topic details are shown as a nested word cloud. Term size and color reflect its representativeness of the topic and the number of documents where it appears. An MDS projection computed from the similarity matrix of the diagonal biclusters generates 2D positions which are fed to the Weighted Map visualization algorithm [2]. This results in similar topics being placed in adjacent tiles. Jaccard similarity is used to display links to the five most similar topics when the analyst hovers over a topic, as shown in Fig. 1. Showing topic relationships aims to alleviate the hard partitioning due to the diagonal biclustering. This overview enables the analyst to discover the main topics and select one for further scrutiny.

Fig. 1.
figure 1

The US presidential election topic is selected from 3,992 online news articles collected between Nov. \(2^{nd}\) and Nov. \(16^{th}\), 2015. Five topic variants concerning Hillary Clinton have been sent for comparison (https://youtu.be/xY6mgZyg3jA).

When the analyst selects a topic, Bimax [4], a pattern-based overlapping biclustering, extracts the topic variants by identifying all maximal combinations of terms shared by a maximal set of documents. While the exhaustiveness of Bimax may serve the needs of the analyst, it produces a very large number of biclusters. To make sense of the numerous Bimax biclusters, we hierarchize them based on term overlaps using the FPTree algorithm [3]. The resulting term hierarchy is represented as a sunburst visualization (3.1 in Fig. 1). The most common terms have a higher overlap degree and appear closer to the root, while the most distinctive terms are placed further away. Each path, from root to leaf, represents a unique association of terms grouped by one bicluster. As we move away from the root along a given path, the word combination becomes more specific and retains fewer documents. At the leaf level, only the documents of one bicluster are retained. By exploring this view and the coordinated comparator view (4), the journalist can focus on a specific aspect of a topic and depict all document relationships to identify facts or viewpoints related to his hypotheses.

The text of the documents can be read in the Document Detail View. In addition, we provide multiple interaction modes illustrated in Fig. 2. Hovering over a term in the hierarchy colors all its occurrences in red (3.3 in Fig. 1) and shows the corresponding term sequence on the right (3.2). The comparator view allows to analyze the common and distinctive terms as well as the distribution of documents across the selected topic variants. Multiple sorting strategies are proposed to facilitate the identification of the most informative terms.

Fig. 2.
figure 2

Interaction Modes. (a) The orange biclusters contain any document selected by the clicked node “Israel”. (b) The biclusters not matching the term “Israel” are filtered. (c) The bicluster colored in blue are sent to the topic variant comparator. (Color figure online)

3 Parameter Setting by the User

The number of Bimax biclusters increases with the size or the density of the diagonal bicluster blocks up to more than ten thousand biclusters. To reduce this number, we allow the user to modify the parameters of Bimax: the minimum number of terms or documents per bicluster (MinT, MinD) and the maximum number of biclusters (MaxB). As Bimax uses binary matrices, we also enable the user to change the binarization threshold (Thr) applied on the TF-IDF weights. Increasing the threshold selects, for each document, the most representative terms and reduces the density and the dimensions of the matrix.

Fig. 3.
figure 3

Number of biclusters as the parameters of Bimax vary.

In Fig. 3, we visualize the effect of varying each parameter separately on the term hierarchy built from the U.S. presidential elections topic. After each parameter variation, the root node “Obama” is clicked to highlight in orange the distribution of the selected documents. With the default parameters (\(MinT=3\), \(MinD=4\), \(Thr=5\)), only the first levels of the 13,000 biclusters are visible in the sunburst visualization. Increasing both Thr and MinT reduces the dispersion of the documents concerning “Obama”, but the changes of Thr maintain the variety regarding the number of terms. As MinD increases, the number of terms tends to be reduced but the documents selected by the node “Obama” remain largely dispersed in the biclusters until the node disappears.