Abstract
We propose a visual analytics tool to support analytic journalists in the exploration of large text corpora. Our tool combines graph modularity-based diagonal biclustering to extract high-level topics with overlapping bi-clustering to elicit fine-grained topic variants. A hybrid topic treemap visualization gives the analyst an overview of all topics. Coordinated sunburst and heatmap visualizations let the analyst inspect and compare topic variants and access document content on demand.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
We present a visual analytics tool designed to help analytic journalists explore large text corpora. Analytic journalists typically start by getting an overview of the field under investigation, then focus on specific aspects to identify facts and viewpoints that verify, refine or refute their hypothesis. Text corpora are often modeled by Term \(\times \) Document matrices, from which topics may be extracted using graph modularity-based diagonal biclustering [1]. Word cloud views are popular representations of individual topics and have been extended in many ways. In the considered use case, the journalist needs to grasp dozens of topics at a glance and appreciate topic importance. A good visualization may further ease this task by displaying topic relationships. Once the journalist has identified a topic of interest, his concern shifts to understanding topic variants and identifying distinctive documents and terms for each. The visualization of overlapping biclusters has been approached in various ways e.g. transparent overlapping hulls in node-link diagrams, matrix visualizations and parallel coordinates by Santamaría et al. [5]. BiSet [6] represents chained bipartite graphs enhanced with semantic bundles to represent chained bicluster relationships. These representations fail to convey an overview of a large number of overlapping biclusters while identifying common and distinctive terms and documents.
2 Tool Overview
To support the topic mapping task, we apply diagonal biclustering based on graph modularity [1] on the Term \(\times \) Document matrix. The Weighted Topic Map visualization in Fig. 1 is a hybrid Treemap view where rectangular tiles represent individual topics, tile area encodes topic importance, while topic details are shown as a nested word cloud. Term size and color reflect its representativeness of the topic and the number of documents where it appears. An MDS projection computed from the similarity matrix of the diagonal biclusters generates 2D positions which are fed to the Weighted Map visualization algorithm [2]. This results in similar topics being placed in adjacent tiles. Jaccard similarity is used to display links to the five most similar topics when the analyst hovers over a topic, as shown in Fig. 1. Showing topic relationships aims to alleviate the hard partitioning due to the diagonal biclustering. This overview enables the analyst to discover the main topics and select one for further scrutiny.
When the analyst selects a topic, Bimax [4], a pattern-based overlapping biclustering, extracts the topic variants by identifying all maximal combinations of terms shared by a maximal set of documents. While the exhaustiveness of Bimax may serve the needs of the analyst, it produces a very large number of biclusters. To make sense of the numerous Bimax biclusters, we hierarchize them based on term overlaps using the FPTree algorithm [3]. The resulting term hierarchy is represented as a sunburst visualization (3.1 in Fig. 1). The most common terms have a higher overlap degree and appear closer to the root, while the most distinctive terms are placed further away. Each path, from root to leaf, represents a unique association of terms grouped by one bicluster. As we move away from the root along a given path, the word combination becomes more specific and retains fewer documents. At the leaf level, only the documents of one bicluster are retained. By exploring this view and the coordinated comparator view (4), the journalist can focus on a specific aspect of a topic and depict all document relationships to identify facts or viewpoints related to his hypotheses.
The text of the documents can be read in the Document Detail View. In addition, we provide multiple interaction modes illustrated in Fig. 2. Hovering over a term in the hierarchy colors all its occurrences in red (3.3 in Fig. 1) and shows the corresponding term sequence on the right (3.2). The comparator view allows to analyze the common and distinctive terms as well as the distribution of documents across the selected topic variants. Multiple sorting strategies are proposed to facilitate the identification of the most informative terms.
3 Parameter Setting by the User
The number of Bimax biclusters increases with the size or the density of the diagonal bicluster blocks up to more than ten thousand biclusters. To reduce this number, we allow the user to modify the parameters of Bimax: the minimum number of terms or documents per bicluster (MinT, MinD) and the maximum number of biclusters (MaxB). As Bimax uses binary matrices, we also enable the user to change the binarization threshold (Thr) applied on the TF-IDF weights. Increasing the threshold selects, for each document, the most representative terms and reduces the density and the dimensions of the matrix.
In Fig. 3, we visualize the effect of varying each parameter separately on the term hierarchy built from the U.S. presidential elections topic. After each parameter variation, the root node “Obama” is clicked to highlight in orange the distribution of the selected documents. With the default parameters (\(MinT=3\), \(MinD=4\), \(Thr=5\)), only the first levels of the 13,000 biclusters are visible in the sunburst visualization. Increasing both Thr and MinT reduces the dispersion of the documents concerning “Obama”, but the changes of Thr maintain the variety regarding the number of terms. As MinD increases, the number of terms tends to be reduced but the documents selected by the node “Obama” remain largely dispersed in the biclusters until the node disappears.
References
Ailem, M., Role, F., Nadif, M.: Co-clustering document-term matrices by direct maximization of graph modularity. In: Proceedings of the 24th ACM International on CIKM, CIKM 2015, pp. 1807–1810. ACM, NY (2015)
Ghoniem, M., Cornil, M., Broeksema, B., Stefas, M., Otjacques, B.: Weighted maps: treemap visualization of geolocated quantitative data. In: IS&T/SPIE Electronic Imaging, p. 93970G–93970G. Int. Soc. for Optics and Photonics (2015)
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 1–12. ACM, NY (2000)
Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)
Santamaría, R., Therón, R., Quintales, L.: A visual analytics approach for understanding biclustering results from microarray data. BMC Bioinform. 9(1), 247 (2008)
Sun, M., Mi, P., North, C., Ramakrishnan, N., BiSet: semantic edge bundling with biclusters for sensemaking. IEEE TVCG PP(99), 1 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Médoc, N., Ghoniem, M., Nadif, M. (2016). Exploratory Analysis of Text Collections Through Visualization and Hybrid Biclustering. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-46131-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)