Exploratory Analysis of Text Collections Through Visualization and Hybrid Biclustering

Médoc, Nicolas; Ghoniem, Mohammad; Nadif, Mohamed

doi:10.1007/978-3-319-46131-1_13

Nicolas Médoc^20,21,
Mohammad Ghoniem²¹ &
Mohamed Nadif²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9853))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2780 Accesses

Abstract

We propose a visual analytics tool to support analytic journalists in the exploration of large text corpora. Our tool combines graph modularity-based diagonal biclustering to extract high-level topics with overlapping bi-clustering to elicit fine-grained topic variants. A hybrid topic treemap visualization gives the analyst an overview of all topics. Coordinated sunburst and heatmap visualizations let the analyst inspect and compare topic variants and access document content on demand.

You have full access to this open access chapter, Download conference paper PDF

Literature Explorer: effective retrieval of scientific documents through nonparametric thematic topic detection

Article Open access 02 August 2019

Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow

Visual Analysis and Knowledge Discovery for Text

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

We present a visual analytics tool designed to help analytic journalists explore large text corpora. Analytic journalists typically start by getting an overview of the field under investigation, then focus on specific aspects to identify facts and viewpoints that verify, refine or refute their hypothesis. Text corpora are often modeled by Term \(\times \) Document matrices, from which topics may be extracted using graph modularity-based diagonal biclustering [1]. Word cloud views are popular representations of individual topics and have been extended in many ways. In the considered use case, the journalist needs to grasp dozens of topics at a glance and appreciate topic importance. A good visualization may further ease this task by displaying topic relationships. Once the journalist has identified a topic of interest, his concern shifts to understanding topic variants and identifying distinctive documents and terms for each. The visualization of overlapping biclusters has been approached in various ways e.g. transparent overlapping hulls in node-link diagrams, matrix visualizations and parallel coordinates by Santamaría et al. [5]. BiSet [6] represents chained bipartite graphs enhanced with semantic bundles to represent chained bicluster relationships. These representations fail to convey an overview of a large number of overlapping biclusters while identifying common and distinctive terms and documents.

2 Tool Overview

To support the topic mapping task, we apply diagonal biclustering based on graph modularity [1] on the Term \(\times \) Document matrix. The Weighted Topic Map visualization in Fig. 1 is a hybrid Treemap view where rectangular tiles represent individual topics, tile area encodes topic importance, while topic details are shown as a nested word cloud. Term size and color reflect its representativeness of the topic and the number of documents where it appears. An MDS projection computed from the similarity matrix of the diagonal biclusters generates 2D positions which are fed to the Weighted Map visualization algorithm [2]. This results in similar topics being placed in adjacent tiles. Jaccard similarity is used to display links to the five most similar topics when the analyst hovers over a topic, as shown in Fig. 1. Showing topic relationships aims to alleviate the hard partitioning due to the diagonal biclustering. This overview enables the analyst to discover the main topics and select one for further scrutiny.

When the analyst selects a topic, Bimax [4], a pattern-based overlapping biclustering, extracts the topic variants by identifying all maximal combinations of terms shared by a maximal set of documents. While the exhaustiveness of Bimax may serve the needs of the analyst, it produces a very large number of biclusters. To make sense of the numerous Bimax biclusters, we hierarchize them based on term overlaps using the FPTree algorithm [3]. The resulting term hierarchy is represented as a sunburst visualization (3.1 in Fig. 1). The most common terms have a higher overlap degree and appear closer to the root, while the most distinctive terms are placed further away. Each path, from root to leaf, represents a unique association of terms grouped by one bicluster. As we move away from the root along a given path, the word combination becomes more specific and retains fewer documents. At the leaf level, only the documents of one bicluster are retained. By exploring this view and the coordinated comparator view (4), the journalist can focus on a specific aspect of a topic and depict all document relationships to identify facts or viewpoints related to his hypotheses.

The text of the documents can be read in the Document Detail View. In addition, we provide multiple interaction modes illustrated in Fig. 2. Hovering over a term in the hierarchy colors all its occurrences in red (3.3 in Fig. 1) and shows the corresponding term sequence on the right (3.2). The comparator view allows to analyze the common and distinctive terms as well as the distribution of documents across the selected topic variants. Multiple sorting strategies are proposed to facilitate the identification of the most informative terms.

3 Parameter Setting by the User

The number of Bimax biclusters increases with the size or the density of the diagonal bicluster blocks up to more than ten thousand biclusters. To reduce this number, we allow the user to modify the parameters of Bimax: the minimum number of terms or documents per bicluster (MinT, MinD) and the maximum number of biclusters (MaxB). As Bimax uses binary matrices, we also enable the user to change the binarization threshold (Thr) applied on the TF-IDF weights. Increasing the threshold selects, for each document, the most representative terms and reduces the density and the dimensions of the matrix.

In Fig. 3, we visualize the effect of varying each parameter separately on the term hierarchy built from the U.S. presidential elections topic. After each parameter variation, the root node “Obama” is clicked to highlight in orange the distribution of the selected documents. With the default parameters (\(MinT=3\), \(MinD=4\), \(Thr=5\)), only the first levels of the 13,000 biclusters are visible in the sunburst visualization. Increasing both Thr and MinT reduces the dispersion of the documents concerning “Obama”, but the changes of Thr maintain the variety regarding the number of terms. As MinD increases, the number of terms tends to be reduced but the documents selected by the node “Obama” remain largely dispersed in the biclusters until the node disappears.

References

Ailem, M., Role, F., Nadif, M.: Co-clustering document-term matrices by direct maximization of graph modularity. In: Proceedings of the 24th ACM International on CIKM, CIKM 2015, pp. 1807–1810. ACM, NY (2015)
Google Scholar
Ghoniem, M., Cornil, M., Broeksema, B., Stefas, M., Otjacques, B.: Weighted maps: treemap visualization of geolocated quantitative data. In: IS&T/SPIE Electronic Imaging, p. 93970G–93970G. Int. Soc. for Optics and Photonics (2015)
Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD 2000, pp. 1–12. ACM, NY (2000)
Google Scholar
Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bhlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E.: A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics 22(9), 1122–1129 (2006)
Google Scholar
Santamaría, R., Therón, R., Quintales, L.: A visual analytics approach for understanding biclustering results from microarray data. BMC Bioinform. 9(1), 247 (2008)
Article Google Scholar
Sun, M., Mi, P., North, C., Ramakrishnan, N., BiSet: semantic edge bundling with biclusters for sensemaking. IEEE TVCG PP(99), 1 (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

LIPADE, University of Paris Descartes, 45, rue des saints pères, 75006, Paris, France
Nicolas Médoc & Mohamed Nadif
ERIN-eScience, Luxembourg Institute of Science and Technology, 41, rue du Brill, 4422, Belvaux, Luxembourg
Nicolas Médoc & Mohammad Ghoniem

Authors

Nicolas Médoc
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Ghoniem
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nicolas Médoc .

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, Leuven, Belgium
Bettina Berendt
Deloitte GmbH, München, Germany
Björn Bringmann
Laboratoire Hubert Curien, Jean Monnet University, Saint-Etienne, France
Élisa Fromont
Allianz SE, Munich, Germany
Gemma Garriga
Max-Planck-Institute for Informatics, Saarbrücken, Germany
Pauli Miettinen
Aalto University School of Science, Espoo, Finland
Nikolaj Tatti
Siemens AG & Lud. Max. Univ. of Munich, Munich, Germany
Volker Tresp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Médoc, N., Ghoniem, M., Nadif, M. (2016). Exploratory Analysis of Text Collections Through Visualization and Hybrid Biclustering. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-46131-1_13
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Exploratory Analysis of Text Collections Through Visualization and Hybrid Biclustering

Abstract

Similar content being viewed by others

Literature Explorer: effective retrieval of scientific documents through nonparametric thematic topic detection

Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow

Visual Analysis and Knowledge Discovery for Text

Keywords

1 Introduction

2 Tool Overview

3 Parameter Setting by the User

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Exploratory Analysis of Text Collections Through Visualization and Hybrid Biclustering

Abstract

Similar content being viewed by others

Literature Explorer: effective retrieval of scientific documents through nonparametric thematic topic detection

Visual Analysis of Topical Evolution in Unstructured Text: Design and Evaluation of TopicFlow

Visual Analysis and Knowledge Discovery for Text

Keywords

1 Introduction

2 Tool Overview

3 Parameter Setting by the User

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation