Keywords

1 Introduction

Celebrating the Web’s \(30^{th}\) anniversary in 2019, we still observe a gigantic growth in Web contents being created and, at the same time, being available for consumption. This novel data source is a blessing and curse at the same time. On the one hand, we benefit from a vast amount of information accessible 24/7 all over the planet. On the other hand, we might be overwhelmed by the sheer amount of data. To this end, efficient and smart approaches are required, in order to help us to “digest” this huge quantity of data in a “healthy manner”. Our hypothesis - therefore - is, that the named entities contained in a Web content carry its inherent semantics. In order to do so, we combine named entity disambiguation (e.g., AIDA [6] or DBpedia Spotlight [10]) with freely available knowledge bases (KBs) such as DBpedia [1] or YAGO [5]. As a result, we have previously introduced semantic fingerprinting [3, 4] as a general purpose approach towards Web content classification.

In this paper, we introduce the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system as an extension of semantic fingerprinting. CALVADOS is a novel approach that aims at distilling and visualizing semantics of documents by exploiting entity-level analytics for a user-friendly “digestion”. To this end, our demonstration paper makes the following salient contributions:

  • use of semantic fingerprinting to capture content’s (inherent) semantics

  • visualization & exploration of (inter-) dependencies among entities contained

  • provisioning of contextual KB data (e.g., types) supporting data digestion

2 Related Work

Most of the approaches dealing with analytics of Web contents focus on automatic classification of text into predefined categories [7]. Document classification using machine learning methods has been studied by [11]. Recent approaches employ deep neural networks requiring a vast amount of training data. Examples include utilizing the word order of the textual data [8] or a hierarchical attention network [12]. In contrast to these works, our approach exploits entity-level semantics to capture a better document representation (i.e., the semantic fingerprint [3]).

Exploiting entity-level semantics is possible thanks to recent named entity recognition and disambiguation (NERD) systems. Several tools, like AIDA [6] or DBpedia Spotlight [10] (also employed in CALVADOS system), identify the mentions of named entities in a text and map them onto their canonical entities in a knowledge base (KB), for instance YAGO [5] or DBpedia [1]. There are recent works, such as EARL [2] and FALCONFootnote 1, which aim to disambiguate entities in short text. Further, information from the linked KBs can be utilized.

To the best of our knowledge, the approach most similar to our work is TagTheWeb [9]. This tool aims at identifying topics associated with documents. It relies on the knowledge expressed by the taxonomic structure of Wikipedia, based on the generation of a fingerprint through the semantic relation between nodes of the Wikipedia Category Graph. Compared to this work, our semantic fingerprint is based on more fine-grained categories. Besides, our tool also allows to semantically compare two documents.

3 Conceptual Approach Overview of CALVADOS

The goal of CALVADOS is to digest the semantics of a Web content and provide visualizations to facilitate content consumption. The backbone of the CALVADOS system is semantic fingerprinting, a fine-grained type classification approach based on the hypothesis that “You know a document by the named entities it contains”. The general approach is briefly explained in this section, more details can be found in [4]. In short, the semantics of a document is captured by the use of a semantic fingerprint, i.e., a vector that encodes the core semantics of the document based on the type information of entities contained. The ambiguity among named entities is handled using aforementioned standard NERD systems. The actual fine-grained type prediction via semantic fingerprinting can be described in the following two steps. First, the document’s semantic fingerprint is computed. For this purpose, we perform entity-level analytics of the entities contained and, in particular, exploit the type information from the knowledge base YAGO. Second, we employ a random forest classifier to predict the top-level type of our prediction. Once identified, the system aims to find the most suitable fine-grained sub-type. In order to do so, the cosine similarity between the fingerprint of the document and the representative vectors of the sub-types are computed, and the one with the highest score being selected. For example, an article about some football game can be predicted as an event in the top-level prediction, and further aligned to the more specific type game in the second step. CALVADOS utilizes the aforementioned semantic fingerprints to semantically analyze and digest Web contents.

4 Demonstration

The goal of CALVADOS is to help users in digesting documents via entity-level analytics. To this end, entity information are extracted and visualized. In particular, various interactive visualizations are provided:

  • the semantic fingerprint of a document

  • the tag cloud based on the named entities contained

  • statistics about similarity with other types showcasing the document’s “flavor”

  • an opportunity to compare two documents based on their inherent semantics

As such, this demo comprises two use cases of the CALVADOS system (cf. Figure 2 for a screenshot of the Web interface or https://calvados.greyc.fr/ for an online demonstrator). The first use case facilitates content digestion of an individual document (cf. Subsect. 4.1). The second use case allows users to compare the semantics of two different documents (cf. Subsect. 4.2).

The overall pipeline of CALVADOS works in four stages. In the first stage, the user inputs the document, and preprocessing (boilerplate removal, NERD via DBpedia Spotlight, etc.) on the document content is performed. The following stage then involves the computation of the semantic fingerprint for the concerned document. Subsequently, in the third stage, the relevant fine-grained types for the document are predicted. Finally, the semantic fingerprint and predictions generated in previous stages are visualized to serve the overall goal of a simplified content digestion based on a semantic distillation. Visualizations are constructed by utilizing the JavaScript library Data-Driven DocumentsFootnote 2 (D3.js). Figure 1 depicts the overview of aforementioned stages.

Fig. 1.
figure 1

Conceptual overview of the CALVADOS pipeline

4.1 Use Case I: Content Digestion via Semantic Distillation

The first use case of CALVADOS is content digestion via semantic distillation. The user can input the content by providing a reference URL to the document or by uploading the raw text itself. CALVADOS performs the entity-level analytics on the document via semantic fingerprinting. As a result, the system offers various visualization in order to provide a user-friendly content consumption. Focal point here is the visualization of the semantic fingerprint depicting the associated types based on the underlying type hierarchy. This graphical metaphor allows the user to understand the document’s constituents on a semantic level. For example, a news article about Theresa MayFootnote 3 comprises a combination of various political parties, administrative districts, skilled workers, etc. Further, CALVADOS also provide an “entity cloud” based on the named entities contained, and highlights those types that are conceptually similar on the entity-level. Figure 2a displays a screenshot of the previously mentioned news article in CALVADOS.

Fig. 2.
figure 2

Screenshots of CALVADOS

4.2 Use Case II: Comparison of Documents Semantics

The second use case of CALVADOS is a semantic document comparison. To this end, we enable the end user to analyze the overlap and differences between two documents at the semantic level. This is achieved by visualizing the semantic fingerprints of both documents simultaneously as “overlay”. Further, an entity cloud on the intersecting named entities is provided. Here, it can be easily observed, that the semantic fingerprints provide more insights in contrast to the plain entity mentions. In addition, information about the most similar types associated with both documents are provided in order to disclose their overall “flavor”. For example, when comparing the previous news article about Theresa May with a Manchester City FC article, there are visible differences. The former article being aligned towards political parties, skilled workers, etc. whereas the later one towards contests, clubs, etc. (cf. Figure 2b). Finally, the quantified value of similarity based on the semantic fingerprints is indicated, as well.

5 Conclusion and Outlook

In this paper, we introduced CALVADOS, a tool for the semantic analysis and digestion of Web contents. CALVADOS lifts document analysis to the entity-level and utilizes semantic fingerprints in order to capture the inherent semantics of a Web content. In future work, we aim at two directions. Since entity-level analytics is language-agnostic in itself, we aim at transferring CALVADOS into other languages (e.g., French). In addition, we will study the aspect of document similarity on the entity-level, e.g., in the context of fake news detection.