CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents

Govind; Kumar, Amit; Alec, Céline; Spaniol, Marc

doi:10.1007/978-3-030-32327-1_17

Govind²⁰,
Amit Kumar²⁰,
Céline Alec²⁰ &
…
Marc Spaniol²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11762))

Included in the following conference series:

European Semantic Web Conference

1196 Accesses

Abstract

Web users these days are confronted with an abundance of information. While this is clearly beneficial in general, there is a risk of “information overload”. To this end, there is an increasing need of filtering, classifying and/or summarizing Web contents automatically. In order to help consumers in efficiently deriving the semantics from Web contents, we have developed the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system. To this end, CALVADOS raises contents to the entity-level and digests its inherent semantics. In this demo, we present how entity-level analytics can be employed to automatically classify the main topic of a Web content and reveal the semantic building blocks associated with the corresponding document.

You have full access to this open access chapter, Download conference paper PDF

Towards a Better Contextualization of Web Contents via Entity-Level Analytics

Analyzing Schema.org

A survey on semantic schema discovery

Article 27 November 2021

Keywords

1 Introduction

Celebrating the Web’s \(30^{th}\) anniversary in 2019, we still observe a gigantic growth in Web contents being created and, at the same time, being available for consumption. This novel data source is a blessing and curse at the same time. On the one hand, we benefit from a vast amount of information accessible 24/7 all over the planet. On the other hand, we might be overwhelmed by the sheer amount of data. To this end, efficient and smart approaches are required, in order to help us to “digest” this huge quantity of data in a “healthy manner”. Our hypothesis - therefore - is, that the named entities contained in a Web content carry its inherent semantics. In order to do so, we combine named entity disambiguation (e.g., AIDA [6] or DBpedia Spotlight [10]) with freely available knowledge bases (KBs) such as DBpedia [1] or YAGO [5]. As a result, we have previously introduced semantic fingerprinting [3, 4] as a general purpose approach towards Web content classification.

In this paper, we introduce the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system as an extension of semantic fingerprinting. CALVADOS is a novel approach that aims at distilling and visualizing semantics of documents by exploiting entity-level analytics for a user-friendly “digestion”. To this end, our demonstration paper makes the following salient contributions:

use of semantic fingerprinting to capture content’s (inherent) semantics
visualization & exploration of (inter-) dependencies among entities contained
provisioning of contextual KB data (e.g., types) supporting data digestion

2 Related Work

Most of the approaches dealing with analytics of Web contents focus on automatic classification of text into predefined categories [7]. Document classification using machine learning methods has been studied by [11]. Recent approaches employ deep neural networks requiring a vast amount of training data. Examples include utilizing the word order of the textual data [8] or a hierarchical attention network [12]. In contrast to these works, our approach exploits entity-level semantics to capture a better document representation (i.e., the semantic fingerprint [3]).

Exploiting entity-level semantics is possible thanks to recent named entity recognition and disambiguation (NERD) systems. Several tools, like AIDA [6] or DBpedia Spotlight [10] (also employed in CALVADOS system), identify the mentions of named entities in a text and map them onto their canonical entities in a knowledge base (KB), for instance YAGO [5] or DBpedia [1]. There are recent works, such as EARL [2] and FALCON^{Footnote 1}, which aim to disambiguate entities in short text. Further, information from the linked KBs can be utilized.

To the best of our knowledge, the approach most similar to our work is TagTheWeb [9]. This tool aims at identifying topics associated with documents. It relies on the knowledge expressed by the taxonomic structure of Wikipedia, based on the generation of a fingerprint through the semantic relation between nodes of the Wikipedia Category Graph. Compared to this work, our semantic fingerprint is based on more fine-grained categories. Besides, our tool also allows to semantically compare two documents.

3 Conceptual Approach Overview of CALVADOS

The goal of CALVADOS is to digest the semantics of a Web content and provide visualizations to facilitate content consumption. The backbone of the CALVADOS system is semantic fingerprinting, a fine-grained type classification approach based on the hypothesis that “You know a document by the named entities it contains”. The general approach is briefly explained in this section, more details can be found in [4]. In short, the semantics of a document is captured by the use of a semantic fingerprint, i.e., a vector that encodes the core semantics of the document based on the type information of entities contained. The ambiguity among named entities is handled using aforementioned standard NERD systems. The actual fine-grained type prediction via semantic fingerprinting can be described in the following two steps. First, the document’s semantic fingerprint is computed. For this purpose, we perform entity-level analytics of the entities contained and, in particular, exploit the type information from the knowledge base YAGO. Second, we employ a random forest classifier to predict the top-level type of our prediction. Once identified, the system aims to find the most suitable fine-grained sub-type. In order to do so, the cosine similarity between the fingerprint of the document and the representative vectors of the sub-types are computed, and the one with the highest score being selected. For example, an article about some football game can be predicted as an event in the top-level prediction, and further aligned to the more specific type game in the second step. CALVADOS utilizes the aforementioned semantic fingerprints to semantically analyze and digest Web contents.

4 Demonstration

The goal of CALVADOS is to help users in digesting documents via entity-level analytics. To this end, entity information are extracted and visualized. In particular, various interactive visualizations are provided:

the semantic fingerprint of a document
the tag cloud based on the named entities contained
statistics about similarity with other types showcasing the document’s “flavor”
an opportunity to compare two documents based on their inherent semantics

As such, this demo comprises two use cases of the CALVADOS system (cf. Figure 2 for a screenshot of the Web interface or https://calvados.greyc.fr/ for an online demonstrator). The first use case facilitates content digestion of an individual document (cf. Subsect. 4.1). The second use case allows users to compare the semantics of two different documents (cf. Subsect. 4.2).

The overall pipeline of CALVADOS works in four stages. In the first stage, the user inputs the document, and preprocessing (boilerplate removal, NERD via DBpedia Spotlight, etc.) on the document content is performed. The following stage then involves the computation of the semantic fingerprint for the concerned document. Subsequently, in the third stage, the relevant fine-grained types for the document are predicted. Finally, the semantic fingerprint and predictions generated in previous stages are visualized to serve the overall goal of a simplified content digestion based on a semantic distillation. Visualizations are constructed by utilizing the JavaScript library Data-Driven Documents^{Footnote 2} (D3.js). Figure 1 depicts the overview of aforementioned stages.

4.1 Use Case I: Content Digestion via Semantic Distillation

The first use case of CALVADOS is content digestion via semantic distillation. The user can input the content by providing a reference URL to the document or by uploading the raw text itself. CALVADOS performs the entity-level analytics on the document via semantic fingerprinting. As a result, the system offers various visualization in order to provide a user-friendly content consumption. Focal point here is the visualization of the semantic fingerprint depicting the associated types based on the underlying type hierarchy. This graphical metaphor allows the user to understand the document’s constituents on a semantic level. For example, a news article about Theresa May^{Footnote 3} comprises a combination of various political parties, administrative districts, skilled workers, etc. Further, CALVADOS also provide an “entity cloud” based on the named entities contained, and highlights those types that are conceptually similar on the entity-level. Figure 2a displays a screenshot of the previously mentioned news article in CALVADOS.

4.2 Use Case II: Comparison of Documents Semantics

The second use case of CALVADOS is a semantic document comparison. To this end, we enable the end user to analyze the overlap and differences between two documents at the semantic level. This is achieved by visualizing the semantic fingerprints of both documents simultaneously as “overlay”. Further, an entity cloud on the intersecting named entities is provided. Here, it can be easily observed, that the semantic fingerprints provide more insights in contrast to the plain entity mentions. In addition, information about the most similar types associated with both documents are provided in order to disclose their overall “flavor”. For example, when comparing the previous news article about Theresa May with a Manchester City FC article, there are visible differences. The former article being aligned towards political parties, skilled workers, etc. whereas the later one towards contests, clubs, etc. (cf. Figure 2b). Finally, the quantified value of similarity based on the semantic fingerprints is indicated, as well.

5 Conclusion and Outlook

In this paper, we introduced CALVADOS, a tool for the semantic analysis and digestion of Web contents. CALVADOS lifts document analysis to the entity-level and utilizes semantic fingerprints in order to capture the inherent semantics of a Web content. In future work, we aim at two directions. Since entity-level analytics is language-agnostic in itself, we aim at transferring CALVADOS into other languages (e.g., French). In addition, we will study the aspect of document similarity on the entity-level, e.g., in the context of fake news detection.

Notes

1.
https://labs.tib.eu/falcon/.
2.
D3.js https://d3js.org/.
3.
https://www.bbc.com/news/uk-politics-47627744.

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: ISWC/ASWC, pp. 722–735 (2007)
Google Scholar
Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: joint entity and relation linking for question answering over knowledge graphs. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 108–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6_7
Chapter Google Scholar
Govind, Alec, C., Spaniol, M.: Semantic fingerprinting: a novel method for entity-level content classification. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. Lecture Notes in Computer Science. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91662-0_21
Chapter Google Scholar
Govind, Alec, C., Spaniol, M.: Fine-grained web content classification via entity-level analytics: the case of semantic fingerprinting. J. Web Eng. (JWE) 17(6&7), 449–482 (2019)
Google Scholar
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Article MathSciNet Google Scholar
Hoffart, J., et al.: Robust disambiguation of named entities in text. In: Conference on EMNLP, Edinburgh, Scotland, UK, pp. 782–792 (2011)
Google Scholar
Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12, 1–28 (2015)
Google Scholar
Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. CoRR abs/1412.1058 (2014)
Google Scholar
Medeiros, J.F., Pereira Nunes, B., Siqueira, S.W.M., Portes Paes Leme, L.A.: TagTheWeb: using wikipedia categories to automatically categorize resources on the web. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 11155, pp. 153–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98192-5_29
Chapter Google Scholar
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: I-Semantics 2011, pp. 1–8. ACM, New York (2011)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: NAACL-HLT, pp. 1480–1489 (2016)
Google Scholar

Download references

Acknowledgements

This work was supported by the RIN RECHERCHE Normandie Digitale research project ASTURIAS contract no. 18E01661. We thank our colleagues for inspiring discussions.

Author information

Authors and Affiliations

Department of Computer Science, Université de Caen Normandie, Campus Côte de Nacre, 14032, Caen Cedex, France
Govind, Amit Kumar, Céline Alec & Marc Spaniol

Authors

Govind
View author publications
You can also search for this author in PubMed Google Scholar
Amit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Céline Alec
View author publications
You can also search for this author in PubMed Google Scholar
Marc Spaniol
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Govind .

Editor information

Editors and Affiliations

Kansas State University, Manhattan, KS, USA
Pascal Hitzler
Vienna University of Economics and Business, Vienna, Austria
Sabrina Kirrane
Linköping University, Linköping, Sweden
Olaf Hartig
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Victor de Boer
Leibniz Information Centre for Science and Technology University Library (TIB), Hannover, Germany
Maria-Esther Vidal
University of Bonn, Bonn, Germany
Maria Maleshkova
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
Stefan Schlobach
Jönköping University, Jönköping, Sweden
Karl Hammar
F. Hoffmann-La Roche AG, Basel, Switzerland
Nelia Lasierra
Robert Bosch GmbH, Stuttgart, Germany
Steffen Stadtmüller
Aalborg University, Aalborg, Denmark
Katja Hose
IMEC, Ghent University, Ghent, Belgium
Ruben Verborgh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Govind, Kumar, A., Alec, C., Spaniol, M. (2019). CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-32327-1_17
Published: 10 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents

Abstract

Similar content being viewed by others