Abstract
Web users these days are confronted with an abundance of information. While this is clearly beneficial in general, there is a risk of “information overload”. To this end, there is an increasing need of filtering, classifying and/or summarizing Web contents automatically. In order to help consumers in efficiently deriving the semantics from Web contents, we have developed the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system. To this end, CALVADOS raises contents to the entity-level and digests its inherent semantics. In this demo, we present how entity-level analytics can be employed to automatically classify the main topic of a Web content and reveal the semantic building blocks associated with the corresponding document.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Celebrating the Web’s \(30^{th}\) anniversary in 2019, we still observe a gigantic growth in Web contents being created and, at the same time, being available for consumption. This novel data source is a blessing and curse at the same time. On the one hand, we benefit from a vast amount of information accessible 24/7 all over the planet. On the other hand, we might be overwhelmed by the sheer amount of data. To this end, efficient and smart approaches are required, in order to help us to “digest” this huge quantity of data in a “healthy manner”. Our hypothesis - therefore - is, that the named entities contained in a Web content carry its inherent semantics. In order to do so, we combine named entity disambiguation (e.g., AIDA [6] or DBpedia Spotlight [10]) with freely available knowledge bases (KBs) such as DBpedia [1] or YAGO [5]. As a result, we have previously introduced semantic fingerprinting [3, 4] as a general purpose approach towards Web content classification.
In this paper, we introduce the CALVADOS (Content AnaLytics ViA Digestion Of Semantics) system as an extension of semantic fingerprinting. CALVADOS is a novel approach that aims at distilling and visualizing semantics of documents by exploiting entity-level analytics for a user-friendly “digestion”. To this end, our demonstration paper makes the following salient contributions:
-
use of semantic fingerprinting to capture content’s (inherent) semantics
-
visualization & exploration of (inter-) dependencies among entities contained
-
provisioning of contextual KB data (e.g., types) supporting data digestion
2 Related Work
Most of the approaches dealing with analytics of Web contents focus on automatic classification of text into predefined categories [7]. Document classification using machine learning methods has been studied by [11]. Recent approaches employ deep neural networks requiring a vast amount of training data. Examples include utilizing the word order of the textual data [8] or a hierarchical attention network [12]. In contrast to these works, our approach exploits entity-level semantics to capture a better document representation (i.e., the semantic fingerprint [3]).
Exploiting entity-level semantics is possible thanks to recent named entity recognition and disambiguation (NERD) systems. Several tools, like AIDA [6] or DBpedia Spotlight [10] (also employed in CALVADOS system), identify the mentions of named entities in a text and map them onto their canonical entities in a knowledge base (KB), for instance YAGO [5] or DBpedia [1]. There are recent works, such as EARL [2] and FALCONFootnote 1, which aim to disambiguate entities in short text. Further, information from the linked KBs can be utilized.
To the best of our knowledge, the approach most similar to our work is TagTheWeb [9]. This tool aims at identifying topics associated with documents. It relies on the knowledge expressed by the taxonomic structure of Wikipedia, based on the generation of a fingerprint through the semantic relation between nodes of the Wikipedia Category Graph. Compared to this work, our semantic fingerprint is based on more fine-grained categories. Besides, our tool also allows to semantically compare two documents.
3 Conceptual Approach Overview of CALVADOS
The goal of CALVADOS is to digest the semantics of a Web content and provide visualizations to facilitate content consumption. The backbone of the CALVADOS system is semantic fingerprinting, a fine-grained type classification approach based on the hypothesis that “You know a document by the named entities it contains”. The general approach is briefly explained in this section, more details can be found in [4]. In short, the semantics of a document is captured by the use of a semantic fingerprint, i.e., a vector that encodes the core semantics of the document based on the type information of entities contained. The ambiguity among named entities is handled using aforementioned standard NERD systems. The actual fine-grained type prediction via semantic fingerprinting can be described in the following two steps. First, the document’s semantic fingerprint is computed. For this purpose, we perform entity-level analytics of the entities contained and, in particular, exploit the type information from the knowledge base YAGO. Second, we employ a random forest classifier to predict the top-level type of our prediction. Once identified, the system aims to find the most suitable fine-grained sub-type. In order to do so, the cosine similarity between the fingerprint of the document and the representative vectors of the sub-types are computed, and the one with the highest score being selected. For example, an article about some football game can be predicted as an event in the top-level prediction, and further aligned to the more specific type game in the second step. CALVADOS utilizes the aforementioned semantic fingerprints to semantically analyze and digest Web contents.
4 Demonstration
The goal of CALVADOS is to help users in digesting documents via entity-level analytics. To this end, entity information are extracted and visualized. In particular, various interactive visualizations are provided:
-
the semantic fingerprint of a document
-
the tag cloud based on the named entities contained
-
statistics about similarity with other types showcasing the document’s “flavor”
-
an opportunity to compare two documents based on their inherent semantics
As such, this demo comprises two use cases of the CALVADOS system (cf. Figure 2 for a screenshot of the Web interface or https://calvados.greyc.fr/ for an online demonstrator). The first use case facilitates content digestion of an individual document (cf. Subsect. 4.1). The second use case allows users to compare the semantics of two different documents (cf. Subsect. 4.2).
The overall pipeline of CALVADOS works in four stages. In the first stage, the user inputs the document, and preprocessing (boilerplate removal, NERD via DBpedia Spotlight, etc.) on the document content is performed. The following stage then involves the computation of the semantic fingerprint for the concerned document. Subsequently, in the third stage, the relevant fine-grained types for the document are predicted. Finally, the semantic fingerprint and predictions generated in previous stages are visualized to serve the overall goal of a simplified content digestion based on a semantic distillation. Visualizations are constructed by utilizing the JavaScript library Data-Driven DocumentsFootnote 2 (D3.js). Figure 1 depicts the overview of aforementioned stages.
4.1 Use Case I: Content Digestion via Semantic Distillation
The first use case of CALVADOS is content digestion via semantic distillation. The user can input the content by providing a reference URL to the document or by uploading the raw text itself. CALVADOS performs the entity-level analytics on the document via semantic fingerprinting. As a result, the system offers various visualization in order to provide a user-friendly content consumption. Focal point here is the visualization of the semantic fingerprint depicting the associated types based on the underlying type hierarchy. This graphical metaphor allows the user to understand the document’s constituents on a semantic level. For example, a news article about Theresa MayFootnote 3 comprises a combination of various political parties, administrative districts, skilled workers, etc. Further, CALVADOS also provide an “entity cloud” based on the named entities contained, and highlights those types that are conceptually similar on the entity-level. Figure 2a displays a screenshot of the previously mentioned news article in CALVADOS.
4.2 Use Case II: Comparison of Documents Semantics
The second use case of CALVADOS is a semantic document comparison. To this end, we enable the end user to analyze the overlap and differences between two documents at the semantic level. This is achieved by visualizing the semantic fingerprints of both documents simultaneously as “overlay”. Further, an entity cloud on the intersecting named entities is provided. Here, it can be easily observed, that the semantic fingerprints provide more insights in contrast to the plain entity mentions. In addition, information about the most similar types associated with both documents are provided in order to disclose their overall “flavor”. For example, when comparing the previous news article about Theresa May with a Manchester City FC article, there are visible differences. The former article being aligned towards political parties, skilled workers, etc. whereas the later one towards contests, clubs, etc. (cf. Figure 2b). Finally, the quantified value of similarity based on the semantic fingerprints is indicated, as well.
5 Conclusion and Outlook
In this paper, we introduced CALVADOS, a tool for the semantic analysis and digestion of Web contents. CALVADOS lifts document analysis to the entity-level and utilizes semantic fingerprints in order to capture the inherent semantics of a Web content. In future work, we aim at two directions. Since entity-level analytics is language-agnostic in itself, we aim at transferring CALVADOS into other languages (e.g., French). In addition, we will study the aspect of document similarity on the entity-level, e.g., in the context of fake news detection.
References
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: ISWC/ASWC, pp. 722–735 (2007)
Dubey, M., Banerjee, D., Chaudhuri, D., Lehmann, J.: EARL: joint entity and relation linking for question answering over knowledge graphs. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11136, pp. 108–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00671-6_7
Govind, Alec, C., Spaniol, M.: Semantic fingerprinting: a novel method for entity-level content classification. In: Mikkonen, T., Klamma, R., Hernández, J. (eds.) ICWE 2018. Lecture Notes in Computer Science. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91662-0_21
Govind, Alec, C., Spaniol, M.: Fine-grained web content classification via entity-level analytics: the case of semantic fingerprinting. J. Web Eng. (JWE) 17(6&7), 449–482 (2019)
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: YAGO2: a spatially and temporally enhanced knowledge base from wikipedia. Artif. Intell. 194, 28–61 (2013)
Hoffart, J., et al.: Robust disambiguation of named entities in text. In: Conference on EMNLP, Edinburgh, Scotland, UK, pp. 782–792 (2011)
Jindal, R., Malhotra, R., Jain, A.: Techniques for text classification: literature review and current trends. Webology 12, 1–28 (2015)
Johnson, R., Zhang, T.: Effective use of word order for text categorization with convolutional neural networks. CoRR abs/1412.1058 (2014)
Medeiros, J.F., Pereira Nunes, B., Siqueira, S.W.M., Portes Paes Leme, L.A.: TagTheWeb: using wikipedia categories to automatically categorize resources on the web. In: Gangemi, A., et al. (eds.) ESWC 2018. LNCS, vol. 11155, pp. 153–157. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-98192-5_29
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: I-Semantics 2011, pp. 1–8. ACM, New York (2011)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: NAACL-HLT, pp. 1480–1489 (2016)
Acknowledgements
This work was supported by the RIN RECHERCHE Normandie Digitale research project ASTURIAS contract no. 18E01661. We thank our colleagues for inspiring discussions.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Govind, Kumar, A., Alec, C., Spaniol, M. (2019). CALVADOS: A Tool for the Semantic Analysis and Digestion of Web Contents. In: Hitzler, P., et al. The Semantic Web: ESWC 2019 Satellite Events. ESWC 2019. Lecture Notes in Computer Science(), vol 11762. Springer, Cham. https://doi.org/10.1007/978-3-030-32327-1_17
Download citation
DOI: https://doi.org/10.1007/978-3-030-32327-1_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32326-4
Online ISBN: 978-3-030-32327-1
eBook Packages: Computer ScienceComputer Science (R0)