A Topic Model for the Data Web

Röder, Michael; Kuchelev, Denis; Ngomo, Axel-Cyrille Ngonga

doi:10.1007/978-3-031-47745-4_14

Michael Röder¹¹,
Denis Kuchelev¹¹ &
Axel-Cyrille Ngonga Ngomo¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14382))

Included in the following conference series:

Iberoamerican Knowledge Graphs and Semantic Web Conference

192 Accesses

Abstract

The usage of knowledge graphs in industry and at Web scale has increased steadily within recent years. However, the decentralized approach to data creation which underpins the popularity of knowledge graphs also comes with significant challenges. In particular, gaining an overview of the topics covered by existing datasets manually becomes a gargantuan if not impossible feat. Several dataset catalogs, portals and search engines offer different ways to interact with lists of available datasets. However, these interactions range from keyword searches to manually created tags and none of these solutions offers an easy access to human-interpretable categories. In addition, most of these approaches rely on metadata instead of the dataset itself. We propose to use topic modeling to fill this gap. Our implementation LODCat automatically creates human-interpretable topics and assigns them to RDF datasets. It does not need any metadata and solely relies on the provided RDF dataset. Our evaluation shows that LODCat can be used to identify the topics of hundreds of thousands of RDF datasets. Also, our experiment results suggest that humans agree with the topics that LODCat assigns to RDF datasets. Our code and data are available online.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 49.99; Price excludes VAT (USA)

Softcover Book: USD 64.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://ckan.org/.
2.
https://www.kaggle.com/datasets.
3.
https://data.europa.eu/en.
4.
https://lov.linkeddata.es/dataset/lov.
5.
Chang et al. [9] show that there is a difference between these two usages of topics.
6.
LODCat is open source at https://github.com/dice-group/lodcat.
7.
Due to space limitations, we refer the interested reader to Blei et al. [6] and Hoffman et al. [16] for further details about LDA and the used inference algorithm.
8.
The \(C_{V2}\) measure has the same definition as the \(C_V\) measure but uses the \(S^{one}_{all}\) segmentation [31]. The variant showed a better performance in our experiments.
9.
https://github.com/sb1992/NETL-Automatic-Topic-Labelling-.
10.
The transformation of counts into occurrences in a synthetic document is similar to the logarithmic variant of the approach described by Röder et al. [30].
11.
The RDF datasets we use are available at https://figshare.com/s/af7f18a7f3307cc86bdd while the results as well as the Wikipedia-based corpus can be found at https://figshare.com/s/9c7670579c969cfeac05.
12.
We use the dump of the English Wikipedia from September 1st 2021.
13.
The stop word list can be found online. We will add the link after the review phase.
14.
We downloaded the datasets in January 2018.
15.
Note that we do not deduplicate the triples across the datasets.
16.
We use the Gensim library [29] with hyper parameter optimization. https://radimrehurek.com/gensim/index.html.
17.
https://climateknowledgeportal.worldbank.org/.
18.
We used LimeSurvey for the questionnaire (https://www.limesurvey.org/). The questionnaire allowed users to skip questions. These skipped questions are not taken into account for the number of answers.

References

Java Platform Standard Ed. 8: Class Math. Website (2014). https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html. Accessed 18 May 2022
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. W3C Note, W3C, March 2011. http://www.w3.org/TR/2011/NOTE-void-20110303/
Asprino, L., Presutti, V.: Observing IoD: its knowledge domains and the varying behavior of ontologies across them. IEEE Access. 11, 21127–21143 (2023)
Article Google Scholar
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD Laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14
Chapter Google Scholar
Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 953–963. The COLING 2016 Organizing Committee, Osaka, Japan, December 2016
Google Scholar
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Article Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375. WWW 2019, Association for Computing Machinery (2019)
Google Scholar
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J.L., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296. Curran Associates, Inc. (2009)
Google Scholar
Chapman, A., et al.: Dataset search: a survey. Int. J. Very Large Data Bases 29, 251–272 (2020)
Article Google Scholar
Cyganiak, R., Reynolds, D.: The RDF data cube vocabulary. W3c Recommendation, January 2014. http://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/
Devaraju, A., Berkovsky, S.: A hybrid recommendation approach for open research datasets. In: Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 207–211. ACM, UMAP 2018 (2018)
Google Scholar
Ell, B., Vrandečić, D., Simperl, E.: Labels in the web of data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_11
Chapter Google Scholar
Heindorf, S., et al.: EvoLearner: learning description logics with evolutionary algorithms. In: Proceedings of the ACM Web Conference 2022, pp. 818–828 (2022)
Google Scholar
Hinneburg, A., Preiss, R., Schröder, R.: TopicExplorer: exploring document collections with topic models. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 838–841. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_59
Chapter Google Scholar
Hoffman, M., Bach, F., Blei, D.: Online Learning for Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems. Curran Associates (2010)
Google Scholar
Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S.Y.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 43, 494–512 (2021)
MathSciNet Google Scholar
Kopsachilis, V., Vaitis, M.: GeoLOD: a spatial linked data catalog and recommender. Big Data Cogn. Comput. 5(2), 17 (2021)
Article Google Scholar
Kunze, S., Auer, S.: Dataset retrieval. In: 2013 IEEE Seventh International Conference on Semantic Computing (ICSC), pp. 1–8, September 2013
Google Scholar
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pp. 1536–1545. HLT 2011, Association for Computational Linguistics, USA (2011)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics (2014)
Google Scholar
McCrae, J.P.: The Linked Open Data Cloud. Website, May 2021. https://www.lod-cloud.net/. Accessed 24 Aug 2021
Mohammadi, M.: (semi-) automatic construction of knowledge graph metadata. In: The Semantic Web: ESWC 2022 Satellite Events, pp. 171–178 (2022)
Google Scholar
Ngomo, A.C.N., et al.: LIMES-a framework for link discovery on the semantic web. J. Web Semant. 35, 413–423 (2018)
Google Scholar
Patni, H.: Linkedsensordata. Website in the web archive, September 2010. https://web.archive.org/web/20190816202119/http://wiki.knoesis.org/index.php/SSW_Datasets. Accessed 11 May 2022
Patni, H., Henson, C., Sheth, A.: Linked sensor data. In: 2010 International Symposium on Collaborative Technologies and Systems, pp. 362–370 (2010)
Google Scholar
Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track, vol. 1035, pp. 245–248. CEUR-WS.org, Aachen, Germany, Germany (2013)
Google Scholar
Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 137–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_9
Chapter Google Scholar
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, May 2010
Google Scholar
Röder, M., Ngonga Ngomo, A.C., Ermilov, I., Both, A.: Detecting similar linked datasets using topic modelling. In: ESWC (2016)
Google Scholar
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the WSDM (2015)
Google Scholar
Saxena, A., Tripathi, A., Talukdar, P.: Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Google Scholar
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: The Semantic Web - ISWC 2014 (2014)
Google Scholar
Singhal, A., Kasturi, R., Srivastava, J.: DataGopher: context-based search for research datasets. In: Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration, pp. 749–756. IEEE IRI 2014 (2014)
Google Scholar
Sleeman, J., Finin, T., Joshi, A.: Topic modeling for RDF graphs. In: ISWC (2015)
Google Scholar
Spahiu, B., Maurino, A., Meusel, R.: Topic profiling benchmarks in the linked open data cloud: issues and lessons learned. Semant. Web 10(2), 329–348 (2019)
Article Google Scholar
Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC (2016)
Google Scholar
Tzitzikas, Y., Manolis, N., Papadakos, P.: Faceted exploration of RDF/S datasets: a survey. J. Intell. Inf. Syst. 48(2), 329–364 (2017)
Article Google Scholar
Vandenbussche, P.Y., Atemezing, G.A., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Semant. Web 8(3), 437–452 (2017)
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the Ministry of Culture and Science of North Rhine-Westphalia (MKW NRW) within the project SAIL under the grant no NW21-059D.

Author information

Authors and Affiliations

DICE, Paderborn University, Paderborn, Germany
Michael Röder, Denis Kuchelev & Axel-Cyrille Ngonga Ngomo

Authors

Michael Röder
View author publications
You can also search for this author in PubMed Google Scholar
Denis Kuchelev
View author publications
You can also search for this author in PubMed Google Scholar
Axel-Cyrille Ngonga Ngomo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Röder .

Editor information

Editors and Affiliations

Autonomous University of Tamaulipas, Ciudad Victoria, Mexico
Fernando Ortiz-Rodriguez
University of La Rioja, Madrid, Spain
Boris Villazón-Terrazas
Autonomous University of Tamaulipas, Ciudad Victoria, Mexico
Sanju Tiwari
University of Zaragoza, Zaragoza, Spain
Carlos Bobed

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Röder, M., Kuchelev, D., Ngomo, AC.N. (2023). A Topic Model for the Data Web. In: Ortiz-Rodriguez, F., Villazón-Terrazas, B., Tiwari, S., Bobed, C. (eds) Knowledge Graphs and Semantic Web. KGSWC 2023. Lecture Notes in Computer Science, vol 14382. Springer, Cham. https://doi.org/10.1007/978-3-031-47745-4_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-47745-4_14
Published: 31 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47744-7
Online ISBN: 978-3-031-47745-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Topic Model for the Data Web