Abstract
The usage of knowledge graphs in industry and at Web scale has increased steadily within recent years. However, the decentralized approach to data creation which underpins the popularity of knowledge graphs also comes with significant challenges. In particular, gaining an overview of the topics covered by existing datasets manually becomes a gargantuan if not impossible feat. Several dataset catalogs, portals and search engines offer different ways to interact with lists of available datasets. However, these interactions range from keyword searches to manually created tags and none of these solutions offers an easy access to human-interpretable categories. In addition, most of these approaches rely on metadata instead of the dataset itself. We propose to use topic modeling to fill this gap. Our implementation LODCat automatically creates human-interpretable topics and assigns them to RDF datasets. It does not need any metadata and solely relies on the provided RDF dataset. Our evaluation shows that LODCat can be used to identify the topics of hundreds of thousands of RDF datasets. Also, our experiment results suggest that humans agree with the topics that LODCat assigns to RDF datasets. Our code and data are available online.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
Chang et al. [9] show that there is a difference between these two usages of topics.
- 6.
LODCat is open source at https://github.com/dice-group/lodcat.
- 7.
- 8.
The \(C_{V2}\) measure has the same definition as the \(C_V\) measure but uses the \(S^{one}_{all}\) segmentation [31]. The variant showed a better performance in our experiments.
- 9.
- 10.
The transformation of counts into occurrences in a synthetic document is similar to the logarithmic variant of the approach described by Röder et al. [30].
- 11.
The RDF datasets we use are available at https://figshare.com/s/af7f18a7f3307cc86bdd while the results as well as the Wikipedia-based corpus can be found at https://figshare.com/s/9c7670579c969cfeac05.
- 12.
We use the dump of the English Wikipedia from September 1st 2021.
- 13.
The stop word list can be found online. We will add the link after the review phase.
- 14.
We downloaded the datasets in January 2018.
- 15.
Note that we do not deduplicate the triples across the datasets.
- 16.
We use the Gensim library [29] with hyper parameter optimization. https://radimrehurek.com/gensim/index.html.
- 17.
- 18.
We used LimeSurvey for the questionnaire (https://www.limesurvey.org/). The questionnaire allowed users to skip questions. These skipped questions are not taken into account for the number of answers.
References
Java Platform Standard Ed. 8: Class Math. Website (2014). https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html. Accessed 18 May 2022
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. W3C Note, W3C, March 2011. http://www.w3.org/TR/2011/NOTE-void-20110303/
Asprino, L., Presutti, V.: Observing IoD: its knowledge domains and the varying behavior of ontologies across them. IEEE Access. 11, 21127–21143 (2023)
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD Laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14
Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 953–963. The COLING 2016 Organizing Committee, Osaka, Japan, December 2016
Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375. WWW 2019, Association for Computing Machinery (2019)
Chang, J., Gerrish, S., Wang, C., Boyd-graber, J.L., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296. Curran Associates, Inc. (2009)
Chapman, A., et al.: Dataset search: a survey. Int. J. Very Large Data Bases 29, 251–272 (2020)
Cyganiak, R., Reynolds, D.: The RDF data cube vocabulary. W3c Recommendation, January 2014. http://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/
Devaraju, A., Berkovsky, S.: A hybrid recommendation approach for open research datasets. In: Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 207–211. ACM, UMAP 2018 (2018)
Ell, B., Vrandečić, D., Simperl, E.: Labels in the web of data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_11
Heindorf, S., et al.: EvoLearner: learning description logics with evolutionary algorithms. In: Proceedings of the ACM Web Conference 2022, pp. 818–828 (2022)
Hinneburg, A., Preiss, R., Schröder, R.: TopicExplorer: exploring document collections with topic models. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 838–841. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_59
Hoffman, M., Bach, F., Blei, D.: Online Learning for Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems. Curran Associates (2010)
Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S.Y.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 43, 494–512 (2021)
Kopsachilis, V., Vaitis, M.: GeoLOD: a spatial linked data catalog and recommender. Big Data Cogn. Comput. 5(2), 17 (2021)
Kunze, S., Auer, S.: Dataset retrieval. In: 2013 IEEE Seventh International Conference on Semantic Computing (ICSC), pp. 1–8, September 2013
Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pp. 1536–1545. HLT 2011, Association for Computational Linguistics, USA (2011)
Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics (2014)
McCrae, J.P.: The Linked Open Data Cloud. Website, May 2021. https://www.lod-cloud.net/. Accessed 24 Aug 2021
Mohammadi, M.: (semi-) automatic construction of knowledge graph metadata. In: The Semantic Web: ESWC 2022 Satellite Events, pp. 171–178 (2022)
Ngomo, A.C.N., et al.: LIMES-a framework for link discovery on the semantic web. J. Web Semant. 35, 413–423 (2018)
Patni, H.: Linkedsensordata. Website in the web archive, September 2010. https://web.archive.org/web/20190816202119/http://wiki.knoesis.org/index.php/SSW_Datasets. Accessed 11 May 2022
Patni, H., Henson, C., Sheth, A.: Linked sensor data. In: 2010 International Symposium on Collaborative Technologies and Systems, pp. 362–370 (2010)
Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track, vol. 1035, pp. 245–248. CEUR-WS.org, Aachen, Germany, Germany (2013)
Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 137–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_9
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, May 2010
Röder, M., Ngonga Ngomo, A.C., Ermilov, I., Both, A.: Detecting similar linked datasets using topic modelling. In: ESWC (2016)
Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the WSDM (2015)
Saxena, A., Tripathi, A., Talukdar, P.: Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: The Semantic Web - ISWC 2014 (2014)
Singhal, A., Kasturi, R., Srivastava, J.: DataGopher: context-based search for research datasets. In: Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration, pp. 749–756. IEEE IRI 2014 (2014)
Sleeman, J., Finin, T., Joshi, A.: Topic modeling for RDF graphs. In: ISWC (2015)
Spahiu, B., Maurino, A., Meusel, R.: Topic profiling benchmarks in the linked open data cloud: issues and lessons learned. Semant. Web 10(2), 329–348 (2019)
Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC (2016)
Tzitzikas, Y., Manolis, N., Papadakos, P.: Faceted exploration of RDF/S datasets: a survey. J. Intell. Inf. Syst. 48(2), 329–364 (2017)
Vandenbussche, P.Y., Atemezing, G.A., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Semant. Web 8(3), 437–452 (2017)
Acknowledgements
This work has been supported by the Ministry of Culture and Science of North Rhine-Westphalia (MKW NRW) within the project SAIL under the grant no NW21-059D.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Röder, M., Kuchelev, D., Ngomo, AC.N. (2023). A Topic Model for the Data Web. In: Ortiz-Rodriguez, F., Villazón-Terrazas, B., Tiwari, S., Bobed, C. (eds) Knowledge Graphs and Semantic Web. KGSWC 2023. Lecture Notes in Computer Science, vol 14382. Springer, Cham. https://doi.org/10.1007/978-3-031-47745-4_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-47745-4_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47744-7
Online ISBN: 978-3-031-47745-4
eBook Packages: Computer ScienceComputer Science (R0)