Skip to main content

A Topic Model for the Data Web

  • Conference paper
  • First Online:
Knowledge Graphs and Semantic Web (KGSWC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14382))

Included in the following conference series:

  • 192 Accesses

Abstract

The usage of knowledge graphs in industry and at Web scale has increased steadily within recent years. However, the decentralized approach to data creation which underpins the popularity of knowledge graphs also comes with significant challenges. In particular, gaining an overview of the topics covered by existing datasets manually becomes a gargantuan if not impossible feat. Several dataset catalogs, portals and search engines offer different ways to interact with lists of available datasets. However, these interactions range from keyword searches to manually created tags and none of these solutions offers an easy access to human-interpretable categories. In addition, most of these approaches rely on metadata instead of the dataset itself. We propose to use topic modeling to fill this gap. Our implementation LODCat automatically creates human-interpretable topics and assigns them to RDF datasets. It does not need any metadata and solely relies on the provided RDF dataset. Our evaluation shows that LODCat can be used to identify the topics of hundreds of thousands of RDF datasets. Also, our experiment results suggest that humans agree with the topics that LODCat assigns to RDF datasets. Our code and data are available online.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://ckan.org/.

  2. 2.

    https://www.kaggle.com/datasets.

  3. 3.

    https://data.europa.eu/en.

  4. 4.

    https://lov.linkeddata.es/dataset/lov.

  5. 5.

    Chang et al. [9] show that there is a difference between these two usages of topics.

  6. 6.

    LODCat is open source at https://github.com/dice-group/lodcat.

  7. 7.

    Due to space limitations, we refer the interested reader to Blei et al. [6] and Hoffman et al. [16] for further details about LDA and the used inference algorithm.

  8. 8.

    The \(C_{V2}\) measure has the same definition as the \(C_V\) measure but uses the \(S^{one}_{all}\) segmentation [31]. The variant showed a better performance in our experiments.

  9. 9.

    https://github.com/sb1992/NETL-Automatic-Topic-Labelling-.

  10. 10.

    The transformation of counts into occurrences in a synthetic document is similar to the logarithmic variant of the approach described by Röder et al. [30].

  11. 11.

    The RDF datasets we use are available at https://figshare.com/s/af7f18a7f3307cc86bdd while the results as well as the Wikipedia-based corpus can be found at https://figshare.com/s/9c7670579c969cfeac05.

  12. 12.

    We use the dump of the English Wikipedia from September 1st 2021.

  13. 13.

    The stop word list can be found online. We will add the link after the review phase.

  14. 14.

    We downloaded the datasets in January 2018.

  15. 15.

    Note that we do not deduplicate the triples across the datasets.

  16. 16.

    We use the Gensim library [29] with hyper parameter optimization. https://radimrehurek.com/gensim/index.html.

  17. 17.

    https://climateknowledgeportal.worldbank.org/.

  18. 18.

    We used LimeSurvey for the questionnaire (https://www.limesurvey.org/). The questionnaire allowed users to skip questions. These skipped questions are not taken into account for the number of answers.

References

  1. Java Platform Standard Ed. 8: Class Math. Website (2014). https://docs.oracle.com/javase/8/docs/api/java/lang/Math.html. Accessed 18 May 2022

  2. Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the void vocabulary. W3C Note, W3C, March 2011. http://www.w3.org/TR/2011/NOTE-void-20110303/

  3. Asprino, L., Presutti, V.: Observing IoD: its knowledge domains and the varying behavior of ontologies across them. IEEE Access. 11, 21127–21143 (2023)

    Article  Google Scholar 

  4. Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD Laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14

    Chapter  Google Scholar 

  5. Bhatia, S., Lau, J.H., Baldwin, T.: Automatic labelling of topics with neural embeddings. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 953–963. The COLING 2016 Organizing Committee, Osaka, Japan, December 2016

    Google Scholar 

  6. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 77–84 (2012)

    Article  Google Scholar 

  7. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  8. Brickley, D., Burgess, M., Noy, N.: Google dataset search: Building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, pp. 1365–1375. WWW 2019, Association for Computing Machinery (2019)

    Google Scholar 

  9. Chang, J., Gerrish, S., Wang, C., Boyd-graber, J.L., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Advances in Neural Information Processing Systems, vol. 22, pp. 288–296. Curran Associates, Inc. (2009)

    Google Scholar 

  10. Chapman, A., et al.: Dataset search: a survey. Int. J. Very Large Data Bases 29, 251–272 (2020)

    Article  Google Scholar 

  11. Cyganiak, R., Reynolds, D.: The RDF data cube vocabulary. W3c Recommendation, January 2014. http://www.w3.org/TR/2014/REC-vocab-data-cube-20140116/

  12. Devaraju, A., Berkovsky, S.: A hybrid recommendation approach for open research datasets. In: Proceedings of the 26th Conference on User Modeling, Adaptation and Personalization, pp. 207–211. ACM, UMAP 2018 (2018)

    Google Scholar 

  13. Ell, B., Vrandečić, D., Simperl, E.: Labels in the web of data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 162–176. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_11

    Chapter  Google Scholar 

  14. Heindorf, S., et al.: EvoLearner: learning description logics with evolutionary algorithms. In: Proceedings of the ACM Web Conference 2022, pp. 818–828 (2022)

    Google Scholar 

  15. Hinneburg, A., Preiss, R., Schröder, R.: TopicExplorer: exploring document collections with topic models. In: Flach, P.A., De Bie, T., Cristianini, N. (eds.) ECML PKDD 2012. LNCS (LNAI), vol. 7524, pp. 838–841. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33486-3_59

    Chapter  Google Scholar 

  16. Hoffman, M., Bach, F., Blei, D.: Online Learning for Latent Dirichlet Allocation. In: Advances in Neural Information Processing Systems. Curran Associates (2010)

    Google Scholar 

  17. Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S.Y.: A survey on knowledge graphs: representation, acquisition, and applications. IEEE Trans. Neural Netw. Learn. Syst. 43, 494–512 (2021)

    MathSciNet  Google Scholar 

  18. Kopsachilis, V., Vaitis, M.: GeoLOD: a spatial linked data catalog and recommender. Big Data Cogn. Comput. 5(2), 17 (2021)

    Article  Google Scholar 

  19. Kunze, S., Auer, S.: Dataset retrieval. In: 2013 IEEE Seventh International Conference on Semantic Computing (ICSC), pp. 1–8, September 2013

    Google Scholar 

  20. Lau, J.H., Grieser, K., Newman, D., Baldwin, T.: Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, pp. 1536–1545. HLT 2011, Association for Computational Linguistics, USA (2011)

    Google Scholar 

  21. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J.R., Bethard, S., McClosky, D.: The stanford corenlp natural language processing toolkit. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60. Association for Computational Linguistics (2014)

    Google Scholar 

  22. McCrae, J.P.: The Linked Open Data Cloud. Website, May 2021. https://www.lod-cloud.net/. Accessed 24 Aug 2021

  23. Mohammadi, M.: (semi-) automatic construction of knowledge graph metadata. In: The Semantic Web: ESWC 2022 Satellite Events, pp. 171–178 (2022)

    Google Scholar 

  24. Ngomo, A.C.N., et al.: LIMES-a framework for link discovery on the semantic web. J. Web Semant. 35, 413–423 (2018)

    Google Scholar 

  25. Patni, H.: Linkedsensordata. Website in the web archive, September 2010. https://web.archive.org/web/20190816202119/http://wiki.knoesis.org/index.php/SSW_Datasets. Accessed 11 May 2022

  26. Patni, H., Henson, C., Sheth, A.: Linked sensor data. In: 2010 International Symposium on Collaborative Technologies and Systems, pp. 362–370 (2010)

    Google Scholar 

  27. Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: Proceedings of the ISWC 2013 Posters & Demonstrations Track, vol. 1035, pp. 245–248. CEUR-WS.org, Aachen, Germany, Germany (2013)

    Google Scholar 

  28. Pietriga, E., et al.: Browsing linked data catalogs with LODAtlas. In: Vrandečić, D., et al. (eds.) ISWC 2018. LNCS, vol. 11137, pp. 137–153. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00668-6_9

    Chapter  Google Scholar 

  29. Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, May 2010

    Google Scholar 

  30. Röder, M., Ngonga Ngomo, A.C., Ermilov, I., Both, A.: Detecting similar linked datasets using topic modelling. In: ESWC (2016)

    Google Scholar 

  31. Röder, M., Both, A., Hinneburg, A.: Exploring the space of topic coherence measures. In: Proceedings of the WSDM (2015)

    Google Scholar 

  32. Saxena, A., Tripathi, A., Talukdar, P.: Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020)

    Google Scholar 

  33. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: The Semantic Web - ISWC 2014 (2014)

    Google Scholar 

  34. Singhal, A., Kasturi, R., Srivastava, J.: DataGopher: context-based search for research datasets. In: Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration, pp. 749–756. IEEE IRI 2014 (2014)

    Google Scholar 

  35. Sleeman, J., Finin, T., Joshi, A.: Topic modeling for RDF graphs. In: ISWC (2015)

    Google Scholar 

  36. Spahiu, B., Maurino, A., Meusel, R.: Topic profiling benchmarks in the linked open data cloud: issues and lessons learned. Semant. Web 10(2), 329–348 (2019)

    Article  Google Scholar 

  37. Spahiu, B., Porrini, R., Palmonari, M., Rula, A., Maurino, A.: ABSTAT: ontology-driven linked data summaries with pattern minimalization. In: ESWC (2016)

    Google Scholar 

  38. Tzitzikas, Y., Manolis, N., Papadakos, P.: Faceted exploration of RDF/S datasets: a survey. J. Intell. Inf. Syst. 48(2), 329–364 (2017)

    Article  Google Scholar 

  39. Vandenbussche, P.Y., Atemezing, G.A., Poveda-Villalón, M., Vatant, B.: Linked open vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Semant. Web 8(3), 437–452 (2017)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Ministry of Culture and Science of North Rhine-Westphalia (MKW NRW) within the project SAIL under the grant no NW21-059D.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Röder .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Röder, M., Kuchelev, D., Ngomo, AC.N. (2023). A Topic Model for the Data Web. In: Ortiz-Rodriguez, F., Villazón-Terrazas, B., Tiwari, S., Bobed, C. (eds) Knowledge Graphs and Semantic Web. KGSWC 2023. Lecture Notes in Computer Science, vol 14382. Springer, Cham. https://doi.org/10.1007/978-3-031-47745-4_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47745-4_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47744-7

  • Online ISBN: 978-3-031-47745-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics