Skip to main content

VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets

  • Conference paper
  • First Online:
The Semantic Web – ISWC 2023 (ISWC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14266))

Included in the following conference series:

  • 860 Accesses

Abstract

Shared vocabularies facilitate data integration and application interoperability on the Semantic Web. An investigation of how vocabularies are practically used in open RDF data, particularly with the increasing number of RDF datasets registered in open data portals, is expected to provide a measurement for the adoption of shared vocabularies and an indicator of the state of the Semantic Web. To support this investigation, we constructed and published VOYAGE, a large collection of vocabulary usage in open RDF datasets. We built it by collecting 68,312 RDF datasets from 517 pay-level domains via 577 open data portals, and we extracted 50,976 vocabularies used in the data. We analyzed the extracted usage data and revealed the distributions of frequency and diversity in vocabulary usage. We particularly characterized the patterns of term co-occurrence, and leveraged them to cluster vocabularies and RDF datasets as a potential application of VOYAGE. Our data is available from Zenodo at https://zenodo.org/record/7902675. Our code is available from GitHub at https://github.com/nju-websoft/VOYAGE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://zenodo.org/record/7902675.

  2. 2.

    https://doi.org/10.5281/zenodo.7902675.

  3. 3.

    https://ckan.org/.

  4. 4.

    http://dataportals.org/.

  5. 5.

    https://getdkan.org/.

  6. 6.

    https://data.wu.ac.at/portalwatch/.

  7. 7.

    https://dev.socrata.com/.

  8. 8.

    https://jena.apache.org/.

  9. 9.

    https://github.com/jeffalstott/powerlaw.

  10. 10.

    http://www.w3.org/2001/XMLSchema#.

  11. 11.

    http://www.w3.org/1999/02/22-rdf-syntax-ns#.

  12. 12.

    http://www.w3.org/2000/01/rdf-schema#.

  13. 13.

    http://www.w3.org/2002/07/owl#.

  14. 14.

    http://www.w3.org/2004/02/skos/core#.

  15. 15.

    http://xmlns.com/foaf/0.1/.

  16. 16.

    http://purl.org/dc/terms/.

  17. 17.

    http://www.socrata.com/rdf/terms#.

  18. 18.

    http://purl.org/dc/elements/1.1/.

  19. 19.

    EDPs that solely consist of terms in the five language-level vocabularies (i.e., xsd, rdf, rdfs, owl, and skos) are excluded from Table 5 and Table 6.

  20. 20.

    http://www.lexinfo.net/ontology/2.0/lexinfo#.

  21. 21.

    http://lemon-model.net/lemon#.

  22. 22.

    http://webdatacommons.org/structureddata/#results-2022-1.

  23. 23.

    https://zenodo.org/record/2634588.

References

  1. Ali, W., Saleem, M., Yao, B., Hogan, A., Ngomo, A.N.: A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J. 31(3), 1–26 (2022). https://doi.org/10.1007/s00778-021-00711-3

    Article  Google Scholar 

  2. Ashraf, J., Hussain, O.K.: Analysing the use of ontologies based on usage network. In: WI 2012, pp. 540–544 (2012). https://doi.org/10.1109/WI-IAT.2012.203

  3. Asprino, L., Beek, W., Ciancarini, P., van Harmelen, F., Presutti, V.: Observing LOD using equivalent set graphs: it is mostly flat and sparsely linked. In: Ghidini, C., et al. (eds.) ISWC 2019, Part I. LNCS, vol. 11778, pp. 57–74. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_4

    Chapter  Google Scholar 

  4. Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of RDFa, Microdata, and Microformats on the web – a quantitative analysis. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 17–32. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_2

    Chapter  Google Scholar 

  5. Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open Web ecosystem. In: WWW 2019, pp. 1365–1375 (2019). https://doi.org/10.1145/3308558.3313685

  6. Cheng, G., Gong, S., Qu, Y.: An empirical study of vocabulary relatedness and its application to recommender systems. In: Aroyo, L., et al. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 98–113. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_7

    Chapter  Google Scholar 

  7. Cheng, G., Liu, M., Qu, Y.: NJVR: The NanJing vocabulary repository. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.T. (eds.) Semantic Web and Web Science. Springer Proceedings in Complexity, pp. 265–272. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6880-6_23

  8. Cheng, G., Qu, Y.: Relatedness between vocabularies on the Web of data: a taxonomy and an empirical study. J. Web Semant. 20, 1–17 (2013). https://doi.org/10.1016/j.websem.2013.02.001

    Article  Google Scholar 

  9. Dividino, R.Q., Scherp, A., Gröner, G., Grotton, T.: Change-a-LOD: does the schema on the Linked Data Cloud change or not? In: COLD 2013 (2013)

    Google Scholar 

  10. Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: A systematic investigation of explicit and implicit schema information on the linked open data cloud. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 228–242. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_16

    Chapter  Google Scholar 

  11. Gottron, T., Knauf, M., Scherp, A.: Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distrib. Parallel Databases 33(4), 515–553 (2014). https://doi.org/10.1007/s10619-014-7143-0

    Article  Google Scholar 

  12. Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the Web. Commun. ACM 59(2), 44–51 (2016). https://doi.org/10.1145/2844544

    Article  Google Scholar 

  13. Haller, A., Fernández, J.D., Kamdar, M.R., Polleres, A.: What are links in Linked Open Data? A characterization and evaluation of links between knowledge graphs on the Web. ACM J. Data Inf. Qual. 12(2), 9:1–9:34 (2020). https://doi.org/10.1145/3369875

  14. Herrera, J.-M., Hogan, A., Käfer, T.: BTC-2019: the 2019 billion triple challenge dataset. In: Ghidini, C., et al. (eds.) ISWC 2019, Part II. LNCS, vol. 11779, pp. 163–180. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_11

    Chapter  Google Scholar 

  15. Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: algorithms for leaning and labelling blank nodes. ACM Trans. Web 11(4), 22:1–22:62 (2017). https://doi.org/10.1145/3068333

  16. Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of Linked Data conformance. J. Web Semant. 14, 14–44 (2012). https://doi.org/10.1016/j.websem.2012.02.001

    Article  Google Scholar 

  17. Ibáñez, L.-D., Millard, I., Glaser, H., Simperl, E.: An assessment of adoption and quality of linked data in European Open government data. In: Ghidini, C., et al. (eds.) ISWC 2019, Part II. LNCS, vol. 11779, pp. 436–453. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_27

    Chapter  Google Scholar 

  18. Kowalczuk, E., Potoniec, J., Lawrynowicz, A.: Extracting usage patterns of ontologies on the Web: a case study on GoodRelations vocabulary in RDFa. In: OWLED 2014, pp. 139–144 (2014)

    Google Scholar 

  19. Abdul Manaf, N.A., Bechhofer, S., Stevens, R.: The current state of SKOS vocabularies on the web. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 270–284. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_25

    Chapter  Google Scholar 

  20. Matentzoglu, N., Bail, S., Parsia, B.: A corpus of OWL DL ontologies. In: DL 2013, pp. 829–841 (2013)

    Google Scholar 

  21. Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the schema.org vocabulary over time. In: WIMS 2015, p. 15 (2015). https://doi.org/10.1145/2797115.2797124

  22. Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons Microdata, RDFa and Microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_18

    Chapter  Google Scholar 

  23. Mika, P., Potter, T.: Metadata statistics for a large Web corpus. In: LDOW 2012 (2012)

    Google Scholar 

  24. Nogales, A., Urbán, M.Á.S., Barriocanal, E.G.: Measuring vocabulary use in the Linked Data Cloud. Online Inf. Rev. 41(2), 252–271 (2017). https://doi.org/10.1108/OIR-06-2015-0183

    Article  Google Scholar 

  25. Pan, J.Z.: Resource description framework. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies. IHIS, pp. 71–90. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-92673-3_3

    Chapter  Google Scholar 

  26. Pan, J.Z., Thomas, E., Sleeman, D.: ONTOSEARCH2: searching and querying Web ontologies. In: WWW/Internet 2006, pp. 211–218 (2006)

    Google Scholar 

  27. Pan, J.Z., Vetere, G., Gómez-Pérez, J.M., Wu, H. (eds.): Exploiting Linked Data and Knowledge Graphs in Large Organisations. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-45654-6

  28. Pham, M.-D., Boncz, P.: Exploiting emergent schemas to make RDF systems more efficient. In: Groth, P., et al. (eds.) ISWC 2016, Part I. LNCS, vol. 9981, pp. 463–479. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_28

    Chapter  Google Scholar 

  29. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16

    Chapter  Google Scholar 

  30. Shi, Q., Wang, J., Pan, J.Z., Cheng, G.: VOYAGE: a large collection of vocabulary usage in open RDF datasets (2023). https://doi.org/10.5281/zenodo.7902675

  31. Stadtmüller, S., Harth, A., Grobelnik, M.: Accessing information about Linked Data vocabularies with vocab.cc. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds.) Semantic Web and Web Science. Springer Proceedings in Complexity, pp. 391–396. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-6880-6_34

  32. Tummarello, G., Morbidoni, C., Bachmann-Gmür, R., Erling, O.: RDFSync: efficient remote synchronization of RDF models. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 537–551. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_39

    Chapter  Google Scholar 

  33. Vandenbussche, P., Atemezing, G., Poveda-Villalón, M., Vatant, B.: Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Semant. Web 8(3), 437–452 (2017). https://doi.org/10.3233/SW-160213

    Article  Google Scholar 

  34. Wang, X., et al.: PCSG: pattern-coverage snippet generation for RDF datasets. In: Hotho, A., et al. (eds.) ISWC 2021. LNCS, vol. 12922, pp. 3–20. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88361-4_1

    Chapter  Google Scholar 

  35. Xu, P., Deng, Z., Choi, K., Cao, L., Wang, S.: Multi-view information-theoretic co-clustering for co-occurrence data. In: AAAI 2019, pp. 379–386 (2019). https://doi.org/10.1609/aaai.v33i01.3301379

  36. Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for Linked Data: a survey. Semant. Web 7(1), 63–93 (2016). https://doi.org/10.3233/SW-150175

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the NSFC (62072224) and the Chang Jiang Scholars Program (J2019032).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gong Cheng .

Editor information

Editors and Affiliations

Ethics declarations

Resource Availability Statement:

VOYAGE is available from Zenodo at https://zenodo.org/record/7902675. For each of the accessed 577 ODPs, its name, URL, API type, API URL, and the IDs of RDF datasets collected from it are given in odps.json. For each of the crawled 72,088 RDF datasets, its ID, title, description, author, license, dump file URLs, and PLDs are given in datasets.json. The IDs of the deduplicated 68,312 RDF datasets and whether they are in the LOD Cloud are given in deduplicated_datasets.json. The extracted 62,864 classes, 842,745 properties, and the IDs of RDF datasets using each term are given in terms.json. The extracted 50,976 vocabularies, the classes and properties in each vocabulary, and the IDs of RDF datasets using each vocabulary are given in vocabularies.json. The extracted 767,976 distinct EDPs and the IDs of RDF datasets using each EDP are given in edps.json. The clusters of vocabularies generated by MV-ITCC and LDA are given in clusters.json. All the experiments presented in the paper can be reproduced from the above files, for which some helpful scripts are available from GitHub at https://github.com/nju-websoft/VOYAGE.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shi, Q., Wang, J., Pan, J.Z., Cheng, G. (2023). VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14266. Springer, Cham. https://doi.org/10.1007/978-3-031-47243-5_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-47243-5_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-47242-8

  • Online ISBN: 978-3-031-47243-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics