Abstract
Shared vocabularies facilitate data integration and application interoperability on the Semantic Web. An investigation of how vocabularies are practically used in open RDF data, particularly with the increasing number of RDF datasets registered in open data portals, is expected to provide a measurement for the adoption of shared vocabularies and an indicator of the state of the Semantic Web. To support this investigation, we constructed and published VOYAGE, a large collection of vocabulary usage in open RDF datasets. We built it by collecting 68,312 RDF datasets from 517 pay-level domains via 577 open data portals, and we extracted 50,976 vocabularies used in the data. We analyzed the extracted usage data and revealed the distributions of frequency and diversity in vocabulary usage. We particularly characterized the patterns of term co-occurrence, and leveraged them to cluster vocabularies and RDF datasets as a potential application of VOYAGE. Our data is available from Zenodo at https://zenodo.org/record/7902675. Our code is available from GitHub at https://github.com/nju-websoft/VOYAGE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
http://www.w3.org/2001/XMLSchema#.
- 11.
http://www.w3.org/1999/02/22-rdf-syntax-ns#.
- 12.
http://www.w3.org/2000/01/rdf-schema#.
- 13.
http://www.w3.org/2002/07/owl#.
- 14.
http://www.w3.org/2004/02/skos/core#.
- 15.
- 16.
- 17.
http://www.socrata.com/rdf/terms#.
- 18.
- 19.
- 20.
http://www.lexinfo.net/ontology/2.0/lexinfo#.
- 21.
http://lemon-model.net/lemon#.
- 22.
http://webdatacommons.org/structureddata/#results-2022-1.
- 23.
References
Ali, W., Saleem, M., Yao, B., Hogan, A., Ngomo, A.N.: A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J. 31(3), 1–26 (2022). https://doi.org/10.1007/s00778-021-00711-3
Ashraf, J., Hussain, O.K.: Analysing the use of ontologies based on usage network. In: WI 2012, pp. 540–544 (2012). https://doi.org/10.1109/WI-IAT.2012.203
Asprino, L., Beek, W., Ciancarini, P., van Harmelen, F., Presutti, V.: Observing LOD using equivalent set graphs: it is mostly flat and sparsely linked. In: Ghidini, C., et al. (eds.) ISWC 2019, Part I. LNCS, vol. 11778, pp. 57–74. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30793-6_4
Bizer, C., Eckert, K., Meusel, R., Mühleisen, H., Schuhmacher, M., Völker, J.: Deployment of RDFa, Microdata, and Microformats on the web – a quantitative analysis. In: Alani, H., et al. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 17–32. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41338-4_2
Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open Web ecosystem. In: WWW 2019, pp. 1365–1375 (2019). https://doi.org/10.1145/3308558.3313685
Cheng, G., Gong, S., Qu, Y.: An empirical study of vocabulary relatedness and its application to recommender systems. In: Aroyo, L., et al. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 98–113. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_7
Cheng, G., Liu, M., Qu, Y.: NJVR: The NanJing vocabulary repository. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.T. (eds.) Semantic Web and Web Science. Springer Proceedings in Complexity, pp. 265–272. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6880-6_23
Cheng, G., Qu, Y.: Relatedness between vocabularies on the Web of data: a taxonomy and an empirical study. J. Web Semant. 20, 1–17 (2013). https://doi.org/10.1016/j.websem.2013.02.001
Dividino, R.Q., Scherp, A., Gröner, G., Grotton, T.: Change-a-LOD: does the schema on the Linked Data Cloud change or not? In: COLD 2013 (2013)
Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: A systematic investigation of explicit and implicit schema information on the linked open data cloud. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 228–242. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_16
Gottron, T., Knauf, M., Scherp, A.: Analysis of schema structures in the Linked Open Data graph based on unique subject URIs, pay-level domains, and vocabulary usage. Distrib. Parallel Databases 33(4), 515–553 (2014). https://doi.org/10.1007/s10619-014-7143-0
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the Web. Commun. ACM 59(2), 44–51 (2016). https://doi.org/10.1145/2844544
Haller, A., Fernández, J.D., Kamdar, M.R., Polleres, A.: What are links in Linked Open Data? A characterization and evaluation of links between knowledge graphs on the Web. ACM J. Data Inf. Qual. 12(2), 9:1–9:34 (2020). https://doi.org/10.1145/3369875
Herrera, J.-M., Hogan, A., Käfer, T.: BTC-2019: the 2019 billion triple challenge dataset. In: Ghidini, C., et al. (eds.) ISWC 2019, Part II. LNCS, vol. 11779, pp. 163–180. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_11
Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: algorithms for leaning and labelling blank nodes. ACM Trans. Web 11(4), 22:1–22:62 (2017). https://doi.org/10.1145/3068333
Hogan, A., Umbrich, J., Harth, A., Cyganiak, R., Polleres, A., Decker, S.: An empirical survey of Linked Data conformance. J. Web Semant. 14, 14–44 (2012). https://doi.org/10.1016/j.websem.2012.02.001
Ibáñez, L.-D., Millard, I., Glaser, H., Simperl, E.: An assessment of adoption and quality of linked data in European Open government data. In: Ghidini, C., et al. (eds.) ISWC 2019, Part II. LNCS, vol. 11779, pp. 436–453. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_27
Kowalczuk, E., Potoniec, J., Lawrynowicz, A.: Extracting usage patterns of ontologies on the Web: a case study on GoodRelations vocabulary in RDFa. In: OWLED 2014, pp. 139–144 (2014)
Abdul Manaf, N.A., Bechhofer, S., Stevens, R.: The current state of SKOS vocabularies on the web. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 270–284. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30284-8_25
Matentzoglu, N., Bail, S., Parsia, B.: A corpus of OWL DL ontologies. In: DL 2013, pp. 829–841 (2013)
Meusel, R., Bizer, C., Paulheim, H.: A web-scale study of the adoption and evolution of the schema.org vocabulary over time. In: WIMS 2015, p. 15 (2015). https://doi.org/10.1145/2797115.2797124
Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons Microdata, RDFa and Microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_18
Mika, P., Potter, T.: Metadata statistics for a large Web corpus. In: LDOW 2012 (2012)
Nogales, A., Urbán, M.Á.S., Barriocanal, E.G.: Measuring vocabulary use in the Linked Data Cloud. Online Inf. Rev. 41(2), 252–271 (2017). https://doi.org/10.1108/OIR-06-2015-0183
Pan, J.Z.: Resource description framework. In: Staab, S., Studer, R. (eds.) Handbook on Ontologies. IHIS, pp. 71–90. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-540-92673-3_3
Pan, J.Z., Thomas, E., Sleeman, D.: ONTOSEARCH2: searching and querying Web ontologies. In: WWW/Internet 2006, pp. 211–218 (2006)
Pan, J.Z., Vetere, G., Gómez-Pérez, J.M., Wu, H. (eds.): Exploiting Linked Data and Knowledge Graphs in Large Organisations. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-45654-6
Pham, M.-D., Boncz, P.: Exploiting emergent schemas to make RDF systems more efficient. In: Groth, P., et al. (eds.) ISWC 2016, Part I. LNCS, vol. 9981, pp. 463–479. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_28
Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of the linked data best practices in different topical domains. In: Mika, P., et al. (eds.) ISWC 2014, Part I. LNCS, vol. 8796, pp. 245–260. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_16
Shi, Q., Wang, J., Pan, J.Z., Cheng, G.: VOYAGE: a large collection of vocabulary usage in open RDF datasets (2023). https://doi.org/10.5281/zenodo.7902675
Stadtmüller, S., Harth, A., Grobelnik, M.: Accessing information about Linked Data vocabularies with vocab.cc. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds.) Semantic Web and Web Science. Springer Proceedings in Complexity, pp. 391–396. Springer, New York (2012). https://doi.org/10.1007/978-1-4614-6880-6_34
Tummarello, G., Morbidoni, C., Bachmann-Gmür, R., Erling, O.: RDFSync: efficient remote synchronization of RDF models. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 537–551. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_39
Vandenbussche, P., Atemezing, G., Poveda-Villalón, M., Vatant, B.: Linked Open Vocabularies (LOV): a gateway to reusable semantic vocabularies on the Web. Semant. Web 8(3), 437–452 (2017). https://doi.org/10.3233/SW-160213
Wang, X., et al.: PCSG: pattern-coverage snippet generation for RDF datasets. In: Hotho, A., et al. (eds.) ISWC 2021. LNCS, vol. 12922, pp. 3–20. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88361-4_1
Xu, P., Deng, Z., Choi, K., Cao, L., Wang, S.: Multi-view information-theoretic co-clustering for co-occurrence data. In: AAAI 2019, pp. 379–386 (2019). https://doi.org/10.1609/aaai.v33i01.3301379
Zaveri, A., Rula, A., Maurino, A., Pietrobon, R., Lehmann, J., Auer, S.: Quality assessment for Linked Data: a survey. Semant. Web 7(1), 63–93 (2016). https://doi.org/10.3233/SW-150175
Acknowledgements
This work was supported by the NSFC (62072224) and the Chang Jiang Scholars Program (J2019032).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Resource Availability Statement:
VOYAGE is available from Zenodo at https://zenodo.org/record/7902675. For each of the accessed 577 ODPs, its name, URL, API type, API URL, and the IDs of RDF datasets collected from it are given in odps.json. For each of the crawled 72,088 RDF datasets, its ID, title, description, author, license, dump file URLs, and PLDs are given in datasets.json. The IDs of the deduplicated 68,312 RDF datasets and whether they are in the LOD Cloud are given in deduplicated_datasets.json. The extracted 62,864 classes, 842,745 properties, and the IDs of RDF datasets using each term are given in terms.json. The extracted 50,976 vocabularies, the classes and properties in each vocabulary, and the IDs of RDF datasets using each vocabulary are given in vocabularies.json. The extracted 767,976 distinct EDPs and the IDs of RDF datasets using each EDP are given in edps.json. The clusters of vocabularies generated by MV-ITCC and LDA are given in clusters.json. All the experiments presented in the paper can be reproduced from the above files, for which some helpful scripts are available from GitHub at https://github.com/nju-websoft/VOYAGE.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shi, Q., Wang, J., Pan, J.Z., Cheng, G. (2023). VOYAGE: A Large Collection of Vocabulary Usage in Open RDF Datasets. In: Payne, T.R., et al. The Semantic Web – ISWC 2023. ISWC 2023. Lecture Notes in Computer Science, vol 14266. Springer, Cham. https://doi.org/10.1007/978-3-031-47243-5_12
Download citation
DOI: https://doi.org/10.1007/978-3-031-47243-5_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-47242-8
Online ISBN: 978-3-031-47243-5
eBook Packages: Computer ScienceComputer Science (R0)