Skip to main content

ODArchive – Creating an Archive for Structured Data from Open Data Portals

  • Conference paper
  • First Online:
  • 3215 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12507))

Abstract

We present ODArchive, a large corpus of structured data collected from over 260 Open Data portals worldwide, alongside with curated, integrated metadata. Furthermore we enrich the harvested datasets by heuristic annotations using the type hierarchies in existing Knowledge Graphs. We both (i) present the underlying distributed architecture to scale up regular harvesting and monitoring changes on these portals, and (ii) make the corpus available via different APIs. Moreover, we (iii) analyse the characteristics of tabular data within the corpus. Our APIs can be used to regularly run such analyses or to reproduce experiments from the literature that have worked on static, not publicly available corpora.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://ckan.org/, accessed 2020-08-17.

  2. 2.

    Overall, historically we monitor and have monitored over 260 portals, however, several of those have gone offline in the meantime or are so-called “harvesting” portals that merely replicate metadata from other portals, for details cf. [14].

  3. 3.

    https://github.com/websi96/datasetarchiver.

  4. 4.

    https://docs.mongodb.com/manual/sharding/#shard-keys, accessed 2020-05-22.

  5. 5.

    https://kubernetes.io/, accessed 2020-05-22.

  6. 6.

    To filter datasets by certain data portals we enriched the descriptions by information collected in the Portal Watch (https://data.wu.ac.at/portalwatch/): we use arc:hasPortal to add this reference. More sophisticated federated queries could be formulated by including the Portal Watch endpoint  [14] which contains additional metadata.

  7. 7.

    The resp. information has been extracted from the most recent DBpedia and Wikidata HDT [4] dumps available at http://www.rdfhdt.org/datasets/.

  8. 8.

    While this needs further investigation, and obviously more sophisticated matching techniques (substrings- or similarity-based), we note that this low percentage seems to hint at the specific textual information in OD tables not necessarily being covered by the more general, encyclopedic knowledge typical in public KGs.

  9. 9.

    E.g., “Ja” and “Nein” (German for “yes” and “no”), are labels for entities in Wikidata.

  10. 10.

    https://github.com/ray-project/ray, accessed 2020-08-17.

  11. 11.

    http://ekzhu.com/datasketch/lshensemble.html, accessed 2020-08-17.

References

  1. Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-y

    Article  Google Scholar 

  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  3. Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685

  4. Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). In: Web Semantics: Science, Services and Agents on the World Wide Web 2019, pp. 22–41 (2013). http://www.websemanticsjournal.org/index.php/ps/article/view/328

  5. Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016). https://doi.org/10.1145/2844544

    Article  Google Scholar 

  6. Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 75–76 (2016). https://doi.org/10.1145/2872518.2889386

  7. Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010). https://doi.org/10.14778/1920841.1921005

    Article  Google Scholar 

  8. Maali, F., Erickson, J.: Data Catalog Vocabulary (DCAT). W3C Recommendation, January 2014. http://www.w3.org/TR/vocab-dcat/

  9. Mitloehner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: Proceedings - 2016 2nd International Conference on Open and Big Data, OBD 2016 (2016). https://doi.org/10.1109/OBD.2016.18

  10. Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. Proc. VLDB Endow. 11(7), 813–825 (2018). https://doi.org/10.14778/3192965.3192973, http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf

  11. Neumaier, S.: Semantic enrichment of open data on the Web - or: how to build an open data knowledge graph. Ph.D. thesis, Technische Universität Wien, Vienna, Austria (2019). https://permalink.catalogplus.tuwien.at/AC15550378

  12. Neumaier, S., Umbrich, J.: Measures for assessing the data freshness in open data portals. In: 2nd International Conference on Open and Big Data, OBD 2016, Vienna, Austria, 22–24 August 2016, pp. 17–24. IEEE Computer Society (2016). https://doi.org/10.1109/OBD.2016.10

  13. Neumaier, S., Umbrich, J., Parreira, J.X., Polleres, A.: Multi-level semantic labelling of numerical values. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 428–445. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_26

    Chapter  Google Scholar 

  14. Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Qual. 8(1), 21–229 (2016). https://doi.org/10.1145/2964909

    Article  Google Scholar 

  15. Oulabi, Y., Bizer, C.: Extending cross-domain knowledge bases with long tail entities using web table data. In: Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019 (2019). https://doi.org/10.5441/002/edbt.2019.34

  16. Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata Vocabulary for Tabular Data. W3C Recommendation, December 2015. https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/

  17. Sarma, A.D., et al.: Finding related tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 817–828. ACM (2012). https://doi.org/10.1145/2213836.2213962

  18. Umbrich, J., Mrzelj, N., Polleres, A.: Towards capturing and preserving changes on the Web of data. In: CEUR Workshop Proceedings (2015). https://pdfs.semanticscholar.org/971b/178200a0bc14735116ace49a0b164e68a926.pdf

  19. Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489

    Article  Google Scholar 

  20. Weik, M.H.: Nyquist Theorem, p. 1127. Springer, Boston (2001). https://doi.org/10.1007/1-4020-0613-6_12654

    Book  Google Scholar 

  21. Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 13:1–13:35 (2020). https://doi.org/10.1145/3372117

    Article  Google Scholar 

  22. Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+. Semantic Web 8(6), 921–957 (2017). https://doi.org/10.3233/SW-160242

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Weber .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Weber, T., Mitöhner, J., Neumaier, S., Polleres, A. (2020). ODArchive – Creating an Archive for Structured Data from Open Data Portals. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-62466-8_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-62465-1

  • Online ISBN: 978-3-030-62466-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics