ODArchive – Creating an Archive for Structured Data from Open Data Portals

Weber, Thomas; Mitöhner, Johann; Neumaier, Sebastian; Polleres, Axel

doi:10.1007/978-3-030-62466-8_20

ODArchive – Creating an Archive for Structured Data from Open Data Portals

Conference paper
First Online: 01 November 2020

3215 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12507))

Abstract

We present ODArchive, a large corpus of structured data collected from over 260 Open Data portals worldwide, alongside with curated, integrated metadata. Furthermore we enrich the harvested datasets by heuristic annotations using the type hierarchies in existing Knowledge Graphs. We both (i) present the underlying distributed architecture to scale up regular harvesting and monitoring changes on these portals, and (ii) make the corpus available via different APIs. Moreover, we (iii) analyse the characteristics of tabular data within the corpus. Our APIs can be used to regularly run such analyses or to reproduce experiments from the literature that have worked on static, not publicly available corpora.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://ckan.org/, accessed 2020-08-17.
2.
Overall, historically we monitor and have monitored over 260 portals, however, several of those have gone offline in the meantime or are so-called “harvesting” portals that merely replicate metadata from other portals, for details cf. [14].
3.
https://github.com/websi96/datasetarchiver.
4.
https://docs.mongodb.com/manual/sharding/#shard-keys, accessed 2020-05-22.
5.
https://kubernetes.io/, accessed 2020-05-22.
6.
To filter datasets by certain data portals we enriched the descriptions by information collected in the Portal Watch (https://data.wu.ac.at/portalwatch/): we use arc:hasPortal to add this reference. More sophisticated federated queries could be formulated by including the Portal Watch endpoint [14] which contains additional metadata.
7.
The resp. information has been extracted from the most recent DBpedia and Wikidata HDT [4] dumps available at http://www.rdfhdt.org/datasets/.
8.
While this needs further investigation, and obviously more sophisticated matching techniques (substrings- or similarity-based), we note that this low percentage seems to hint at the specific textual information in OD tables not necessarily being covered by the more general, encyclopedic knowledge typical in public KGs.
9.
E.g., “Ja” and “Nein” (German for “yes” and “no”), are labels for entities in Wikidata.
10.
https://github.com/ray-project/ray, accessed 2020-08-17.
11.
http://ekzhu.com/datasketch/lshensemble.html, accessed 2020-08-17.

References

Abedjan, Z., Golab, L., Naumann, F.: Profiling relational data: a survey. VLDB J. 24(4), 557–581 (2015). https://doi.org/10.1007/s00778-015-0389-y
Article Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Brickley, D., Burgess, M., Noy, N.F.: Google dataset search: building a search engine for datasets in an open web ecosystem. In: The World Wide Web Conference, WWW 2019, San Francisco, CA, USA, 13–17 May 2019, pp. 1365–1375. ACM (2019). https://doi.org/10.1145/3308558.3313685
Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). In: Web Semantics: Science, Services and Agents on the World Wide Web 2019, pp. 22–41 (2013). http://www.websemanticsjournal.org/index.php/ps/article/view/328
Guha, R.V., Brickley, D., Macbeth, S.: Schema.org: evolution of structured data on the web. Commun. ACM 59(2), 44–51 (2016). https://doi.org/10.1145/2844544
Article Google Scholar
Lehmberg, O., Ritze, D., Meusel, R., Bizer, C.: A large public corpus of web tables containing time and context metadata. In: Proceedings of the 25th International Conference Companion on World Wide Web, pp. 75–76 (2016). https://doi.org/10.1145/2872518.2889386
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endow. 3(1–2), 1338–1347 (2010). https://doi.org/10.14778/1920841.1921005
Article Google Scholar
Maali, F., Erickson, J.: Data Catalog Vocabulary (DCAT). W3C Recommendation, January 2014. http://www.w3.org/TR/vocab-dcat/
Mitloehner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: Proceedings - 2016 2nd International Conference on Open and Big Data, OBD 2016 (2016). https://doi.org/10.1109/OBD.2016.18
Nargesian, F., Zhu, E., Pu, K.Q., Miller, R.J.: Table union search on open data. Proc. VLDB Endow. 11(7), 813–825 (2018). https://doi.org/10.14778/3192965.3192973, http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf
Neumaier, S.: Semantic enrichment of open data on the Web - or: how to build an open data knowledge graph. Ph.D. thesis, Technische Universität Wien, Vienna, Austria (2019). https://permalink.catalogplus.tuwien.at/AC15550378
Neumaier, S., Umbrich, J.: Measures for assessing the data freshness in open data portals. In: 2nd International Conference on Open and Big Data, OBD 2016, Vienna, Austria, 22–24 August 2016, pp. 17–24. IEEE Computer Society (2016). https://doi.org/10.1109/OBD.2016.10
Neumaier, S., Umbrich, J., Parreira, J.X., Polleres, A.: Multi-level semantic labelling of numerical values. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 428–445. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46523-4_26
Chapter Google Scholar
Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Qual. 8(1), 21–229 (2016). https://doi.org/10.1145/2964909
Article Google Scholar
Oulabi, Y., Bizer, C.: Extending cross-domain knowledge bases with long tail entities using web table data. In: Advances in Database Technology - 22nd International Conference on Extending Database Technology, EDBT 2019 (2019). https://doi.org/10.5441/002/edbt.2019.34
Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata Vocabulary for Tabular Data. W3C Recommendation, December 2015. https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
Sarma, A.D., et al.: Finding related tables. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, 20–24 May 2012, pp. 817–828. ACM (2012). https://doi.org/10.1145/2213836.2213962
Umbrich, J., Mrzelj, N., Polleres, A.: Towards capturing and preserving changes on the Web of data. In: CEUR Workshop Proceedings (2015). https://pdfs.semanticscholar.org/971b/178200a0bc14735116ace49a0b164e68a926.pdf
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489
Article Google Scholar
Weik, M.H.: Nyquist Theorem, p. 1127. Springer, Boston (2001). https://doi.org/10.1007/1-4020-0613-6_12654
Book Google Scholar
Zhang, S., Balog, K.: Web table extraction, retrieval, and augmentation: a survey. ACM Trans. Intell. Syst. Technol. 11(2), 13:1–13:35 (2020). https://doi.org/10.1145/3372117
Article Google Scholar
Zhang, Z.: Effective and efficient semantic table interpretation using tableminer+. Semantic Web 8(6), 921–957 (2017). https://doi.org/10.3233/SW-160242
Article Google Scholar

Download references

Author information

Authors and Affiliations

Vienna University of Economics and Business, Vienna, Austria
Thomas Weber, Johann Mitöhner, Sebastian Neumaier & Axel Polleres
Complexity Science Hub Vienna, Vienna, Austria
Axel Polleres

Authors

Thomas Weber
View author publications
You can also search for this author in PubMed Google Scholar
Johann Mitöhner
View author publications
You can also search for this author in PubMed Google Scholar
Sebastian Neumaier
View author publications
You can also search for this author in PubMed Google Scholar
Axel Polleres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Weber .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Jeff Z. Pan
University of Liverpool, Liverpool, UK
Valentina Tamma
University of Bari, Bari, Italy
Claudia d’Amato
University of California, Santa Barbara, Santa Barbara, CA, USA
Krzysztof Janowicz
California State University, Long Beach, Long Beach, CA, USA
Bo Fu
Vienna University of Economics and Business, Vienna, Austria
Axel Polleres
Rensselaer Polytechnic Institute, Troy, NY, USA
Oshani Seneviratne
Massachusetts Institute of Technology, Cambridge, MA, USA
Lalana Kagal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Weber, T., Mitöhner, J., Neumaier, S., Polleres, A. (2020). ODArchive – Creating an Archive for Structured Data from Open Data Portals. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-62466-8_20
Published: 01 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)