Abstract
In this lecture we will discuss and introduce challenges of integrating openly available Web data and how to solve them. Firstly, while we will address this topic from the viewpoint of Semantic Web research, not all data is readily available as RDF or Linked Data, so we will give an introduction to different data formats prevalent on the Web, namely, standard formats for publishing and exchanging tabular, tree-shaped, and graph data. Secondly, not all Open Data is really completely open, so we will discuss and address issues around licences, terms of usage associated with Open Data, as well as documentation of data provenance. Thirdly, we will discuss issues connected with (meta-)data quality issues associated with Open Data on the Web and how Semantic Web techniques and vocabularies can be used to describe and remedy them. Fourth, we will address issues about searchability and integration of Open Data and discuss in how far semantic search can help to overcome these. We close with briefly summarizing further issues not covered explicitly herein, such as multi-linguality, temporal aspects (archiving, evolution, temporal querying), as well as how/whether OWL and RDFS reasoning on top of integrated open data could be help.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
https://www.w3.org/2001/sw/, last accessed 30/03/2017.
- 2.
https://www.w3.org/2013/data/, last accessed 30/03/2017.
- 3.
http://wiki.dbpedia.org/about/facts-figures, last accessed 30/03/2017.
- 4.
http://www.rdfhdt.org/datasets/, last accessed 30/03/2017.
- 5.
Executing the SPARQL query SELECT (count(*) as ?C) WHERE {?S ?P ?O } on https://query.wikidata.org/ gives 1.7B triples, last accessed 30/03/2017.
- 6.
http://wiki.openstreetmap.org/wiki/Planet.osm, last accessed 30/03/2017.
- 7.
That is, within your published RDF graph, use HTTP URIs pointing to other dereferenceable documents, that possibly contain further RDF graphs.
- 8.
http://commoncrawl.org/, last accessed 30/03/2017.
- 9.
https://ckan.org/, last accessed 30/3/2017.
- 10.
https://socrata.com/, last accessed 30/3/2017.
- 11.
https://data.humdata.org/, last accessed 27/3/2017.
- 12.
http://opendefinition.org/ofd/, last accessed 30/03/2017.
- 13.
The numbers for the RDF serializations JSON-LD (8 resources) and TTL (55) are vanishingly small.
- 14.
DCAT is a vocabulary commonly used for describing general metadata about datasets. See Sect. 5.2 for mapping and homogenization of metadata descriptions using standard vocabularies.
- 15.
For instance, see Converter Tools on https://project-open-data.cio.gov/, last accessed 24/03/2017.
- 16.
https://developers.google.com/kml/documentation/, last accessed 24/03/2017.
- 17.
http://www.opengeospatial.org/standards/gml, last accessed 24/03/2017.
- 18.
http://www.opengeospatial.org/standards/wfs, last accessed 24/03/2017.
- 19.
https://www.w3.org/TR/xquery-30/, last accessed 24/03/2017.
- 20.
https://www.w3.org/TR/xpath-30/, last accessed 24/03/2017.
- 21.
https://www.w3.org/TR/xslt-30/, last accessed 24/03/2017.
- 22.
https://www.w3.org/XML/Schema, last accessed 24/03/2017.
- 23.
https://www.w3.org/community/rax/, last accessed 24/03/2017.
- 24.
http://lod-cloud.net/state/state_2014/#toc10, last accessed 01/05/2017.
- 25.
http://extensions.ckan.org/extension/harvest/, last accessed 24/03/2017.
- 26.
http://docs.ckan.org/projects/ckanext-spatial/en/latest/, last accessed 24/03/2017.
- 27.
https://joinup.ec.europa.eu/asset/dcat_application_profile/description, last accessed 24/03/2017.
- 28.
Google Research Blog entry, https://research.googleblog.com/2017/01/facilitating-discovery-of-public.html, last accessed 27/01/2017.
- 29.
https://www.w3.org/TR/vocab-dqv/, last accessed 24/03/2017.
- 30.
http://data.wu.ac.at/csvengine, last accessed 24/03/2017.
- 31.
https://www.elastic.co/products/elasticsearch, last accessed 24/03/2017.
- 32.
https://www.opendatanetwork.com, last accessed 24/03/2017.
- 33.
https://dev.socrata.com, last accessed 24/03/2017.
References
Abele, A., McCrae, J.P., Buitelaar, P., Jentzsch, A., Cyganiak, R.: Linking open data cloud diagram 2017 (2017)
Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proc. VLDB Endow. 6(6), 421–432 (2013)
Alexander, K., Cyganiak, R., Hausenblas, M., Zhao, J.: Describing linked datasets with the VoID Vocabulary, March 2011. https://www.w3.org/TR/void/
Arenas, M., Barceló, P., Libkin, L., Murlak, F.: Foundations of Data Exchange. Cambridge University Press, New York (2014)
Assaf, A., Troncy, R., Senart, A.: HDL - towards a harmonized dataset model for open data portals. In: PROFILES 2015, 2nd International Workshop on Dataset Profiling & Federated Search for Linked Data, Main conference ESWC15, 31 May-4, Portoroz, Slovenia, Portoroz, Slovenia, 05 2015. CEUR-WS.org., June 2015
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52
Auer, S., Lehmann, J.: Creating knowledge out of interlinked data. Semant. Web 1(1–2), 97–104 (2010)
Bailey, J., Bry, F., Furche, T., Schaffert, S.: Web and semantic web query languages: a survey. In: Eisinger, N., Małuszyński, J. (eds.) Reasoning Web. LNCS, vol. 3564, pp. 35–133. Springer, Heidelberg (2005). doi:10.1007/11526988_3
Bauckmann, J., Abedjan, Z., Leser, U., Müller, H., Naumann, F.: Discovering conditional inclusion dependencies. In 21st ACM International Conference on Information and Knowledge Management (CIKM 2012), Maui, HI, USA, October 29 - November 02, 2012, pp. 2094–2098 (2012)
Beckett, D., Berners-Lee, T., Prud’hommeaux, E., Carothers, G.: RDF 1.1 turtle: the terse RDF triple language. W3C Recommendation, February 2014. http://www.w3.org/TR/turtle/
Beek, W., Rietveld, L., Schlobach, S., van Harmelen, F.: LOD laundromat: why the semantic web needs centralization (even if we don’t like it). IEEE Internet Comput. 20(2), 78–81 (2016)
Berners-Lee, T.: Linked Data. W3C Design Issues, July 2006. http://www.w3.org/DesignIssues/LinkedData.html. Accessed 31 Mar 2017
Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Sci. Am. 5, 29–37 (2001)
Bernstein, A., Hendler, J., Noy, N.: The semantic web. Commun. ACM 59(9), 35–37 (2016)
Bischof, S., Decker, S., Krennwallner, T., Lopes, N., Polleres, A.: Mapping between RDF and XML with XSPARQL. J. Data Semant. 1(3), 147–185 (2012)
Borriello, M., Dirschl, C., Polleres, A., Ritchie, P., Salliau, F., Sasaki, F., Stoitsis, G.: From XML to RDF step by step: approaches for leveraging xml workflows with linked data. In: XML Prague 2016 - Conference Proceedings, pp. 121–138, Prague, Czech Republic, February 2016
Bourhis, P., Reutter, J.L., Suárez, F., Domagoj Vrgoc, J.: Data model, query languages and schema specification. CoRR, abs/1701.02221 (2017)
Bray, T.: The JavaScript Object Notation (JSON) Data Interchange Format. Internet Engineering Task Force (IETF) RFC 7159, March 2014
Brickley, D., Guha, R.V.: RDF Schema 1.1. W3C Recommendation, February 2014. http://www.w3.org/TR/rdf-schema/
Cabrio, E., Palmero Aprosio, A., Villata, S.: These are your rights. In: Presutti, V., d’Amato, C., Gandon, F., d’Aquin, M., Staab, S., Tordai, A. (eds.) ESWC 2014. LNCS, vol. 8465, pp. 255–269. Springer, Cham (2014). doi:10.1007/978-3-319-07443-6_18
Carothers, G., Seaborne, A.: RDF 1.1 N-triples: a line-based syntax for an RDF graph. W3C Recommendation, February 2014. http://www.w3.org/TR/rdf-schema/
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), pp. 1101–1102, New York, NY, USA. ACM (2008)
Cyganiak, R., Wood, D., Lanthaler, M., Klyne, G., Carroll, J.J., Mcbride, B.: RDF 1.1 concepts and abstract syntax. Technical report (2014)
d’Aquin, M., Motta, E.: Watson, more than a semantic web search engine. Semant. Web 2(1), 55–63 (2011)
Sarma, A.D., Fang, L., Gupta, N., Halevy, A., Lee, H., Wu, F., Xin, R., Yu, C.: Finding related tables. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 817–828. ACM (2012)
Dell’Aglio, D., Polleres, A., Lopes, N., Bischof, S.: Querying the web of data with XSPARQL 1.1. In: ISWC2014 Developers Workshop, vol. 1268 of CEUR Workshop Proceedings. CEUR-WS.org, October 2014
Ding, L., Finin, T., Joshi, A., Pan, R., Scott Cost, R., Peng, Y., Reddivari, P., Doshi, V., Sachs, J.: Swoogle: a search and metadata engine for the semantic web. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management (CIKM 2004), pp. 652–659, New York, NY, USA. ACM (2004)
Ermilov, I., Auer, S., Stadler, C.: User-driven semantic mapping of tabular data. In: Proceedings of the 9th International Conference on Semantic Systems (I-SEMANTICS 2013), pp. 105–112, New York, NY, USA. ACM (2013)
European Commission. Towards a thriving data-driven economy, July 2014
Fernández, J.D., Martınez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19(2), 22–41 (2013)
Fernández Garcia, J.D., Umbrich, J., Knuth, M., Polleres, A.: Evaluating query and storage strategies for RDF archives. In: 12th International Conference on Semantic Systems (SEMANTICS), ACM International Conference Proceedings Series, pp. 41–48. ACM, September 2016
Fürber, C., Hepp, M.: Towards a vocabulary for data quality management in semantic web architectures. In: Proceedings of the 1st International Workshop on Linked Web Data Management (LWDM 2011), pp. 1–8, New York, NY, USA. ACM (2011)
Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation, March 2013. http://www.w3.org/TR/sparql11-query/
Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers, San Rafael (2011)
Hernández, D., Hogan, A., Krötzsch, M.: Reifying RDF: what works well with wikidata? In: Proceedings of the 11th International Workshop on Scalable Semantic Web Knowledge Base Systems Co-located with 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA, October 11, 2015, pp. 32–47 (2015)
Hernández, D., Hogan, A., Riveros, C., Rojas, C., Zerega, E.: Querying wikidata: comparing SPARQL, relational and graph databases. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9982, pp. 88–103. Springer, Cham (2016). doi:10.1007/978-3-319-46547-0_10
Hitzler, P., Lehmann, J., Polleres, A.: Logics for the semantic web. In: Gabbay, D.M., Siekmann, J.H., Woods, J. (eds.) Computational Logic, vol. 9 of Handbook of the History of Logic, pp. 679–710. Elesevier, Amsterdam (2014)
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Iannella, R., Villata, S.: ODRL information model. W3C Working Draft (2017). https://www.w3.org/TR/odrl-model/
Open Knowledge International. Open Definition Conformant Licenses, April 2017. http://opendefinition.org/licenses/. Accessed 28 Apr 2017
Klyne, G., Carroll, J.J.: Resource description framework (RDF): concepts and abstract syntax. Technical report (2004)
Kruse, S., Papenbrock, T., Dullweber, C., Finke, M., Hegner, M., Zabel, M., Zöllner, C., Naumann, F.: Fast approximate discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW 2017), 17. Fachtagung des GI-Fachbereichs, Datenbanken und Informationssysteme (DBIS), 6.-10. März 2017, Stuttgart, Germany, Proceedings, pp. 207–226 (2017)
Kruse, S., Papenbrock, T., Naumann, F.: Scaling out the discovery of inclusion dependencies. In: Datenbanksysteme für Business, Technologie und Web (BTW), 16. Fachtagung des GI-Fachbereichs “Datenbanken und Informationssysteme” (DBIS), 4.-6.3.2015 in Hamburg, Germany. Proceedings, pp. 445–454 (2015)
Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 6(2), 167–195 (2015)
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. PVLDB 3(1), 1338–1347 (2010)
Liu, Z.H., Hammerschmidt, B., McMahon, D.: JSON data management: supporting schema-less development in RDBMS. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD 2014), pp. 1247–1258, New York, NY, USA. ACM (2014)
Lopez, V., Kotoulas, S., Sbodio, M.L., Stephenson, M., Gkoulalas-Divanis, A., Aonghusa, P.M.: QuerioCity: a linked data platform for urban information management. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7650, pp. 148–163. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35173-0_10
Maali, F., Erickson, J.: Data Catalog Vocabulary (DCAT), January 2014. http://www.w3.org/TR/vocab-dcat/
McGuinness, D., Lebo, T., Sahoo, S.: The PROV Ontology (PROV-O), April 2013. http://www.w3.org/TR/prov-o/
Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_18
Meusel, R., Ritze, D., Paulheim, H.: Towards more accurate statistical profiling of deployed schema.org microdata. J. Data Inf. Qual. 8(1), 3:1–3:31 (2016)
Miles, A., Bechhofer, S.: Simple knowledge organization system reference. W3C Recommendation (2009)
Miller, R.J., Hernández, M.A., Haas, L.M., Yan, L., Howard Ho, C.T., Fagin, R., Popa, L.: The clio project: managing heterogeneity. SIGMOD Rec. 30(1), 78–83 (2001)
Mitlöhner, J., Neumaier, S., Umbrich, J., Polleres, A.: Characteristics of open data CSV files. In: 2nd International Conference on Open and Big Data, Invited Paper, August 2016
Mulwad, V., Finin, T., Joshi, A.: Semantic message passing for generating linked data from tables. In: The Semantic Web - ISWC 2013–12th International Semantic Web Conference, Sydney, NSW, Australia, 21–25 October, 2013, Proceedings, Part I, pp. 363–378 (2013)
Navigli, R., Ponzetto., S.P.: Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif. Intell. 193, 217–250 (2012)
Neumaier, S., Umbrich, J., Parreira, J.X., Polleres, A.: Multi-level semantic labelling of numerical values. In: Groth, P., et al. (eds.) ISWC 2016. LNCS, vol. 9981, pp. 428–445. Springer, Cham (2016). doi:10.1007/978-3-319-46523-4_26
Neumaier, S., Umbrich, J., Polleres, A.: Automated quality assessment of metadata across open data portals. J. Data Inf. Qual. 8(1), 2:1–2:29 (2016)
Neumaier, S., Umbrich, J., Polleres, A.: Lifting data portals to the web of data. In: WWW 2017 Workshop on Linked Data on the Web (LDOW 2017), Perth, Australia, 3-7 April, 2017 (2017)
Auer, S., Lehmann, J., Ngonga Ngomo, A.-C.: Introduction to linked data and its lifecycle on the web. In: Polleres, A., d’Amato, C., Arenas, M., Handschuh, S., Kroner, P., Ossowski, S., Patel-Schneider, P. (eds.) Reasoning Web 2011. LNCS, vol. 6848, pp. 1–75. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23032-5_1
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)
Papenbrock, T., Kruse, S., Quiané-Ruiz, J.-A., Naumann, F.: Divide & conquer-based inclusion dependency discovery. PVLDB 8(7), 774–785 (2015)
Pezoa, F., Reutter, J.L., Suárez, F., Ugarte, M., Vrgoc, D.: Foundations of JSON schema. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 263–273 (2016)
Polleres, A., Hogan, A., Delbru, R., Umbrich, J.: RDFS & OWL reasoning for linked data. In: Rudolph, S., Gottlob, G., Horrocks, I., van Harmelen, F. (eds.) Reasoning Web. Semantic Technologies for Intelligent Data Access (Reasoning Web 2013), volume 8067, pp. 91–149. Springer, Mannheim (2013)
Pollock, R., Tennison, J., Kellogg, G., Herman, I.: Metadata vocabulary for tabular data, W3C Recommendation, December 2015. https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/
Ramnandan, S.K., Mittal, A., Knoblock, C.A., Szekely, P.: Assigning semantic labels to data sources. In: Gandon, F., Sabou, M., Sack, H., d’Amato, C., Cudré-Mauroux, P., Zimmermann, A. (eds.) ESWC 2015. LNCS, vol. 9088, pp. 403–417. Springer, Cham (2015). doi:10.1007/978-3-319-18818-8_25
Shafranovich,Y.: Common Format and MIME Type for Comma-Separated Values (CSV) Files. RFC 4180 (Informational), October 2005
Sporny, M., Kellogg, G., Lanthaler, M.: JSON-LD 1.0A JSON-based Serialization for Linked Data, January 2014. http://www.w3.org/TR/json-ld/
Steyskal, S., Polleres, A.: Defining expressive access policies for linked data using the ODRL ontology 2.0. In: Proceedings of the 10th International Conference on Semantic Systems (SEMANTICS 2014) (2014)
Taheriyan, M., Knoblock, C.A., Szekely, P., Ambite, J.L.: A scalable approach to learn semantic models of structured sources. In: Proceedings of the 8th IEEE International Conference on Semantic Computing (ICSC 2014) (2014)
Tanon, T.P., Vrandecic, D., Schaffert, S., Steiner, T., Pintscher, L.: From freebase to wikidata: the great migration. In: Proceedings of the 25th International Conference on World Wide Web (WWW 2016), Montreal, Canada, 11–15 April, 2016, pp. 1419–1428 (2016)
The Open Data Charter. G8 open data charter and technical annex (2013)
Venetis, P., Halevy, A.Y., Madhavan, J., Pasca, M., Shen, W., Fei, W., Miao, G., Chung, W.: Recovering semantics of tables on the web. PVLDB 4(9), 528–538 (2011)
Vrandecic, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Weibel, S., Kunze, J., Lagoze, C., Wolf, M.: Dublin core metadata for resource discovery. Technical report, USA (1998)
Zhang, Z.: Towards efficient and effective semantic table interpretation. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 487–502. Springer, Cham (2014). doi:10.1007/978-3-319-11964-9_31
Acknowledgements
The work presented in this paper has been supported by the Austrian Research Promotion Agency (FFG) under the projects ADEQUATe (grant no. 849982) and DALICC (grant no. 855396).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this chapter
Cite this chapter
Neumaier, S., Polleres, A., Steyskal, S., Umbrich, J. (2017). Data Integration for Open Data on the Web. In: Ianni, G., et al. Reasoning Web. Semantic Interoperability on the Web. Reasoning Web 2017. Lecture Notes in Computer Science(), vol 10370. Springer, Cham. https://doi.org/10.1007/978-3-319-61033-7_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-61033-7_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-61032-0
Online ISBN: 978-3-319-61033-7
eBook Packages: Computer ScienceComputer Science (R0)