Skip to main content

A Detailed Analysis of the Quality of Stream-Based Schema Construction on Linked Open Data

  • Conference paper
  • First Online:
Semantic Web and Web Science

Part of the book series: Springer Proceedings in Complexity ((SPCOM))

Abstract

The continuously increasing volume of linked open data (LOD) is a challenge when it comes to processing this data. Using the output of an RDF graph traversal (e.g. an LOD crawl) as a linearisation of the data can serve as a basis for a stream-based processing approach. SchemEX (Konrath et al., J. Web Semantics 2012, to appear) utilises such an approach to efficiently compute a schema-based index structure for looking up relevant data sources. In this paper we conduct a detailed analysis of the impact of the stream-based approach regarding the accuracy of the computed schema. We investigate the impact of parameter choices as well as the impact of the analysed data set under several application-motivated metrics. It can be observed that all three factors have an influence on the quality of the schema. In particular, we found that excessive use of blank nodes has a negative impact when using SchemEX to answer complex queries in the deviations. However, stream-based schema approximation is quite accurate. The deviation in the schema elements is at most 10%; the information encoded in the schema deviates by even less than 4 %.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Available from: http://km.aifb.kit.edu/projects/btc-2011/.

  2. 2.

    Essentially, the rightmost values of the curves correspond to the metric value we displayed in the plots above of a situation of having processed the complete 20 million triples of the data segment.

References

  1. Böhm, C., Freitag, M., Heise, A., Lehmann, C., Mascher, A., Naumann, F., Ercegovac, V., Hernandez, M., Haase, P., Schmidt, M.: Govwild: integrating open government data for transparency. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 321–324. WWW ’12 Companion. ACM, New York, NY (2012)

    Google Scholar 

  2. Böhm, C., Lorey, J., Naumann, F.: Creating void descriptions for web-scale data. Web Semant. Sci. Serv. Agents World Wide Web 9(3), 339–345 (2011)

    Article  Google Scholar 

  3. Gallego, M., Fernández, J., Martínez-Prieto, M., de la Fuente, P.: Rdf visualization using a three-dimensional adjacency matrix. In: SemSearch’11: Proceedings of 4th International Semantic Search Workshop, 2011

    Google Scholar 

  4. Goldman, R., Widom, J.: Dataguides: Enabling query formulation and optimization in semistructured databases. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB’97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25–29, 1997, Athens, Greece. pp. 436–445. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  5. Gottron, T., Knauf, M., Scheglmann, S., Scherp, A.: Explicit and implicit schema information on the linked open data cloud: Joined forces or antagonists? Tech. Rep. 06/2012, Institut WeST, Universität Koblenz-Landau (2012)

    Google Scholar 

  6. Gottron, T., Scherp, A., Krayer, B., Peters, A.: Get the google feeling: Supporting users in finding relevant sources of linked open data at web-scale. In: Semantic Web Challenge, Submission to the Billion Triple Track, 2012

    Google Scholar 

  7. Hausenblas, M., Halb, W., Raimond, Y., Heath, T.: What is the size of the semantic web? In: Proceedings of the International Conference on Semantic Systems, 2008

    Google Scholar 

  8. Heath, T., Bizer, C.: Linked Data: Evolving the Web Into a Global Data Space. Synthesis Lectures on the Semantic Web: Theory and Technology. Morgan & Claypool (2011)

    Google Scholar 

  9. Isele, R., Harth, A., Umbrich, J., Bizer, C.: LDspider: An open-source crawling framework for the web of linked data. In: Poster, International Semantic Web Conference 2010. Shanghai, China (2010)

    Google Scholar 

  10. Konrath, M., Gottron, T., Scherp, A.: Schemex – web-scale indexed schema extraction of linked open data. In: Semantic Web Challenge, Submission to the Billion Triple Track, 2011

    Google Scholar 

  11. Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX-Efficient Construction of a Data Catalogue by Stream-based Indexing of Linked Data, Web Semantics: Science, Services and Agents on the World Wide Web, 16(5), pp. 52–58, 2012. The Semantic Web Challenge. (2011)

    Google Scholar 

  12. Maduko, A., Anyanwu, K., Sheth, A., Schliekelman, P.: Graph summaries for subgraph frequency estimation. In: Proceedings of the 5th European Semantic Web Conference on The Semantic Web: Research and Applications, pp. 508–523, ESWC’08. Springer, Berlin, Heidelberg (2008)

    Google Scholar 

  13. Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 295–306, SIGMOD ’98. ACM, New York, NY (1998)

    Google Scholar 

  14. Nestorov, S., Ullman, J.D., Wiener, J.L., Chawathe, S.S.: Representative objects: Concise representations of semistructured, hierarchial data. In: Proceedings of the Thirteenth International Conference on Data Engineering, pp. 79–90, ICDE ’97. IEEE Computer Society, Washington, DC (1997)

    Google Scholar 

  15. Papakonstantinou, Y., Garcia-Molina, H., Widom, J.: Object exchange across heterogeneous information sources. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 251–260, ICDE ’95. IEEE Computer Society, Washington, DC (1995)

    Google Scholar 

  16. Wang, Q.Y., Yu, J.X., Wong, K.F.: Approximate graph schema extraction for semi-structured data. In: Proceedings of the 7th International Conference on Extending Database Technology: Advances in Database Technology, pp. 302–316, EDBT ’00. Springer, London (2000)

    Google Scholar 

  17. Yan, X., Han, J.: gspan: Graph-based substructure pattern mining. In: Proceedings of the 2002 IEEE International Conference on Data Mining, p. 721, ICDM ’02. IEEE Computer Society, Washington, DC (2002)

    Google Scholar 

  18. Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structure-based approach. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 335–346, SIGMOD ’04. ACM, New York, NY (2004)

    Google Scholar 

Download references

Acknowledgements

The research leading to these results has received partial funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 257859, ROBUST and grant agreement no. 287975, SocialSensor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Gottron .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this paper

Cite this paper

Gottron, T., Pickhardt, R. (2013). A Detailed Analysis of the Quality of Stream-Based Schema Construction on Linked Open Data. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, HT. (eds) Semantic Web and Web Science. Springer Proceedings in Complexity. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6880-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-6880-6_8

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-6879-0

  • Online ISBN: 978-1-4614-6880-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics