Abstract
Linked Data (LD) overlays the World Wide Web of documents with a Web of Data. This is becoming significant as shown in the growth of LD repositories available as part of the Linked Open Data (LOD) cloud. At the instance-level, LD sources use a combination of terms from various vocabularies, expressed as RDFS/OWL, to describe data and publish it to the Web. However, LD sources do not organise data to conform to a specific structure analogous to a relational schema; instead data can adhere to multiple vocabularies. Expressing SPARQL queries over LD sources – usually over a SPARQL endpoint that is presented to the user – requires knowledge of the predicates used so as to allow queries to express user requirements as graph patterns. Although LD provides low barriers to data publication using a single language (i.e., RDF), sources organise data with different structures and terminologies. This paper describes an approach to automatically derive structural summaries over instance-level data expressed as RDF triples. The technique builds on a hierarchical clustering algorithm that organises RDF instance-level data into groups that are then utilised to infer a structural summary over a LD source. The resulting structural summaries are expressed in the form of classes, properties and, relationships. Our experimental evaluation shows good results when applied to different types of LD sources.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
See Vocabulary of Interlinked Datasets: http://www.w3.org/TR/void/.
- 3.
For more statistics, see http://www4.wiwiss.fu-berlin.de/lodcloud/state/.
- 4.
Is the opposite of a similarity.
- 5.
Observing the vocabularies listed by the Linked Open Vocabularies (LOV) project: http://lov.okfn.org/dataset/lov/.
- 6.
References
Arenas, M., Gutierrez, C., Pérez, J.: Foundations of RDF databases. In: Tessaris, S., Franconi, E., Eiter, T., Gutierrez, C., Handschuh, S., Rousset, M.-C., Schmidt, R.A. (eds.) Reasoning Web. LNCS, vol. 5689, pp. 158–204. Springer, Heidelberg (2009)
Bizer, C., Cyganiak, R.: D2r server - publishing relational databases on the semantic web. In: 5th International Semantic Web Conference, p. 26 (2006)
Bizer, C., Heath, T., Berners-Lee, T.: Linked data - the story so far. Int. J. Semant. Web Inf. Syst. 5(3), 1–22 (2009)
Fahad, M.: Er2owl: generating owl ontology from er diagram. In: Shi, Z., Mercier-Laurent, E., Leake, D. (eds.) Intelligent Information Processing IV. IFIP, vol. 288, pp. 28–37. Springer, Heidelberg (2008)
Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)
Goldman, R., Widom, J.: Dataguides: enabling query formulation and optimization in semistructured databases. In: Proceedings of the 23rd International Conference on Very Large Data Bases, pp. 436–445. Morgan Kaufmann Publishers Inc. (1997)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: On clustering validation techniques. J. Intell. Inf. Syst. 17(2–3), 107–145 (2001)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.-U., Umbrich, J.: Data summaries for on-demand queries over linked data. In: WWW, pp. 411–420 (2010)
Heath, T., Bizer, C.: Linked Data: evolving the web into a global data space. In: Synthesis Lectures on the Semantic Web. Morgan & Claypool Publishers (2011)
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with swse: the semantic web search engine. J. Web Sem. 9(4), 365–401 (2011)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, New York (1990)
Klyne, G., Carroll, J.J.: Resource description framework (RDF): concepts and abstract syntax. Technical report, W3C (2004)
Konrath, M., Gottron, T., Staab, S., Scherp, A.: Schemex - efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Sem. 16, 52–58 (2012)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: KDD, pp. 16–22 (1999)
Ravi Bhushan Mishra and Sandeep Kumar: Semantic web reasoners and languages. Artif. Intell. Rev. 35(4), 339–368 (2011)
Paton, N.W., Christodoulou, K., Fernandes, A.A.A., Parsia, B., Hedeler, C.: Pay-as-you-go data integration for linked data: opportunities, challenges and architectures. In: Proceedings of the 4th International Workshop on Semantic Web Information Management, SWIM 2012, pp. 3:1–3:8. ACM (2012)
Prasser, F., Kemper, A., Kuhn, K.A.: Efficient distributed query processing for autonomous RDF databases. In: Proceedings of the 15th International Conference on Extending Database Technology, EDBT 2012, pp. 372–383. ACM (2012)
Prud’hommeaux, E., Seaborne, A.: SPARQL query language for RDF. W3C Recommendation 4, 1–106 (2008)
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 524–538. Springer, Heidelberg (2008)
Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.: FedX: optimization techniques for federated query processing on linked data. In: Aroyo, L., Welty, C., Alani, H., Taylor, J., Bernstein, A., Kagal, L., Noy, N., Blomqvist, E. (eds.) ISWC 2011, Part I. LNCS, vol. 7031, pp. 601–616. Springer, Heidelberg (2011)
Umbrich, J., Hose, K., Karnstedt, M., Harth, A., Polleres, A.: Comparing data summaries for processing live queries over linked data. World Wide Web 14(5–6), 495–544 (2011)
Völker, J., Niepert, M.: Statistical schema induction. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 124–138. Springer, Heidelberg (2011)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: CIKM, pp. 515–524 (2002)
Zong, N., Im, D.-H., Yang, S.-K., Namgoong, H., Kim, H.-G.: Dynamic generation of concepts hierarchies for knowledge discovering in bio-medical linked data sets. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication, ICUIMC 2012, pp. 12:1–12:5. ACM (2012)
Acknowledgement
Klitos Christodoulou has been supported by funding from the UK Engineering and Physical Sciences Research council, whose support we are pleased to acknowledge.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Christodoulou, K., Paton, N.W., Fernandes, A.A.A. (2015). Structure Inference for Linked Data Sources Using Clustering. In: Hameurlain, A., Küng, J., Wagner, R., Bianchini, D., De Antonellis, V., De Virgilio, R. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XIX. Lecture Notes in Computer Science(), vol 8990. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46562-2_1
Download citation
DOI: https://doi.org/10.1007/978-3-662-46562-2_1
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46561-5
Online ISBN: 978-3-662-46562-2
eBook Packages: Computer ScienceComputer Science (R0)