Comparing data summaries for processing live queries over Linked Data

Umbrich, Jürgen; Hose, Katja; Karnstedt, Marcel; Harth, Andreas; Polleres, Axel

doi:10.1007/s11280-010-0107-z

Comparing data summaries for processing live queries over Linked Data

Published: 07 January 2011

Volume 14, pages 495–544, (2011)
Cite this article

World Wide Web Aims and scope Submit manuscript

Jürgen Umbrich¹,
Katja Hose²,
Marcel Karnstedt¹,
Andreas Harth³ &
…
Axel Polleres¹

473 Accesses
53 Citations
3 Altmetric
Explore all metrics

Abstract

A growing amount of Linked Data—graph-structured data accessible at sources distributed across the Web—enables advanced data integration and decision-making applications. Typical systems operating on Linked Data collect (crawl) and pre-process (index) large amounts of data, and evaluate queries against a centralised repository. Given that crawling and indexing are time-consuming operations, the data in the centralised index may be out of date at query execution time. An ideal query answering system for querying Linked Data live should return current answers in a reasonable amount of time, even on corpora as large as the Web. In such a live query system source selection—determining which sources contribute answers to a query—is a crucial step. In this article we propose to use lightweight data summaries for determining relevant sources during query evaluation. We compare several data structures and hash functions with respect to their suitability for building such summaries, stressing benefits for queries that contain joins and require ranking of results and sources. We elaborate on join variants, join ordering and ranking. We analyse the different approaches theoretically and provide results of an extensive experimental evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aberer, K., Cudré-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: building internet-scale semantic overlay networks. In: ISWC’04, pp. 107–121 (2004)
Adjiman, Ph., Goasdoué, F., Rousset, M.-Ch.: SomeRDFS in the semantic web. JDS 8, 158–181 (2007)
Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02, pp. 1–16 (2002)
Berners-Lee, T.: Linked Data, July 2006. http://www.w3.org/DesignIssues/LinkedData
Berners-Lee, T., Connolly, D.: Notation3 (N3): a readable RDF syntax, January 2008. W3C Team Submission. Available at http://www.w3.org/TeamSubmission/n3/
Bizer, Ch., Heath, T., Berners-Lee, T.: Linked data—the story so far. JSWIS 5(3), 1–22 (2009)
Google Scholar
Brickley, D., Miller, L.: FOAF vocabulary specification 0.91, November 2007. http://xmlns.com/foaf/spec/
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Rec. 30(2), 211–222 (2001)
Article Google Scholar
Cai, M., Frank, M.: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In: WWW’04, pp. 650–657 (2004)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)
MATH Google Scholar
Cheng, G., Qu, Y.: Searching linked objects with falcons: approach, implementation and evaluation. JSWIS 5(3), 49–70 (2009)
Google Scholar
Clark, K.G., Feigenbaum, L., Torres, E.: SPARQL Protocol for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-protocol/
Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: ICDCS ’02, pp. 23–32 (2002)
Cudré-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: an infrastructure for peer information management. IEEE Internet Computing 11(5), 864–875 (2007)
Article Google Scholar
Cyganiak, R., Stenzhorn, H., Delbru, R., Decker, S., Tummarello, G.: Semantic sitemaps: efficient and flexible access to datasets on the semantic web. In: ESWC’08, pp. 690–704 (2008)
d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with Watson. In: EON’07, pp. 1–10 (2007)
Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: ESWC 2010, pp. 240–256 (2010)
Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Englewood Cliffs (1999)
Google Scholar
Gibbons, P., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB ’97, pp. 466–475 (1997)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB ’01, pp. 79–88 (2001)
Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)
Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD ’00, pp. 463–474 (2000)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD ’84, pp. 47–57 (1984)
Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: 3rd Latin American Web Congress, pp. 71–80 (2005)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over Linked Data. In: WWW’10, pp. 411–420 (2010)
Hartig, O., Bizer, Ch., Freytag, J.-Ch.: Executing SPARQL queries over the Web of Linked Data. In: ISWC’09 (2009)
Hayes, P.: RDF semantics. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-mt/
Heimbigner, D., McLeod, D.: A federated architecture for information management. ACM Trans. Inf. Syst. 3(3), 253–278 (1985)
Article Google Scholar
Heine, F.: Scalable P2P based RDF querying. In: InfoScale’06, pp. 17–22 (2006)
Heine, F., Hovestadt, M., Kao, O.: Processing complex RDF queries over P2P networks. In: Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR’05), pp. 41–48 (2005)
Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the web. Comput. Netw. 31(11–16), 1291–1303 (1999)
Article Google Scholar
Hogan, A., Harth, A., Umbrich, J., Decker, S.: Towards a scalable search and query engine for the web. In: WWW’07, pp. 1301–1302 (2007)
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Technical Report DERI-TR-2010-07-23, DERI (2010)
Hose, K.: Processing rank-aware queries in schema-based P2P systems. Ph.D. thesis, TU Ilmenau (2009)
Hose, K., Karnstedt, M., Koch, A., Sattler, K., Zinn, D.: Processing rank-aware queries in P2P systems. In: DBISP2P’05, pp. 238–249 (2005)
Hose, K., Klan, D., Sattler, K.: Distributed data summaries for approximate query processing in PDMS. In: IDEAS ’06, pp. 37–44 (2006)
Huang, S.-H.S.: Multidimensional extendible hashing for partial-match queries. JPP 14, 73–82 (1985)
MATH Google Scholar
Ioannidis, Y.: The history of histograms (abridged). In: VLDB ’03, pp. 19–30 (2003)
Karnstedt, M.: Query processing in a DHT-based universal storage. Ph.D. thesis, AVM (2009)
Karnstedt, M., Sattler, K., Richtarsky, M., Müller, J., Hauswirth, M., Schmidt, R., John, R.: UniStore: querying a DHT-based universal storage. In: ICDE’07 Demonstrations Program, pp. 1503–1504 (2007)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604–632 (1999)
Article MathSciNet MATH Google Scholar
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Article Google Scholar
Langegger, A., Wöß, W.: RDFStats—an extensible RDF statistics generator and library. In: Workshop on Web Semantics, DEXA (2009)
ldspider. Google code, April 2010
Manola, F., Miller, E.: RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-primer/
Marzolla, M., Mordacchini, M., Orlando, S.: Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks. In: PDP’06, pp. 457–464 (2006)
Miller, L., Seaborne, A., Reggiori, A.: Three implementations of SquishQL, a simple RDF query language. In: ISWC’02, pp. 423–435 (2002)
Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD 88, pp. 28–36 (1988)
Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M., Risch, T.: Edutella: a P2P networking infrastructure based on RDF. In: WWW’02 (2002)
Neumann, Th., Weikum, G.: RDF-3X: a RISC-style engine for RDF. VLDB Endowment 1(1), 647–659 (2008)
Google Scholar
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)
Article Google Scholar
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Petrakis, Y., Koloniari, G., Pitoura, E.: On using histograms as routing indexes in peer-to-peer systems. In: DBISP2P, pp. 16–30 (2004)
Petrakis, Y., Pitoura, E.: On constructing small worlds in unstructured peer-to-peer systems. In: EDBT Workshops, pp. 415–424 (2004)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: VLDB ’97, pp. 486–495 (1997)
Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538, Tenerife, Spain. Springer (2008)
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538 (2008)
Rathi, A., Lu, H., Hedrick, G.E.: Performance comparison of extendible hashing and linear hashing techniques. SIGSMALL/PC Notes 17(2), 19–26 (1991)
Article Google Scholar
Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: HyperCuP, hypercubes, ontologies, and efficient search on peer-to-peer networks. In: Agents and Peer-to-Peer Computing, vol. 2530, pp. 133–134. Springer (2003)
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: ICDE ’06, p. 39 (2006)
Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J.: Towards distributed processing of RDF path queries. JWET 2(2/3), 207–230 (2005)
Google Scholar
Stuckenschmidt, H., Vdovjak, R., Houben, G.-J., Broekstra, J.: Index structures and algorithms for querying distributed RDF repositories. In: WWW’04, pp. 631–639 (2004)
Umbrich, J., Karnstedt, M., Land, S.: Towards understanding the changing web: mining the dynamics of Linked-Data sources and entities. In: LWA 2010, FG-KDML, pp. 159–162 (2010)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. VLDB Endowment 1(1), 1008–1019 (2008)
Google Scholar
Zinn, D.: Skyline queries in P2P systems. Master’s thesis, TU Ilmenau (2004)

Download references

Author information

Authors and Affiliations

Digital Enterprise Research Institute, National University of Ireland, Galway, Ireland
Jürgen Umbrich, Marcel Karnstedt & Axel Polleres
Max-Planck-Institut für Informatik, Saarbrücken, Germany
Katja Hose
Institute AIFB, Karlsruhe Institute of Technology, Karlsruhe, Germany
Andreas Harth

Authors

Jürgen Umbrich
View author publications
You can also search for this author in PubMed Google Scholar
Katja Hose
View author publications
You can also search for this author in PubMed Google Scholar
Marcel Karnstedt
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Harth
View author publications
You can also search for this author in PubMed Google Scholar
Axel Polleres
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andreas Harth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Umbrich, J., Hose, K., Karnstedt, M. et al. Comparing data summaries for processing live queries over Linked Data. World Wide Web 14, 495–544 (2011). https://doi.org/10.1007/s11280-010-0107-z

Download citation

Received: 15 May 2010
Revised: 30 October 2010
Accepted: 21 December 2010
Published: 07 January 2011
Issue Date: October 2011
DOI: https://doi.org/10.1007/s11280-010-0107-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing data summaries for processing live queries over Linked Data

Abstract

Access this article

Similar content being viewed by others

Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search

Linked Data Management

Querying Datasets on the Web with High Availability

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Comparing data summaries for processing live queries over Linked Data

Abstract

Access this article

Similar content being viewed by others

Indexing Data on the Web: A Comparison of Schema-Level Indices for Data Search

Linked Data Management

Querying Datasets on the Web with High Availability

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation