Abstract
A growing amount of Linked Data—graph-structured data accessible at sources distributed across the Web—enables advanced data integration and decision-making applications. Typical systems operating on Linked Data collect (crawl) and pre-process (index) large amounts of data, and evaluate queries against a centralised repository. Given that crawling and indexing are time-consuming operations, the data in the centralised index may be out of date at query execution time. An ideal query answering system for querying Linked Data live should return current answers in a reasonable amount of time, even on corpora as large as the Web. In such a live query system source selection—determining which sources contribute answers to a query—is a crucial step. In this article we propose to use lightweight data summaries for determining relevant sources during query evaluation. We compare several data structures and hash functions with respect to their suitability for building such summaries, stressing benefits for queries that contain joins and require ranking of results and sources. We elaborate on join variants, join ordering and ranking. We analyse the different approaches theoretically and provide results of an extensive experimental evaluation.
Similar content being viewed by others
References
Aberer, K., Cudré-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: building internet-scale semantic overlay networks. In: ISWC’04, pp. 107–121 (2004)
Adjiman, Ph., Goasdoué, F., Rousset, M.-Ch.: SomeRDFS in the semantic web. JDS 8, 158–181 (2007)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02, pp. 1–16 (2002)
Berners-Lee, T.: Linked Data, July 2006. http://www.w3.org/DesignIssues/LinkedData
Berners-Lee, T., Connolly, D.: Notation3 (N3): a readable RDF syntax, January 2008. W3C Team Submission. Available at http://www.w3.org/TeamSubmission/n3/
Bizer, Ch., Heath, T., Berners-Lee, T.: Linked data—the story so far. JSWIS 5(3), 1–22 (2009)
Brickley, D., Miller, L.: FOAF vocabulary specification 0.91, November 2007. http://xmlns.com/foaf/spec/
Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Rec. 30(2), 211–222 (2001)
Cai, M., Frank, M.: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In: WWW’04, pp. 650–657 (2004)
Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)
Cheng, G., Qu, Y.: Searching linked objects with falcons: approach, implementation and evaluation. JSWIS 5(3), 49–70 (2009)
Clark, K.G., Feigenbaum, L., Torres, E.: SPARQL Protocol for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-protocol/
Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: ICDCS ’02, pp. 23–32 (2002)
Cudré-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: an infrastructure for peer information management. IEEE Internet Computing 11(5), 864–875 (2007)
Cyganiak, R., Stenzhorn, H., Delbru, R., Decker, S., Tummarello, G.: Semantic sitemaps: efficient and flexible access to datasets on the semantic web. In: ESWC’08, pp. 690–704 (2008)
d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with Watson. In: EON’07, pp. 1–10 (2007)
Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: ESWC 2010, pp. 240–256 (2010)
Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Englewood Cliffs (1999)
Gibbons, P., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB ’97, pp. 466–475 (1997)
Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB ’01, pp. 79–88 (2001)
Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)
Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD ’00, pp. 463–474 (2000)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD ’84, pp. 47–57 (1984)
Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: 3rd Latin American Web Congress, pp. 71–80 (2005)
Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over Linked Data. In: WWW’10, pp. 411–420 (2010)
Hartig, O., Bizer, Ch., Freytag, J.-Ch.: Executing SPARQL queries over the Web of Linked Data. In: ISWC’09 (2009)
Hayes, P.: RDF semantics. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-mt/
Heimbigner, D., McLeod, D.: A federated architecture for information management. ACM Trans. Inf. Syst. 3(3), 253–278 (1985)
Heine, F.: Scalable P2P based RDF querying. In: InfoScale’06, pp. 17–22 (2006)
Heine, F., Hovestadt, M., Kao, O.: Processing complex RDF queries over P2P networks. In: Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR’05), pp. 41–48 (2005)
Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the web. Comput. Netw. 31(11–16), 1291–1303 (1999)
Hogan, A., Harth, A., Umbrich, J., Decker, S.: Towards a scalable search and query engine for the web. In: WWW’07, pp. 1301–1302 (2007)
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Technical Report DERI-TR-2010-07-23, DERI (2010)
Hose, K.: Processing rank-aware queries in schema-based P2P systems. Ph.D. thesis, TU Ilmenau (2009)
Hose, K., Karnstedt, M., Koch, A., Sattler, K., Zinn, D.: Processing rank-aware queries in P2P systems. In: DBISP2P’05, pp. 238–249 (2005)
Hose, K., Klan, D., Sattler, K.: Distributed data summaries for approximate query processing in PDMS. In: IDEAS ’06, pp. 37–44 (2006)
Huang, S.-H.S.: Multidimensional extendible hashing for partial-match queries. JPP 14, 73–82 (1985)
Ioannidis, Y.: The history of histograms (abridged). In: VLDB ’03, pp. 19–30 (2003)
Karnstedt, M.: Query processing in a DHT-based universal storage. Ph.D. thesis, AVM (2009)
Karnstedt, M., Sattler, K., Richtarsky, M., Müller, J., Hauswirth, M., Schmidt, R., John, R.: UniStore: querying a DHT-based universal storage. In: ICDE’07 Demonstrations Program, pp. 1503–1504 (2007)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604–632 (1999)
Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)
Langegger, A., Wöß, W.: RDFStats—an extensible RDF statistics generator and library. In: Workshop on Web Semantics, DEXA (2009)
ldspider. Google code, April 2010
Manola, F., Miller, E.: RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-primer/
Marzolla, M., Mordacchini, M., Orlando, S.: Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks. In: PDP’06, pp. 457–464 (2006)
Miller, L., Seaborne, A., Reggiori, A.: Three implementations of SquishQL, a simple RDF query language. In: ISWC’02, pp. 423–435 (2002)
Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD 88, pp. 28–36 (1988)
Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M., Risch, T.: Edutella: a P2P networking infrastructure based on RDF. In: WWW’02 (2002)
Neumann, Th., Weikum, G.: RDF-3X: a RISC-style engine for RDF. VLDB Endowment 1(1), 647–659 (2008)
Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)
Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)
Petrakis, Y., Koloniari, G., Pitoura, E.: On using histograms as routing indexes in peer-to-peer systems. In: DBISP2P, pp. 16–30 (2004)
Petrakis, Y., Pitoura, E.: On constructing small worlds in unstructured peer-to-peer systems. In: EDBT Workshops, pp. 415–424 (2004)
Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: VLDB ’97, pp. 486–495 (1997)
Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538, Tenerife, Spain. Springer (2008)
Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538 (2008)
Rathi, A., Lu, H., Hedrick, G.E.: Performance comparison of extendible hashing and linear hashing techniques. SIGSMALL/PC Notes 17(2), 19–26 (1991)
Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: HyperCuP, hypercubes, ontologies, and efficient search on peer-to-peer networks. In: Agents and Peer-to-Peer Computing, vol. 2530, pp. 133–134. Springer (2003)
Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: ICDE ’06, p. 39 (2006)
Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J.: Towards distributed processing of RDF path queries. JWET 2(2/3), 207–230 (2005)
Stuckenschmidt, H., Vdovjak, R., Houben, G.-J., Broekstra, J.: Index structures and algorithms for querying distributed RDF repositories. In: WWW’04, pp. 631–639 (2004)
Umbrich, J., Karnstedt, M., Land, S.: Towards understanding the changing web: mining the dynamics of Linked-Data sources and entities. In: LWA 2010, FG-KDML, pp. 159–162 (2010)
Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. VLDB Endowment 1(1), 1008–1019 (2008)
Zinn, D.: Skyline queries in P2P systems. Master’s thesis, TU Ilmenau (2004)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Umbrich, J., Hose, K., Karnstedt, M. et al. Comparing data summaries for processing live queries over Linked Data. World Wide Web 14, 495–544 (2011). https://doi.org/10.1007/s11280-010-0107-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-010-0107-z