Skip to main content
Log in

Comparing data summaries for processing live queries over Linked Data

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

A growing amount of Linked Data—graph-structured data accessible at sources distributed across the Web—enables advanced data integration and decision-making applications. Typical systems operating on Linked Data collect (crawl) and pre-process (index) large amounts of data, and evaluate queries against a centralised repository. Given that crawling and indexing are time-consuming operations, the data in the centralised index may be out of date at query execution time. An ideal query answering system for querying Linked Data live should return current answers in a reasonable amount of time, even on corpora as large as the Web. In such a live query system source selection—determining which sources contribute answers to a query—is a crucial step. In this article we propose to use lightweight data summaries for determining relevant sources during query evaluation. We compare several data structures and hash functions with respect to their suitability for building such summaries, stressing benefits for queries that contain joins and require ranking of results and sources. We elaborate on join variants, join ordering and ranking. We analyse the different approaches theoretically and provide results of an extensive experimental evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Aberer, K., Cudré-Mauroux, P., Hauswirth, M., Van Pelt, T.: GridVine: building internet-scale semantic overlay networks. In: ISWC’04, pp. 107–121 (2004)

  2. Adjiman, Ph., Goasdoué, F., Rousset, M.-Ch.: SomeRDFS in the semantic web. JDS 8, 158–181 (2007)

    Google Scholar 

  3. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS ’02, pp. 1–16 (2002)

  4. Berners-Lee, T.: Linked Data, July 2006. http://www.w3.org/DesignIssues/LinkedData

  5. Berners-Lee, T., Connolly, D.: Notation3 (N3): a readable RDF syntax, January 2008. W3C Team Submission. Available at http://www.w3.org/TeamSubmission/n3/

  6. Bizer, Ch., Heath, T., Berners-Lee, T.: Linked data—the story so far. JSWIS 5(3), 1–22 (2009)

    Google Scholar 

  7. Brickley, D., Miller, L.: FOAF vocabulary specification 0.91, November 2007. http://xmlns.com/foaf/spec/

  8. Bruno, N., Chaudhuri, S., Gravano, L.: STHoles: a multidimensional workload-aware histogram. SIGMOD Rec. 30(2), 211–222 (2001)

    Article  Google Scholar 

  9. Cai, M., Frank, M.: RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In: WWW’04, pp. 650–657 (2004)

  10. Chakrabarti, K., Garofalakis, M., Rastogi, R., Shim, K.: Approximate query processing using wavelets. VLDB J. 10(2–3), 199–223 (2001)

    MATH  Google Scholar 

  11. Cheng, G., Qu, Y.: Searching linked objects with falcons: approach, implementation and evaluation. JSWIS 5(3), 49–70 (2009)

    Google Scholar 

  12. Clark, K.G., Feigenbaum, L., Torres, E.: SPARQL Protocol for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-protocol/

  13. Crespo, A., Garcia-Molina, H.: Routing indices for peer-to-peer systems. In: ICDCS ’02, pp. 23–32 (2002)

  14. Cudré-Mauroux, P., Agarwal, S., Aberer, K.: GridVine: an infrastructure for peer information management. IEEE Internet Computing 11(5), 864–875 (2007)

    Article  Google Scholar 

  15. Cyganiak, R., Stenzhorn, H., Delbru, R., Decker, S., Tummarello, G.: Semantic sitemaps: efficient and flexible access to datasets on the semantic web. In: ESWC’08, pp. 690–704 (2008)

  16. d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with Watson. In: EON’07, pp. 1–10 (2007)

  17. Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: ESWC 2010, pp. 240–256 (2010)

  18. Garcia-Molina, H., Widom, J., Ullman, J.D.: Database System Implementation. Prentice-Hall, Englewood Cliffs (1999)

    Google Scholar 

  19. Gibbons, P., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. In: VLDB ’97, pp. 466–475 (1997)

  20. Gilbert, A.C., Kotidis, Y., Muthukrishnan, S., Strauss, M.: Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: VLDB ’01, pp. 79–88 (2001)

  21. Goldman, R., Widom, J.: DataGuides: enabling query formulation and optimization in semistructured databases. In: VLDB’97, pp. 436–445 (1997)

  22. Gunopulos, D., Kollios, G., Tsotras, V., Domeniconi, C.: Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD ’00, pp. 463–474 (2000)

  23. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: SIGMOD ’84, pp. 47–57 (1984)

  24. Harth, A., Decker, S.: Optimized index structures for querying RDF from the web. In: 3rd Latin American Web Congress, pp. 71–80 (2005)

  25. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K., Umbrich, J.: Data summaries for on-demand queries over Linked Data. In: WWW’10, pp. 411–420 (2010)

  26. Hartig, O., Bizer, Ch., Freytag, J.-Ch.: Executing SPARQL queries over the Web of Linked Data. In: ISWC’09 (2009)

  27. Hayes, P.: RDF semantics. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-mt/

  28. Heimbigner, D., McLeod, D.: A federated architecture for information management. ACM Trans. Inf. Syst. 3(3), 253–278 (1985)

    Article  Google Scholar 

  29. Heine, F.: Scalable P2P based RDF querying. In: InfoScale’06, pp. 17–22 (2006)

  30. Heine, F., Hovestadt, M., Kao, O.: Processing complex RDF queries over P2P networks. In: Workshop on Information Retrieval in Peer-to-Peer Networks (P2PIR’05), pp. 41–48 (2005)

  31. Henzinger, M.R., Heydon, A., Mitzenmacher, M., Najork, M.: Measuring index quality using random walks on the web. Comput. Netw. 31(11–16), 1291–1303 (1999)

    Article  Google Scholar 

  32. Hogan, A., Harth, A., Umbrich, J., Decker, S.: Towards a scalable search and query engine for the web. In: WWW’07, pp. 1301–1302 (2007)

  33. Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. Technical Report DERI-TR-2010-07-23, DERI (2010)

  34. Hose, K.: Processing rank-aware queries in schema-based P2P systems. Ph.D. thesis, TU Ilmenau (2009)

  35. Hose, K., Karnstedt, M., Koch, A., Sattler, K., Zinn, D.: Processing rank-aware queries in P2P systems. In: DBISP2P’05, pp. 238–249 (2005)

  36. Hose, K., Klan, D., Sattler, K.: Distributed data summaries for approximate query processing in PDMS. In: IDEAS ’06, pp. 37–44 (2006)

  37. Huang, S.-H.S.: Multidimensional extendible hashing for partial-match queries. JPP 14, 73–82 (1985)

    MATH  Google Scholar 

  38. Ioannidis, Y.: The history of histograms (abridged). In: VLDB ’03, pp. 19–30 (2003)

  39. Karnstedt, M.: Query processing in a DHT-based universal storage. Ph.D. thesis, AVM (2009)

  40. Karnstedt, M., Sattler, K., Richtarsky, M., Müller, J., Hauswirth, M., Schmidt, R., John, R.: UniStore: querying a DHT-based universal storage. In: ICDE’07 Demonstrations Program, pp. 1503–1504 (2007)

  41. Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. JACM 46(5), 604–632 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  42. Kossmann, D.: The state of the art in distributed query processing. ACM Comput. Surv. 32(4), 422–469 (2000)

    Article  Google Scholar 

  43. Langegger, A., Wöß, W.: RDFStats—an extensible RDF statistics generator and library. In: Workshop on Web Semantics, DEXA (2009)

  44. ldspider. Google code, April 2010

  45. Manola, F., Miller, E.: RDF Primer. W3C Recommendation, February 2004. http://www.w3.org/TR/rdf-primer/

  46. Marzolla, M., Mordacchini, M., Orlando, S.: Tree vector indexes: efficient range queries for dynamic content on peer-to-peer networks. In: PDP’06, pp. 457–464 (2006)

  47. Miller, L., Seaborne, A., Reggiori, A.: Three implementations of SquishQL, a simple RDF query language. In: ISWC’02, pp. 423–435 (2002)

  48. Muralikrishna, M., DeWitt, D.: Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In: SIGMOD 88, pp. 28–36 (1988)

  49. Nejdl, W., Wolf, B., Qu, C., Decker, S., Sintek, M., Naeve, A., Nilsson, M., Palmer, M., Risch, T.: Edutella: a P2P networking infrastructure based on RDF. In: WWW’02 (2002)

  50. Neumann, Th., Weikum, G.: RDF-3X: a RISC-style engine for RDF. VLDB Endowment 1(1), 647–659 (2008)

    Google Scholar 

  51. Oren, E., Delbru, R., Catasta, M., Cyganiak, R., Stenzhorn, H., Tummarello, G.: Sindice.com: a document-oriented lookup index for open linked data. IJMSO 3(1), 37–52 (2008)

    Article  Google Scholar 

  52. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank Citation ranking: bringing order to the web. Technical report, Stanford Digital Library Technologies Project (1998)

  53. Petrakis, Y., Koloniari, G., Pitoura, E.: On using histograms as routing indexes in peer-to-peer systems. In: DBISP2P, pp. 16–30 (2004)

  54. Petrakis, Y., Pitoura, E.: On constructing small worlds in unstructured peer-to-peer systems. In: EDBT Workshops, pp. 415–424 (2004)

  55. Poosala, V., Ioannidis, Y.: Selectivity estimation without the attribute value independence assumption. In: VLDB ’97, pp. 486–495 (1997)

  56. Prud’hommeaux, E., Seaborne, A.: SPARQL Query Language for RDF, January 2008. W3C Recommendation. http://www.w3.org/TR/rdf-sparql-query/

  57. Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538, Tenerife, Spain. Springer (2008)

  58. Quilitz, B., Leser, U.: Querying distributed RDF data sources with SPARQL. In: ESWC’08, pp. 524–538 (2008)

  59. Rathi, A., Lu, H., Hedrick, G.E.: Performance comparison of extendible hashing and linear hashing techniques. SIGSMALL/PC Notes 17(2), 19–26 (1991)

    Article  Google Scholar 

  60. Schlosser, M., Sintek, M., Decker, S., Nejdl, W.: HyperCuP, hypercubes, ontologies, and efficient search on peer-to-peer networks. In: Agents and Peer-to-Peer Computing, vol. 2530, pp. 133–134. Springer (2003)

  61. Srivastava, U., Haas, P.J., Markl, V., Kutsch, M., Tran, T.M.: ISOMER: consistent histogram construction using query feedback. In: ICDE ’06, p. 39 (2006)

  62. Stuckenschmidt, H., Vdovjak, R., Broekstra, J., Houben, G.-J.: Towards distributed processing of RDF path queries. JWET 2(2/3), 207–230 (2005)

    Google Scholar 

  63. Stuckenschmidt, H., Vdovjak, R., Houben, G.-J., Broekstra, J.: Index structures and algorithms for querying distributed RDF repositories. In: WWW’04, pp. 631–639 (2004)

  64. Umbrich, J., Karnstedt, M., Land, S.: Towards understanding the changing web: mining the dynamics of Linked-Data sources and entities. In: LWA 2010, FG-KDML, pp. 159–162 (2010)

  65. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. VLDB Endowment 1(1), 1008–1019 (2008)

    Google Scholar 

  66. Zinn, D.: Skyline queries in P2P systems. Master’s thesis, TU Ilmenau (2004)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Harth.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Umbrich, J., Hose, K., Karnstedt, M. et al. Comparing data summaries for processing live queries over Linked Data. World Wide Web 14, 495–544 (2011). https://doi.org/10.1007/s11280-010-0107-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-010-0107-z

Keywords

Navigation