Skip to main content
Log in

RDF in the clouds: a survey

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The Resource Description Framework (RDF) pioneered by the W3C is increasingly being adopted to model data in a variety of scenarios, in particular data to be published or exchanged on the Web. Managing large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance, and elasticity feature it provides, enabling the easy deployment of distributed and parallel architectures. In this article, we survey RDF data management architectures and systems designed for a cloud environment, and more generally, those large-scale RDF data management systems that can be easily deployed therein. We first give the necessary background, then describe the existing systems and proposals in this area, and classify them according to dimensions related to their capabilities and implementation techniques. The survey ends with a discussion of open problems and perspectives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. http://en.wikipedia.org/wiki/Open_data

  2. From now on, we will use the term RDF(S) to refer to both RDF and RDFS.

  3. http://www.w3.org/TR/sparql11-property-paths/

References

  1. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for semantic Web data management. VLDB J. 18(2), 385–406 (2009)

    Article  Google Scholar 

  2. Abiteboul, S., Hull, R., Vianu, V.: Foundations of Databases. Addison-Wesley, Boston (1995)

    MATH  Google Scholar 

  3. Abiteboul, S., Manolescu, I., Polyzotis, N., Preda, N., Sun, C.: XML Processing in DHT Networks, pp. 606–615. ICDE, Cancun, Mexico (2008)

  4. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., Rasin, A.: HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, Lyon, France (2009)

  5. Afrati F. N., Ullman J. D.: Optimizing joins in a map-reduce environment. In: EDBT, pp. 99–110, Lausanne, Switzerland (2010)

  6. Afrati, F.N., Ullman, J.D.: Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Trans. Knowl. Data Eng., 23(9), 1282–1298 (2011)

  7. Apache Accumulo.: http://accumulo.apache.org/ (2012)

  8. Apache Cassandra.: http://cassandra.apache.org/ (2012)

  9. Apache Hadoop.: http://hadoop.apache.org/ (2012)

  10. Apache HBase.: http://hbase.apache.org/ (2012)

  11. Aranda-Andújar, A., Bugiotti, F., Camacho-Rodríguez, J., Colazzo, D., Goasdoué, F., Kaoudi, Z., Manolescu, I.: Amada: Web Data Repositories in the Amazon cloud. CIKM, pp. 2749–2751, Maui, Hawaii (2012)

  12. Arias, M., Fernández, J.D., Martínez-Prieto, M.A., de la Fuente, P.: An Empirical Study of Real-World SPARQL Queries. In: USEWOD (2011)

  13. Amazon Web Services.: http://aws.amazon.com/ (2012)

  14. Bal, H.E., Maassen, J., van Nieuwpoort, R.V., Drost, N., Kemp, R., Palmer, N., Wrzesinska, G., Kielmann, T., Seinstra, F., Jacobs, C.: Real-world distributed computing with Ibis. IEEE Comput. 43(8), 54–62 (2010)

    Article  Google Scholar 

  15. Bancilhon, F., Maier, D., Sagiv, Y., Ullman, J.D.: Magic sets and other strange ways to implement logic programs PODS, pp. 1–15, Cambridge, Massachusetts, USA (1986)

  16. Berners-Lee, T.: Linked data—design issues. http://www.w3.org/DesignIssues/LinkedData.html. (2006)

  17. Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MaPreduce. In: SIGMOD Conference, pp. 975–986, Indianapolis, Indiana, USA (2010)

  18. Bornea, M.A., Dolby, J., Kementsietsidis, A., Srinivas, K., Dantressangle, P., Udrea, O., Bhattacharjee, B.: Building an efficient RDF store over a relational database. In: SIGMOD, pp. 121–132, New York, USA (2013)

  19. Brickley, D., Guha, R.V.: RDF Vocabulary Description Language 1.0: RDF Schema. Technical report, W3C Recommendation (2004)

  20. Broekstra, J., Kampman, A., van Harmelen, F.: Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In: International Semantic Web Conference, pp. 54–68, Sardinia, Italy (2002)

  21. Bugiotti, F., Camacho-Rodríguez, J., Goasdoué, F., Kaoudi, Z., Manolescu, I., Zampetakis, S.: SPARQL query processing in the cloud. In: Harth, A., Hose, K., Schenkel, R. (eds.) Linked Data Management. Chapman and Hall/CRC, Boca Raton (2014)

    Google Scholar 

  22. Bugiotti, F., Goasdoué, F., Kaoudi, Z., Manolescu, I.: RDF Data Management in the Amazon Cloud. In: DanaC Workshop (in conjunction with EDBT) (2012)

  23. Cattell, R.: Scalable SQL and NoSQL data stores. SIGMOD Record 39(4), 12–27 (May 2011)

  24. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI (2006)

  25. Inseok Chong, E., Das, S., Eadon, G., Srinivasan, J.: An efficient SQL-based RDF querying scheme. In: VLDB (2005)

  26. Colazzo, D., Goasdoué, F., Manolescu, I., Roatiş, A.: RDF Analytics: Lenses over Semantic Graphs. In: WWW (2014)

  27. Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M.: Mapreduce online. In: NSDI (2010)

  28. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: OSDI (2004)

  29. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P., Vogels, W.: Dynamo: Amazon’s highly available key-value store. In: SOSP, pp. 205–220 (2007)

  30. Dittrich, J., Quiane-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

  31. Dittrich, J., Quiane-Ruiz, J.-A., Richter, S., Schuh, S., Jindal, A., Schad, J.: Only aggressive elephants are fast elephants. In: PVLDB, pp. 1591–1602 (2012)

  32. Doulkeridis, C., Norvag, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. 23(3), 355–380 (2013)

  33. DynamoDB.: http://aws.amazon.com/dynamodb/

  34. Elghandour, I., Aboulnaga, A.: ReStore: reusing results of MapReduce jobs. PVLDB 5(6), 586–597 (2012)

    Google Scholar 

  35. Erling, O., Mikhailov, I.: RDF Support in the Virtuoso DBMS. CSSW, pp. 59–68, Leipzig, Germany (2007)

  36. Filali, I., Bongiovanni, F., Huet, F., Baude, F.: A Survey of Structured P2P Systems for RDF Data Storage and Retrieval. T. Large-Scale Data- and Knowledge-Centered Systems 3, 20–55 (2011)

    Google Scholar 

  37. Galarraga, L., Hose, K., Schenkel, R.: Partout: A distributed engine for efficient RDF processing. Technical report: CoRR abs/1212.5636 (2012)

  38. Goasdoué, F., Manolescu, I., Roatiş, A.: Efficient query answering against dynamic RDF databases. In: EDBT (2013)

  39. W3C OWL Working Group. OWL 2 Web Ontology Language. W3C Recommendation, Dec 2012. http://www.w3.org/TR/rdf-mt/

  40. Harris, S., Lamb, N., Shadbolt, N.: 4store: The design and implementation of a clustered RDF store. In: SSWS Workshop (2009)

  41. Harris, S., Seaborne, A.: SPARQL 1.1 Query Language. W3C Recommendation. http://www.w3.org/TR/sparql11-overview/ (2013)

  42. Hayes, P.: RDF Semantics. W3C Recommendation. http://www.w3.org/TR/rdf-mt/ (2004)

  43. Hose, K., Schenkel, R.: WARP: Workload-Aware Replication and Partitioning for RDF. In: DESWEB Workshop (in conjunction with ICDE), (2013)

  44. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying of large RDF graphs. PVLDB 4(11), 1123–1134 (2011)

    Google Scholar 

  45. Husain, M., Khan, L., Kantarcioglu, M., Thuraisingham, B.M.: Data intensive query processing for large RDF graphs using cloud computing tools, IEEE CLOUD, pp. 1–10 , Miami, FL (2010)

  46. Husain, M.F., McGlothlin, J.P., Masud, M.M., Khan, L.R., Thuraisingham, B.M.: Heuristics-based query processing for large RDF graphs using cloud computing. IEEE Trans. Knowl. Data Eng. 23(9), 1312–1327 (2011)

  47. Lawder, J.K., King, P.J.H.: Using Space-filling curves for multi-dimensional indexing. In: British National Conference on Databases: Advances in Databases (2000)

  48. Kaoudi, Z., Koubarakis, M.: Distributed RDFS reasoning over structured overlay networks. J. Data Semant. 2(4), 189–227 (2013)

  49. Kaoudi, Z., Koubarakis, M., Kyzirakos, K., Miliaraki, I., Magiridou, M., Papadakis-Pesaresi, A.: Atlas: Storing, updating and querying RDF(S) data on top of DHTs. Web Semantics: Science, Services and Agents on the World Wide Web, 8(4), (2010)

  50. Kaoudi, Z., Kyzirakos, K., Koubarakis, M.: SPARQL query optimization on top of DHTs. In: ISWC (2010)

  51. Kim, H., Ravindra, P., Anyanwu, K.: From SPARQL to MapReduce: The journey using a nested triplegroup algebra (demo). PVLDB 4(12), 1426–1429 (2011)

    Google Scholar 

  52. Kim, H., Ravindra, P., Anyanwu, K.: Scan-sharing for optimizing RDF graph pattern matching on MapReduce. In: IEEE conference on cloud computing, pp. 139–146 (2012)

  53. Kiryakov, A., Bishoa, B., Ognyanoff, D., Peikov, I., Tashev, Z., Velkov, R.: The features of BigOWLIM that Enabled the BBC’s World Cup Website. In: Workshop on Semantic Data Management (2010)

  54. Klyne, G., Carroll, J.J.: Resource description framework (RDF): Concepts and abstract syntax. W3C Recommendation (2004)

  55. Ladwig, G., Harth, A.: CumulusRDF: linked data management on nested key-value stores. In: SSWS (2011)

  56. State of the LOD cloud. http://www4.wiwiss.fu-berlin.de/lodcloud/state/, (2011)

  57. Manola, F., Miller, E.: RDF Primer. W3C Recommendation (2004)

  58. METIS.: http://glaros.dtc.umn.edu/gkhome/views/metis

  59. Muñoz, S., Pérez, J., Gutierrez, C.: Simple and efficient minimal RDFS. Web Semant.: Sci Services and Agents on the World Wide Web 7(3), 220–234 (2009)

    Article  Google Scholar 

  60. Neumann, T., Weikum, G.: The RDF-3X engine for scalable management of RDF data. VLDBJ, 19(1):91–113 (2010)

  61. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD, pp. 1099–1110 (2008)

  62. Ono, K., Lohman, G.M.: Measuring the complexity of join enumeration in query optimization. In: VLDB, pp. 314–325 (1990)

  63. Marin Dimitrov (Ontotext).: Semantic technologies from big data. http://www.slideshare.net/marin_dimitrov/semantic-technologies-for-big-data, (2012)

  64. Owens, A., Seaborne, A., Gibbins, N., Schraefel, M..: Clustered TDB: a clustered triple store for Jena. Technical report (2008)

  65. Özsu, T., Valduriez, P.: Principles of distributed database systems. Springer, Berlin (2011)

    Google Scholar 

  66. Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.: H\(_2\)RDF: adaptive query processing on RDF data in the cloud (demo). In: WWW (2012)

  67. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity of SPARQL. ACM Trans. Database Syst. 34, 16:1–16:45 (2009)

    Article  Google Scholar 

  68. Punnoose, R., Crainiceanu, A., Rapp, D.: Rya: a scalable RDF triple store for the clouds. In Workshop on Cloud Intelligence (in conjunction with VLDB) (2012)

  69. Raschia, G., Theobald, M., Manolescu, I.: Proceedings of the first International Workshop On Open Data (WOD) (2012)

  70. Ravindra, P., Kim, H., Anyanwu, K.: An intermediate algebra for optimizing RDF graph pattern matching on MapReduce. In: ESWC, pp. 46–61 (2011)

  71. Rohloff, K., Schantz, R.E.: High-performance, massively scalable distributed systems using the MapReduce software framework: the SHARD triple-store. In: Programming Support Innovations for Emerging Distributed Applications (2010)

  72. Rohloff, K., Schantz, R.E.: Clause-iteration with MapReduce to scalably query datagraphs in the SHARD graph-store. In: Workshop on Data-intensive Distributed Computing (2011)

  73. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 1–11: 44 (2013)

    Article  Google Scholar 

  74. Saleem, M., Kamdar, M.R., Iqbal, A., Sampath, S., Deus, H.F., Ngonga, A.: Fostering Serendipity through Big Linked Data. In: Semantic Web Challenge at ISWC (2013)

  75. Schätzle, A., Przyjaciel-Zablocki, M., Lausen, G.: PigSPARQL: Mapping SPARQL to pig latin. In: SWIM (2011)

  76. Schätzle, A., Przyjaciel-Zablocki, M., Dorner, C., Hornung, T., Lausen, G.: Cascading map-side joins over HBase for scalable join processing. In: SSWS+HPCSW (2012)

  77. Shao, B., Wang, H., Li, Y.: The trinity graph engine. Technical report, http://research.microsoft.com/pubs/161291/trinity.pdf (2012)

  78. Stein, R., Zacharias, V.: RDF on cloud number nine. Scalable and Dynamic. In: Workshop on New Forms of Reasoning for the Semantic Web (2010)

  79. The Cancer Genome Atlas project.: http://cancergenome.nih.gov/

  80. ter Horst, H.J.: Completeness, decidability and complexity of entailment for RDF Schema and a semantic extension involving the OWL vocabulary. Web Semant. 3(2–3), 79–115 (2005)

    Article  MathSciNet  Google Scholar 

  81. Theoharis, Y., Christophides, V., Karvounarakis, G.: Benchmarking Database representations of RDF/S stores. In: ISWC (2005)

  82. Trißl, S., Leser, U.: Fast and practical indexing and querying of very large graphs. In: SIGMOD (2007)

  83. Urbani, J., Kotoulas, S., Maassen, J., van Harmelen, F., Bal, H.E.: OWL reasoning with WebPIE: calculating the closure of 100 billion triples. In: ESWC, pp. 213–227 (2010)

  84. Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable distributed reasoning using mapreduce. In: ISWC (2009)

  85. Urbani, J., van Harmelen, F., Schlobach, S., Bal, H.: QueryPIE: backward reasoning for OWL horst over very large knowledge bases. In: ISWC (2011)

  86. Wang, G., Chan, C.: Multi-query optimization in mapreduce framework. PVLDB 7(3), 145–156 (2013)

    Google Scholar 

  87. Weaver, J., Hendler, J.A.: Parallel materialization of the finite RDFS closure for hundreds of millions of triples. In: ISWC (2009)

  88. Weiss, C., Karras, P., Bernstein, A.: Hexastore: sextuple indexing for semantic web data management. PVLDB 1(1), 1008–1019 (2008)

    Google Scholar 

  89. Wilkinson, K., Sayers, C., Kuno, H.A., Raynolds, D.: Efficient RDF storage and retrieval in Jena2. In: SWDB (in conjunction with VLDB) (2003)

  90. Wu, B., Jin, H., Yuan, P.: Scalable SAPRQL querying processing on large RDF data in cloud computing environment. In: ICPCA/SWS, pp. 631–646 (2012)

  91. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. In: PVLDB (2013)

  92. Zhang, X., Chen, L., Tong, Y., Wang, M.: EAGRE: Towards scalable I/O efficient SPARQL query evaluation on the cloud. In: ICDE (2013)

  93. Zhang, X., Chen, L., Wang, M.: Towards efficient join processing over large RDF graph using mapreduce. In: SSDBM, pp. 250–259 (2012)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zoi Kaoudi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kaoudi, Z., Manolescu, I. RDF in the clouds: a survey. The VLDB Journal 24, 67–91 (2015). https://doi.org/10.1007/s00778-014-0364-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-014-0364-z

Keywords

Navigation