Skip to main content
Log in

Compressed and queryable self-indexes for RDF archives

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

RDF compression and querying are consolidated topics in the Web of Data, with a plethora of solutions to efficiently store and query static datasets. However, as RDF data changes along time, it becomes necessary to keep different versions of RDF datasets, in what is called an RDF archive. For large RDF datasets, naive techniques to store these versions lead to significant scalability problems. In this paper, we present v-RDF-SI, one of the first RDF archiving solutions that aim at joining both compression and fast querying. In v-RDF-SI, we extend existing RDF representations based on compact data structures to provide efficient support of version-based queries in compressed space. We present two implementations of v-RDF-SI, named v-RDFCSA and v-HDT, based, respectively, on RDFCSA (an RDF self-index) and HDT (a W3C-supported compressed RDF representation). We experimentally evaluate v-RDF-SI over a public benchmark named BEAR, showing that v-RDF-SI drastically reduces space requirements, being up to 40 times smaller than the baselines provided by BEAR, and 4 times smaller than alternatives based on compact data structures, while yielding significantly faster query times in most cases. On average, the fastest variants of v-RDF-SI outperform the alternatives by almost an order of magnitude.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data availability

The BEAR dataset used in our experiments is available at https://aic.ai.wu.ac.at/qadlod/bear.html (named BEAR-A).

Notes

  1. For simplicity, we obviate here bnodes a particular type that has only a local scope to the dataset. For our purpose, they can be converted to URIs via skolemization.

  2. A suffix starting at position j from \(S_{id}^{'}\) is defined as the subsequence \(S_{id}^{'}[j, 3n]\).

  3. A wavelet tree WT over a sequence S built on an alphabet \(\Sigma =[1,\sigma ]\), represents S implicitly and supports \({\textit{access}}(S,i)\), \({\textit{rank}}_b(S,i)\), and \({\textit{select}}_b(S,j)\), being \(b\in \Sigma \), in logarithmic time.

  4. Available at https://github.com/rdfhdt/hdt-cpp.

  5. Let us consider a bitmap as a sequence of ones and zeroes with no additional structures to efficiently support \({\textit{rank}}\) and \({\textit{select}}\) operations.

  6. Obtained from https://github.com/fclaude/libcds.

  7. https://www.w3.org/2001/sw/RDFCore/ntriples/.

  8. https://jena.apache.org/documentation/tdb/.

References

  1. Abeliuk A, Cánovas R, Navarro G (2013) Practical compressed suffix trees. Algorithms 6(2):319–351. https://doi.org/10.3390/a6020319

    Article  Google Scholar 

  2. Ali W, Saleem M, Yao B et al (2022) A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J 31(3):1–26. https://doi.org/10.1007/s00778-021-00711-3

    Article  Google Scholar 

  3. Álvarez-García S, Brisaboa N, Fernández J et al (2015) Compressed vertical partitioning for efficient RDF management. Knowl Inf Syst 44(2):439–474. https://doi.org/10.1007/s10115-014-0770-y

    Article  Google Scholar 

  4. Arndt N, Naumann P, Radtke N et al (2019) Decentralized collaborative knowledge management using git. J Web Semant 54:29–47. https://doi.org/10.1016/j.websem.2018.08.002

    Article  Google Scholar 

  5. Atre M, Chaoji V, Zaki MJ et al (2010) Matrix “bit” loaded: A scalable lightweight join query processor for RDF data. In: Proceedings of the 19th international conference on world wide web (WWW), pp 41–50. https://doi.org/10.1145/1772690.1772696

  6. Bigerl A, Conrads F, Behning C et al (2020) Tentris—A tensor-based triple store. In: Proceedings of the 19th international semantic web conference (ISWC), pp 56–73. https://doi.org/10.1007/978-3-030-62419-4_4

  7. Bizer C, Meusel R, Primpel A et al (2022) Web data commons-microdata, RDFa, JSON-LD, and microformat data sets. https://webdatacommons.org/structureddata/

  8. Brisaboa N, Ladra S, Navarro G (2014) Compact representation of web graphs with extended functionality. Inf Syst 39(1):152–174. https://doi.org/10.1016/j.is.2013.08.003

    Article  Google Scholar 

  9. Brisaboa N, Cerdeira A, Fariña A et al (2015) A compact RDF store using suffix arrays. In: Proceedings of the 22nd international symposium on string processing and information retrieval (SPIRE). LNCS, vol 9309. Springer, Cham, pp 103–115. https://doi.org/10.1007/978-3-319-23826-5_11

  10. Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2017) Compressed representation of dynamic binary relations with applications. Inf Syst 69:106–123. https://doi.org/10.1016/j.is.2017.05.003

    Article  Google Scholar 

  11. Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2019) Improved compressed string dictionaries. In: Proceedings of the 28th ACM international conference on information and knowledge management (CIKM). ACM, pp 29–38. https://doi.org/10.1145/3357384.3357972

  12. Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2022) Space/time-efficient RDF stores based on circular suffix sorting. J Supercomputi 79(5):5643–5683. https://doi.org/10.1007/s11227-022-04890-w

    Article  Google Scholar 

  13. Cerdeira-Pena A, Fariña A, Fernández JD et al (2016) Self-indexing RDF archives. In: Proceedings of the data compression conference (DCC). IEEE, pp 526–535. https://doi.org/10.1109/DCC.2016.40

  14. Chan HL, Hon WK, Lam TW et al (2007) Compressed indexes for dynamic text collections. ACM Trans Algorithms 3(2):21–es. https://doi.org/10.1145/1240233.1240244

  15. Claude F, Navarro G (2009) Practical rank/select queries over arbitrary sequences. In: Proceedings of the 15th international symposium on string processing and information retrieval (SPIRE). LNCS, vol 5280. Springer, Berlin, pp 176–187. https://doi.org/10.1007/978-3-540-89097-3_18

  16. Cordova J, Navarro G (2016) Practical dynamic entropy-compressed bitvectors with applications. In: Proceedings of the 15th international symposium on experimental algorithms (SEA). LNCS, vol 9685, pp 105—117. https://doi.org/10.1007/978-3-319-38851-9_8

  17. Curé O, Blin G et al (2014) Waterfowl: a compact, self-indexed and inference-enabled immutable RDF store. In: Proceedings of the 11th extended semantic web conference (ESWC), LNCS, vol 8465, pp 302–316. https://doi.org/10.1007/978-3-319-07443-6_21

  18. Dong-Hyuk I, Sang-Won L, Hyoung-Joo K (2012) A version management framework for RDF triple stores. Int J Softw Eng Knowl Eng 22(1):85–106. https://doi.org/10.1142/S0218194012500040

    Article  Google Scholar 

  19. Erling O, Mikhailov I (2009) RDF support in the Virtuoso DBMS. In: networked knowledge - networked media. Studies in computational intelligence, vol 221. Springer, Berlin, pp 7–24. https://doi.org/10.1007/978-3-642-02184-8_2

  20. Fariña A, Brisaboa NR, Navarro G et al (2012) Word-based self-indexes for natural language text. ACM Trans Inf Syst 30(1):article 1. https://doi.org/10.1145/2094072.2094073

  21. Fernández J, Martínez-Prieto M, Gutiérrez C et al (2013) Binary RDF representation for publication and exchange (HDT). J Web Semant 19:22–41. https://doi.org/10.1016/j.websem.2013.01.002

    Article  Google Scholar 

  22. Fernández JD, Martínez-Prieto MA (2018) RDF serialization and archival. Springer, Cham, pp 1–11. https://doi.org/10.1007/978-3-319-63962-8_286-1

  23. Fernández JD, Llaves A, Corcho O (2014a) Efficient RDF interchange (ERI) format for RDF data streams. In: Proceedings of the 13th international semantic web conference (ISWC). LNCS, vol 8797. Springer, Berlin, pp 244–259. https://doi.org/10.1007/978-3-319-11915-1_16

  24. Fernández N, Arias J, Sánchez L et al (2014b) RDSZ: an approach for lossless RDF stream compression. In: Proceedings of the 11th extended semantic web conference (ESWC). LNCS, vol 8465. Springer, Cham, pp 52–67. https://doi.org/10.1007/978-3-319-07443-6_5

  25. Fernández JD, Polleres A, Umbrich J (2015) Towards efficient archiving of dynamic linked open data. In: Proceedings of the first DIACHRON workshop on managing the evolution and preservation of the data web. Co-located with 12th extended semantic web conference (ESWC), pp 34–49. http://ceur-ws.org/Vol-1377/

  26. Fernández JD, Umbrich J, Polleres A et al (2019) Evaluating query and storage strategies for RDF archives. Semant Web J 10(2):247–291. https://doi.org/10.3233/SW-180309

    Article  Google Scholar 

  27. Gomes D, Costa M, Cruz D et al (2013) Creating a billion-scale searchable web archive. In: Proceedings of the 22nd international conference on world wide web (WWW companion). Association for Computing Machinery, New York, pp 1059–1066. https://doi.org/10.1145/2487788.2488118

  28. González R, Grabowski S, Mäkinen V et al (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on efficient and experimental algorithms (WEA). CTI Press and Ellinika Grammata, pp 27–38

  29. Graube M, Hensel S, Urbas L (2014) R43ples: Revisions for triples. In: Proceedings of the 1st workshop on linked data quality (LQD)

  30. Grossi R, Gupta A, Vitter JS (2003) High-order entropy-compressed text indexes. In: Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms (SODA). Society for Industrial and Applied Mathematics, USA, pp 841–850. https://doi.org/10.5555/644108.644250

  31. Harris S, Seaborne A (2013) SPARQL 1.1 Query language. W3C Recommendation. http://www.w3.org/TR/sparql11-query/

  32. Hasemann H, Kröller A, Pagel M (2012) RDF provisioning for the internet of things. In: Proceedings of the 3rd IEEE international conference on the internet of things (IOT), pp 143–150. https://doi.org/10.1109/IOT.2012.6402316

  33. Hernández-Illera A, Martínez-Prieto M, Fernández J (2015) Serializing RDF in compressed space. In: Proceedings of the data compression conference (DCC). IEEE Computer Society, USA, pp 363–372. https://doi.org/10.1109/DCC.2015.16

  34. Hernández-Illera A, Martínez-Prieto M, Fernández J et al (2020) iHDT++: improving HDT for SPARQL triple pattern resolution. J Intell Fuzzy Syst 39(2):2249–2261. https://doi.org/10.3233/JIFS-179888

    Article  Google Scholar 

  35. Käfer T, Abdelrahman A, Umbrich J et al (2013) Observing linked data dynamics. In: Proceedings of the 10th extended semantic web conference (ESWC). LNCS, vol 7882. Springer, Berlin, pp 213–227. https://doi.org/10.1007/978-3-642-38288-8_15

  36. Klein M, Fensel D, Kiryakov A et al (2002) Ontology versioning and change detection on the web. In: Proceedings of the 13th international conference on knowledge engineering and knowledge management (EKAW). LNCS, vol 2473. Springer, Berlin, pp 197–212. https://doi.org/10.1007/3-540-45810-7_20

  37. Lhez J, Ren X, Belabbess B et al (2017) A compressed, inference-enabled encoding scheme for RDF stream processing. In: Proceedings of the 14th extended semantic web conference (ESWC). LNCS, vol 10250. Springer, Berlin, pp 79–93. https://doi.org/10.1007/978-3-319-58451-5_6

  38. Mäkinen V, Navarro G (2008) Dynamic entropy-compressed sequences and full-text indexes. ACM Trans Algorithms 4(3):article 32. https://doi.org/10.1145/1367064.1367072

  39. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948. https://doi.org/10.1137/0222058

    Article  MathSciNet  Google Scholar 

  40. Martínez-Prieto MA, Arias Gallego M, Fernández JD (2012) Exchange and consumption of huge RDF data. In: Proceedings of the 9th extended semantic web conference (ESWC). LNCS, vol 7295. Springer, Berlin, pp 437–452. https://doi.org/10.1007/978-3-642-30284-8_36

  41. Martínez-Prieto M, Brisaboa N, Cánovas R et al (2016) Practical compressed string dictionaries. Inf Syst 56:73–108. https://doi.org/10.1016/j.is.2015.08.008

    Article  Google Scholar 

  42. Martínez-Prieto MA, Fernández JD, Hernández-Illera A et al (2018) RDF Compression. Springer, Cham, pp 1–11. https://doi.org/10.1007/978-3-319-63962-8_62-1

  43. Martínez-Prieto MA, Fernández JD, Hernández-Illera A et al (2020) Knowledge graph compression for big semantic data. Springer, Cham, pp 1–13. https://doi.org/10.1007/978-3-319-63962-8_62-2

  44. Meinhardt P, Knuth M, Sack H (2015) Tailr: a platform for preserving history on the web of data. In: Proceedings of the 11th international conference on semantic systems (SEMANTICS). Association for Computing Machinery, New York, pp 57–64. https://doi.org/10.1145/2814864.2814875

  45. Munro JI, Nekrich Y, Vitter JS (2015) Dynamic data structures for document collections and graphs. In: Proceedings of the 34th ACM symposium on principles of database systems (PODS). Association for Computing Machinery, New York, pp 277–289. https://doi.org/10.1145/2745754.2745778

  46. Navarro G (2016) Compact data structures–a practical approach. Cambridge University Press, New York. https://doi.org/10.1017/CBO9781316588284

    Article  Google Scholar 

  47. Navarro G, Providel E (2012) Fast, small, simple rank/select on bitmaps. In: Proceedings of the 11th international conference on experimental algorithms (SEA). LNCS, vol 7276. Springer, Berlin, pp 295–306. https://doi.org/10.1007/978-3-642-30850-5_26

  48. Neumann T, Weikum G (2010) The RDF-3X engine for scalable management of RDF data. VLDB J 19(1):91–113. https://doi.org/10.1007/s00778-009-0165-y

    Article  Google Scholar 

  49. Neumann T, Weikum G (2010) x-RDF-3X: Fast querying, high update rates, and consistency for RDF databases. Proc VLDB Endow 3(1–2):256–263. https://doi.org/10.14778/1920841.1920877

  50. Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the meeting on algorithm engineering & experiments (ALENEX). Society for Industrial and Applied Mathematics, Philadelphia, pp 60–70. https://doi.org/10.5555/2791188.2791194

  51. Pelgrin O, Galárraga L, Hose K (2021) Towards fully-fledged archiving for RDF datasets. Semantic Web J Pre-press 1–24. https://doi.org/10.3233/sw-210434

  52. Pibiri GE, Perego R, Venturini R (2020) Compressed indexes for fast search of semantic data. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2966609

    Article  Google Scholar 

  53. Raman R, Raman V, Rao S (2002) Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms (SODA). Society for Industrial and Applied Mathematics, USA, pp 233–242. https://doi.org/10.5555/545381.545411

  54. Sadakane K (2003) New text indexing functionalities of the compressed suffix arrays. J Algorithms 48(2):294–313. https://doi.org/10.1016/S0196-6774(03)00087-7

    Article  MathSciNet  Google Scholar 

  55. Schreiber G, Raimond Y (2014) RDF Primer. W3C Recommendation. https://www.w3.org/TR/rdf11-primer/

  56. Taelman R, Vander Sande M, Van Herwegen J et al (2019) Triple storage for random-access versioned querying of RDF archives. J Web Semant 54:4–28. https://doi.org/10.1016/j.websem.2018.08.001

    Article  Google Scholar 

  57. Thompson BB, Personick M, Cutcher M (2014) The Bigdata® RDF graph database. In: Linked data management. Chapman and Hall/CRC, chap 8, p 1–46. https://doi.org/10.1201/b16859

  58. Vander Sander M, Colpaert P, Verborgh R et al (2013) R &Wbase: Git for triples. In: Proceedings of the WWW2013 workshop on linked data on the web (LDOW), vol CEUR-WS 996, LDOW paper 1. CEUR-WS.org, p 5. http://ceur-ws.org/Vol-996

  59. Völkel M, Groza T (2006) Semversion: an RDF-based ontology versioning system. In: Proceedings of the IADIS international conference WWW/Internet (ICWI), pp 195–202. http://www.iadisportal.org/digital-library/semversion-an-rdf-based-ontology-versioning-system

  60. Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. https://doi.org/10.14778/1453856.1453965

Download references

Acknowledgements

The first three co-authors are members of the CITIC, which, as Research Center accredited by the Galician University System, is funded by Consellería de Cultura, Educación e Universidades from Xunta de Galicia, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by Secretaría Xeral de Universidades [Grant ED431G 2019/01]. The Spanish group is also funded by Xunta de Galicia/FEDER-UE [ED431C 2021/53]; by MICINN [Magist: PID2019-105221RB-C41; FLATCity-POC: PDC2021-121239-C31; SIGTRANS: PDC2021-120917-C21; EXTRA-Compact: PID2020-114635RB-I00; PID2019-105221RB-C41]; by MCIU-AEI/FEDER-UE [BIZDEVOPS: RTI2018-098309-B-C32]; and by Xunta de Galicia/Igape/IG240.2020.1.185.

Author information

Authors and Affiliations

Authors

Contributions

All authors have contributed equally to this paper, including the conceptualization, investigation, experimentation, and writing of the article.

Corresponding author

Correspondence to Guillermo de Bernardo.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

An early partial version of this article appeared in Proc DCC’16 [13].

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cerdeira-Pena, A., de Bernardo, G., Fariña, A. et al. Compressed and queryable self-indexes for RDF archives. Knowl Inf Syst 66, 381–417 (2024). https://doi.org/10.1007/s10115-023-01967-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-01967-7

Keywords

Navigation