Abstract
RDF compression and querying are consolidated topics in the Web of Data, with a plethora of solutions to efficiently store and query static datasets. However, as RDF data changes along time, it becomes necessary to keep different versions of RDF datasets, in what is called an RDF archive. For large RDF datasets, naive techniques to store these versions lead to significant scalability problems. In this paper, we present v-RDF-SI, one of the first RDF archiving solutions that aim at joining both compression and fast querying. In v-RDF-SI, we extend existing RDF representations based on compact data structures to provide efficient support of version-based queries in compressed space. We present two implementations of v-RDF-SI, named v-RDFCSA and v-HDT, based, respectively, on RDFCSA (an RDF self-index) and HDT (a W3C-supported compressed RDF representation). We experimentally evaluate v-RDF-SI over a public benchmark named BEAR, showing that v-RDF-SI drastically reduces space requirements, being up to 40 times smaller than the baselines provided by BEAR, and 4 times smaller than alternatives based on compact data structures, while yielding significantly faster query times in most cases. On average, the fastest variants of v-RDF-SI outperform the alternatives by almost an order of magnitude.
Similar content being viewed by others
Data availability
The BEAR dataset used in our experiments is available at https://aic.ai.wu.ac.at/qadlod/bear.html (named BEAR-A).
Notes
For simplicity, we obviate here bnodes a particular type that has only a local scope to the dataset. For our purpose, they can be converted to URIs via skolemization.
A suffix starting at position j from \(S_{id}^{'}\) is defined as the subsequence \(S_{id}^{'}[j, 3n]\).
A wavelet tree WT over a sequence S built on an alphabet \(\Sigma =[1,\sigma ]\), represents S implicitly and supports \({\textit{access}}(S,i)\), \({\textit{rank}}_b(S,i)\), and \({\textit{select}}_b(S,j)\), being \(b\in \Sigma \), in logarithmic time.
Available at https://github.com/rdfhdt/hdt-cpp.
Let us consider a bitmap as a sequence of ones and zeroes with no additional structures to efficiently support \({\textit{rank}}\) and \({\textit{select}}\) operations.
Obtained from https://github.com/fclaude/libcds.
References
Abeliuk A, Cánovas R, Navarro G (2013) Practical compressed suffix trees. Algorithms 6(2):319–351. https://doi.org/10.3390/a6020319
Ali W, Saleem M, Yao B et al (2022) A survey of RDF stores & SPARQL engines for querying knowledge graphs. VLDB J 31(3):1–26. https://doi.org/10.1007/s00778-021-00711-3
Álvarez-García S, Brisaboa N, Fernández J et al (2015) Compressed vertical partitioning for efficient RDF management. Knowl Inf Syst 44(2):439–474. https://doi.org/10.1007/s10115-014-0770-y
Arndt N, Naumann P, Radtke N et al (2019) Decentralized collaborative knowledge management using git. J Web Semant 54:29–47. https://doi.org/10.1016/j.websem.2018.08.002
Atre M, Chaoji V, Zaki MJ et al (2010) Matrix “bit” loaded: A scalable lightweight join query processor for RDF data. In: Proceedings of the 19th international conference on world wide web (WWW), pp 41–50. https://doi.org/10.1145/1772690.1772696
Bigerl A, Conrads F, Behning C et al (2020) Tentris—A tensor-based triple store. In: Proceedings of the 19th international semantic web conference (ISWC), pp 56–73. https://doi.org/10.1007/978-3-030-62419-4_4
Bizer C, Meusel R, Primpel A et al (2022) Web data commons-microdata, RDFa, JSON-LD, and microformat data sets. https://webdatacommons.org/structureddata/
Brisaboa N, Ladra S, Navarro G (2014) Compact representation of web graphs with extended functionality. Inf Syst 39(1):152–174. https://doi.org/10.1016/j.is.2013.08.003
Brisaboa N, Cerdeira A, Fariña A et al (2015) A compact RDF store using suffix arrays. In: Proceedings of the 22nd international symposium on string processing and information retrieval (SPIRE). LNCS, vol 9309. Springer, Cham, pp 103–115. https://doi.org/10.1007/978-3-319-23826-5_11
Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2017) Compressed representation of dynamic binary relations with applications. Inf Syst 69:106–123. https://doi.org/10.1016/j.is.2017.05.003
Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2019) Improved compressed string dictionaries. In: Proceedings of the 28th ACM international conference on information and knowledge management (CIKM). ACM, pp 29–38. https://doi.org/10.1145/3357384.3357972
Brisaboa NR, Cerdeira-Pena A, de Bernardo G et al (2022) Space/time-efficient RDF stores based on circular suffix sorting. J Supercomputi 79(5):5643–5683. https://doi.org/10.1007/s11227-022-04890-w
Cerdeira-Pena A, Fariña A, Fernández JD et al (2016) Self-indexing RDF archives. In: Proceedings of the data compression conference (DCC). IEEE, pp 526–535. https://doi.org/10.1109/DCC.2016.40
Chan HL, Hon WK, Lam TW et al (2007) Compressed indexes for dynamic text collections. ACM Trans Algorithms 3(2):21–es. https://doi.org/10.1145/1240233.1240244
Claude F, Navarro G (2009) Practical rank/select queries over arbitrary sequences. In: Proceedings of the 15th international symposium on string processing and information retrieval (SPIRE). LNCS, vol 5280. Springer, Berlin, pp 176–187. https://doi.org/10.1007/978-3-540-89097-3_18
Cordova J, Navarro G (2016) Practical dynamic entropy-compressed bitvectors with applications. In: Proceedings of the 15th international symposium on experimental algorithms (SEA). LNCS, vol 9685, pp 105—117. https://doi.org/10.1007/978-3-319-38851-9_8
Curé O, Blin G et al (2014) Waterfowl: a compact, self-indexed and inference-enabled immutable RDF store. In: Proceedings of the 11th extended semantic web conference (ESWC), LNCS, vol 8465, pp 302–316. https://doi.org/10.1007/978-3-319-07443-6_21
Dong-Hyuk I, Sang-Won L, Hyoung-Joo K (2012) A version management framework for RDF triple stores. Int J Softw Eng Knowl Eng 22(1):85–106. https://doi.org/10.1142/S0218194012500040
Erling O, Mikhailov I (2009) RDF support in the Virtuoso DBMS. In: networked knowledge - networked media. Studies in computational intelligence, vol 221. Springer, Berlin, pp 7–24. https://doi.org/10.1007/978-3-642-02184-8_2
Fariña A, Brisaboa NR, Navarro G et al (2012) Word-based self-indexes for natural language text. ACM Trans Inf Syst 30(1):article 1. https://doi.org/10.1145/2094072.2094073
Fernández J, Martínez-Prieto M, Gutiérrez C et al (2013) Binary RDF representation for publication and exchange (HDT). J Web Semant 19:22–41. https://doi.org/10.1016/j.websem.2013.01.002
Fernández JD, Martínez-Prieto MA (2018) RDF serialization and archival. Springer, Cham, pp 1–11. https://doi.org/10.1007/978-3-319-63962-8_286-1
Fernández JD, Llaves A, Corcho O (2014a) Efficient RDF interchange (ERI) format for RDF data streams. In: Proceedings of the 13th international semantic web conference (ISWC). LNCS, vol 8797. Springer, Berlin, pp 244–259. https://doi.org/10.1007/978-3-319-11915-1_16
Fernández N, Arias J, Sánchez L et al (2014b) RDSZ: an approach for lossless RDF stream compression. In: Proceedings of the 11th extended semantic web conference (ESWC). LNCS, vol 8465. Springer, Cham, pp 52–67. https://doi.org/10.1007/978-3-319-07443-6_5
Fernández JD, Polleres A, Umbrich J (2015) Towards efficient archiving of dynamic linked open data. In: Proceedings of the first DIACHRON workshop on managing the evolution and preservation of the data web. Co-located with 12th extended semantic web conference (ESWC), pp 34–49. http://ceur-ws.org/Vol-1377/
Fernández JD, Umbrich J, Polleres A et al (2019) Evaluating query and storage strategies for RDF archives. Semant Web J 10(2):247–291. https://doi.org/10.3233/SW-180309
Gomes D, Costa M, Cruz D et al (2013) Creating a billion-scale searchable web archive. In: Proceedings of the 22nd international conference on world wide web (WWW companion). Association for Computing Machinery, New York, pp 1059–1066. https://doi.org/10.1145/2487788.2488118
González R, Grabowski S, Mäkinen V et al (2005) Practical implementation of rank and select queries. In: Poster proceedings of the 4th workshop on efficient and experimental algorithms (WEA). CTI Press and Ellinika Grammata, pp 27–38
Graube M, Hensel S, Urbas L (2014) R43ples: Revisions for triples. In: Proceedings of the 1st workshop on linked data quality (LQD)
Grossi R, Gupta A, Vitter JS (2003) High-order entropy-compressed text indexes. In: Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms (SODA). Society for Industrial and Applied Mathematics, USA, pp 841–850. https://doi.org/10.5555/644108.644250
Harris S, Seaborne A (2013) SPARQL 1.1 Query language. W3C Recommendation. http://www.w3.org/TR/sparql11-query/
Hasemann H, Kröller A, Pagel M (2012) RDF provisioning for the internet of things. In: Proceedings of the 3rd IEEE international conference on the internet of things (IOT), pp 143–150. https://doi.org/10.1109/IOT.2012.6402316
Hernández-Illera A, Martínez-Prieto M, Fernández J (2015) Serializing RDF in compressed space. In: Proceedings of the data compression conference (DCC). IEEE Computer Society, USA, pp 363–372. https://doi.org/10.1109/DCC.2015.16
Hernández-Illera A, Martínez-Prieto M, Fernández J et al (2020) iHDT++: improving HDT for SPARQL triple pattern resolution. J Intell Fuzzy Syst 39(2):2249–2261. https://doi.org/10.3233/JIFS-179888
Käfer T, Abdelrahman A, Umbrich J et al (2013) Observing linked data dynamics. In: Proceedings of the 10th extended semantic web conference (ESWC). LNCS, vol 7882. Springer, Berlin, pp 213–227. https://doi.org/10.1007/978-3-642-38288-8_15
Klein M, Fensel D, Kiryakov A et al (2002) Ontology versioning and change detection on the web. In: Proceedings of the 13th international conference on knowledge engineering and knowledge management (EKAW). LNCS, vol 2473. Springer, Berlin, pp 197–212. https://doi.org/10.1007/3-540-45810-7_20
Lhez J, Ren X, Belabbess B et al (2017) A compressed, inference-enabled encoding scheme for RDF stream processing. In: Proceedings of the 14th extended semantic web conference (ESWC). LNCS, vol 10250. Springer, Berlin, pp 79–93. https://doi.org/10.1007/978-3-319-58451-5_6
Mäkinen V, Navarro G (2008) Dynamic entropy-compressed sequences and full-text indexes. ACM Trans Algorithms 4(3):article 32. https://doi.org/10.1145/1367064.1367072
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948. https://doi.org/10.1137/0222058
Martínez-Prieto MA, Arias Gallego M, Fernández JD (2012) Exchange and consumption of huge RDF data. In: Proceedings of the 9th extended semantic web conference (ESWC). LNCS, vol 7295. Springer, Berlin, pp 437–452. https://doi.org/10.1007/978-3-642-30284-8_36
Martínez-Prieto M, Brisaboa N, Cánovas R et al (2016) Practical compressed string dictionaries. Inf Syst 56:73–108. https://doi.org/10.1016/j.is.2015.08.008
Martínez-Prieto MA, Fernández JD, Hernández-Illera A et al (2018) RDF Compression. Springer, Cham, pp 1–11. https://doi.org/10.1007/978-3-319-63962-8_62-1
Martínez-Prieto MA, Fernández JD, Hernández-Illera A et al (2020) Knowledge graph compression for big semantic data. Springer, Cham, pp 1–13. https://doi.org/10.1007/978-3-319-63962-8_62-2
Meinhardt P, Knuth M, Sack H (2015) Tailr: a platform for preserving history on the web of data. In: Proceedings of the 11th international conference on semantic systems (SEMANTICS). Association for Computing Machinery, New York, pp 57–64. https://doi.org/10.1145/2814864.2814875
Munro JI, Nekrich Y, Vitter JS (2015) Dynamic data structures for document collections and graphs. In: Proceedings of the 34th ACM symposium on principles of database systems (PODS). Association for Computing Machinery, New York, pp 277–289. https://doi.org/10.1145/2745754.2745778
Navarro G (2016) Compact data structures–a practical approach. Cambridge University Press, New York. https://doi.org/10.1017/CBO9781316588284
Navarro G, Providel E (2012) Fast, small, simple rank/select on bitmaps. In: Proceedings of the 11th international conference on experimental algorithms (SEA). LNCS, vol 7276. Springer, Berlin, pp 295–306. https://doi.org/10.1007/978-3-642-30850-5_26
Neumann T, Weikum G (2010) The RDF-3X engine for scalable management of RDF data. VLDB J 19(1):91–113. https://doi.org/10.1007/s00778-009-0165-y
Neumann T, Weikum G (2010) x-RDF-3X: Fast querying, high update rates, and consistency for RDF databases. Proc VLDB Endow 3(1–2):256–263. https://doi.org/10.14778/1920841.1920877
Okanohara D, Sadakane K (2007) Practical entropy-compressed rank/select dictionary. In: Proceedings of the meeting on algorithm engineering & experiments (ALENEX). Society for Industrial and Applied Mathematics, Philadelphia, pp 60–70. https://doi.org/10.5555/2791188.2791194
Pelgrin O, Galárraga L, Hose K (2021) Towards fully-fledged archiving for RDF datasets. Semantic Web J Pre-press 1–24. https://doi.org/10.3233/sw-210434
Pibiri GE, Perego R, Venturini R (2020) Compressed indexes for fast search of semantic data. IEEE Trans Knowl Data Eng. https://doi.org/10.1109/TKDE.2020.2966609
Raman R, Raman V, Rao S (2002) Succinct indexable dictionaries with applications to encoding \(k\)-ary trees and multisets. In: Proceedings of the 13th annual ACM-SIAM symposium on discrete algorithms (SODA). Society for Industrial and Applied Mathematics, USA, pp 233–242. https://doi.org/10.5555/545381.545411
Sadakane K (2003) New text indexing functionalities of the compressed suffix arrays. J Algorithms 48(2):294–313. https://doi.org/10.1016/S0196-6774(03)00087-7
Schreiber G, Raimond Y (2014) RDF Primer. W3C Recommendation. https://www.w3.org/TR/rdf11-primer/
Taelman R, Vander Sande M, Van Herwegen J et al (2019) Triple storage for random-access versioned querying of RDF archives. J Web Semant 54:4–28. https://doi.org/10.1016/j.websem.2018.08.001
Thompson BB, Personick M, Cutcher M (2014) The Bigdata® RDF graph database. In: Linked data management. Chapman and Hall/CRC, chap 8, p 1–46. https://doi.org/10.1201/b16859
Vander Sander M, Colpaert P, Verborgh R et al (2013) R &Wbase: Git for triples. In: Proceedings of the WWW2013 workshop on linked data on the web (LDOW), vol CEUR-WS 996, LDOW paper 1. CEUR-WS.org, p 5. http://ceur-ws.org/Vol-996
Völkel M, Groza T (2006) Semversion: an RDF-based ontology versioning system. In: Proceedings of the IADIS international conference WWW/Internet (ICWI), pp 195–202. http://www.iadisportal.org/digital-library/semversion-an-rdf-based-ontology-versioning-system
Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. https://doi.org/10.14778/1453856.1453965
Acknowledgements
The first three co-authors are members of the CITIC, which, as Research Center accredited by the Galician University System, is funded by Consellería de Cultura, Educación e Universidades from Xunta de Galicia, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014–2020, and the remaining 20% by Secretaría Xeral de Universidades [Grant ED431G 2019/01]. The Spanish group is also funded by Xunta de Galicia/FEDER-UE [ED431C 2021/53]; by MICINN [Magist: PID2019-105221RB-C41; FLATCity-POC: PDC2021-121239-C31; SIGTRANS: PDC2021-120917-C21; EXTRA-Compact: PID2020-114635RB-I00; PID2019-105221RB-C41]; by MCIU-AEI/FEDER-UE [BIZDEVOPS: RTI2018-098309-B-C32]; and by Xunta de Galicia/Igape/IG240.2020.1.185.
Author information
Authors and Affiliations
Contributions
All authors have contributed equally to this paper, including the conceptualization, investigation, experimentation, and writing of the article.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
An early partial version of this article appeared in Proc DCC’16 [13].
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cerdeira-Pena, A., de Bernardo, G., Fariña, A. et al. Compressed and queryable self-indexes for RDF archives. Knowl Inf Syst 66, 381–417 (2024). https://doi.org/10.1007/s10115-023-01967-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-01967-7