ABSTRACT
We consider the feasibility of processing billions of RDF triples on a single commodity machine using streaming and sorting techniques and focusing on RDF processing tasks relevant for Linked Data consumption: data filtering and transformation, RDFS inference, owl:sameAs smushing and statistics extraction. To investigate this research question we built RDFpro (rdf processor), an open source tool that provides streaming and sorting-based processors for the considered tasks and allows their sequential and parallel composition in complex pipelines. an empirical evaluation of RDFpro in four application scenario---dataset analysis, filtering, merging and massaging---shows the effectiveness of the tool and allows to positively answer our research question.
- Infovore. https://github.com/paulhoule/infovore.Google Scholar
- Jena riot. https://jena.apache.org/documentation/io/.Google Scholar
- make-void. https://github.com/cygri/make-void.Google Scholar
- rapper. http://librdf.org/raptor/rapper.html.Google Scholar
- rdfConvert. https://bitbucket.org/dotnetrdf/dotnetrdf/wiki/UserGuide/Tools/rdfConvert.Google Scholar
- rdfpro. http://fracor.bitbucket.org/rdfpro/.Google Scholar
- rdfpipe. http://rdfextras.readthedocs.org/en/latest/tools/rdfpipe.html.Google Scholar
- Sesame RDFConverter. http://sourceforge.net/projects/rdfconvert.Google Scholar
- G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. On the streaming model augmented with a sorting primitive. In IEEE Symposium on Foundations of Computer Science (FOCS), pages 540--549, 2004. Google ScholarDigital Library
- K. Alexander, R. Cyganiak, M. Hausenblas, and J. Zhao. Describing linked datasets. In Workshop on Linked Data on the Web (LDOW), 2009.Google Scholar
- S. Auer, J. Demter, M. Martin, and J. Lehmann. LODStats - an extensible framework for high-performance dataset analytics. In EKAW, pages 353--362, 2012. Google ScholarDigital Library
- B. Bishop, A. Kiryakov, D. Ognyanoff, I. Peikov, Z. Tashev, and R. Velkov. OWLIM: A family of scalable semantic repositories. Semant. Web, 2(1):33--42, 2011. Google ScholarCross Ref
- C. Bizer and A. Schultz. The R2R framework: Publishing and discovering mappings on the Web. In Int. Workshop on Consuming Linked Data (COLD), 2010.Google Scholar
- C. Böhm, J. Lorey, and F. Naumann. Creating voiD descriptions for Web-scale data. Web Semant., 9(3):339--345, Sept. 2011. Google ScholarDigital Library
- S. Ceri, G. Gottlob, and L. Tanca. What you always wanted to know about datalog (and never dared to ask). IEEE Knowl. Data Eng., 1(1):146--166, 1989. Google ScholarDigital Library
- J. D. Fernández, M. A. Martínez-Prieto, C. Gutiérrez, A. Polleres, and M. Arias. Binary RDF representation for publication and exchange (HDT). Web Semant., 19:22--41, 2013. Google ScholarDigital Library
- T. Heath and C. Bizer. Linked Data: Evolving the Web into a Global Data Space. Morgan & Claypool, 2011. Google ScholarDigital Library
- B. Heitmann, R. Cyganiak, C. Hayes, and S. Decker. Architecture of Linked Data applications. In Linked Data Management: Principles and Techniques. CRC Press, 2013.Google Scholar
- A. Langegger and W. Woss. RDFStats - an extensible RDF statistics generator and library. In Int. Workshop on Database and Expert Systems Application, DEXA'09, pages 79--83, 2009. Google ScholarDigital Library
- D. Le-Phuoc, A. Polleres, M. Hauswirth, G. Tummarello, and C. Morbidoni. Rapid prototyping of semantic mash-ups through Semantic Web Pipes. In WWW, pages 581--590, 2009. Google ScholarDigital Library
- A. Margara, J. Urbani, F. van Harmelen, and H. Bal. Streaming the Web: Reasoning over dynamic data. Web Semant., 25(0):24--44, 2014.Google ScholarDigital Library
- E. Marx, S. Shekarpour, S. Auer, and A.-C. Ngomo. Large-scale RDF dataset slicing. In IEEE Int. Conf. on Semantic Computing (ICSC), pages 228--235, 2013. Google ScholarDigital Library
- T. O'Connell. A survey of graph algorithms under extended streaming models of computation. In Fundamental Problems in Computing, pages 455--476. Springer Netherlands, 2009.Google ScholarCross Ref
- A. Schultz, A. Matteini, R. Isele, P. N. Mendes, C. Bizer, and C. Becker. LDIF - a framework for large-scale Linked Data integration. In WWW Developers Track, 2012.Google Scholar
- J. Urbani, S. Kotoulas, J. Maassen, F. Van Harmelen, and H. Bal. WebPIE: Aweb-scale parallel inference engine using MapReduce. J. Web Semant, 10:59--75, 2012. Google ScholarDigital Library
- J. Volz, C. Bizer, M. Gaedke, and G. Kobilarov. Silk - A link discovery framework for the Web of Data. In Workshop on Linked Data on the Web (LDOW), 2009.Google Scholar
- M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable internet services. In ACM Symposium on Operating Systems Principles (SOSP), pages 230--243, 2001. Google ScholarDigital Library
Index Terms
- Processing billions of RDF triples on a single machine using streaming and sorting
Recommendations
The RDF virtual machine
The Resource Description Framework (RDF) is a semantic network data model that is used to create machine-understandable descriptions of the world and is the basis of the Semantic Web. This article discusses the application of RDF to the representation ...
BioPortal as a dataset of linked biomedical ontologies and terminologies in RDF
BioPortal is a repository of biomedical ontologies --the largest such repository, with more than 300 ontologies to date. This set includes ontologies that were developed in OWL, OBO and other formats, as well as a large number of medical terminologies ...
Using the relation ontology Metarel for modelling Linked Data as multi-digraphs
Linked Data for Health Care and the Life SciencesThe Semantic Web standards OWL and RDF are often used to represent biomedical information as Linked Data; however, the OWL/RDF syntax, which combines both, was never optimised for querying. By combining two formal paradigms for modelling Linked Data, ...
Comments