RDF partitioning for scalable SPARQL query processing

Wang, Xiaoyan; Yang, Tao; Chen, Jinchuan; He, Long; Du, Xiaoyong

doi:10.1007/s11704-015-4104-3

RDF partitioning for scalable SPARQL query processing

Research Article
Published: 13 August 2015

Volume 9, pages 919–933, (2015)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Xiaoyan Wang^1,2,3,4,
Tao Yang^1,4,
Jinchuan Chen^2,4,
Long He^1,4 &
…
Xiaoyong Du^1,2,4

144 Accesses
10 Citations
Explore all metrics

Abstract

The volume of RDF data increases dramatically within recent years, while cloud computing platforms like Hadoop are supposed to be a good choice for processing queries over huge data sets for their wonderful scalability. Previous work on evaluating SPARQL queries with Hadoop mainly focus on reducing the number of joins through careful split of HDFS files and algorithms for generating Map/Reduce jobs. However, the way of partitioning RDF data could also affect system performance. Specifically, a good partitioning solution would greatly reduce or even totally avoid cross-node joins, and significantly cut down the cost in query evaluation. Based on HadoopDB, this work processes SPARQL queries in a hybrid architecture, where Map/Reduce takes charge of the computing tasks, and RDF query engines like RDF-3X store the data and execute join operations. According to the analysis of query workloads, this work proposes a novel algorithm for automatically partitioning RDF data and an approximate solution to physically place the partitions in order to reduce data redundancy. It also discusses how to make a good trade-off between query evaluation efficiency and data redundancy. All of these proposed approaches have been evaluated by extensive experiments over large RDF data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RDF Data Partitioning for Efficient SPARQL Query Processing with Spark SQL

A Semantic Data Parallel Query Method Based on Hadoop

Towards a Scalable Semantic-Based Distributed Approach for SPARQL Query Evaluation

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Cui B, Mei H, Ooi B. Big data: the driver for innovation in databases. National Science Review, 2014, 1(1): 27–30
Article Google Scholar
Husain M, McGlothlin J, Masud M, Khan L, Thuraisingham B. Heuristics based query processing for large RDF graphs using cloud computing. IEEE Transactions on Knowledge and Data Engineering, 2011, 23(9): 1312–1327
Article Google Scholar
Myung J, Yeon J, Lee S. Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 Workshop on Massive Data Analytics on the Cloud. 2010, 1–6
Chapter Google Scholar
Agrawal S, Narasayya V, Yang B. Integrating vertical and horizontal partitioning into automated physical database design. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2004, 359–370
Google Scholar
Pavlo A, Curino V, Zdonik S. Skew-aware automatic database partitioning in shared-nothing parallel OLTP systems. In: Proceedings of the ACM SIGMOD International Conference on management of Data. 2012, 61–72
Google Scholar
Chang C, Kurc T, Sussman A, Catalyurek U, Saltz J. A hypergraphbased workload partitioning strategy for parallel data aggregation. In: Proceedings of the 11th SIAM Conference on Parallel Processing for Scientific Computing, 2001
Google Scholar
Abouzeid A, Bajda-Pawlikowski K, Abadi D, Silberschatz A, Rasin A. Hadoopdb: an architectural hybrid of mapreduce and DBMS technologies for analytical workloads. The Proceedings of the VLDB Endowment, 2009, 2(1): 992–933
Article Google Scholar
Neumann T, Weikum G. RDF-3X: a risc-style engine for RDF. The Proceedings of the VLDB Endowment, 2008, 1(1): 647–659
Article Google Scholar
Huang J, Abadi D, Ren K. Scalable SPARQL querying of large RDF graphs. The Proceedings of the VLDB Endowment, 2011, 4(11): 1123–1134
Google Scholar
Andreev K, Räcke H. Balanced graph partitioning. In: Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2004, 120–124
Google Scholar
Guo Y B, Pan Z X, Heflin J. Lubm: A benchmark for owl knowledge base systems. Web Semantics: Science, Services and Agents on the World Wide Web, 2005, 3(2): 158–182
Article Google Scholar
Yang T, Chen J, Wang X, Chen Y, Du X. Efficient SPARQL Query Evaluation via Automatic Data Partitioning. Technical Report. 2012
Google Scholar
Sanghavi S, Shah D, Willsky A. Message passing for maximum weight independent set. IEEE Transactions on Information Theory, 2009, 55(11): 4822–4834
Article MathSciNet Google Scholar
Cormen T, Leiserson C, Rivest R, Stein C. Introduction to Algorithms. New York The MIT Press, 2001
MATH Google Scholar
Getoor L, Taskar B, Koller D. Selectivity estimation using probabilistic models. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2001, 461–472
Google Scholar
Wilkinson K, Sayers C, Kuno H, Reynolds D. Efficient RDF storage and retrieval in Jena2. In: Proceedings of the 2nd International Semantic Web Conference. 2003, 131–150
Google Scholar
Du F, Chen Y, Du X. Partitioned indexes for entity search over RDF knowledge bases. In: Proceedings of the 17th International Conference on Database Systems for Advanced Applications. 2012, 141–155
Chapter Google Scholar
Górlitz O, Thimm M, Staab S. SPLODGE: Systematic generation of SPARQL benchmark queries for linked open data. In: Proceedings of International Semantic Web-ISWC. 2012, 116–132
Google Scholar
Kim H, Ravindra P, Anyanwu K. Scan-sharing for optimizing RDF graph pattern matching on mapreduce. In: Proceedings of the 5th IEEE International Conference on Cloud Computing. 2012, 139–146
Google Scholar
Zeng K, Yang J, Wang H, Shao B, Wang Z. A distributed graph engine for web scale RDF data. In: Proceedings of the 39th International Conference on Very Large Data Bases. 2013, 265–276
Google Scholar
Rao J, Zhang C, Megiddo N, Lohman G. Automating physical database design in a parallel database. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2002, 558–569
Google Scholar

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, 100872, China
Xiaoyan Wang, Tao Yang, Long He & Xiaoyong Du
Key Laboratory of Data Engineering and Knowledge Engineering of Ministry of Education, Renmin University, Beijing, 100872, China
Xiaoyan Wang, Jinchuan Chen & Xiaoyong Du
Information Center, the Supreme People’s Court, Beijing, 100745, China
Xiaoyan Wang
State Key Laboratory of Software Development Environment, Beihang University, Beijing, 100191, China
Xiaoyan Wang, Tao Yang, Jinchuan Chen, Long He & Xiaoyong Du

Authors

Xiaoyan Wang
View author publications
Search author on:PubMed Google Scholar
Tao Yang
View author publications
Search author on:PubMed Google Scholar
Jinchuan Chen
View author publications
Search author on:PubMed Google Scholar
Long He
View author publications
Search author on:PubMed Google Scholar
Xiaoyong Du
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Jinchuan Chen.

Additional information

Xiaoyan Wang is currently a PhD candidate at Renmin University of China, China. She received her BS in Computer Science and Technology from Central South University, China and MS in Computer Application and Technology form Shandong University, China. Her research interests include big data management and storage.

Tao Yang is now a software engineer for mobile ads serving in Google. She received her MS in Computer Science and Technology from Renmin University of China, China in 2014. Her research interest maily focuses on RDF data management in distributed systems.

Jinchuan Chen is currently an associate professor of the Key Laboratory of Data Engineering and Knowledge Engineering (Renmin University of China, China), Ministry of Education. He received his BS from Department of Computer Science and Technology of Beijing Normal University, China in 2001, and MS from Institute of Software, Chinese Academy of Sciences, China in 2004. He then obtained his PhD in Computer Science and Technology from the Department of Computing of the Hong Kong Polytechnic University, China in 2009. His research interests mainly focus on uncertain data management and unstructured data management.

Long He is a master candidate at School of Information, Renmin University of China, China. He received his BS in Information Security from Hunan University of Science and Technology, China. His research interests include big data management and database system.

Xiaoyong Du received his BS of Computational Mathematics from Hangzhou University, China in 1983 and ME of Computer Science from Renmin University of China (RUC), China in 1988. He obtained his PhD of Computer Science from Nagoya Institute of Technology, Japan in 1997. He is currently a professor and Dean of School of Information in RUC. His current research interests include big data management, intelligent information retrieval, and semantic web.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Yang, T., Chen, J. et al. RDF partitioning for scalable SPARQL query processing. Front. Comput. Sci. 9, 919–933 (2015). https://doi.org/10.1007/s11704-015-4104-3

Download citation

Received: 16 March 2014
Accepted: 30 March 2015
Published: 13 August 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s11704-015-4104-3

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

RDF partitioning for scalable SPARQL query processing

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

RDF Data Partitioning for Efficient SPARQL Query Processing with Spark SQL

A Semantic Data Parallel Query Method Based on Hadoop

Towards a Scalable Semantic-Based Distributed Approach for SPARQL Query Evaluation

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now