SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

Ahn, Jinhyun; Im, Dong-Hyuk; Kim, Hong-Gee

doi:10.1007/s11227-015-1459-z

SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

Published: 07 June 2015

Volume 71, pages 3695–3725, (2015)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Jinhyun Ahn¹,
Dong-Hyuk Im² &
Hong-Gee Kim^1,3

391 Accesses
5 Citations
Explore all metrics

Abstract

Large numbers of Resource Description Framework triples are available in Linked Data which can grow exponentially. It makes SPARQL query processing engines infeasible on a single machine. To address this scalability issue, MapReduce framework-based SPARQL engines have been proposed, but we note that these methods are limited in terms of join evaluations. The two-way join-based approach evaluates joins via a sequence of binary multiplications that require multiple MapReduce jobs, which involves costly disk accesses between MapReduce jobs. The multi-way join-based approach combines multiple two-way join operations, which allows the simultaneous evaluation of joins during one MapReduce job. However, the size of data for the MapReduce job might increase exponentially if a complex query is given. In this study, we propose SigMR, a pruning method for multi-way join-based SPARQL query processing in MapReduce. In the proposed approach, a SPARQL query can be evaluated in a single MapReduce job, where the size of data is reduced dramatically by pruning based on our signature encoding technique, thereby overcoming the weaknesses of the previous approaches. In experiments, we showed that the query processing time required was lower with our approach than existing MapReduce-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SMat-J: A Sparse Matrix-Based Join for SPARQL Query Processing

A Scalable Sparse Matrix-Based Join for SPARQL Query Processing

Cost- and Robustness-Based Query Optimization for Linked Data Fragments

Notes

References

Abadi DJ, Marcus A, Madden SR, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, VLDB ’07. VLDB endowment, pp 411–422
Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282–1298. doi:10.1109/TKDE.2011.47
Article Google Scholar
Aluç G, Ozsu MT, Daudjee K (2014) Workload matters: why rdf databases need a new design. Proc VLDB Endow 7(10):837–840
Article Google Scholar
Apache storm. https://storm.apache.org. Accessed 25 May 2015
Aranda-Andújar A, Bugiotti F, Camacho-Rodríguez J, Colazzo D, Goasdoué F, Kaoudi Z, Manolescu I (2012) Amada: web data repositories in the amazon cloud. In: CIKM 2012. Maui, États-Unis
Arenas M, Cuenca Grau B, Evgeny E, Marciuska S, Zheleznyakov D (2014) Towards semantic faceted search. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, WWW companion ’14. International world wide web conferences steering committee, Republic and Canton of Geneva, Switzerland, pp 219–220. doi:10.1145/2567948.2577381
Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix bit loaded: a scalable lightweight join query processor for rdf data. In: Proceedings of the 19th international conference on world wide web. ACM, pp 41–50
Becker C, Bizer C (2008) Dbpedia mobile: a location-enabled linked data browser. In: Proceedings of World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), Beijing, China, 2008
Berners-Lee T, Hendler J, Lassila O et al (2001) The semantic web. Sci Am 284(5):28–37
Article Google Scholar
Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D (2006) Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop, vol 2006
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692
Article MATH Google Scholar
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering usingmapreduce. J Supercomput 70(3):1249–1259. doi:10.1007/s11227-014-1225-7
Article Google Scholar
Cure Faye, Blin O (2012) A survey of RDF storage approaches. ARIMA J 15:11–35
Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Xicheng D, Ying W, Huaming L (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: Proceedings of the 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011
Galárraga L, Hose K, Schenkel R (2014) Partout: a distributed engine for efficient rdf processing. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion. International world wide web conferences steering committee, pp 267–268
Hose K, Schenkel R (2013) Warp: workload-aware replication and partitioning for rdf. In: 4th international workshop on data engineering meets semantic web (DESWeb 2013). Brisbane, Australia
Huang J, Abadi DJ, Ren K (2011) Scalable sparql querying of large rdf graphs. Proc VLDB Endow 4(11):1123–1134
Google Scholar
Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham B (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312–1327
Article Google Scholar
Kaoudi Z, Manolescu I (2014) Rdf in the clouds: a survey. VLDB J. doi:10.1007/s00778-014-0364-z
Google Scholar
Koren J, Zhang Y, Liu X (2008) Personalized interactive faceted search. In: Proceedings of the 17th international conference on world wide web. ACM, pp 477–486
Lee T, Im DH, Kim H, Kim HJ (2014) Application of filters to multiway joins in mapreduce. Math Probl Eng 2014, Art. ID 249418. doi:10.1155/2014/249418
McBride B (2001) Jena: implementing the rdf model and syntax specification. In: Proceedings of the Second International Workshop on the Semantic Web, Hongkong, 2001
Minack E, Sauermann L, Grimnes G, Fluit C, Broekstra J (2008) The sesame lucene sail: rdf queries with full-text search. In: Technical Report 2008-1, NEPOMUK consortium
Myung J, Sg Lee (2013) Exploiting inter-operation parallelism for matrix chain multiplication using mapreduce. J Supercomput 66(1):594–609. doi:10.1007/s11227-013-0936-5
Article Google Scholar
Myung J, Yeon J, Lee Sg (2010) Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 workshop on massive data analytics on the cloud, MDAC ’10. ACM, New York, NY, USA, pp 6:1–6:6. doi:10.1145/1779599.1779605
Neumann T, Weikum G (2010) The rdf-3x engine for scalable management of rdf data. VLDB J 19(1):91–113. doi:10.1007/s00778-009-0165-y
Article Google Scholar
Papailiou N, Konstantinou I, Tsoumakos D, Koziris N (2012) H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st international conference companion on world wide web. ACM, pp 397–400
Phan LTX, Zhang Z, Loo BT, Lee I (2010) Real-time MapReduce scheduling. In: Technical report no. MS-CIS-10-32, University of Pennsylvania, Philadelphia
Punnoose R, Crainiceanu A, Rapp D (2012) Rya: a scalable rdf triple store for the clouds. In: Proceedings of the 1st international workshop on cloud intelligence. ACM, p 4
Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In: Programming support innovations for emerging distributed applications. ACM, p 4
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
Um Jh, Choi H, Sk Song, Sp Choi, Yoon H, Jung H, Kim Th (2013) Development of a virtualized supercomputing environment for genomic analysis. J Supercomput 65(1):71–85. doi:10.1007/s11227-012-0752-3
Article Google Scholar
Van Aart C, Wielinga B, Van Hage WR (2010) Mobile cultural heritage guide: location-aware semantic search. In: Proceedings of The 17th International Conference on Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2001
Virtuoso. http://virtuoso.openlinksw.com/. Accessed 25 May 2015
Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. doi:10.14778/1453856.1453965
Article Google Scholar
Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. In: Proceedings of the VLDB Endowment, vol 6. VLDB Endowment, pp 265–276
Zhang X, Chen L, Tong Y, Wang M (2013) Eagre: towards scalable i/o efficient sparql query evaluation on the cloud. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 565–576
Zou L, Mo J, Chen L, Özsu MT, Zhao D (2011) gstore: answering sparql queries via subgraph matching. Proc VLDB Endow 4(8):482–493
Article Google Scholar

Download references

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002236). This work was also supported by ICT R&D program of MSIP/IITP. [R0101-15-0054, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services].

Author information

Authors and Affiliations

Biomedical Knowledge Engineering Laboratory, Dental Research Institute, Seoul National University, Seoul, Republic of Korea
Jinhyun Ahn & Hong-Gee Kim
Department of Computer and Information Engineering, Hoseo University, Asan, Republic of Korea
Dong-Hyuk Im
Institute of Human-Environment Interface Biology, Seoul National University, Seoul, Republic of Korea
Hong-Gee Kim

Authors

Jinhyun Ahn
View author publications
You can also search for this author inPubMed Google Scholar
Dong-Hyuk Im
View author publications
You can also search for this author inPubMed Google Scholar
Hong-Gee Kim
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Dong-Hyuk Im.

Appendices

Appendix A: Algorithms

Appendix B: LUBM queries

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahn, J., Im, DH. & Kim, HG. SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join. J Supercomput 71, 3695–3725 (2015). https://doi.org/10.1007/s11227-015-1459-z

Download citation

Published: 07 June 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11227-015-1459-z

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SMat-J: A Sparse Matrix-Based Join for SPARQL Query Processing

A Scalable Sparse Matrix-Based Join for SPARQL Query Processing

Cost- and Robustness-Based Query Optimization for Linked Data Fragments

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix A: Algorithms

Appendix B: LUBM queries

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now