Abstract
Large numbers of Resource Description Framework triples are available in Linked Data which can grow exponentially. It makes SPARQL query processing engines infeasible on a single machine. To address this scalability issue, MapReduce framework-based SPARQL engines have been proposed, but we note that these methods are limited in terms of join evaluations. The two-way join-based approach evaluates joins via a sequence of binary multiplications that require multiple MapReduce jobs, which involves costly disk accesses between MapReduce jobs. The multi-way join-based approach combines multiple two-way join operations, which allows the simultaneous evaluation of joins during one MapReduce job. However, the size of data for the MapReduce job might increase exponentially if a complex query is given. In this study, we propose SigMR, a pruning method for multi-way join-based SPARQL query processing in MapReduce. In the proposed approach, a SPARQL query can be evaluated in a single MapReduce job, where the size of data is reduced dramatically by pruning based on our signature encoding technique, thereby overcoming the weaknesses of the previous approaches. In experiments, we showed that the query processing time required was lower with our approach than existing MapReduce-based methods.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig5_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig6_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig7_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig8_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig9_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig10_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig11_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig12_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig13_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig14_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fig15_HTML.gif)
Similar content being viewed by others
References
Abadi DJ, Marcus A, Madden SR, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, VLDB ’07. VLDB endowment, pp 411–422
Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282–1298. doi:10.1109/TKDE.2011.47
Aluç G, Ozsu MT, Daudjee K (2014) Workload matters: why rdf databases need a new design. Proc VLDB Endow 7(10):837–840
Apache storm. https://storm.apache.org. Accessed 25 May 2015
Aranda-Andújar A, Bugiotti F, Camacho-Rodríguez J, Colazzo D, Goasdoué F, Kaoudi Z, Manolescu I (2012) Amada: web data repositories in the amazon cloud. In: CIKM 2012. Maui, États-Unis
Arenas M, Cuenca Grau B, Evgeny E, Marciuska S, Zheleznyakov D (2014) Towards semantic faceted search. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, WWW companion ’14. International world wide web conferences steering committee, Republic and Canton of Geneva, Switzerland, pp 219–220. doi:10.1145/2567948.2577381
Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix bit loaded: a scalable lightweight join query processor for rdf data. In: Proceedings of the 19th international conference on world wide web. ACM, pp 41–50
Becker C, Bizer C (2008) Dbpedia mobile: a location-enabled linked data browser. In: Proceedings of World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), Beijing, China, 2008
Berners-Lee T, Hendler J, Lassila O et al (2001) The semantic web. Sci Am 284(5):28–37
Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D (2006) Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop, vol 2006
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692
Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering usingmapreduce. J Supercomput 70(3):1249–1259. doi:10.1007/s11227-014-1225-7
Cure Faye, Blin O (2012) A survey of RDF storage approaches. ARIMA J 15:11–35
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Xicheng D, Ying W, Huaming L (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: Proceedings of the 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011
Galárraga L, Hose K, Schenkel R (2014) Partout: a distributed engine for efficient rdf processing. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion. International world wide web conferences steering committee, pp 267–268
Hose K, Schenkel R (2013) Warp: workload-aware replication and partitioning for rdf. In: 4th international workshop on data engineering meets semantic web (DESWeb 2013). Brisbane, Australia
Huang J, Abadi DJ, Ren K (2011) Scalable sparql querying of large rdf graphs. Proc VLDB Endow 4(11):1123–1134
Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham B (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312–1327
Kaoudi Z, Manolescu I (2014) Rdf in the clouds: a survey. VLDB J. doi:10.1007/s00778-014-0364-z
Koren J, Zhang Y, Liu X (2008) Personalized interactive faceted search. In: Proceedings of the 17th international conference on world wide web. ACM, pp 477–486
Lee T, Im DH, Kim H, Kim HJ (2014) Application of filters to multiway joins in mapreduce. Math Probl Eng 2014, Art. ID 249418. doi:10.1155/2014/249418
McBride B (2001) Jena: implementing the rdf model and syntax specification. In: Proceedings of the Second International Workshop on the Semantic Web, Hongkong, 2001
Minack E, Sauermann L, Grimnes G, Fluit C, Broekstra J (2008) The sesame lucene sail: rdf queries with full-text search. In: Technical Report 2008-1, NEPOMUK consortium
Myung J, Sg Lee (2013) Exploiting inter-operation parallelism for matrix chain multiplication using mapreduce. J Supercomput 66(1):594–609. doi:10.1007/s11227-013-0936-5
Myung J, Yeon J, Lee Sg (2010) Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 workshop on massive data analytics on the cloud, MDAC ’10. ACM, New York, NY, USA, pp 6:1–6:6. doi:10.1145/1779599.1779605
Neumann T, Weikum G (2010) The rdf-3x engine for scalable management of rdf data. VLDB J 19(1):91–113. doi:10.1007/s00778-009-0165-y
Papailiou N, Konstantinou I, Tsoumakos D, Koziris N (2012) H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st international conference companion on world wide web. ACM, pp 397–400
Phan LTX, Zhang Z, Loo BT, Lee I (2010) Real-time MapReduce scheduling. In: Technical report no. MS-CIS-10-32, University of Pennsylvania, Philadelphia
Punnoose R, Crainiceanu A, Rapp D (2012) Rya: a scalable rdf triple store for the clouds. In: Proceedings of the 1st international workshop on cloud intelligence. ACM, p 4
Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In: Programming support innovations for emerging distributed applications. ACM, p 4
Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10
Um Jh, Choi H, Sk Song, Sp Choi, Yoon H, Jung H, Kim Th (2013) Development of a virtualized supercomputing environment for genomic analysis. J Supercomput 65(1):71–85. doi:10.1007/s11227-012-0752-3
Van Aart C, Wielinga B, Van Hage WR (2010) Mobile cultural heritage guide: location-aware semantic search. In: Proceedings of The 17th International Conference on Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2001
Virtuoso. http://virtuoso.openlinksw.com/. Accessed 25 May 2015
Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. doi:10.14778/1453856.1453965
Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. In: Proceedings of the VLDB Endowment, vol 6. VLDB Endowment, pp 265–276
Zhang X, Chen L, Tong Y, Wang M (2013) Eagre: towards scalable i/o efficient sparql query evaluation on the cloud. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 565–576
Zou L, Mo J, Chen L, Özsu MT, Zhao D (2011) gstore: answering sparql queries via subgraph matching. Proc VLDB Endow 4(8):482–493
Acknowledgments
This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002236). This work was also supported by ICT R&D program of MSIP/IITP. [R0101-15-0054, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services].
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Algorithms
![figure a](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figa_HTML.gif)
![figure b](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figb_HTML.gif)
![figure c](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figc_HTML.gif)
![figure d](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figd_HTML.gif)
![figure e](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Fige_HTML.gif)
![figure f](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figf_HTML.gif)
Appendix B: LUBM queries
![figure g](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figg_HTML.gif)
![figure h](http://media.springernature.com/lw685/springer-static/image/art%3A10.1007%2Fs11227-015-1459-z/MediaObjects/11227_2015_1459_Figh_HTML.gif)
Rights and permissions
About this article
Cite this article
Ahn, J., Im, DH. & Kim, HG. SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join. J Supercomput 71, 3695–3725 (2015). https://doi.org/10.1007/s11227-015-1459-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1459-z