Skip to main content

Advertisement

Log in

SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Large numbers of Resource Description Framework triples are available in Linked Data which can grow exponentially. It makes SPARQL query processing engines infeasible on a single machine. To address this scalability issue, MapReduce framework-based SPARQL engines have been proposed, but we note that these methods are limited in terms of join evaluations. The two-way join-based approach evaluates joins via a sequence of binary multiplications that require multiple MapReduce jobs, which involves costly disk accesses between MapReduce jobs. The multi-way join-based approach combines multiple two-way join operations, which allows the simultaneous evaluation of joins during one MapReduce job. However, the size of data for the MapReduce job might increase exponentially if a complex query is given. In this study, we propose SigMR, a pruning method for multi-way join-based SPARQL query processing in MapReduce. In the proposed approach, a SPARQL query can be evaluated in a single MapReduce job, where the size of data is reduced dramatically by pruning based on our signature encoding technique, thereby overcoming the weaknesses of the previous approaches. In experiments, we showed that the query processing time required was lower with our approach than existing MapReduce-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://www.w3.org/RDF.

  2. http://www.w3.org/TR/rdf-sparql-query.

  3. http://linkeddata.org.

  4. http://dbpedia.org.

  5. http://swat.cse.lehigh.edu/projects/lubm.

References

  1. Abadi DJ, Marcus A, Madden SR, Hollenbach K (2007) Scalable semantic web data management using vertical partitioning. In: Proceedings of the 33rd international conference on very large data bases, VLDB ’07. VLDB endowment, pp 411–422

  2. Afrati FN, Ullman JD (2011) Optimizing multiway joins in a map-reduce environment. IEEE Trans Knowl Data Eng 23(9):1282–1298. doi:10.1109/TKDE.2011.47

    Article  Google Scholar 

  3. Aluç G, Ozsu MT, Daudjee K (2014) Workload matters: why rdf databases need a new design. Proc VLDB Endow 7(10):837–840

    Article  Google Scholar 

  4. Apache storm. https://storm.apache.org. Accessed 25 May 2015

  5. Aranda-Andújar A, Bugiotti F, Camacho-Rodríguez J, Colazzo D, Goasdoué F, Kaoudi Z, Manolescu I (2012) Amada: web data repositories in the amazon cloud. In: CIKM 2012. Maui, États-Unis

  6. Arenas M, Cuenca Grau B, Evgeny E, Marciuska S, Zheleznyakov D (2014) Towards semantic faceted search. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion, WWW companion ’14. International world wide web conferences steering committee, Republic and Canton of Geneva, Switzerland, pp 219–220. doi:10.1145/2567948.2577381

  7. Atre M, Chaoji V, Zaki MJ, Hendler JA (2010) Matrix bit loaded: a scalable lightweight join query processor for rdf data. In: Proceedings of the 19th international conference on world wide web. ACM, pp 41–50

  8. Becker C, Bizer C (2008) Dbpedia mobile: a location-enabled linked data browser. In: Proceedings of World Wide Web 2008 Workshop: Linked Data on the Web (LDOW 08), Beijing, China, 2008

  9. Berners-Lee T, Hendler J, Lassila O et al (2001) The semantic web. Sci Am 284(5):28–37

    Article  Google Scholar 

  10. Berners-Lee T, Chen Y, Chilton L, Connolly D, Dhanaraj R, Hollenbach J, Lerer A, Sheets D (2006) Tabulator: exploring and analyzing linked data on the semantic web. In: Proceedings of the 3rd international semantic web user interaction workshop, vol 2006

  11. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692

    Article  MATH  Google Scholar 

  12. Cui X, Zhu P, Yang X, Li K, Ji C (2014) Optimized big data k-means clustering usingmapreduce. J Supercomput 70(3):1249–1259. doi:10.1007/s11227-014-1225-7

    Article  Google Scholar 

  13. Cure Faye, Blin O (2012) A survey of RDF storage approaches. ARIMA J 15:11–35

    Google Scholar 

  14. Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  15. Xicheng D, Ying W, Huaming L (2011) Scheduling mixed real-time and non-real-time applications in MapReduce environment. In: Proceedings of the 17th International Conference on Parallel and Distributed Systems, Tainan, Taiwan, 2011

  16. Galárraga L, Hose K, Schenkel R (2014) Partout: a distributed engine for efficient rdf processing. In: Proceedings of the companion publication of the 23rd international conference on world wide web companion. International world wide web conferences steering committee, pp 267–268

  17. Hose K, Schenkel R (2013) Warp: workload-aware replication and partitioning for rdf. In: 4th international workshop on data engineering meets semantic web (DESWeb 2013). Brisbane, Australia

  18. Huang J, Abadi DJ, Ren K (2011) Scalable sparql querying of large rdf graphs. Proc VLDB Endow 4(11):1123–1134

    Google Scholar 

  19. Husain M, McGlothlin J, Masud MM, Khan L, Thuraisingham B (2011) Heuristics-based query processing for large rdf graphs using cloud computing. IEEE Trans Knowl Data Eng 23(9):1312–1327

    Article  Google Scholar 

  20. Kaoudi Z, Manolescu I (2014) Rdf in the clouds: a survey. VLDB J. doi:10.1007/s00778-014-0364-z

    Google Scholar 

  21. Koren J, Zhang Y, Liu X (2008) Personalized interactive faceted search. In: Proceedings of the 17th international conference on world wide web. ACM, pp 477–486

  22. Lee T, Im DH, Kim H, Kim HJ (2014) Application of filters to multiway joins in mapreduce. Math Probl Eng 2014, Art. ID 249418. doi:10.1155/2014/249418

  23. McBride B (2001) Jena: implementing the rdf model and syntax specification. In: Proceedings of the Second International Workshop on the Semantic Web, Hongkong, 2001

  24. Minack E, Sauermann L, Grimnes G, Fluit C, Broekstra J (2008) The sesame lucene sail: rdf queries with full-text search. In: Technical Report 2008-1, NEPOMUK consortium

  25. Myung J, Sg Lee (2013) Exploiting inter-operation parallelism for matrix chain multiplication using mapreduce. J Supercomput 66(1):594–609. doi:10.1007/s11227-013-0936-5

    Article  Google Scholar 

  26. Myung J, Yeon J, Lee Sg (2010) Sparql basic graph pattern processing with iterative mapreduce. In: Proceedings of the 2010 workshop on massive data analytics on the cloud, MDAC ’10. ACM, New York, NY, USA, pp 6:1–6:6. doi:10.1145/1779599.1779605

  27. Neumann T, Weikum G (2010) The rdf-3x engine for scalable management of rdf data. VLDB J 19(1):91–113. doi:10.1007/s00778-009-0165-y

    Article  Google Scholar 

  28. Papailiou N, Konstantinou I, Tsoumakos D, Koziris N (2012) H2rdf: adaptive query processing on rdf data in the cloud. In: Proceedings of the 21st international conference companion on world wide web. ACM, pp 397–400

  29. Phan LTX, Zhang Z, Loo BT, Lee I (2010) Real-time MapReduce scheduling. In: Technical report no. MS-CIS-10-32, University of Pennsylvania, Philadelphia

  30. Punnoose R, Crainiceanu A, Rapp D (2012) Rya: a scalable rdf triple store for the clouds. In: Proceedings of the 1st international workshop on cloud intelligence. ACM, p 4

  31. Rohloff K, Schantz RE (2010) High-performance, massively scalable distributed systems using the mapreduce software framework: the shard triple-store. In: Programming support innovations for emerging distributed applications. ACM, p 4

  32. Shvachko K, Kuang H, Radia S, Chansler R (2010) The hadoop distributed file system. In: 2010 IEEE 26th symposium on mass storage systems and technologies (MSST). IEEE, pp 1–10

  33. Um Jh, Choi H, Sk Song, Sp Choi, Yoon H, Jung H, Kim Th (2013) Development of a virtualized supercomputing environment for genomic analysis. J Supercomput 65(1):71–85. doi:10.1007/s11227-012-0752-3

    Article  Google Scholar 

  34. Van Aart C, Wielinga B, Van Hage WR (2010) Mobile cultural heritage guide: location-aware semantic search. In: Proceedings of The 17th International Conference on Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2001

  35. Virtuoso. http://virtuoso.openlinksw.com/. Accessed 25 May 2015

  36. Weiss C, Karras P, Bernstein A (2008) Hexastore: sextuple indexing for semantic web data management. Proc VLDB Endow 1(1):1008–1019. doi:10.14778/1453856.1453965

    Article  Google Scholar 

  37. Zeng K, Yang J, Wang H, Shao B, Wang Z (2013) A distributed graph engine for web scale rdf data. In: Proceedings of the VLDB Endowment, vol 6. VLDB Endowment, pp 265–276

  38. Zhang X, Chen L, Tong Y, Wang M (2013) Eagre: towards scalable i/o efficient sparql query evaluation on the cloud. In: 2013 IEEE 29th international conference on data engineering (ICDE). IEEE, pp 565–576

  39. Zou L, Mo J, Chen L, Özsu MT, Zhao D (2011) gstore: answering sparql queries via subgraph matching. Proc VLDB Endow 4(8):482–493

    Article  Google Scholar 

Download references

Acknowledgments

This research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002236). This work was also supported by ICT R&D program of MSIP/IITP. [R0101-15-0054, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong-Hyuk Im.

Appendices

Appendix A: Algorithms

figure a
figure b
figure c
figure d
figure e
figure f

Appendix B: LUBM queries

figure g
figure h

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahn, J., Im, DH. & Kim, HG. SigMR: MapReduce-based SPARQL query processing by signature encoding and multi-way join. J Supercomput 71, 3695–3725 (2015). https://doi.org/10.1007/s11227-015-1459-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1459-z

Keywords

Navigation