ABSTRACT
In light of the challenges of effectively managing Big Data, we are witnessing a gradual shift towards the increasingly popular Linked Open Data (LOD) paradigm. LOD aims to impose a machine-readable semantic layer over structured as well as unstructured data and hence automate some data analysis tasks that are not designed for computers. The convergence of Big Data and LOD is, however, not straightforward: the semantic layer of LOD and the Big Data large scale storage do not get along easily. Meanwhile, the sheer data size envisioned by Big Data denies certain computationally expensive semantic technologies, rendering the latter much less efficient than their performance on relatively small data sets.
In this paper, we propose a mechanism allowing LOD to take advantage of existing large-scale data stores while sustaining its "semantic" nature. We demonstrate how RDF-based semantic models can be distributed across multiple storage servers and we examine how a fundamental semantic operation can be tuned to meet the requirements on distributed and parallel data processing. Our future work will focus on stress test of the platform in the magnitude of tens of billions of triples, as well as comparative studies in usability and performance against similar offerings.
- TEDTalks: Hans Rosling: Asia's rise -- how and when - Hans Rosling (2009). TEDTalks (video), 2009.Google Scholar
- R. Angles and C. Gutierrez. Survey of graph database models. ACM Comput. Surv., 40(1):1:1--1:39, 2008. Google ScholarDigital Library
- F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel-Schneider, editors. The description logic handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. ISBN 0-521-78176-0. Google ScholarDigital Library
- C. Bizer, T. Heath, and T. Berners-Lee. Linked data - the story so far. Int. J. Semantic Web Inf. Syst., 5(3):1--22, 2009.Google ScholarCross Ref
- D. G. Brizan and A. U. Tansel. A survey of entity resolution and record linkage methodologies. Communications of the IIMA, 6(3):41--50, 2006.Google Scholar
- M. Cai and M. Frank. RDFPeers: a scalable distributed RDF repository based on a structured peer-to-peer network. In Proceedings of the 13th international conference on World Wide Web, WWW '04, pages 650--657, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
- J. J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkinson. Jena: implementing the semantic web recommendations. In WWW Alt. '04: Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 74--83. ACM, 2004. Google ScholarDigital Library
- R. Cattell. Scalable SQL and NoSQL data stores. SIGMOD Rec., 39(4):12--27, May 2011. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a distributed storage system for structured data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI '06, pages 15--15. USENIX Association, 2006. Google ScholarDigital Library
- S. Chaudhuri and U. Dayal. An overview of data warehousing and OLAP technology. SIGMOD Rec., 26(1):65--74, Mar. 1997. Google ScholarDigital Library
- W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string distance metrics for name-matching tasks. In IIWeb, pages 73--78, 2003.Google ScholarDigital Library
- T. M. Connolly and C. Begg. Database Systems: A Practical Approach to Design, Implementation, and Management. Addison-Wesley Longman Publishing Co., Inc., 3rd edition, 2001. Google ScholarDigital Library
- F. Dau. Semantic technologies for enterprises. Technical report, SAP AG, April 2011.Google Scholar
- D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, June 1992. Google ScholarDigital Library
- J. Euzenat, C. Meilicke, H. Stuckenschmidt, P. Shvaiko, and C. T. dos Santos. Ontology alignment evaluation initiative: six years of experience. Journal of Data Semantics, 15:158--192, 2011. Google ScholarDigital Library
- T. Gruber. Ontology. In L. Liu and M. T. Özsu, editors, Encyclopedia of Database Systems, pages 1963--1965. 2009.Google ScholarDigital Library
- S. Harris and N. Gibbins. 3store: Efficient bulk RDF storage. In 1st International Workshop on Practical and Scalable Semantic Systems (PSSS'03), pages 1--15, 2003.Google Scholar
- M. Hausenblas, R. Grossman, A. Harth, and P. Cudré-Mauroux. Large-scale linked data processing - cloud computing to the rescue? In Proceedings of the 2nd International Conference on Cloud Computing and Services Science, pages 246--251, 2012.Google Scholar
- A. Hogan, A. Zimmermann, J. Umbrich, A. Polleres, and S. Decker. Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora. Web Semantics, 10:76--110, 2012. Google ScholarDigital Library
- B. Hu and G. Svensson. A case study of linked enterprise data. In Proceedings of the 9th international semantic web conference on The semantic web, ISWC'10, pages 129--144, Berlin, Heidelberg, 2010. Springer-Verlag. Google ScholarDigital Library
- D. Karger, E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In Proceedings of the twenty-ninth annual ACM symposium on Theory of computing, STOC '97, pages 654--663, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010. Google ScholarDigital Library
- M. Larrea, A. Fernández, and S. Arévalo. Eventually consistent failure detectors. J. Parallel Distrib. Comput., 65(3):361--373, Mar. 2005. Google ScholarDigital Library
- O. Lassila and R. Swick. Resource Description Framework (RDF) model and syntax specification. W3C, 1999.Google Scholar
- J. Mondal and A. Deshpande. Managing large dynamic graphs efficiently. In Proceedings of the 2012 international conference on Management of Data, SIGMOD '12, pages 145--156. ACM, 2012. Google ScholarDigital Library
- T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB Endow., 1(1):647--659, Aug. 2008. Google ScholarDigital Library
- E. Oren, S. Kotoulas, G. Anadiotis, R. Siebes, A. ten Teije, and F. van Harmelen. Marvin: Distributed reasoning over large-scale semantic web data. Web Semant., 7(4):305--316, 2009. Google ScholarDigital Library
- N. Papailiou, I. Konstantinou, D. Tsoumakos, and N. Koziris. H2rdf: adaptive query processing on rdf data in the cloud. In Proceedings of the 21st international conference companion on World Wide Web, WWW '12 Companion, pages 397--400. ACM, 2012. Google ScholarDigital Library
- D. Pritchett. BASE: an ACID alternative. Queue, 6(3):48--55, May 2008. Google ScholarDigital Library
- E. PrudŠhommeaux and A. Seaborne. SPARQL Query Language for RDF, 2008.Google Scholar
- E. Rahm and P. A. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10(4):334--350, Dec. 2001. Google ScholarDigital Library
- M. Seeger. Key-Value stores: a practical overview. Media, pages 1--21, 2009.Google Scholar
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST'10, pages 1--10. IEEE Computer Society, 2010. Google ScholarDigital Library
- A. Tenschert, M. Assel, A. Cheptsov, and G. Gallizo. Parallelization and distribution techniques for ontology matching in urban computing environments. In Proceedings of the Fourth International Workshop on Ontology Matching at the ISWC Conference, October 2009.Google Scholar
- G. Tsatsanifos, D. Sacharidis, and T. Sellis. On enhancing scalability for distributed rdf/s stores. In Proceedings of the 14th International Conference on Extending Database Technology, EDBT/ICDT '11, pages 141--152, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- W. Vogels. Eventually consistent. Commun. ACM, 52(1):40--44, 2009. Google ScholarDigital Library
- S. Wasserman, K. Faust, and D. Iacobucci. Social network analysis: methods and applications (structural analysis in the social sciences). Cambridge University Press, Nov. 1994.Google Scholar
- T. World. Survey distributed databases, April 2012.Google Scholar
Index Terms
- Towards big linked data: a large-scale, distributed semantic data storage
Recommendations
Towards Big Linked Data: A Large-Scale, Distributed Semantic Data Storage
In light of the challenges of effectively managing Big Data, the authors are witnessing a gradual shift towards the increasingly popular Linked Open Data LOD paradigm. LOD aims to impose a machine-readable semantic layer over structured as well as ...
Comments