Skip to main content

Incremental Schema Generation for Large and Evolving RDF Sources

  • Chapter
  • First Online:
Transactions on Large-Scale Data- and Knowledge-Centered Systems LI

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 13410))

  • 139 Accesses

Abstract

The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to provide the underlying structural schema of a given RDF dataset, either from the existing instances, or using some schema-related declarations if provided. However, as the instances in the RDF dataset evolve, the generated schema may become inconsistent with the dataset. It is therefore necessary to incrementally update the existing schema according to the changes occurring in the dataset over time.

In this paper, we propose a schema discovery approach for massive RDF datasets which incrementally deals with both the insertion and the deletion of entities. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technologies to scale-up to massive data, while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/BOUHAMOUM/incremental_sc_dbscan.git.

  2. 2.

    https://github.com/BOUHAMOUM/SC-DBSCAN.

  3. 3.

    http://downloads.dbpedia.org/3.9/.

References

  1. Alcalde, C., Burusco, A.: Study of the relevance of objects and attributes of L-fuzzy contexts using overlap indexes. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 537–548. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91473-2_46

    Chapter  Google Scholar 

  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  3. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Proceeding of the 20th International Conference on Extending Database Technology (EDBT), pp. 222–233 (2017)

    Google Scholar 

  4. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)

    Article  Google Scholar 

  5. Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. Trans. Large Scale Data Knowl. Centered Syst. 46, 91–120 (2020). https://doi.org/10.1007/978-3-662-62386-2_4

  6. Bouhamoum, R., Kedad, Z., Lopes, S.: Incremental schema discovery at scale for RDF data. In: Verborgh, R., et al. (eds.) ESWC 2021. LNCS, vol. 12731, pp. 195–211. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77385-4_12

    Chapter  Google Scholar 

  7. Bouhamoum, R., Kellou-Menouer, K.K., Lopes, S., Kedad, Z.: Scaling up schema discovery approaches. In: Proceeding of the 34th International Conference on Data Engineering Workshops (ICDEW), pp. 84–89. IEEE (2018)

    Google Scholar 

  8. Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015). https://doi.org/10.1007/978-3-662-46562-2_1

  9. Cordova, I., Moh, T.: DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS 2015, Amsterdam, Netherlands, 20–24 July 2015, pp. 531–540. IEEE (2015). https://doi.org/10.1109/HPCSim.2015.7237086

  10. Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998). http://www.vldb.org/conf/1998/p323.pdf

  11. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)

    Google Scholar 

  12. The Apache Software Foundation: Apache Hadoop (2018). https://hadoop.apache.org/. Accessed 20 Oct 2018

  13. Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40

    Chapter  Google Scholar 

  14. Gragera Aguaza, A., Suppakitpaisarn, V.: Relaxed triangle inequality ratio of the Sørensen-Dice and Tversky indexes. Theor. Comput. Sci. 718, 37–45 (2017)

    Article  Google Scholar 

  15. Han, D., Agrawal, A., Liao, W., Choudhary, A.N.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, 23–27 May 2016, pp. 1393–1402. IEEE Computer Society (2016). https://doi.org/10.1109/IPDPSW.2016.57

  16. He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comp. Sci. 8(1), 83–99 (2014). https://doi.org/10.1007/s11704-013-3158-3

    Article  MathSciNet  Google Scholar 

  17. He, Y., et al.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2011, Tainan, Taiwan, 7–9 December 2011, pp. 473–480. IEEE Computer Society (2011). https://doi.org/10.1109/ICPADS.2011.83

  18. IBM: IBM quest synthetic data generator (2015). https://sourceforge.net/projects/ibmquestdatagen/. Accessed 01 Oct 2018

  19. Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  20. Jafari, O., Maurya, P., Nagarkar, P., Islam, K.M., Crushev, C.: A survey on locality sensitive hashing algorithms and their applications. CoRR abs/2102.08942 (2021). https://arxiv.org/abs/2102.08942

  21. Kardoulakis, N., Kellou-Menouer, K., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: Hint: hybrid and incremental type discovery for large RDF data sources. In: Zhu, Q., Zhu, X., Tu, Y., Xu, Z., Kumar, A. (eds.) SSDBM 2021: 33rd International Conference on Scientific and Statistical Database Management, Tampa, FL, USA, 6–7 July 2021, pp. 97–108. ACM (2021). https://doi.org/10.1145/3468791.3468808

  22. Kellou-Menouer, K., Kardoulakis, N., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: A survey on semantic schema discovery. VLDB J. (2021). https://doi.org/10.1145/3468791.3468808

  23. Kellou-Menouer, K., Kedad, Z.: Schema discovery in RDF data sources. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 481–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_36

    Chapter  Google Scholar 

  24. Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016). https://doi.org/10.1007/978-3-662-54037-4_4

  25. Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc. VLDB Endow. 10(3), 157–168 (2016). https://doi.org/10.14778/3021924.3021932

  26. Luo, G., Luo, X., Gooch, T.F., Tian, L., Qin, K.: A parallel DBSCAN algorithm based on spark. In: Cai, Z., et al. (eds.) 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom), BDCloud-SocialCom-SustainCom 2016, Atlanta, GA, USA, 8–10 October 2016, pp. 548–553. IEEE Computer Society (2016). https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85

  27. Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alex. Eng. J. 54, 1147–1154 (2015)

    Article  Google Scholar 

  28. Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W., Manne, F., Choudhary, A.N.: A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Hollingsworth, J.K. (ed.) SC Conference on High Performance Computing Networking, Storage and Analysis, SC 2012, Salt Lake City, UT, USA, 11–15 November 2012, p. 62. IEEE/ACM (2012). https://doi.org/10.1109/SC.2012.9

  29. Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)

    Google Scholar 

  30. Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35

    Chapter  Google Scholar 

  31. Savvas, I.K., Tselios, D.C.: Parallelizing DBSCAN algorithm using MPI. In: Reddy, S., Gaaloul, W. (eds.) 25th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, WETICE 2016, Paris, France, 13–15 June 2016, pp. 77–82. IEEE Computer Society (2016). https://doi.org/10.1109/WETICE.2016.26

  32. Song, H., Lee, J.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1173–1187. ACM (2018). https://doi.org/10.1145/3183713.3196887

  33. Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20

    Chapter  Google Scholar 

  34. The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Redouane Bouhamoum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer-Verlag GmbH Germany, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Bouhamoum, R., Kedad, Z., Lopes, S. (2022). Incremental Schema Generation for Large and Evolving RDF Sources. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-66111-6_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-66110-9

  • Online ISBN: 978-3-662-66111-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics