Incremental Schema Generation for Large and Evolving RDF Sources

Bouhamoum, Redouane; Kedad, Zoubida; Lopes, Stéphane

doi:10.1007/978-3-662-66111-6_2

Redouane Bouhamoum¹¹,
Zoubida Kedad¹¹ &
Stéphane Lopes¹¹

Part of the book series: Lecture Notes in Computer Science ((TLDKS,volume 13410))

139 Accesses

Abstract

The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to provide the underlying structural schema of a given RDF dataset, either from the existing instances, or using some schema-related declarations if provided. However, as the instances in the RDF dataset evolve, the generated schema may become inconsistent with the dataset. It is therefore necessary to incrementally update the existing schema according to the changes occurring in the dataset over time.

In this paper, we propose a schema discovery approach for massive RDF datasets which incrementally deals with both the insertion and the deletion of entities. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technologies to scale-up to massive data, while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Alcalde, C., Burusco, A.: Study of the relevance of objects and attributes of L-fuzzy contexts using overlap indexes. In: Medina, J., et al. (eds.) IPMU 2018. CCIS, vol. 853, pp. 537–548. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91473-2_46
Chapter Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Schema inference for massive JSON datasets. In: Proceeding of the 20th International Conference on Extending Database Technology (EDBT), pp. 222–233 (2017)
Google Scholar
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)
Article Google Scholar
Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. Trans. Large Scale Data Knowl. Centered Syst. 46, 91–120 (2020). https://doi.org/10.1007/978-3-662-62386-2_4
Bouhamoum, R., Kedad, Z., Lopes, S.: Incremental schema discovery at scale for RDF data. In: Verborgh, R., et al. (eds.) ESWC 2021. LNCS, vol. 12731, pp. 195–211. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-77385-4_12
Chapter Google Scholar
Bouhamoum, R., Kellou-Menouer, K.K., Lopes, S., Kedad, Z.: Scaling up schema discovery approaches. In: Proceeding of the 34th International Conference on Data Engineering Workshops (ICDEW), pp. 84–89. IEEE (2018)
Google Scholar
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015). https://doi.org/10.1007/978-3-662-46562-2_1
Cordova, I., Moh, T.: DBSCAN on resilient distributed datasets. In: 2015 International Conference on High Performance Computing & Simulation, HPCS 2015, Amsterdam, Netherlands, 20–24 July 2015, pp. 531–540. IEEE (2015). https://doi.org/10.1109/HPCSim.2015.7237086
Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998). http://www.vldb.org/conf/1998/p323.pdf
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
Google Scholar
The Apache Software Foundation: Apache Hadoop (2018). https://hadoop.apache.org/. Accessed 20 Oct 2018
Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., et al. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40
Chapter Google Scholar
Gragera Aguaza, A., Suppakitpaisarn, V.: Relaxed triangle inequality ratio of the Sørensen-Dice and Tversky indexes. Theor. Comput. Sci. 718, 37–45 (2017)
Article Google Scholar
Han, D., Agrawal, A., Liao, W., Choudhary, A.N.: A novel scalable DBSCAN algorithm with spark. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops 2016, Chicago, IL, USA, 23–27 May 2016, pp. 1393–1402. IEEE Computer Society (2016). https://doi.org/10.1109/IPDPSW.2016.57
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: MR-DBSCAN: a scalable MapReduce-based DBSCAN algorithm for heavily skewed data. Front. Comp. Sci. 8(1), 83–99 (2014). https://doi.org/10.1007/s11704-013-3158-3
Article MathSciNet Google Scholar
He, Y., et al.: MR-DBSCAN: an efficient parallel density-based clustering algorithm using mapreduce. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2011, Tainan, Taiwan, 7–9 December 2011, pp. 473–480. IEEE Computer Society (2011). https://doi.org/10.1109/ICPADS.2011.83
IBM: IBM quest synthetic data generator (2015). https://sourceforge.net/projects/ibmquestdatagen/. Accessed 01 Oct 2018
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Jafari, O., Maurya, P., Nagarkar, P., Islam, K.M., Crushev, C.: A survey on locality sensitive hashing algorithms and their applications. CoRR abs/2102.08942 (2021). https://arxiv.org/abs/2102.08942
Kardoulakis, N., Kellou-Menouer, K., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: Hint: hybrid and incremental type discovery for large RDF data sources. In: Zhu, Q., Zhu, X., Tu, Y., Xu, Z., Kumar, A. (eds.) SSDBM 2021: 33rd International Conference on Scientific and Statistical Database Management, Tampa, FL, USA, 6–7 July 2021, pp. 97–108. ACM (2021). https://doi.org/10.1145/3468791.3468808
Kellou-Menouer, K., Kardoulakis, N., Troullinou, G., Kedad, Z., Plexousakis, D., Kondylakis, H.: A survey on semantic schema discovery. VLDB J. (2021). https://doi.org/10.1145/3468791.3468808
Kellou-Menouer, K., Kedad, Z.: Schema discovery in RDF data sources. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 481–495. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_36
Chapter Google Scholar
Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016). https://doi.org/10.1007/978-3-662-54037-4_4
Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: NG-DBSCAN: scalable density-based clustering for arbitrary data. Proc. VLDB Endow. 10(3), 157–168 (2016). https://doi.org/10.14778/3021924.3021932
Luo, G., Luo, X., Gooch, T.F., Tian, L., Qin, K.: A parallel DBSCAN algorithm based on spark. In: Cai, Z., et al. (eds.) 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom), BDCloud-SocialCom-SustainCom 2016, Atlanta, GA, USA, 8–10 October 2016, pp. 548–553. IEEE Computer Society (2016). https://doi.org/10.1109/BDCloud-SocialCom-SustainCom.2016.85
Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alex. Eng. J. 54, 1147–1154 (2015)
Article Google Scholar
Patwary, M.M.A., Palsetia, D., Agrawal, A., Liao, W., Manne, F., Choudhary, A.N.: A new scalable parallel DBSCAN algorithm using the disjoint-set data structure. In: Hollingsworth, J.K. (ed.) SC Conference on High Performance Computing Networking, Storage and Analysis, SC 2012, Salt Lake City, UT, USA, 11–15 November 2012, p. 62. IEEE/ACM (2012). https://doi.org/10.1109/SC.2012.9
Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)
Google Scholar
Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35
Chapter Google Scholar
Savvas, I.K., Tselios, D.C.: Parallelizing DBSCAN algorithm using MPI. In: Reddy, S., Gaaloul, W. (eds.) 25th IEEE International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises, WETICE 2016, Paris, France, 13–15 June 2016, pp. 77–82. IEEE Computer Society (2016). https://doi.org/10.1109/WETICE.2016.26
Song, H., Lee, J.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, 10–15 June 2018, pp. 1173–1187. ACM (2018). https://doi.org/10.1145/3183713.3196887
Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20
Chapter Google Scholar
The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018

Download references

Author information

Authors and Affiliations

DAVID Laboratory, University of Versailles Saint-Quentin-en-Yvelines, Versailles, France
Redouane Bouhamoum, Zoubida Kedad & Stéphane Lopes

Authors

Redouane Bouhamoum
View author publications
You can also search for this author in PubMed Google Scholar
Zoubida Kedad
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Redouane Bouhamoum .

Editor information

Editors and Affiliations

IRIT, Paul Sabatier University, Toulouse, France
Abdelkader Hameurlain
IFS, Technical University of Vienna, Vienna, Austria
A Min Tjoa
University of Montpellier, Montpellier, France
Esther Pacitti
University of Rennes 1, Rennes, France
Zoltan Miklos

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bouhamoum, R., Kedad, Z., Lopes, S. (2022). Incremental Schema Generation for Large and Evolving RDF Sources. In: Hameurlain, A., Tjoa, A.M., Pacitti, E., Miklos, Z. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems LI. Lecture Notes in Computer Science(), vol 13410. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-66111-6_2

Download citation

DOI: https://doi.org/10.1007/978-3-662-66111-6_2
Published: 08 October 2022
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-66110-9
Online ISBN: 978-3-662-66111-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Incremental Schema Generation for Large and Evolving RDF Sources