Incremental Schema Discovery at Scale for RDF Data

Bouhamoum, Redouane; Kedad, Zoubida; Lopes, Stéphane

doi:10.1007/978-3-030-77385-4_12

Redouane Bouhamoum¹⁶,
Zoubida Kedad¹⁶ &
Stéphane Lopes¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12731))

Included in the following conference series:

European Semantic Web Conference

2407 Accesses
2 Citations

Abstract

The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to generate a structural schema of a given RDF dataset from its instances. However, as new instances are added, the generated schema may become inconsistent with the dataset.

In this paper, we propose an incremental schema discovery approach for massive RDF datasets. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technology to scale up schema discovery while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
RDF: https://www.w3.org/RDF/.
2.
https://www.dbpedia.org/.
3.
https://github.com/BOUHAMOUM/incremental_sc_dbscan.git.
4.
https://github.com/BOUHAMOUM/SC-DBSCAN.
5.
IBM QSDG: https://sourceforge.net/projects/ibmquestdatagen/.
6.
http://downloads.dbpedia.org/3.9/.

References

Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)
Article Google Scholar
Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. In: Hameurlain, A., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI. LNCS, vol. 12410, pp. 91–120. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-662-62386-2_4
Chapter Google Scholar
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE. IEEE Computer Society, Atlanta (2006)
Google Scholar
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015)
Google Scholar
Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August, 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
Google Scholar
Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., Fu, H., Tian, Y., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40
Chapter Google Scholar
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. In: Proceeding of the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPS), vol. 8, pp. 83–99. Springer, Heidelberg (2013)
Google Scholar
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Article Google Scholar
Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016)
Google Scholar
Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: Ng-dbscan: scalable density-based clustering for arbitrary data. In: Proceeding of the 42nd International Conference on Very Large Data Bases (VLDB), vol. 10(3), 157–168, November 2016
Google Scholar
Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alexandria Eng. J. 54, 1147–1154 (2015). Elsevier B.V
Article Google Scholar
Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)
Google Scholar
Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35
Chapter Google Scholar
Song, H., Lee, J.G.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 1173–1187. ACM (2018)
Google Scholar
Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20
Chapter Google Scholar
The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018

Download references

Author information

Authors and Affiliations

DAVID Lab, University of Versailles Saint-Quentin-en-Yvelines, Versailles, France
Redouane Bouhamoum, Zoubida Kedad & Stéphane Lopes

Authors

Redouane Bouhamoum
View author publications
You can also search for this author in PubMed Google Scholar
Zoubida Kedad
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Redouane Bouhamoum .

Editor information

Editors and Affiliations

Ghent University, Ghent, Belgium
Ruben Verborgh
Aalborg University, Aalborg, Denmark
Katja Hose
University of Mannheim, Mannheim, Germany
Heiko Paulheim
ERCIM, Sophia Antipolis, France
Pierre-Antoine Champin
University of Siegen, Siegen, Germany
Maria Maleshkova
Universidad Politécnica de Madrid, Boadilla del Monte, Spain
Oscar Corcho
eBay Inc., San Jose, CA, USA
Petar Ristoski
FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Eggenstein-Leopoldshafen, Germany
Mehwish Alam

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bouhamoum, R., Kedad, Z., Lopes, S. (2021). Incremental Schema Discovery at Scale for RDF Data. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-77385-4_12
Published: 31 May 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77384-7
Online ISBN: 978-3-030-77385-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics