Abstract
The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to generate a structural schema of a given RDF dataset from its instances. However, as new instances are added, the generated schema may become inconsistent with the dataset.
In this paper, we propose an incremental schema discovery approach for massive RDF datasets. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technology to scale up schema discovery while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52
Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)
Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. In: Hameurlain, A., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI. LNCS, vol. 12410, pp. 91–120. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-662-62386-2_4
Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE. IEEE Computer Society, Atlanta (2006)
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015)
Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August, 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)
Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., Fu, H., Tian, Y., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40
He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. In: Proceeding of the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPS), vol. 8, pp. 83–99. Springer, Heidelberg (2013)
Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)
Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016)
Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: Ng-dbscan: scalable density-based clustering for arbitrary data. In: Proceeding of the 42nd International Conference on Very Large Data Bases (VLDB), vol. 10(3), 157–168, November 2016
Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alexandria Eng. J. 54, 1147–1154 (2015). Elsevier B.V
Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)
Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35
Song, H., Lee, J.G.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 1173–1187. ACM (2018)
Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20
The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Bouhamoum, R., Kedad, Z., Lopes, S. (2021). Incremental Schema Discovery at Scale for RDF Data. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-77385-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-77384-7
Online ISBN: 978-3-030-77385-4
eBook Packages: Computer ScienceComputer Science (R0)