Skip to main content

Incremental Schema Discovery at Scale for RDF Data

  • Conference paper
  • First Online:
The Semantic Web (ESWC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12731))

Included in the following conference series:

Abstract

The lack of a descriptive schema for an RDF dataset has motivated several research works addressing the problem of automatic schema discovery. The goal of these approaches is to generate a structural schema of a given RDF dataset from its instances. However, as new instances are added, the generated schema may become inconsistent with the dataset.

In this paper, we propose an incremental schema discovery approach for massive RDF datasets. It is based on a scalable and incremental density-based clustering algorithm which propagates the changes occurring in the dataset into the clusters corresponding to the classes of the schema. Our approach is implemented using big data technology to scale up schema discovery while providing a high quality clustering result. We present some experiments which demonstrate the efficiency of our proposal on both synthetic and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    RDF: https://www.w3.org/RDF/.

  2. 2.

    https://www.dbpedia.org/.

  3. 3.

    https://github.com/BOUHAMOUM/incremental_sc_dbscan.git.

  4. 4.

    https://github.com/BOUHAMOUM/SC-DBSCAN.

  5. 5.

    IBM QSDG: https://sourceforge.net/projects/ibmquestdatagen/.

  6. 6.

    http://downloads.dbpedia.org/3.9/.

References

  1. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  2. Baazizi, M.A., Lahmar, H.B., Colazzo, D., Ghelli, G., Sartiani, C.: Parametric schema inference for massive JSON datasets. VLDB J. 28, 497–521 (2019)

    Article  Google Scholar 

  3. Bouhamoum, R., Kedad, Z., Lopes, S.: Scalable schema discovery for RDF data. In: Hameurlain, A., Tjoa, A.M. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XLVI. LNCS, vol. 12410, pp. 91–120. Springer, Heidelberg (2020). https://doi.org/10.1007/978-3-662-62386-2_4

    Chapter  Google Scholar 

  4. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: Proceedings of the 22nd International Conference on Data Engineering, ICDE. IEEE Computer Society, Atlanta (2006)

    Google Scholar 

  5. Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large Scale Data Knowl. Centered Syst. 19, 1–25 (2015)

    Google Scholar 

  6. Ester, M., Kriegel, H., Sander, J., Wimmer, M., Xu, X.: Incremental clustering for mining in a data warehousing environment. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB 1998, Proceedings of 24rd International Conference on Very Large Data Bases, 24–27 August, 1998, New York City, New York, USA, pp. 323–333. Morgan Kaufmann (1998)

    Google Scholar 

  7. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceeding of the Second International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231. AAAI Press (1996)

    Google Scholar 

  8. Gong, Y., Sinnott, R.O., Rimba, P.: RT-DBSCAN: real-time parallel clustering of spatio-temporal data using spark-streaming. In: Shi, Y., Fu, H., Tian, Y., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2018. LNCS, vol. 10860, pp. 524–539. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-93698-7_40

    Chapter  Google Scholar 

  9. He, Y., Tan, H., Luo, W., Feng, S., Fan, J.: Mr-dbscan: a scalable mapreduce-based dbscan algorithm for heavily skewed data. In: Proceeding of the 27th International Parallel and Distributed Processing Symposium Workshops (IPDPS), vol. 8, pp. 83–99. Springer, Heidelberg (2013)

    Google Scholar 

  10. Jaccard, P.: The distribution of flora in the alpine zone. New Phytol. 11(2), 37–50 (1912)

    Article  Google Scholar 

  11. Kellou-Menouer, K., Kedad, Z.: A self-adaptive and incremental approach for data profiling in the semantic web. Trans. Large Scale Data Knowl. Centered Syst. 29, 108–133 (2016)

    Google Scholar 

  12. Lulli, A., Dell’Amico, M., Michiardi, P., Ricci, L.: Ng-dbscan: scalable density-based clustering for arbitrary data. In: Proceeding of the 42nd International Conference on Very Large Data Bases (VLDB), vol. 10(3), 157–168, November 2016

    Google Scholar 

  13. Bakr, A.M., Ghanem, N.M., Ismail, M.A.: Efficient incremental density-based algorithm for clustering large datasets. Alexandria Eng. J. 54, 1147–1154 (2015). Elsevier B.V

    Article  Google Scholar 

  14. Pernelle, N., Saïs, F., Mercier, D., Thuraisamy, S.: RDF data evolution: efficient detection and semantic representation of changes. In: Proceedings of the Posters and Demos Track of the International Conference on Semantic Systems - SEMANTICS, vol. 12 (2016)

    Google Scholar 

  15. Sevilla Ruiz, D., Morales, S.F., García Molina, J.: Inferring versioned schemas from NoSQL databases and its applications. In: Johannesson, P., Lee, M.L., Liddle, S.W., Opdahl, A.L., López, Ó.P. (eds.) ER 2015. LNCS, vol. 9381, pp. 467–480. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25264-3_35

    Chapter  Google Scholar 

  16. Song, H., Lee, J.G.: RP-DBSCAN: a superfast parallel DBSCAN algorithm based on random partitioning. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 1173–1187. ACM (2018)

    Google Scholar 

  17. Issa, S., Paris, P.-H., Hamdi, F., Si-Said Cherfi, S.: Revealing the conceptual schemas of RDF datasets. In: Giorgini, P., Weber, B. (eds.) CAiSE 2019. LNCS, vol. 11483, pp. 312–327. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-21290-2_20

    Chapter  Google Scholar 

  18. The Apache Software Foundation: Apache Spark (2018). https://spark.apache.org. Accessed 20 Oct 2018

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Redouane Bouhamoum .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bouhamoum, R., Kedad, Z., Lopes, S. (2021). Incremental Schema Discovery at Scale for RDF Data. In: Verborgh, R., et al. The Semantic Web. ESWC 2021. Lecture Notes in Computer Science(), vol 12731. Springer, Cham. https://doi.org/10.1007/978-3-030-77385-4_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-77385-4_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-77384-7

  • Online ISBN: 978-3-030-77385-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics