Two-stage Detection of Semantic Redundancies in RDF Data

Authors

  • Yiming Chen Nanjing University of Aeronautics and Astronautics, China
  • Daiyi Li Nanjing University of Aeronautics and Astronautics, China
  • Li Yan Nanjing University of Aeronautics and Astronautics, China
  • Zongmin Ma Nanjing University of Aeronautics and Astronautics, China

DOI:

https://doi.org/10.13052/jwe1540-9589.2184

Keywords:

RDF redundancy, Duplicate data, RDF similarity, Candidate selection

Abstract

With the enrichment of the RDF (resource description framework), integrating diverse data sources may result in RDF data duplication. Failure to effectively detect the duplicates brings redundancies into the integrated RDF datasets. This not only increases unnecessarily the size of the datasets, but also reduces the dataset quality. Traditionally a similarity calculation is applied to detect if a pair of candidates contains duplicates. For massive RDF data, a simple similarity calculation may lead to extremely low efficiency. To detect duplicates in the massive RDF data, in this paper we propose a detection approach based on RDF data clustering and similarity measurements. We first propose a clustering method based on locality sensitive hashing (LSH), which can efficiently select candidate pairs in RDF data. Then, a similarity calculation is performed on the selected candidate pairs. We finally obtain the duplicate candidates. We show through experiments that our approach can quickly extract the duplicate candidates in RDF datasets. Our approach had the highest F score and time performance in the OAEI (Ontology Alignment Evaluation Initiative) 2019 competition.

Downloads

Download data is not yet available.

Author Biographies

Yiming Chen, Nanjing University of Aeronautics and Astronautics, China

Yiming Chen received his M.Sc. degree in computer science and technology from Nanjing University of Aeronautics and Astronautics, China. His research interests include RDF data management and knowledge graphs.

Daiyi Li, Nanjing University of Aeronautics and Astronautics, China

Daiyi Li is currently working toward a Ph.D. degree in the School of Computer Science and Technology at Nanjing University of Aeronautics and Astronautics, China. His research fields are knowledge graphs and natural language processing.

Li Yan, Nanjing University of Aeronautics and Astronautics, China

Li Yan is currently a full professor at Nanjing University of Aeronautics and Astronautics, China. Her research interests mainly include big data processing, knowledge graphs, spatiotemporal data management, and fuzzy data modelling. She has published more than fifty papers on these topics.

Zongmin Ma, Nanjing University of Aeronautics and Astronautics, China

Zongmin Ma is currently a full professor at Nanjing University of Aeronautics and Astronautics, China. His research interests include big data and knowledge engineering, the Semantic Web, temporal/spatial information modeling and processing, deep learning, and knowledge representation and reasoning with a special focus on information uncertainty. He has published more than two hundred papers in international journals and conferences on these topics. He is a Fellow of the IFSA.

References

A. Zaveri, A. Rula, A. Maurino, “Quality assessment for linked data: a survey.” Semantic Web, vol. 7, no. 1, pp. 63–93, 2015.

H. Jin, L. Huang, P. Yuan “K-Radius subgraph comparison for RDF data cleansing, In: Proceedings of the 2010 International Conference on Web-age Information Management. pp. 309–320, 2010.

H. Wu, B. Villazon-Terrazas, P. Z. Pan, “How redundant is it?: An empirical analysis on linked datasets.” In: Proceedings of the 5th International Workshop on Consuming Linked Data. pp. 97–108, 2014.

J. Z. Pan, J. M. G. Pérez, Y. Ren, “SSP: Compressing RDF data by summarisation, serialization and predictive encoding.” Technical report, 2014. Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014SSP.pdf.

A. Hernández-Illera, M. A. Martínez-Prieto, J. D. Fernández “RDF-TR: Exploiting structural redundancies to boost RDF compression.” Information Sciences, vol. 508, no. 2, pp. 234–259, 2019.

G. Tao, J. Gu, H. Li, “Detect redundant RDF data by rules.” In: Proceedings of the 2016 International Conference on Database Systems for Advanced Applications. Springer, pp. 362–368, 2016.

M. Meier, “Towards rule-based minimization of RDF graphs under constraints.” In: Proceedings of the 2008 International Conference on Web Reasoning and Rule Systems. Springer, pp. 89–103, 2008.

R. Pichler, A. Polleres, A. Skritek, “Redundancy elimination on RDF graphs in the presence of rules, Constraints, and Queries.” In: Proceedings of the 2010 International Conference on Web Reasoning and Rule Systems. Springer, pp. 133–148, 2010.

Z. Zhang Z, A. G. Nuzzolese, A. L. Gentile, “Entity deduplication on scholarly data.” In: Proceedings of the 2017 European Semantic Web Conference. Springer, pp. 85–100, 2017.

L. Yan, R. Ma, D. Li, “RDF approximate queries based on semantic similarity.” Computing, vol. 99, no. 5, pp. 481–491, 2017.

R. Meymandpour, J. G. Davis, “A semantic similarity measure for linked data: An information content-based approach.@ Knowledge-Based Systems, vol. 109, pp. 276–293, 2016.

G. Piao, S. S. Ara, J. G. Breslin, “Computing the semantic similarity of resources in DBpedia for recommendation purposes.” In: Proceedings of the 2016 Joint International Semantic Technology Conference, Springer, pp. 185–200, 2016.

D. Song, J. Heflin, “Domain-independent entity coreference for linking ontology instances.” Journal of Data and Information Quality, vol. 4, no. 2, pp. 1–29, 2013.

Z. Wang, Z. Xiao, H. Lei, “RiMOM results for OAEI 2010.” In: Proceedings of the 5th International Workshop on Ontology Matching, pp. 195–202, 2010.

K. Sharma, U. Marjit, U. Biswas, “Duplicate resource detection in RDF datasets using Hadoop and MapReduce.” Advances in Electronics, Communication and Computing. Springer, pp. 253–261, 2018.

M. Michelson, C. A. Knoblock, “Creating relational data from unstructured and ungrammatical data sources.” Journal of Artificial Intelligence Research, vol. 31, pp. 543–590, 2008.

D. Song, Y. Luo, J. Heflin, “Linking heterogeneous data in the semantic web using scalable and domain-independent candidate selection.” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 143–156, 2016.

A. Rajaraman, J. D. Ullman, “Mining of massive datasets: Finding similar items.” Mining of Massive Datasets, Cambridge University Press, pp. 78–137, 2014.

D. Faria, C. Pesquita, E. Santos, “The agreement maker light ontology matching system.” In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems. Springer, pp. 527–541, 2013.

J. Wu, Z. Pan, C. Zhang, P. Wang, “Lily results for OAEI 2019.” In: Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference. pp. 153–159, 2019.

Xiaowen Wang, Yizhi Jiang, Yi Luo, “FTRLIM results for OAEI 2019.” In: Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference. pp. 146–152, 2019.

E. Jimenez-Ruiz, “LogMap family participation in the OAEI 2019.” In: Proceedings of the 18th International Semantic Web Conference. pp. 160–163, 2019.

Downloads

Published

2023-03-19

How to Cite

Chen, Y. ., Li, D. ., Yan, L. ., & Ma, Z. . (2023). Two-stage Detection of Semantic Redundancies in RDF Data. Journal of Web Engineering, 21(08), 2313–2338. https://doi.org/10.13052/jwe1540-9589.2184

Issue

Section

Articles