Two-stage Detection of Semantic Redundancies in RDF Data
DOI:
https://doi.org/10.13052/jwe1540-9589.2184Keywords:
RDF redundancy, Duplicate data, RDF similarity, Candidate selectionAbstract
With the enrichment of the RDF (resource description framework), integrating diverse data sources may result in RDF data duplication. Failure to effectively detect the duplicates brings redundancies into the integrated RDF datasets. This not only increases unnecessarily the size of the datasets, but also reduces the dataset quality. Traditionally a similarity calculation is applied to detect if a pair of candidates contains duplicates. For massive RDF data, a simple similarity calculation may lead to extremely low efficiency. To detect duplicates in the massive RDF data, in this paper we propose a detection approach based on RDF data clustering and similarity measurements. We first propose a clustering method based on locality sensitive hashing (LSH), which can efficiently select candidate pairs in RDF data. Then, a similarity calculation is performed on the selected candidate pairs. We finally obtain the duplicate candidates. We show through experiments that our approach can quickly extract the duplicate candidates in RDF datasets. Our approach had the highest F score and time performance in the OAEI (Ontology Alignment Evaluation Initiative) 2019 competition.
Downloads
References
A. Zaveri, A. Rula, A. Maurino, “Quality assessment for linked data: a survey.” Semantic Web, vol. 7, no. 1, pp. 63–93, 2015.
H. Jin, L. Huang, P. Yuan “K-Radius subgraph comparison for RDF data cleansing, In: Proceedings of the 2010 International Conference on Web-age Information Management. pp. 309–320, 2010.
H. Wu, B. Villazon-Terrazas, P. Z. Pan, “How redundant is it?: An empirical analysis on linked datasets.” In: Proceedings of the 5th International Workshop on Consuming Linked Data. pp. 97–108, 2014.
J. Z. Pan, J. M. G. Pérez, Y. Ren, “SSP: Compressing RDF data by summarisation, serialization and predictive encoding.” Technical report, 2014. Available at http://www.kdrive-project.eu/wp-content/uploads/2014/06/WP3-TR2-2014SSP.pdf.
A. Hernández-Illera, M. A. Martínez-Prieto, J. D. Fernández “RDF-TR: Exploiting structural redundancies to boost RDF compression.” Information Sciences, vol. 508, no. 2, pp. 234–259, 2019.
G. Tao, J. Gu, H. Li, “Detect redundant RDF data by rules.” In: Proceedings of the 2016 International Conference on Database Systems for Advanced Applications. Springer, pp. 362–368, 2016.
M. Meier, “Towards rule-based minimization of RDF graphs under constraints.” In: Proceedings of the 2008 International Conference on Web Reasoning and Rule Systems. Springer, pp. 89–103, 2008.
R. Pichler, A. Polleres, A. Skritek, “Redundancy elimination on RDF graphs in the presence of rules, Constraints, and Queries.” In: Proceedings of the 2010 International Conference on Web Reasoning and Rule Systems. Springer, pp. 133–148, 2010.
Z. Zhang Z, A. G. Nuzzolese, A. L. Gentile, “Entity deduplication on scholarly data.” In: Proceedings of the 2017 European Semantic Web Conference. Springer, pp. 85–100, 2017.
L. Yan, R. Ma, D. Li, “RDF approximate queries based on semantic similarity.” Computing, vol. 99, no. 5, pp. 481–491, 2017.
R. Meymandpour, J. G. Davis, “A semantic similarity measure for linked data: An information content-based approach.@ Knowledge-Based Systems, vol. 109, pp. 276–293, 2016.
G. Piao, S. S. Ara, J. G. Breslin, “Computing the semantic similarity of resources in DBpedia for recommendation purposes.” In: Proceedings of the 2016 Joint International Semantic Technology Conference, Springer, pp. 185–200, 2016.
D. Song, J. Heflin, “Domain-independent entity coreference for linking ontology instances.” Journal of Data and Information Quality, vol. 4, no. 2, pp. 1–29, 2013.
Z. Wang, Z. Xiao, H. Lei, “RiMOM results for OAEI 2010.” In: Proceedings of the 5th International Workshop on Ontology Matching, pp. 195–202, 2010.
K. Sharma, U. Marjit, U. Biswas, “Duplicate resource detection in RDF datasets using Hadoop and MapReduce.” Advances in Electronics, Communication and Computing. Springer, pp. 253–261, 2018.
M. Michelson, C. A. Knoblock, “Creating relational data from unstructured and ungrammatical data sources.” Journal of Artificial Intelligence Research, vol. 31, pp. 543–590, 2008.
D. Song, Y. Luo, J. Heflin, “Linking heterogeneous data in the semantic web using scalable and domain-independent candidate selection.” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 1, pp. 143–156, 2016.
A. Rajaraman, J. D. Ullman, “Mining of massive datasets: Finding similar items.” Mining of Massive Datasets, Cambridge University Press, pp. 78–137, 2014.
D. Faria, C. Pesquita, E. Santos, “The agreement maker light ontology matching system.” In: OTM Confederated International Conferences on the Move to Meaningful Internet Systems. Springer, pp. 527–541, 2013.
J. Wu, Z. Pan, C. Zhang, P. Wang, “Lily results for OAEI 2019.” In: Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference. pp. 153–159, 2019.
Xiaowen Wang, Yizhi Jiang, Yi Luo, “FTRLIM results for OAEI 2019.” In: Proceedings of the 14th International Workshop on Ontology Matching co-located with the 18th International Semantic Web Conference. pp. 146–152, 2019.
E. Jimenez-Ruiz, “LogMap family participation in the OAEI 2019.” In: Proceedings of the 18th International Semantic Web Conference. pp. 160–163, 2019.