Abstract
The growth of Web of Data led to the development of dataset recommendation methodologies, which automate the discovery of datasets that may contain same or related instances (i.e., objects), in order to be used as input for several tasks including Link Discovery. The recommendation process takes as input one dataset (or any tripleset) and proposes other datasets which are the most likely to contain related instances. Existing recommenders determine the relevance between datasets by comparing their textual and structural similarity or by examining existing links among them. In this paper, we determine relevancy by comparing the geospatial relatedness of triplesets containing instances belonging to spatial classes (that is, classes containing instances whose locations are georeferenced by point geometries) based on the hypothesis that pairs of classes whose instances present similar spatial distribution are likely to contain semantically related instances. The proposed methodology builds summaries that capture the spatial distribution of classes. It utilizes the summaries, first, to rule out irrelevant (to an input class) classes by applying spatial filters and, then, to rank the remaining classes by applying a geospatial relatedness measure, so as the top ranked classes are more probable to contain related instances. The methodology’s evaluation contains an exploration of Web of Data spatial classes characteristics and a discussion of the experiment results that validate our hypothesis. We show that the spatial filtering reduces effectively and efficiently up to 99% the search space for relevant classes in Web of Data and that the proposed geospatial relatedness measures generate ranked lists of recommended classes with 62% mean average precision, approximately 35% higher than simple baselines.
Similar content being viewed by others
Notes
They refer in term co-occurrence in text corpuses.
University Ontology (https://www.cs.umd.edu/projects/plus/SHOE/onts/univ1.0.html).
Dublin Core Metadata Initiative (http://www.dublincore.org/specifications/dublin-core/dcmi-terms/).
The implementation can be extended so as the source class to be any point spatial dataset specified by the user (such as a personal shapefile, a geoJSON file, a Web Feature Service or a spatial class from a non-identified SPARQL endpoint).
The respective SPARQL queries for the rest ontologies are available at https://github.com/vkopsachilis/WoDSpatialClassRecommender.
The full list of identified spatial classes is available at https://github.com/vkopsachilis/WoDSpatialClassRecommender.
The respective SPARQL queries for the rest ontologies are available at https://github.com/vkopsachilis/WoDSpatialClassRecommender.
The full ground truth list is available at https://github.com/vkopsachilis/WoDSpatialClassRecommender.
References
Adelfio MD, Nutanong S, Samet H (2011) Similarity search on a large collection of point sets. In: Proceedings of the 19th ACM SIGSPATIAL international conference on advances in geographic information systems. ACM, New York, GIS ’11, pp 132–141. https://doi.org/10.1145/2093973.2093992
Ballatore A, Bertolotto M, Wilson DC (2014) An evaluative baseline for geo-semantic relatedness and similarity. GeoInformatica 18(4):747–767
Ben Ellefi M, Bellahsene Z, Dietze S, Todorov K (2016a) Beyond established knowledge graphs-recommending web datasets for data linking. In: Bozzon A, Cudre-Maroux P, Pautasso C (eds) Web engineering. Springer, Cham, pp 262–279
Ben Ellefi M, Bellahsene Z, Dietze S, Todorov K (2016b) Dataset recommendation for data linking: An intensional approach. In: Proceedings of the 13th international conference on the semantic web. latest advances and new domains, vol 9678. Springer, Berlin, pp 36–51
Berners-Lee T (2006) Linked data. https://www.w3.org/DesignIssues/LinkedData.html. Last accessed 16 August 2019
Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL conference 2009
Caraballo AAM, Arruda NM, Nunes BP, Lopes GR, Casanova MA (2014) Trtml—a tripleset recommendation tool based on supervised learning algorithms. In: Presutti V, Blomqvist E, Troncy R, Sack H, Papadakis I, Tordai A (eds) The semantic web: ESWC 2014 satellite events. Springer, Cham, pp 413–417
Caraballo AAM, Nunes BP, Casanova MA (2016) Drx: A lod dataset interlinking recommendation tool
Chapman A, Simperl EPB, Koesten L, Konstantinidis G, Ibáñez-Gonzalez LD, Kacprzak E, Groth PT (2019) Dataset search: a survey. arXiv:abs/1901.00735
Das Sarma A, Fang L, Gupta N, Halevy A, Lee H, Wu F, Xin R, Yu C (2012) Finding related tables. In: Proceedings of the 2012 ACM SIGMOD international conference on management of data. ACM, New York, SIGMOD ’12, pp 817–828. https://doi.org/10.1145/2213836.2213962
Efstathiades C, Belesiotis A, Skoutas D, Pfoser D (2016) Similarity search on spatio-textual point sets. In: EDBT
Emaldi M, Corcho Ó, de Ipiña DL (2014) Detection of related semantic datasets based on frequent subgraph mining. In: IESD@ISWC
Feliachi A, Abadie N, Hamdi F (2017) An adaptive approach for interlinking georeferenced data. In: Proceedings of the knowledge capture conference. ACM, New York, K-CAP 2017, pp 12:1–12:8
Harth A, Hose K, Karnstedt M, Polleres A, Sattler KU, Umbrich J (2010) Data summaries for on-demand queries over linked data. In: Proceedings of the 19th international conference on world wide web. ACM, New York, WWW ’10, pp 411–420. https://doi.org/10.1145/1772690.1772733
Heath T, Bizer C (2011) Linked data: evolving the web into a global data space. Synth Lect Seman Web Theory Technol 1(1):1–136. https://doi.org/10.2200/S00334ED1V01Y201102WBE001
Hecht B, Raubal M (2008) GeoSR: Geographically explore semantic relations in world knowledge. Springer, Berlin, pp 95–113
Kanza Y, Kravi E, Safra E, Sagiv Y (2017) Location-based distance measures for geosocial similarity. ACM Trans Web 11(3):17:1–17:32. https://doi.org/10.1145/3054951
Kufer S, Henrich A (2014) Hybrid quantized resource descriptions for geospatial source selection. In: Proceedings of the 4th international workshop on location and the web. ACM, New York, LocWeb ’14, pp 17–24. https://doi.org/10.1145/2663713.2664428
Lehmberg O, Ritze D, Ristoski P, Meusel R, Paulheim H, Bizer C (2015) The mannheim search join engine. Web Semant 35(P3):159–166. https://doi.org/10.1016/j.websem.2015.05.001
Leme LAPP, Lopes GR, Nunes BP, Casanova MA, Dietze S (2013) Identifying candidate datasets for data interlinking. In: Daniel F, Dolog P, Li Q (eds) Web engineering. Springer, Berlin, pp 354–366
Liu H, Wang T, Tang J, Ning H, Wei D, Xie S, Liu P (2016) Identifying linked data datasets for sameas interlinking using recommendation techniques. In: Cui B, Zhang N, Xu J, Lian X, Liu D (eds) Web-age information management. Springer, Cham, pp 298–309
Liu H, Wang T, Tang J, Ning H, Wei D (2017) Link prediction of datasets sameAs interlinking network on web of data. In: 3rd international conference on information management (ICIM), pp 346–352. https://doi.org/10.1109/INFOMAN.2017.7950406
Lopes GR, Leme LAPP, Nunes BP, Casanova MA, Dietze S (2013) Recommending tripleset interlinking through a social network approach. In: Lin X, Manolopoulos Y, Srivastava D, Huang G (eds) Web information systems engineering—WISE 2013. Springer, Berlin, pp 149–161
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, New York
Martins YC, da Mota FF, Cavalcanti MC (2016) Dscrank: a method for selection and ranking of datasets. In: Garoufallou E, Subirats Coll I, Stellato A, Greenberg J (eds) Metadata and semantics research. Springer, Cham, pp 333–344
Mehdi M, Iqbal A, Hogan A, Hasnain A, Khan Y, Decker S, Sahay R (2014) Discovering domain-specific public sparql endpoints: a life-sciences use-case. In: Proceedings of the 18th international database engineering and applications symposium. ACM, New York, IDEAS ’14, pp 39–45. https://doi.org/10.1145/2628194.2628220
Mountantonakis M, Tzitzikas Y (2018) Scalable methods for measuring the connectivity and quality of large numbers of linked datasets. J Data Inf Qual 9(3):15:1–15:49
Nentwig M, Hartung M, Ngonga Ngomo AC, Rahm E (2015) A survey of current link discovery frameworks. Semantic Web (Preprint):1–18. http://www.semantic-web-journal.net/system/files/swj1117.pdf
Neumaier S, Polleres A (2019) Enabling spatio-temporal search in open data. J Web Semant 55:21–36. https://doi.org/10.1016/j.websem.2018.12.007
Ngomo ACN, Auer S (2011) Limes - a time-efficient approach for large-scale link discovery on the web of data. In: IJCAI
Nikolov A, d’Aquin M (2011) Identifying relevant sources for data linking using a semantic web index. In: WWW2011 workshop: linked data on the web (LDOW 2011) at 20th international world wide web conference (WWW 2011)
Nikolov A, d’Aquin M, Motta E (2012) What should I link to? Identifying relevant sources and classes for data linking. In: Pan JZ, Chen H, Kim HG, Li J, Horrocks I, Mizoguchi R, Wu Z, Wu Z (eds) The semantic web. Springer, Berlin, pp 284–299
Röder M, Ngonga Ngomo AC, Ermilov I, Both A (2016) Detecting similar linked datasets using topic modelling. In: Proceedings of the 13th international conference on the semantic web. Latest advances and new domains, vol 9678. Springer, Berlin, pp 3–19
Saleem M, Khan Y, Hasnain A, Ermilov I, Ngonga Ngomo AC (2014) A fine-grained evaluation of sparql endpoint federation systems. Semant Web J. https://doi.org/10.3233/SW-150186
Schmachtenberg M, Bizer C, Paulheim H (2014a) Adoption of the linked data best practices in different topical domains. In: Mika P, Tudorache T, Bernstein A, Welty C, Knoblock C, Vrandečić D, Groth P, Noy N, Janowicz K, Goble C (eds) The semantic web—ISWC 2014. Springer, Cham, pp 245–260
Schmachtenberg M, Bizer C, Paulheim H (2014b) State of the lod cloud 2014. http://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/. Last accessed 16 August 2019
Schwering A, Raubal M (2005) Spatial relations for semantic similarity measurement. In: Akoka J, Liddle SW, Song IY, Bertolotto M, Comyn-Wattiau I, van den Heuvel WJ, Kolp M, Trujillo J, Kop C, Mayr HC (eds) Perspectives in conceptual modeling. Springer, Berlin, pp 259–269
Sherif MA, Ngomo ACN (2017) A systematic survey of point set distance measures for link discovery. Semant Web 9:589–604
Sun W, Chou CP, Stacy AW, Ma H, Unger J, Gallaher P (2007) Sas and spss macros to calculate standardized Cronbach’s alpha using the upper bound of the phi coefficient for dichotomous items. Behav Res Methods 39(1):71–81. https://doi.org/10.3758/BF03192845
Tobler WR (1970) A computer movie simulating urban growth in the detroit region. Econ Geogr 46(sup1):234–240. https://doi.org/10.2307/143141
Tummarello G, Cyganiak R, Catasta M, Danielczyk S, Delbru R, Decker S (2010) Sig.ma: live views on the web of data. J Web Semant 8(4):355–364. https://doi.org/10.1016/j.websem.2010.08.003
Vidal ME, Castillo S, Acosta M, Montoya G, Palma G (2016) On the selection of sparql endpoints to efficiently execute federated sparql queries. In: Hameurlain A, Kung J, Wagner R (eds) Transactions on large-scale data- and knowledge-centered systems XXV. Springer, Berlin, pp 109–149
Vilches-Blázquez LM, Saquicela V, Corcho O (2012) Interlinking geospatial information in the web of data. Springer, Berlin, pp 119–139
Volz J, Bizer C, Gaedke M, Kobilarov G (2009) Discovering and maintaining links on the web of data. In: Bernstein A, Karger DR, Heath T, Feigenbaum L, Maynard D, Motta E, Thirunarayan K (eds) The semantic web—ISWC 2009. Springer, Berlin, pp 650–665
Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics, Stroudsburg, PA, ACL ’94, pp 133–138. https://doi.org/10.3115/981732.981751
Zhu R, Hu Y, Janowicz K, McKenzie G (2016) Spatial signatures for geographic feature types: examining gazetteer ontologies using spatial statistics. Trans GIS 20(3):333–355. https://doi.org/10.1111/tgis.12232
Acknowledgements
This research is being supported by the funding program “YPATIA” of University of Aegean.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kopsachilis, V., Vaitis, M., Mamoulis, N. et al. Recommending Geo-semantically Related Classes for Link Discovery. J Data Semant 9, 151–177 (2020). https://doi.org/10.1007/s13740-020-00117-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13740-020-00117-4