Permanent URL::

https://w3id.org/scholarlydata

Resource type::

Ontology and dataset.

 

1 Introduction

A good practise in the Semantic Web community is to encourage the publication of Linked Data about scientific conferences in the field, as a way of “eating our own dog food” [8]. The main example is the Semantic Web Dog Food Footnote 1 (SWDF), a corpus that collects Linked Data about papers, people, organisations, and events related to academic conferences. Currently, all main Semantic Web conferences and related events publish their data as Linked Data on SWDF, but for many other conferences, events and publication venues information is still not available in a structured and linked form. On the other hand the growth of available content with respect to the early times of SWDF poses data management issues and reveals design problems which where not foreseen when the dataset was at its initial stage. There are several challenges to pursue the maintenance of a healthy and sustainable SWDF for the future: (i) the availability of appropriate vocabularies to express the current state of the data; (ii) the shared knowledge of such vocabularies; (iii) the availability of tools to ease the task of data acquisition, conversion, integration, augmentation, verification and finally publication; (iv) the ongoing maintenance of the dataset.

In this work we address these issues and we propose a refactoring of the Semantic Web Conference (SWC) OntologyFootnote 2. The new ontology, named conference-ontology [12], adopts best ontology design practices (e.g. Ontology Design Patterns, ontology reuse and interlinking) and guarantees interoperability with SWC ontology and all other pertinent vocabularies. We use cLODgFootnote 3 (conference Linked Open Data generator) [6] to regenerate the SWDF dataset according to conference-ontology and provide a sustainable solution for the growth of the dataset in the future.

The main advantage of the proposed approach is the availability of a shared procedure and open source tools for conference data generation, with the primary goal to ensure the sustainability and usability of our own Semantic Web Dog Food and the ease of data contribution from beyond our community. We make the new resource available at https://w3id.org/scholarlydata as data dump, SPARQL endpoint and we offer the facilities to generate data about new conferences using cLODg and submit it for addition to scholarlydata. The newly submitted data is manually checked before inclusion to avoid corruption of the dataset and general spam.

2 State of the Art

The first considerable effort to offer comprehensive semantic descriptions of conference events is represented by the metadata projects at ESWC 2006 and ISWC 2006 conferences [11], with the Semantic Web Conference OntologyFootnote 4 being the vocabulary of choice to represent such data. Increasing number of initiatives are pursuing the publication about conferences data as Linked Data, mainly promoted by publishers such as SpringerFootnote 5 or ElsevierFootnote 6 amongst many others. For example, the knowledge management of scholarly products is an emerging research area in the Semantic Web field known as Semantic Publishing [14]. Semantic Publishing aims at providing access to semantic enhanced scholarly products with the aim of enabling a variety of semantically oriented tasks, such as knowledge discovery, knowledge exploration and data integration. The Semantic Publishing challenge [9] is a breakthrough in this direction. Its objective is assessing the quality of systems that extract meaningful metadata from scholarly articles and represent them as RDF. Similarly, the Jailbreaking the PDF initiative [5] is aimed at creating a formal flexible infrastructure to extract semantic information from PDF documents as domain-specific annotations. Despite these continuous efforts, it has been argued that lots of information about academic conferences is still missing or spread across several sources in a largely chaotic and non-structured way [1]. Besides the problem of missing content, one of the other major challenges with scholarly data is to ensure data quality, which means dealing with data-entry errors, disparate citation formats, lack of (enforcement of) standards, imperfect citation-gathering software, ambiguous author names and abbreviations of publication venue titles [10]. Currently the generation of data for the SWDF corpus still relies on little or no strategies to deal with duplicates, inconsistencies, misspelling and name variations. In this work we aim to close these gaps by making available a solid data model and a shared and open workflow (available Open Source) as a long term solution for the population and maintenance of an enhanced version of the SDWF dataset.

3 The SWDF and Its Current Issues

The SWDF uses the Semantic Web Conference (SWC) ontology as the reference ontology for modelling data about academic conferences. The SWC ontology combines existing widely accepted vocabularies (i.e. FOAFFootnote 7, SIOCFootnote 8 and Dublin CoreFootnote 9) and relies on the SWRCFootnote 10 (Semantic Web for Research Communities) ontology for modelling entities such as accepted papers, authors, their affiliations, talks and other events, the organising committee and all other roles involved. The core types of SWC ontology are foaf:Person for describing people, foaf:Organization for organisations (e.g. universities, research institutions, etc.), swc:Artefact for documents (e.g. papers, proceedings, etc.), swc:OrganisedEvent for events and swc:Role for the people roles at the conference. Unfortunately, the lack of clear guidelines for data generation and maintenance and some modelling choices of the SWC ontology affect the current quality of SWDF. The data generation is based on a collaborative model that delegates the metadata chairs of each conference to independently deal with the process of generating conference Linked Data. Linked Data are generated from a variety of formats typically provided by a conference management system (e.g. EasyChair). While the collaborative process is beneficial to the growth of the dataset and its adoption in the community, the lack of clear guidelines and of standard tools supporting the generation process affects the quality of generated data. Examples are: (i) a portion of the included conference/workshop data use vocabularies or ontologies which are not aligned to the SWC ontology and, in some cases, no longer maintained or existing (e.g. swrc-ext Footnote 11 or xmllondon Footnote 12); (ii) the usage of classes and properties not defined in the SWC ontology and introduced without providing an extension of the ontology (e.g. swc:room, swc:editorList, swc:completeGraph, swc:IW3C2Liaison, swc:SemanticWebTechnologiesCo-ordinator, etc.); (iii) the misuse of properties (either defined in the SWC ontology or in other vocabularies/ontologies) with respect to their domain and range; (iv) typos (e.g. the materialisation of triples having the predicate swc:partOf instead of swc:isPartOf). In addition, we argue that the SWC ontology itself has intentional issues, mainly concerning the modelling of affiliations, roles and lists. Affiliations (of people to organisation) are represented via the object property swrc:affiliation from the SWRC ontology while the membership relation (organisation to people) via the property foaf:member. Although intuitive, this representation ignores the temporal dimension (i.e. the time when a given affiliation is held by an actor) that is relevant to interpret affiliations correctly. For example, with this model it is not possible to provide a correct answer to a simple competency question, such as “What was the affiliation of a person when participating to a certain conference?”. Roles such as program chair, track chair, etc. are currently modelled using an ontology pattern based on the reification of a n-ary relation. The n-ary relation is identified by individuals of the class swc:Role which are used to associate people to events. The SWC ontology contains a very basic set of role classes (i.e. swc:Chair, swc:Delegate, swc:Presenter and swc:ProgrammeCommitteeMember) represented as sub-classes of swc:Role. This choice allows to instantiate the small set of different Role classes and cover the roles at specific events. For example, instead of sub-classing the swc:Chair class with MainChair, WorkshopChair, TutorialChair, etc., the different types of chairs should simply be instances of the generic swc:Chair and labelled appropriately (e.g. iswc2015:general-chair Footnote 13). The problem with this solution is that the (individuals representing) roles are defined locally to each conference, e.g. there is a different individual for representing the role “general chair” for each conference in the dataset. This causes the presence of 1,717 distinct individuals in the current dataset that truly represent a set of only 34 unique roles (cf. Sect. 4). Hence, it is difficult to answer simple queries like “Who was the general chair at each edition of ISWC?” without using regular expressions on roles’ labels (such labels are heterogeneous and not always provided). Finally, lists of authors are represented via the property bibo:authorList, which accepts rdf:List or rdf:Seq as range. Therefore lists of authors in the SWDF are expressed via the properties rdf:_1, rdf:_2, rdf:_3, etc., based on rdfs:ContainerMembershipProperty. This solution makes querying and reasoning on ordered list of authors very hard [2].

4 A Sustainable SWDF

Our solution for enhancing the SWDF and solving the issues described in Sect. 3 is based on (i) the refactoring of the SWC ontology, (ii) the refactoring of the current SWDF dataset and (iii) a fully implemented open source workflow to generate, verify and add data to SWDF. The proposed refactoring of the SWC ontology, conference-ontology Footnote 14, is a new self-contained ontology, which exploits Ontology Design Patterns (ODP) [3]. We model affiliations reusing the time indexed situation ODPFootnote 15 and the roles held by people at a conference with the time indexed person role ODPFootnote 16. Both patterns provide commonly accepted solutions to model complex situations as n-ary relations, amongst many other available ones [4].

The classes conf:AffiliationDuringEvent and conf:Affiliation AtTimeOfSubmission model situations where a person (an individual of the class conf:Person) is affiliated to an organisation (an individual of the class conf:Organisation) at a specific time, which can be either an interval (coinciding with the conference dates) or the instant when the paper was submitted. This allows the representation of cases where a person changes affiliation in the time interval between paper submission and conference event. Similarly, the class conf:RoleDuringEvent associates a person with a role (an individual of the class conf:Role) at a conference. Additionally, conf:AffiliationDuringEvent and conf:AffiliationAtTimeOfSubmission can be associated with conf:AffiliationRole, a subclass of conf:Role, to represent the role held by a person within an organisation. We reused the Sequence ODPFootnote 17 to represent ordered lists of authors. We represent a list with conf:List, whose items are individuals of the class conf:ListItem. The association between conf:ListItem and conf:List is done via the property conf:isItemOf. A conf:List has pointers to the first (conf:hasFirstItem) and the last item (conf:hasLastItem). Each conf:ListItem is linked to its predecessor (conf:hasPreviousItem) and successor (conf:hasNextItem). This new modelling overcomes the limitation in the current SWDF offering a new range of services for scholarly monitoring, such as statistics on career development, change of affiliations over time, covered roles at conferences in order to monitor their involvement and impact at different granularity levels, ranging from a broader scientific area to specific communities or conferences. An example of query to obtain all roles covered overtime by a specific researcher is the following:

figure a

To guarantee interoperability with SWC and all other already used vocabularies in the SWDF dataset, we produced extensive alignmentsFootnote 18, which allow the materialisation of triples via reasoning. We include alignments to:

  • the SWC ontology itself to guarantee backward interoperability with SWDF;

  • the top level classes of Dolce D0Footnote 19 for interoperability with a series of linked datasets aligned to it (e.g. DBpedia);

  • all relevant SPAR ontologiesFootnote 20 such as: FaBIO [13] for compliance with FRBR; DoCO for modelling the part relations (conf:hasPart and its inverse conf:isPartOf existing between abstracts conf:Abstract, articles conf:InProceedings and the books of proceedings conf:Proceedings); PRO and SCORO for modelling roles as defined in SPAR;

  • the Organization OntologyFootnote 21 for modelling organisations, roles and affiliations;

  • FOAF for modelling people;

  • SKOSFootnote 22 for modelling broader/narrower relations;

  • ICATZDFootnote 23 for events;

  • the Collection Ontology [2] for modelling the sequences represented by the lists of authors.

5 Scholarlydata.org

Using cLODg and our new conference-ontology we performed a batch cleaning of the whole SWDF dataset, consisting of 48 conferences and 235 workshops. The new dataset contains 93,519 individuals. The distribution of classes is reported in Table 1.

Table 1. Number of unique individuals for each class of conference-ontology generated with cLODg.

For the role definitions we corrected the current 1,717 roles in SWDF, defined at conference level, by generating 34 roles at global level and reusing them at conference level. E.g. the role role:general-chair Footnote 24 is one individual which can be reused in all conferences with the relation conf:withRole. These 34 roles are organised in a hierarchy by using SKOS to express broader and narrower relations between them, e.g. the role role:chair is defined as skos:broader role:general-chair. The current list of roles can be obtained using the query:

figure b

Using cLODg to produce metadata about a new conference guarantees that pertinent roles are reused if already existing the dataset.

We produced instance level alignments of (i) individuals of conf:Person to ORCIDFootnote 25 (Open Researcher and Contributor ID) and (ii) individuals of conf:InProceedings to DOIFootnote 26 (Digital Object Identifier), whenever possible. ORCID provides persistent digital identifiers for scientific researchers and academic authors. A DOI is a serial code used to uniquely identify digital objects, particularly used for electronic documents. The alignments to ORCID were produced by using the public API provided by ORCIDFootnote 27. The references to DOI were produced by using the API provided by CrossrefFootnote 28, performing a search on each article title.

All data is uploaded on https://w3id.org/scholarlydata where can be accessed in different formats (i.e. HTML+RDFa, RDF/XML, Turtle, N-TRIPLES, and JSON-LD) via URI dereferencing, queried via SPARQL or downloaded as single RDF dumps for each conference and workshop. Each dump is provided in two versions: a simple one, where data is represented by the conference-ontology only and one containing all the alignments (and therefore also complaint to SWDF), which have been materialised using a reasoner. These dumps are released with the “creative commons by 3.0" licenseFootnote 29 and are described by using the VOID vocabularyFootnote 30. Additionally, we explicitly state the primary source of our data is the SWDF by using the property prov:hadPrimarySource of PROV-OFootnote 31. Dump data is also publicly available on datahubFootnote 32. It is worth remarking that cLODg is released as an open source software with the MIT LicenseFootnote 33 and can be used by metadata curator to add data about a new conference. In fact, cLODg provides a nearly one-click process to produce conference Linked Data and includes all the components for data transformation, deduplication, URIs reuse, alignment of individuals to external resources, etc. and assures that data is produced according to the conference-ontology and compliant with the SWDF. An early description of cLODg can be found in [7] and in the github repository for its newer versionFootnote 34. By providing a user friendly data generation tool we aim at encouraging the growth of the dataset beyond the Semantic Web community.

6 Conclusions and Future Work

This paper analyses the Semantic Web Dog Food dataset and discusses its quality and sustainability issues. As the main scholarly dataset for the Semantic Web community, we believe it is important that the dataset is maintained in good health. We therefore perform a refactoring on the dataset addressing its current issues and we make the cLODg workflow publicly available as potential solution for future maintenance. The new resource https://w3id.org/scholarlydata is publicly available both as dump download and SPARQL endpoint, with facilities to upload new data. With the availability of cLODg as standard Linked Data publication workflow, we believe that scholarlydata has the potential to grow way beyond the Semantic Web conferences. As future work we plan a systematic evaluation of the resource and the introduction of more sophisticated components to deal with instance matching in the cLODg workflow. Moreover we will work on fostering collaboration with Conference Management System providers, to provide cLODg as a build-in facility in the systems.