Keywords

1 Introduction

People with different backgrounds ranging from school students to high profile professionals around the world are engaged in several initiatives such as political movements, environmental protection and fund-raising with the goal to achieve individual, community and society well-being [21]. One example is the Fridays For Future movement which was initiated by the young student Greta Thunberg to demonstrate against Swedens climate policy [40]. One of her main demands has been that the actions of the government of Sweden should become sufficient in order to comply with the essence of the Paris Agreement. Her initiative has quickly gained a lot of attention and initiated demonstrations all over Europe, and later in different countries around the world. Greta Thunberg is an illustrative example of a young school student who recognized some of the scientific findings with regards to the climate change and understood its importance for the social good: “We want politicians to listen to the scientists”Footnote 1, “Why should I be studying for a future that soon may be no more, when no one is doing anything to save that future?”. While scientists increasingly have been called to share research findings about climate change [47], many other topics that are relevant to social good do not have a comparable media presence. For this reason, the information needs of activists that are non-experts may remain unsatisfied with regards to these topics. However, information technology and the digitization of scientific artifacts haves increased the amount of available scientific resources and offer a great potential to fulfill the information needs of activists that are concerned about social good. An overwhelming amount of scientific artifacts such as publications and their metadata have been made available independent of any geographical or temporal constraints on the web [6, 21, 52]. However, for non-experts the effective access to these artifacts is limited. While there are already existing services such as Google Scholar (GS) to explore and retrieve scientific publications, they alone are not sufficient to effectively fulfill the information needs of non-expert activists. One of the main reasons is the discrepancy between their search behaviors and the functionality of these services: GS expects specific search queries in order to provide relevant content on the first result page whereas non-experts (for instance undergraduate students) tend to use simple keyword or phrase queries, do not refine their search queries (e.g. by analyzing metadata), and usually ignore retrieved results beyond the first result page [10, 20]. In addition, search engines and services such as Google Scholar are not developed with the specific goal of providing access to content related to social good. Therefore, there is a need for a domain-specific system that can be effectively used by non-experts to access scientific content related to social good.

An approach to structure related knowledge that can be used to perform concept-based retrieval instead of string matching are knowledge graphs (KGs) [6, 25] which represent information as a set of triples of the form \((h, r, t) \in KG\) where \(h\) and \(t\) represent entities and \(r\) their relation. Recently, knowledge graph embeddings (KGEs) that encode the entities and relations of a KG into vector spaces while maintaining structural characteristics of the KG became popular. These embeddings can be used for several downstream machine learning tasks including recommendation systems.

In this paper, we present a recommender system that suggests for an entity of interest (i.e., publication, author, domain and venue) a set of related entities which helps users to effectively find relevant content related to the topic of social good from the large amount of available information. Our contributions are: (i.) a KG that contains information about publications, domains, authors and venues. We focused on publications that are related to real-world problems such as climate change, marine litter, right movement and cyber security, (ii.) a baseline recommender system that exploits KGEs to provide recommendations. We trained four different KGE models i.e., TransE, TransR, TransD and ComplEx, and selected TransE to provide recommendations that have been manually evaluated. While the general approach can be transferred to different domains, the proposed recommender system is domain-specific.

In the following, we give an overview of the related work (Sect. 2), explain the KGE models that are relevant in the context of this work (Sect. 3), describe the process of creating our KG (Sect. 4), present our recommendation system (Sect. 5), explain our performed experiments (Sect. 6), discuss the limitations of our system and point out future work (Sect. 7), and finally, we give a short summary of this work (Sect. 8).

2 Related Work

Through the development of specialized search engines, digital libraries, databases and social networks for the scholarly domain, the availability of scientific artifacts and their metadata has been facilitated. Google ScholarFootnote 2 is an online search engine that has been realized in 2004 and enables users to search for both, the printed and digital version of articles. AminerFootnote 3 provides a faceted browser on top of its mining service for researchers. ResearchGateFootnote 4 is a social network for researchers in which they can present their scientific profiles, their publications and interact with each other. MendeleyFootnote 5 is a desktop service with a web program created by Elsevier for managing and sharing research papers. There are several efforts to provide enhanced services by representing metadata of scholarly artifacts in a structured form. A crowd-sourcing platform for metadata management of scholarly artifacts is introduced in [43], and the representation of metadata in a semantic format is proposed in [3]. In Chi et al. [6] a knowledge graph and a metadata management system for smart education is presented. However, most of these services either lack a systematic recommendation service or provide specialized suggestions based on user profiles. To the best of our knowledge (apart from dedicated journals and university libraries [48]) the domain of social science lacks a comprehensive and specialized knowledge graph with analytical and recommendation services on top. In a recent work, an embedding based recommendation system for books has been proposed. However, the recommendations are limited to a single entity type, i.e. books. In this study and the follow up work, we aim to provide a comprehensive and domain-specific system in order to assist users in finding relevant artifacts of different types. Through the use of machine learning approaches, the system proposes recommendations that are beyond simple keyword-matching based recommendations.

3 Knowledge Graph Embeddings

Knowledge graph embedding models can be roughly divided into translational distance models and semantic matching models. Translational distance models compute the plausibility of a triple based on distance function (e.g. based on the Euclidean distance) and semantic matching models determine the plausibility of a triple by comparing the similarity of the latent features of the entities and relations [45]. In the following, we describe KGE models that are relevant in the context of this work, however many others have been proposed.

TransE. An established translational distance model is TransE [5] that models a relation r as the translation from head entity h to the tail entity t:

$$\begin{aligned} h + r \approx t \end{aligned}$$
(1)

To measure the plausibility of a triple, following scoring function is defined:

$$\begin{aligned} f_r(h,t) = - \Vert h + r - t\Vert \end{aligned}$$
(2)

Besides its simplicity, TransE is computational efficient, and can therefore be applied to large scale KGs. However, TransE is limited in modeling 1-N, N-1 and N-M relations. For this reason, several extensions have been proposed [45].

TransH. TransH [45] is an extension of TransE that addresses the limitations of TransE in modeling N-M relations. Each relation is represented by an additional hyperplane, and the translation from the head to the tail entity is performed in the relation specific hyperplane. First, the head and tail entities are projected into the relation specific hyperplane:

$$\begin{aligned} h_{\perp } = h -w_{r}^\top hw_r \end{aligned}$$
(3)
$$\begin{aligned} t_{\perp } = t -w_{r}^\top tw_r \end{aligned}$$
(4)

where \(w_r\) is the normal vector of the hyperplane. After projecting the head and tail entity, the plausibility of the triple (h,r,t) is computed:

$$\begin{aligned} f_r(h,t) = -\Vert h_{\perp } + d_r - t_{\perp }\Vert _{2}^2 \end{aligned}$$
(5)

where \(d_r\) is the relation specific translation vector lying in the relation specific hyperplane.

TransR. TransR [45] is an extension of TransH that encodes entities and relations in contrast to TransE and TransH, in different vector spaces. Each relation is represented by a matrix \(M_r\) that is used to project the entities into the relational specific space:

$$\begin{aligned} h_{r} = h M_{r} \end{aligned}$$
(6)
$$\begin{aligned} t_{r} = t M_{r} \end{aligned}$$
(7)

Consequently, the scoring function is defined as:

$$\begin{aligned} f_r(h,t) = -\Vert h_{r} + r - t_{r}\Vert _{2}^2 \end{aligned}$$
(8)

TransD. TransD [15] is an extension of TransR that uses fewer parameters than TransR. It also eliminates the involvement of matrix-vector multiplications. Entities and relations are represented by two vectors, of which \(h,r,t\) encode the semantics of the entities/relations, and \(h_{p}, r{_p}, t{_p}\) are used to construct projection matrices which project the entities in relation specific spaces:

$$\begin{aligned} M_{rh} = r_{p}h_{p}^{T} + I^{m \times n} \end{aligned}$$
(9)
$$\begin{aligned} M_{rt} = r_{p}t_{p}^{T} + I^{m \times n}, \end{aligned}$$
(10)

where \(I\) is a matrix where the values of the diagonal elements are 1 and 0 elsewhere (in case of \(m=n\), I is the identity matrix). These matrices are used to compute the projections of the head and tail entity:

$$\begin{aligned} h_{\perp } = M_{rh}h \end{aligned}$$
(11)
$$\begin{aligned} t_{\perp } = M_{rt}t \end{aligned}$$
(12)

Based on the projected entities, the score of the triple (h, r, t) is computed:

$$\begin{aligned} f_r(h,t) = -\Vert h_{\perp } + r - t_{\perp }\Vert _{2}^2 \end{aligned}$$
(13)

RESCAL. RESCAL [24] is a semantic matching model that represents each entity as a vector and each relation as a matrix, \( M_r\). It uses the following scoring function:

$$\begin{aligned} f_r(h,t) = h^{T} M_{r} t \end{aligned}$$
(14)

The relation matrix, \( M_r\), encodes pairwise interactions between the features of the head and tail entities.

DistMult. DistMult [50] simplifies RESCAL by allowing only diagonal matrices:

$$\begin{aligned} f_r(h,t) = h^{T} diag(r) t \end{aligned}$$
(15)

where \( r \in R^d\) and \(M_r = diag(r)\). The restriction to diagonal matrices makes DistMult more computationally efficient than RESCAL, but also less expressive compared to RESCAL.

ComplEx. ComplEx [42] is an extension of DistMult into the complex space. Considering the scoring function of DistMult (Eq. 15), it can be observed that it has a limitation in representing anti-symmetric relations since \(h^{T} diag(r) t\) is equivalent to \(t^{T} diag(r) h\). Equation 15 can be written in terms of the Hadamard product of \(h,r, t\): \(<h,r,t> \ = \ \sum _{i=1}^{d} h_i * r_i * t_i\), where \(h,r, t \in R^d\). The scoring function of ComplEx uses the Hadamard product in the complex space, i.e. \(h,r, t \in C^d\):

$$\begin{aligned} f_r(h,t) = \mathfrak {R}(\sum _{i=1}^{d} h_i * r_i * \overline{t_i}) \end{aligned}$$
(16)

where \(\mathfrak {R}(x)\) represents the real part of a complex number and \(\overline{x}\) its conjugate. It is straightforward to show that \(f_r(h,t) \not = f_r(t,h)\), i.e. ComplEx is capable of modeling anti-symmetric relations.

4 Knowledge Graph Creation

As a first step, we created a scientific KG that gathers information relevant for social good which we used as a basis for providing recommendations. We defined the following requirements for the KG:

Fig. 1.
figure 1

KG schema for social good. The graph shows a snapshot of schema for our KG with core entities and relations between them

  1. R1

    The KG should contain publications with a focus on topics related to social good (e.g. climate change, social initiatives, political movements etc.)

  2. R2

    The KG should contain sufficient metadata to provide qualitative recommendations.

  3. R3

    The KG should be sufficiently large to allow conclusive insights about the applicability of modern machine learning methods.

To create the KG, the following steps were performed: (i.) Domain Conceptualization, (ii.) Topic Conceptualization, (iii.) Data collection and Data Curation.

Domain Conceptualization. Reusing already existing ontologies, we modeled possible characteristics of KGs related to social good (see Fig. 1). The model is defined generic enough to support researchers gaining an overview of the domain, to define new KGs. The model can be exploited by machines as an additional source of knowledge. Due to the lack of FAIR data [49] in scientific artifacts of social good topics, currently our KG does not contain all types and relations described in the schema. Overall, seven core classes have been identified, namely Papers, Venues, Authors, Organizations, Funders, Domains and Projects. Furthermore, eight relationship types have been defined between these classes (see Fig. 1) (Table 1).

Table 1. KG Statistics

Topic Conceptualization. A list of topics have been collected from focal resources active in social good such as development program of United NationFootnote 6 and sustainable development goals for 2030 [16] in addition to a systematic exploration on the Web. The topics have been short listed into four distinct categories as climate change, political movements, marine/sea litter and cyber security. This list have been used in the follow up steps of this work.

figure a

Data Collection. The data was collected using web crawlers of the RAxFootnote 7 platform which has reached to index metadata of 160+ million research paper. Based on the keywords, an exemplary dataset of 4004 matched papers has been extracted. The data was initially stored in JSON format (Listing 1.1), which we converted into a set of triples (Listing 1.2) representing our KG. KGs not only enable to represent data in form of triples, but also the metadata. For instance, for an entity representing a paper, we created triples of the form (Paper1, belongsToDomain, Environmental Studies) or (Author1, authorOf, Paper1).

figure b

5 System Description

The input to our workflow is a KG based on which a set of recommendations are computed in two major steps (Fig. 2): (i.) learning the KGEs and (ii.) generating the recommendations based on the KGEs.

Learning the KGEs. In order to learn KGEs on the constructed scholarly KG for social good, we utilize the software package PyKEEN [1] which integrates several KGE models. The learned embeddings encode structured knowledge represented in the KG. In the context of this work, we focused on the models TransE, TransD, TransR and ComplEx. The learned embeddings have been used as a basis for computing the recommendations.

Fig. 2.
figure 2

A pipeline of recommendation services. (i) embedded KG into latent feature space, (ii) filter publications based on KGEs, (iii) filter publications based on embeddings of their abstracts.

Generating Recommendations. For each seed entity that can be any entity in the KG (a publication, an author or a venue), the n nearest neighbors have been computed using the Euclidean norm (however, any similarity measure can be applied) and provided as recommendations. The recommendations for a seed publication can be for example the list of researchers who co-authored other publications with the authors in the seed publication, related publications or venues. The system is able to provide recommendations that represent n-hop dependencies in the KG. In turn, the provided recommendations can be used as seed entities to access information that represents long-term dependencies in the KG. The described steps don’t require any complex traversing of the graph, instead, simple arithmetic operations are applied on the learned embeddings (Fig. 3).

Fig. 3.
figure 3

Recommendations per seed entities. For every seed entity type, a number of different recommendations are given.

6 Experiments

We evaluated four different KGE models (i.e., TransE, TransD, TransR and ComplEx) on the created KG. Afterward, we took one of the best performing models to provide the top n recommendations for a set of seed papers that have been manually evaluated. However, our approach be can applied on any type of seed entities.

6.1 Experimental Setup

We randomly split the initial KG into a training and test set where we took for each relation 10% of the triples which contain this relation as test triples. For each model, we performed a hyper-parameter optimization based on random search [11] and used mean rank and hits@k as evaluation metrics.

6.2 Evaluation of the KGE Models

All the models have been trained based on the open world assumption, i.e., triples that are not part of the KG are not considered as non-existing, but as unknowns [25]. Therefore, we created artificial negative samples based on the negative sampling approach described by Bordes et al. [5]. For TransE, TransR and TransD the margin ranking loss that maximizes the distance between a positive and a corresponding negative triple [25] was applied, and for ComplEx and ComplEx* the softplus loss [42] was used. Furthermore, for all models except for ComplEx* one negative per each positive example, and for ComplEx* 10 negatives per each positive example were created in every forward step (Table 2).

Table 2. HPO results.

It can be observed that TransE and ComplEx* performed very well (Fig. 1). The high performance of TransE can be explained due to the fact that for most of the unique (subject, relation) and (relation, object)-pairs, there exists exactly one corresponding entity (subject/object) (77,72%/70.05% of the unique pairs). TransE even outperformed ComplEx when using only one negative for each positive example. However, for ComplEx only a few iterations of hyper-parameter optimization have been performed, and therefore, it is worthy to extend the hyper-parameter search. Similarly, the results of TransR and TransD might improve when applying a more extensive hyper-parameter search.

6.3 Recommendation of Related Information

Based on the results of the experiments, the TransE model has been selected to be used to compute and evaluate recommendations for a set of seed publications (Table 3 shows the titles, domains and venues of our seed papers). We chose TransE instead of ComplEx*, because it performed similarly and ComplEx(*) provides for each entity two vector representations those efficient combination should be investigated in more depth in a future work.

Table 3. Selected seed publications.

Table 4 includes the validated recommendations for two of our seed papers from which one belongs to the domain of Environmental Studies and the second to the domain of Social Science. The recommendations are sorted according to their scores in descending order, i.e. the first recommendation received the highest score. For each recommended artifact, we performed a manual evaluation by looking them up in Google Scholar (the most used search engine for scholarly artifacts), and analysing their metadata. Among the recommendations there were obvious recommendations such as the authors of the seed papers (which we removed from the list of recommendations), irrelevant recommendations such as recommendation (1) for the first seed paper, related publications (e.g. recommendation (2) and (3) for the first seed paper), co-authors of the authors of a seed paper (such as “Manfred Hauswirth”) for the fist seed paper). Similar patterns can be detected in the recommendations for the second seed paper. The results highlight that relevant artifacts are recommended by the system. The recommendations indicate that the KGEs preserved the structure of the KG, for instance: (i) “Manfred Hauswirth” is a co-author of the authors of first seed paper, (ii) recommendation (1) that represents a publication, cites two of the authors of the seed paper (“Cory Andrew Henson”, and “Vivek Kumar Singh”). Furthermore, it seems that the model has been able to distinguish entity types since the top recommendations usually represented publications for seed publications. While our evaluation approach indicates that the system is capable of providing relevant recommendations, involving external participants in the evaluation procedure will provide important insights regarding the effectiveness of our proposed system. In particular, we aim to perform a user study with expert and non-expert participants in order to analyse whether their information needs to topics related to social good can be fulfilled more effectively by using the proposed system. Because our work represents a preliminary work and such an evaluation requires an extensive preparation, we plan to target the described evaluation in a future work.

Table 4. Recommendation for selected seed publications.

7 Limitations and Future Work

The approach presented in this paper represents a preliminary work that will be extended in future. Although the created KG contains already valuable information that we exploited to provide recommendations, it can benefit from several extensions. Currently, it contains only four entity and eight relationship types. We aim to augment this KG with additional information. In particular, we want to add entities that represent NGOs and other organizations, public speakers and events that are related to the topic of social good. Moreover, we want to provide major supporters/sponsors behind these organizations and events in order to provide more insights. Furthermore, we want to include relationship types that represent connections between (public/private) organizations to events and venues. The extended KG would contain more complex information that could be used to find non-obvious dependencies and to provide more diverse recommendations. For this work, we made use of KGE models that only consider triples of the form (h, r, t) where both h and t represent entities of the KG. However, there is a trend to develop multimodal KGE models that incorporate different types of information such as textual, numerical and visual information. In our future work, we plan to develop a multimodal KGE model in order to exploit textual information (e.g. abstracts of the papers) and numerical information (e.g. publication date, number of citations) which are available for our KG and might help to provide better recommendations.

Here, we provided recommendations by computing the nearest neighbors of a seed entity in the embedding space. Although this approach is easy to realize and provides interesting recommendations, it should serve as a baseline system for more sophisticated systems. As a next step, we aim to explore reinforcement learning based approaches in which feedback of the recommendations are taken into account during the training.

8 Conclusion

In this paper, we presented a socio-scholarly knowledge graph which contains information about scientific artifacts that are related to the topic of social good. A specific knowledge graph embedding-based recommendation system has been developed for this KG. The system provides recommendations for any given seed entity (publication, author, venue, domain) by returning related entities. Our results show a great potential to leverage the system in broader scale of scholarly recommendations for active members of social good movements.