Keywords

1 Introduction

Knowledge graphs like Freebase [1], WordNet [14] and Google Knowledge Graph play extremely practical roles in numerous AI applications, such as Question Answering System [6] and Information Extraction [8]. A typical knowledge graph (KG) is a multi-relational directed graph, in which nodes represent entities and edges represent different types of relations. That is, a basic triplet fact (hrt) in KG represents that the relationship r links the head entity h and tail entity t. e.g., (Barack_Obama, Place_of_Birth, Hawai). Although there are huge amounts of structured data, a knowledge graph is factually far from completeness. Knowledge graph completion aims to predict new relational facts under supervision of the existing knowledge graph.

In the past decade, massive traditional approaches based on logic and symbol [15, 16] have been done for knowledge graph completion, but they are intractable and not enough convergence for large scale knowledge graphs. Recently an emerging approach called knowledge graph embedding, which embeds all objects (entities and relations) of a KG into a low-dimensional space, have highly attracted attention. Following this approach, many models described in Section “Related Work” have been presented. Among these models, Trans(E, H, R and D) [4, 10, 11, 18] are fundamental and efficient while achieving state-of-the-art predictive performance. TransE [4] simply and directly build entity and relation embeddings by regarding a relation as translation from head entity to tail entity, but there are flaws in dealing with complex relations, such as reflexive, one-to-many, many-to-one, and many-to-many relations. To address these issues of TransE, TransH [18] considers some mapping properties of complex relations in embedding, and projects entity embeddings into relation-specific hyperplanes. But for TransH, there is only one normal vector used for modeling relation-specific hyperplane, which leads that entities and relations are still in the same space and a limit representation for mapping properties. TransR [11] regards to map entity embeddings into r-relation space with a transfer matrix, and TransD [10] uses the product of two projection vectors of an entity-relation pair to construct the transfer matrix. Such transfer matrix can build entity and relation embeddings in separate spaces and has more general representation for mapping properties, however, it will cost much more computations and memories on the mappings.

Fig. 1.
figure 1

Simple visualization of TransE, TransH and TransGH.

In this paper, we propose an expressive model named translation on generalized hyperplanes (TransGH) to promote TransH. Instead of the only one normal vector, TransGH uses a set of basis vectors to determine a generalized hyperplane. Figure 1 simply shows the differences of TransE, TransH and TransGH.

  • TransE builds the translation from head embedding to tail embedding as \(\mathbf{h} + \mathbf{r} \approx \mathbf{t} \) when the triplet (hrt) holds.

  • TransH projects entity embeddings into relation-specific hyperplanes characterized by one normal vector \(\mathbf{w}_r\), and builds translation between the projected entities on the hyperplane as \(\mathbf{h}_\perp + \mathbf{r } \approx \mathbf{t}_\perp \), where \(\mathbf{h}_\perp = \mathbf{h} - \mathbf{w}_r^T \mathbf{h} \mathbf{w}_r\) and \(\mathbf{t}_\perp = \mathbf{t} - \mathbf{w}_r^T \mathbf{t} \mathbf{w }_r\).

  • Different from TransH, TransGH uses a set of basis vectors \(\{ \mathbf{w}_{r}^1, \mathbf{w}_{r}^2, ..., \mathbf{w}_{r}^v \}\), (\(v<<|\mathbf{h}|\)) to determine a generalized relation-specific hyperplane, and the mappings of the entity embeddings on the hyperplane are \(\mathbf{h}_\perp = \mathbf{h} - {\sum _{i}}{ \mathbf{w}_{r}^i }^T \mathbf{h } \mathbf{w }_{r}^i\) and \(\mathbf{t }_\perp = \mathbf{t } - {\sum _{i}}{ \mathbf{w }_{r}^i }^T \mathbf{h } \mathbf{w }_{r}^i(i \in [1, v])\).

The basic idea of TransGH illustrated in Fig. 1(c) is that for a given triplet (hrt), firstly the entity embeddings \(\mathbf{h }\) and \(\mathbf{t }\) are projected on the generalized hyperplane as \(\mathbf{{h}}_\perp \) and \(\mathbf{{t}}_\perp \) with a set of basis vectors respectively, where the embedding \(\mathbf{{h}}_\perp \) is expected to be close to the embedding \(\mathbf{{t}}_\perp \) by adding the relation embedding \(\mathbf{{r}}\).

Our contributions in this paper are: (1) We propose a novel model TransGH, which models each relation as a vector on the generalized hyperplane determined by a set of basis vectors. (2) TransGH has the similar parameters to TransH as it only extends one normal vector to a set of basis vectors, indicating that TranGH is applicable to large scale KGs. (3) In the two tasks of link prediction and triplet classification, TransGH has significant improvements comparing with previous Trans(E, H, R and D).

2 Related Work

2.1 Translation-Based Models

Translation-based models usually embed entities and relations into a low-dimensional vector space, and enforce vector embeddings compatible under a score function f(hrt). Different models have the different definitions of score functions. Below we briefly summarize some baseline translation-based models and give the corresponding score functions.

TransE [4] embeds entities and relations into the same space \(R^m\), and interprets each relation as a translation vector from the head entity embedding to tail entity embedding. Hence the score function is defined as \(f(h,r,t) = \parallel \mathbf{{h}} + \mathbf{{r}} - \mathbf{{t}} \parallel ^2_2\) for a triplet (hrt). TransE is effective for one-to-one relations but has flaws in dealing with one-to-many, many-to-one and many-to-many relations.

To overcome the issues of TransE, TransH [18] projects entity embeddings into relation-specific hyperplanes to enable an entity has distinct representations when involved in different relations. It models each relation r as a vector \(\mathbf{{r}}\) on the hyperplane with a normal vector \(\mathbf{{w}}_r\), therefore the scoring function is defined as \(f(h,r,t) = \parallel \mathbf{{h}}_\perp + \mathbf{{r}} - \mathbf{{t}}_\perp \parallel ^2_2\). With \(\parallel \mathbf{{w}}_r \parallel _2 = 1\), it is easily to get \(\mathbf{{h}}_\perp = \mathbf{{h}} - \mathbf{{w}}_{r}^T \mathbf{{h}} \mathbf{{w}}_{r}\), \( \mathbf{{t}}_\perp = \mathbf{{t}} - {\mathbf{{w}}_{r}}^T \mathbf{{t}} \mathbf{{w}}_{r}\), and \({\mathbf{{h}}, \mathbf{{t}}, \mathbf{{r}}, \mathbf{{w}}_r \in } R^m\).

Both TransE and TransH embed entities and relations into the same vector space without considering that entities and relations are different types of objects. TransR/CtransR [11] regards entities and relations as completely different objects via embedding entities and relations into entity space \(R^m\) and relation spaces \(R^n\), respectively. It maps entity embeddings from entity space to r-relation space with a mapping matrix \(\mathbf{{M}}_r\). Then the score function is defined as \(f(h,r,t) = \parallel \mathbf{{h}}_{r} + \mathbf{{r}} - \mathbf{{t}}_{r} \parallel ^2_2\), where \(\mathbf{{h}}_{r} = \mathbf{{h}} \mathbf{{M}}_{r}\), \(\mathbf{{t}}_{r} = \mathbf{{t}} \mathbf{{M}}_{r}\) and \(\mathbf{{h}},\mathbf{{t}} \in R^m, \mathbf{{r}} \in R^n, \mathbf{{M}}_r \in R^{m \times n}\). CtransR is an extension of TransR, which divides all the entity pair(ht) in the training data into multiple groups(clusters) and learns independent relation vector for each group.

TransD [10] is an improvement of TransR/CtransR, which considers the multiple types of entities and relations simultaneously. It replaces the transfer matrix by the product of two projection vectors of an entity-relation pair. Therefore score function is denoted as \(f(h,r,t) = \parallel \mathbf{{M}}_{rh} \mathbf{{h}} + \mathbf{{r}} - \mathbf{{M}}_{rt} \mathbf{{t}} \parallel ^2_2\), where \(\mathbf{{M}}_{rh} = \mathbf{{r}}_p {\mathbf{{h}}_p}^T + \mathbf{{I}}^{n \times m}\), \(\mathbf{{M}}_{rt} = \mathbf{{r}}_p {\mathbf{{t}}_p}^T + \mathbf{{I}}^{n \times m}\), and \(\mathbf{{h}}, \mathbf{{h}}_p, \mathbf{{t}}, \mathbf{{t}}_p \in R^m, \mathbf{{r}}, \mathbf{{r}}_p \in R^n\).

Recently TransE-RS and TransH-RS [19] combine a limit-based scoring loss for learning knowledge embeddings, which have significant improvements compared to state-of-the-art baselines.

2.2 Other Models

Unstructured Molel (UM) [3] is a simplified version of TransE with considering the knowledge graph as none-relation and setting all relation vectors as \(\mathbf{{r}}=0\), which leads to the score function \(f_r(h,r,t) = \parallel \mathbf{{h}} - \mathbf{{t}} \parallel \). Obviously, this model can not deal with the different relations.

Structured Embedding (SE) [5] interprets entities as vectors and each relation as two independent matrices \(\mathbf{M}_r^h\) and \(\mathbf{M}_r^t\) for projecting the head entity embedding and tail entity embedding. Then score function is \( f_r(h,r,t) = - \Vert \mathbf{M}_r^h \mathbf{{h}} - \mathbf{M}_r^t \mathbf{{t}}\Vert \). SE can not capture the information between entities and relations since it uses the two separate matrices.

Latent Factor Model (LFM) [9, 17] encodes entities as vectors and sets each relation as a matrix. Each r-specific matrix is asymmetric and directly operates between two entity embeddings. The score function is \(f(h,r,t) = \mathbf{{h}}^T \mathbf{{M}}_r \mathbf{{t}}\).

Semantic Matching Energy (SME) [2, 3] introduces two definitions of semantic matching energy functions for optimization, a linear form \(f(h,r,t)=(\mathbf{{M}}_1 \mathbf{{h}} + \mathbf{{M}}_2 \mathbf{{r}} + \mathbf{{b}}_1)^T(\mathbf{{M}}_3 \mathbf{{t}} + \mathbf{{M}}_4 \mathbf{{r}} + \mathbf{{b}}_2)\), and a bilinear form \(f(h,r,t)=(\mathbf{{M}}_1 \mathbf{{h}} \otimes \mathbf{{M}}_2 \mathbf{{r}} + \mathbf{{b}}_1)^T(\mathbf{{M}}_3 \mathbf{{t}} \otimes \mathbf{{M}}_4 \mathbf{{r}} + \mathbf{{b}}_2)\), where \(\mathbf{{M}}_1, \mathbf{{M}}_2, \mathbf{{M}}_3, \mathbf{{M}}_4\) are weight matrices, \(\mathbf{{b}}_1\) and \(\mathbf{{b}}_2\) are bias vectors and \(\otimes \) is Hadamard product.

Single Layer Model (SLM) [16] is designed as a plain baseline of NTN. It introduces nonlinear transformations by neural networks. The score function is \(f(h,r,t) = {\mathbf{{u}}_r}^Tg(\mathbf{{M}}_{rh} \mathbf{{h}} + \mathbf{{M}}_{rt} \mathbf{{t}} + \mathbf{{b}}_r)\), where \(\mathbf{{M}}_{rh}\) and \(\mathbf{{M}}_{rt}\) are weight matrices, and \(g(\cdot )\) is the function \(\tanh (\cdot )\).

The Neural Tensor Network (NTN) [16] uses a bilinear tensor layer related two entity vectors to replace a standard linear neural network layer. It computes a score to measure the plausibility of a triplet (hrt) by the function \( f(h,r,t) = {\mathbf{{u}}_{r}}^T g(\mathbf{{h}}^T \mathbf{M}_{r} \mathbf{{t}} + \mathbf{{V}}_{r}[\mathbf{{h}}; \mathbf{{t}}] + \mathbf{{b}}_{r} )\) where \(g(\cdot ) = \tanh (\cdot )\); \([\mathbf{{h}}; \mathbf{{t}}]\) denotes the vertical stacking of vectors \(\mathbf{{h}}\) and \(\mathbf{{t}}\), \(\mathbf{{V}}_{r}\) is weight matrix and \(\mathbf{M}_{r}\) is a 3-way tensor.

3 Our Model

TransGH considers the translation operation on a generalized hyperplane determined by a set of basis vectors, to achieve the generalized ability for preserving mapping properties of complex relation facts, and also avoid much more computations on entity mappings.

Fig. 2.
figure 2

The two phases of TransGH. The red bold arrows represent \({\mathbf{w}_r^i }^T \mathbf{{h}} \mathbf{w}_r^i\). (Color figure online)

3.1 Generalized Hyperplane

We extend the hyperplane of TransH to the generalized hperplane by a set of basis vectors \(\{ \mathbf{w}_r^1, \mathbf{w}_r^2, ..., \mathbf{w}_r^v \}\), \((\mathbf{w}_r^i \in R^m, i \in [1, v])\), the basis vectors are orthogonal to each other. With the same setting of TransH, we also restrict \(\Vert \mathbf{w}_r^i \Vert _2 = 1\) for each set of r-relation vectors. For an entity embedding \(\mathbf{e }\), a transfer vector \(\mathbf{e }_r\) on the set of basis vectors can be written as:

$$ \mathbf{e }_r = {\mathbf{w}_r^1}^T \mathbf{e } \mathbf{w}_r^1 + \ldots +{ \mathbf{w}_r^v}^T \mathbf{e } \mathbf{w}_r^v=\sum _i { \mathbf{w}_r^i}^T \mathbf{e } \mathbf{w}_r^i$$

where v is the number of vectors and m is the dimension of entity (relation) vector space. Based on the transfer vector \(\mathbf{e }_r\), we can obtain the projection \(\mathbf{e }_\perp \) of entity embedding \(\mathbf{e }\) on the generalized hyperplane as \(\mathbf{e }_\perp = \mathbf{e } - \mathbf{e }_r\). Thus the generalized hyperplane determined by the set of basis vectors \( \{ \mathbf{w}_r^1, \mathbf{w}_r^2, ..., \mathbf{w}_r^v \}\), can be described as

$$\{ \mathbf{e }_\perp |\mathbf{e }_\perp =\mathbf{e } - \sum _i { \mathbf{w}_r^i }^T \mathbf{e } \mathbf{w}_r^i\}$$

where \(\mathbf{w}_r^i \in R^m\) and \(\Vert \mathbf{w}_r^i\Vert _2 = 1\). The proposed hyperplane is a generalisation of that in TransH.

3.2 TransGH

As shown in Fig. 2, the basic idea of TransGH can be summed up in two steps: (1) projection: projecting entity embeddings on the generalized hyperplane. (2) translation: connecting projected entities with the relation-specific translation vector. Specifically, for a triplet (hrt):

  • In the projection phase, with the restriction \(\Vert \mathbf{w}_r^i \Vert _2 = 1\), it is easily to get the projections of head and tail embedding on the generalized hyperplane, that is

    $$\mathbf{{h}}_\perp = \mathbf{{h}} - \sum _i {\mathbf{w}_r^i}^T \mathbf{{h}} \mathbf{w}_r^i, \; \; \; \; \mathbf{{t}}_\perp = \mathbf{{t}} - \sum _{i} { \mathbf{w}_r^i }^T \mathbf{{t}} \mathbf{w}_r^i$$
  • In the translation phase, the relation r is interpreted as the translation vector \(\mathbf{{r}}\) from the head projections \(\mathbf{{h}}_\perp \) to the tail projection \(\mathbf{{t}}_\perp \). Therefore, the score function is denoted as:

    $$\begin{aligned} f(h, r, t) = \Vert (\mathbf{{h}} - \sum _{i} { \mathbf{w}_r^i }^T \mathbf{{h}} \mathbf{w}_r^i) + \mathbf{{r}} - (\mathbf{{t}} - \sum _{i} { \mathbf{w}_r^i }^T \mathbf{{t}} \mathbf{w}_r^i) \Vert _2^2 \end{aligned}$$

The score function is to measure the compatible of a positive triplet, and also is expected to be low for a positive triplet, otherwise high for a negative triplet.

3.3 Training Method and Implementation Details

We use the following margin-based loss function to encourage discrimination between positive triplets and negative triplets:

$$ {\L } = \sum _{(h,r,t) \in P} \sum _{(h^{\prime }, r, t^{\prime }) \in N} [0, f(h,r,t) + \gamma - f(h^{\prime }, r, t^{\prime })]_+ $$

Here, \({[x]_+} = {max(0,x)}\) means to get the maximum number between 0 and x, P is the set of positive triplets; N is the set of negative triplets, that is \(N = \{(h^{\prime }, r, t)\mid (h^{\prime } \in \texttt {E} \wedge h^{\prime } \ne h) \cup (h, r, t^{\prime })\mid (t^{\prime } \in \texttt {E} \wedge t^{\prime } \ne t )\}\). E is the entities set. \(\gamma > 0\) is the margin hyper-parameter with expectation of dividing the positive triplets and negative triplets. Then we minimize the loss function with considering the following constraints:

$$\begin{aligned}&\forall e \in \texttt {E}, \Vert \mathbf{{e}}\Vert _2 \le 1, \forall r \in \texttt {R}, \Vert \mathbf{{r}}\Vert _2 \le 1 \end{aligned}$$
(1)
$$\begin{aligned}&\forall r \in \texttt {R}, i \in [1,v], \Vert \mathbf{w}_r^i \Vert _2 = 1 \end{aligned}$$
(2)
$$\begin{aligned}&\forall r \in \texttt {R}, i \in [1,v], \frac{| \sum _i {\mathbf{w}_r^i}^T \mathbf{{r}} |}{\Vert \mathbf{{r}} \Vert _2} \le \epsilon \end{aligned}$$
(3)
$$\begin{aligned}&\forall r \in \texttt {R}, i, j \in [1,v](i \ne j), \frac{| \sum _{(i,j)} {\mathbf{w}_r^i}^T {\mathbf{w}_r^j} |}{\Vert {\mathbf{w}_r^j} \Vert _2} \le \epsilon \end{aligned}$$
(4)

where \(\epsilon \) is a small scalar, R is the relations set, constraint (3) assures the translation vector \(\mathbf {r}\) is on the generalized hyperplane and constraint (4) guarantees each two basis vectors are orthogonal. Afterwards we directly optimize the following loss function with soft constraints:

$$\begin{aligned} \begin{aligned} {\L } =&\sum _{(h,r,t) \in P} \sum _{(h^{\prime }, r, t^{\prime }) \in N} [0, f(h,r,t) + \gamma - f(h^{\prime }, r, t^{\prime })]_+ \\&+ C ( A_1 + A_2) \end{aligned} \end{aligned}$$
(5)

where we set

$$\begin{aligned} \begin{aligned} A_1&= \sum _{e \in \texttt {E}}[\Vert \mathbf{{e}} \Vert _2^2 - 1]_+ + \sum _{r \in \texttt {R}}[\Vert \mathbf{{r}} \Vert _2^2 - 1]_+\\ A_2&= \sum _{r \in \texttt {R}}\{ [({\frac{ \sum _{i} {\mathbf{w}_r^i}^T \mathbf{{r}}}{\Vert \mathbf{{r}} \Vert _2}})^2 - \epsilon ^2]_+ + [ ({\frac{ \sum _{(i,j)} {\mathbf{w}_r^i}^T \mathbf{w}_r^j}{\Vert {\mathbf{{w}}_r^j} \Vert _2}})^2 - \epsilon ^2]_+\} \end{aligned} \end{aligned}$$
(6)

and C is a hyper-parameter used to measure the importance of soft constrains.

Table 1. Complexity (the number of parameters and the number of multiplication operations).

The loss function favors the lower scores for positive triplets than that for negative triplets. We adopt stochastic gradient descent (SGD) [7] to minimize the above loss function. Notice that the constrain (2) is missed in Eq. 5. To satisfy it, we set each vector \(\mathbf{{w}}_r^i\) to unit \(l_2\)-ball before traversing each mini-batch. Moreover, negative triplets are generated via replacing either the head or tail of original triplets exited in KGs by a random entity, but not both at the same time. For reducing the false negative triplets, here we follow [18] and set different probabilities for the replacement. In experiment, the traditional sampling method is denoted as “unif” and the new method [18] as “bern”.

Generally, all embeddings of entities \( \{{\mathbf{{e}}_i}\}_{i=1}^{\mid \texttt {E} \mid }\), relations \( \{{\mathbf{{r}}_k}\}_{k=1}^{\mid \texttt {R} \mid }\) and relation-specific vectors \(\{ \mathbf{w}_r^1, \mathbf{w}_r^2, ... , \mathbf{w}_r^v \}_{r=1}^{\mid \texttt {R} \mid }\) are learned by TransGH. Hence parameters of this model is \(N_em + N_r(1 + v)n\) and the time complexity is \(2vmN_t\), which is similar to TransH as we usually set \(v \ll m\), e.g., \(v = 2, 3, 4\). We compare the parameters and time complexities with several baselines in Table 1. We denote \(N_e\) as the number of entities, \(N_r\) as the number of relations and \(N_t\) as the number of triplets in a knowledge graph respectively. m and n separately represent the dimension of entity space and relation space. d denotes the average number of clusters of a relation. k is the number of hidden nodes of a neural network, s is the number of slice of a tensor. v is the number of vectors for a relation.

4 Experiments and Analysis

We study and evaluate our model on two tasks: link prediction [4, 18] and triplet classification [16]. In our experiments, two datasets including FreeBase [1] and WordNet [14] are used. Then we show the experimental results and some analysis of them.

4.1 Datasets

WordNet is designed to build an usable dictionary and support automatic text analysis. In WordNet, each entity represents a synset containing several words, which are corresponding to a distinct word sense. Relationships indicate the lexical relations between synsets, such as hypernym, hyponym, meronym and holonym. An example of triplets is (__warship_NN_1, _hyponym, __torpedo_boat_NN_1). The two data sets from WordNet, WN18 and WN11, are used in our experiments. WN18 contains 18 relations and WN11 contains 11 relations. The number of entities involved in the two data sets is close.

FreeBase is a large and rising knowledge graph of general facts. An example of FreeBase is (nietzchka_keene, place_of_death, madison), it builds a relation place_of_death between a name entity nietzchka_keene and a place entity madison. We use two data sets with FreeBase in this paper, FB15k and FB13. FB15k consists of 592,213 triplets with 14,951 entities and 1,345 relations. FB13 is a more dense subgraph including 75,043 entities and 13 relations. The statistics of these data sets are listed in Table 2.

Table 2. Data sets used in the experiments.

4.2 Link Prediction

Link prediction is to predict the missing h or t for a positive triplet (hrt), used in [4, 10, 11, 18]. In this task, it focuses more on ranking a set of candidate entities from the knowledge graph rather than obtaining the best one for each position of missing entity. The data sets used in this task are WN18 and FB15k, which are same settings to [4, 10, 11, 18].

Evaluation Rules. We adopt the same protocols used in [4, 10, 11, 18] to evaluate this task. Specifically, in testing phase, for each test triplet (hrt), we replace the head(tail) entity by every entity e from the set of entities for a KG and calculate the scores of these corrupted triplets by using the score function f(hrt), then we get the rank of the original triplet after ranking these scores in ascending order. Following [4, 10, 11, 18], two metrics are used to evaluation: the average rank(Mean Rank) and the proportion of ranks not larger than 10 (Hit@10). This is called “raw” setting. Notice that the corrupted triplets may exit in the KG, they can be regarded as correct triplets, hence it is not wrong to rank them before the original triplet. To eliminate this case, we filter out corrupted triplets existing in a KG before ranking. This is called “filt” setting. In both settings, lower Mean and higher Hit@10 are excepted.

Table 3. Link prediction results.

Implementation. In training phase, we select the learning rate \(\eta \) for SGD from {0.001, 0.01, 0.1}, the \(\gamma \) from{1, 2, 3, 4, 5, 6, 7, 8}, the entity(relation) embedding dimension m from{50, 100, 150}, the number of vectors v from {0.25, 0.5, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, the batch size b from{480, 960, 1200, 4800}, the hyper-parameter C from {0.005, 0.0625, 0.25, 0.5}. The best parameters are determined by validation set. Under unif setting, the best optimal configures are \(\eta \) = 0.01, \(\gamma \) = 7, m = 100, v = 2, b = 1200, C = 0.0625 on WN18; \(\eta \) = 0.01, \(\gamma \) = 2, m = 100, v = 6, b = 1200, C = 0.0625 on FB15k. Under bern setting, the best optimal configures are \(\eta \) = 0.01, \(\gamma \) = 7, m = 100, v = 2, b = 1200, C = 0.005 on WN18; \(\eta \) = 0.01, \(\gamma \) = 1, m = 100, v = 4, b = 480, C = 0.0625 on FB15k. We traverse all the training triplets for 5000 rounds and take L1 as dissimilarity on both datasets.

Table 4. Results on FB15k by relation category.

Results. The results on both WN18 and FB15k are shown in Table 3. The results of previous studies are referred from their report, since the same datasets are used. Our model consistently and significantly outperforms previous models on both the metrics of WN18 and FB15k, where the results of our Mean(raw) is 191, Mean(filt) is 179, Hit@10(raw) is 94.8%, Hit@10(filt) is 95.0% on WN18, and that of Mean(raw) is 186, Mean(filt) is 64, Hit@10(raw) is 54.1% and Hit@10(filt) is 80.1% on FB15k. Moreover, our model has respectively remarkable improvements on metrics of Mean(raw), Mean(filt), Hit@10(raw) and Hit@10(filt) comparing with TransH, which are 172, 124, 6.2% and 8.3% on WN18, and 25, 23, 8.4% and 15.7% on FB15k higher than those of TransH. We believe the improved performance of our model is due to its use of the set of basis vectors.

Table 4 analyzes Hits@10 results on FB15k with respect to the relation categories. Following the same rules in [4] on FB15k, we separate the 1345 relations into four categories, including one-to-one, one-to-many, many-to-one and many-to-many relations. From Table 4 we can observe that TransGH significantly performs better results than all baselines on both unif and bern settings. Our method has highest accuracies on predicting head (one-to-one 87.0%, one-to-many 95.8%, many-to-one 47.9% and many-to-many 80.8%) and predicting tail (one-to-one 86.8%, one-to-many 55.8%, many-to-one 94.8% and many-to-many 84.3%). Additionally, comparing with TransH, we also give the result on Hit@10 metric of some typical complex relations in Table 5. In this experiment, we directly copy the results reported in [18] shows TransGH has remarkable improvement on Hit@10 metric of some typical complex relations compared with TransH. It indicates TransGH can capture more fertile information between entities and relations, and achieve the better ability for modeling mapping properties of complex relation facts. As Tables 6 and 7 shown, TransGH rationality enables the same category objects (entities and relations) to have similar vector embeddings.

Table 5. Hits@10(filt)bern of TransGH and TransH on some examples of one-to-many\(^*\), many-to-one\(^\dag \), many-to-many\(^\ddag \) and symmetric\(^{\S }\) relations.
Table 6. The top-3 similarity entities with regard to some examples on WN18. The similarity scores are computed with cosine function.
Table 7. The top-3 similarity relations with regard to some examples on FB15k. The similarity scores are computed with cosine function.

4.3 Triplet Classification

Triplet classification is to decide whether a given triplet (hrt) is correct or not. This is a binary classification task, which has been presented by [16]. In this task, three data sets WN11, FB13 and FB15k are used, and negative triplets are needed to the evaluation of binary classification. The first two sets appeared in [16] already have negative triplets, but the third one including negative triplets has not been published recently. For FB15k, we construct it by following the same principles used for FB13 in [16].

Evaluation Rules. There exists a simple decision rule for triplet classification: we first get a relation-specific threshold \(\delta _r\) determined by maximizing the classification accuracy on the validation set. For a triplet (hrt), if the dissimilarity score gained by the score function f(hrt) is below \(\delta _r\), then predict positive. Otherwise predict negative.

Implementation. We compare our model with several baseline methods mentioned in [10]. For the sake of fairness, word embedding [13] is not used in our experiments. In training stage, we select the same configuration with link prediction. The best parameters are also determined by validation set. On unif setting, the best optimal configures are \(\eta \) = 0.01, \(\gamma \) = 11, m = 100, v = 3, b = 480, C = 0.25 on WN11; \(\eta \) = 0.01, \(\gamma \) = 0.25, m = 100, v = 2, b = 1200, C = 0.0625 on FB13; \(\eta \) = 0.01, \(\gamma \) = 1, m = 100, v = 6, b = 480, C = 0.0625 on FB15k. On bern setting, the best optimal configures are \(\eta \) = 0.01, \(\gamma \) = 11, m = 100, v = 3, b = 480, C = 0.0625 on WN11; \(\eta \) = 0.01, \(\gamma \) = 0.25, m = 100, v = 2, b = 1200, C = 0.005 on FB13; \(\eta \) = 0.01, \(\gamma \) = 1, m = 100, v = 10, b = 480, C = 0.0625 on FB15k. We set the number of epochs to 5000 for three data sets. Meanwhile we also take L1 as dissimilarity on WN11, FB15k and L2 on FB13.

Table 8. Triplet classification accuracies.
Fig. 3.
figure 3

Classification accuracies of on WN11.

Results. Evaluation results of triplet classification are shown in Table 8. TransGH consistently scores better accuracy on WN11 and FB15k than the current state-of-the-art model, where accuracies are 87.3% and 91.2% on WN11 and FB15k respectively. TransGH has slightly worse accuracy on FB13. This is mainly because that FB13 has the most entities and therefore good representations of rarely occurring entities are difficult for learning. Additionally TransGH achieves at least 8.5%, 1.9%, 11.4% higher than TransH on the three datasets. Therefore we believe the set of basis vectors is beneficial to model the complex relations and learn the embeddings of entities and relations of a knowledge graph. We also compare the classification accuracies of different relations by TransH and TransGH on WN11. In this experiment, we rerun TransH with the parameters reported in [18], and obtain slightly different accuracies 76.5%(unif) and 77.6%(bern) with the reported results in Table 8. We ignore the differences derived from randomly experiments. The accuracies of eleven relations on WN11 are given separately in Fig. 3. From results of Fig. 3, TransGH significantly improve TransH in each relation classification expect for the relation _similar_to. As reported in [10], the prediction accuracy needs more information while the number of entity pairs linked by relation _similar_to only accounts for 1.5% in all train data, therefore the inadequate entity pairs linked by relation _similar_to is the main cause.

5 Conclusion and Future Work

In this paper, we have proposed a new knowledge graph embedding method TransGH. The key idea of TransGH is to learn embeddings via modeling each relation as the translation vector between projected entities on the generalized hyperplane, which is characterized by a set of basis vectors. In addition, TrasGH is efficient for preserving mapping properties of complex relation facts while keeping low complexity of parameters. We empirically conduct experiments on triplet classification and link prediction with two knowledge graphs FreeBase and WordNet. The experimental results show that TransGH significantly and consistently has considerable improvement over baselines, and achieve state-of-the-art performance, which demonstrates the superiority and generality of our model.

In the future, we will explore the following directions: (1) We will utilize the word embeddings obtained from word2vec [12] in our experiments for improving the performance of our model TransGH. (2) We will train our model TransGH using the promising limit-based scoring loss function introduced by [19] for future improvement. (3) We will devise and exploit a question answering system based on TransGH.