Neural variational entity set expansion for automatically populated knowledge graphs

Rastogi, Pushpendre; Poliak, Adam; Lyzinski, Vince; Van Durme, Benjamin

doi:10.1007/s10791-018-9342-1

Neural variational entity set expansion for automatically populated knowledge graphs

Knowledge Graphs and Semantics in Text Analysis and Retrieval
Published: 25 October 2018

Volume 22, pages 232–255, (2019)
Cite this article

Download PDF

Information Retrieval Journal Aims and scope Submit manuscript

Neural variational entity set expansion for automatically populated knowledge graphs

Download PDF

Pushpendre Rastogi ORCID: orcid.org/0000-0003-1573-3066¹,
Adam Poliak¹,
Vince Lyzinski¹ &
…
Benjamin Van Durme¹

993 Accesses
7 Citations
1 Altmetric
Explore all metrics

Abstract

We propose Neural variational set expansion to extract actionable information from a noisy knowledge graph (KG) and propose a general approach for increasing the interpretability of recommendation systems. We demonstrate the usefulness of applying a variational autoencoder to the Entity set expansion task based on a realistic automatically generated KG.

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

Structural Bias in Knowledge Graphs for the Entity Alignment Task

Sample and Feature Enhanced Few-Shot Knowledge Graph Completion

1 Introduction

Imagine a physician trying to pin-point a specific diagnosis or a journalist investigating abuses of governmental power. In both scenarios, a domain expert may try to find answers based on prior known, relevant entities—either a list of diagnoses of with similar symptoms that a patient is experiencing or a list of known conspirators. Instead of manually looking for connections between potential answers and prior knowledge, a searcher would like to rely on an automatic Recommender to find the connections and answers for them, i.e. related entities.

In the information retrieval (IR) community, Entity set expansion (ESE) is the established task of recommending entities that are similar to a provided seed of entities.^{Footnote 1} ESE has been applied in Question Answering (Wang et al. 2008), Relation Extraction (Lang and Henderson 2013) and Information Extraction (He and Grishman 2015) settings. The physician and journalist in our example can not fully take advantage of IR advances in ESE for two main reasons. Recent advances (1) often assume access to a clean, large Knowledge Graph and (2) are uninterpretable.

Many advanced ESE algorithms rely on manually curated, clean Knowledge Graphs (KG), e.g. DBpedia (Auer et al. 2007) and Freebase (Bollacker et al. 2008). In real-world settings, users rarely have access to clean KGs, and instead may rely on automatically generated KGs. Such KGs are often noisy because they are created from complicated and error-prone NLP processes—illustrated in Fig. 1. For example, automatic KGs may include duplicate entities, associations (relations) between entities may be missing, and entities with similar names may be incorrectly disambiguated. These imperfections prevent machine learning approaches from performing well on automatically generated KGs. Furthermore, many ESE algorithms degrade as the sparsity and unreliability of KGs increases (Pujara et al. 2017; Rastogi et al. 2017). Advanced ESE methods, especially those that rely on neural networks, are uninterpretable (Mitra and Craswell 2017). If a physician can not explain decisions, patients may not trust her and if a journalist can not demonstrate how a certain individual is acting unethically or above the law, a resulting article may lack credibility. Furthermore, uniterpretability may limit the applications of advancements in IR, and more broadly artificial intelligence, as humans “won’t trust an A.I. unless it can explain itself.”^{Footnote 2}

We introduce Neural variational set expansion (NVSE) to advance the applicability of ESE research. NVSE is an unsupervised model based on Variational Autoencoders (VAEs) that receives a query, uses a Bayesian approach to determine a latent concept that unifies entities in the query, and returns a ranked list of similar entities based on the previously determined unified latent concept. NVSE does not require supervised examples of queries and responses, nor pre-built clusters of entities. Instead, our method only requires sentences with linked entity mentions, i.e. spans of token associated with a KG entity, often included in automatically generated KGs.

NVSE is robust to noisy automatically generated KGs, thus removing the need to rely on manually curated, clean KGs. We evaluate NVSE on the ESE task using Tinkerbell (Al-Badrashiny et al. 2017), an automatically generated KG that placed first in the TAC KGP shared task. Unlike how ESE has been used to improve entity linking for KG construction (Gottipati and Jiang 2011), our goal is the opposite: we leverage noisy automatically generated KGs to perform ESE. NVSE is interpretable; it outputs query rationales—a summarization of features our models associates with the query—and result justifications—an ordered list of sentences from the underlying corpus that justify why our method returned that entity. Query rationales and result justifications are reminiscent of annotator rationales (Zaidan et al. 2007).

To our knowledge this is the first unsupervised neural approach for ESE as opposed to neural methods for supervised collaborative filtering (Lee et al. 2017). All code and data is available at https://github.com/se4u/nvse and a video demonstration of the system is available at https://youtu.be/sGO_wvuPIzM.

2 Related work

2.1 Methods dependent on external information

Since automatically generated KGs can be noisy, some methods utilize information beyond entity links and mentions to aid ESE. Paşca and Van Durme (2007) use search engine query logs to extract attributes related to entities and Paşca and Van Durme (2008) extract sets of instances associated with class labels based on web documents and queries. Pantel et al. (2009) use a large amount of web data as they apply a learned word similarity matrix extracted from a 200 billion word Internet crawl to the ESE task. Both He and Xin (2011)’s SEISA system and Tong and Dean (2008)’s Google Sets use lists of items from the Internet and try to determine which elements in the lists are most relevant to a query. Sadamitsu et al. (2011) rely on given topic information about the queried entities to train a discriminative system. More recent approaches also use external information. Zaheer et al. (2017) use LDA Blei et al. (2003) to create word clusters for supervision, and Vartak et al. (2017) use manual annotations by Twitter users. Zheng et al. (2017) uses inter-entity links in knowledge graphs which are very sparse in automatically generated KGs (Pujara et al. 2017; Rastogi et al. 2017). All of these approaches use more information than just entity links and mentions.

2.2 Methods for comparing entities

Set Expander for Any Language (SEAL) (Wang and Cohen 2007) and its variants (Wang and Cohen 2008, 2009) learn similarities between new words and example words using methods like Random Walks and Random Walks With Restart. Similar to Lin (1998)’s using cosine and Jaccard similarity to find similar words, SEISA uses these metrics to expand sets. These methods are limited to only extracting words that coocur. Because they are applied on web-scale data, SEAL and SEISA assume entities will eventually coocur. This assumption might not be valid in an underlying corpus used to automatically generate a KG. In contrast to those approaches, NVSE finds similar entities based on a kernel between distributions.

2.3 Queries as natural language

In the INEX-XER shared task, queries were represented as natural language questions (Demartini et al. 2010). Metzger et al. (2014) and Zhang et al. (2017) propose methods to extract related entities in a KG based on a natural language query. This scenario is similar to a person interacting with a system like Amazon Alexa. However, our setup better reflects users searching for similar entities in a KG as it is more efficient for users to type entities of interest instead of natural language text.

2.4 Neural collaborative filtering

We are not the first to incorporate neural methods in a recommendation system. Recently, He et al. (2017) and Lee et al. (2017) presented deep auto-encoders for collaborative filtering. Collaborative Filtering assumes a large dataset of previous user interactions with the search engine. For many domains it is not possible to create such a dataset since new data is added everyday and queries change rapidly based on different users and domains. Therefore, we propose the first neural method which does not use supervision for Entity set expansion.

3 Notation

Let ${\mathcal{D}}$ be the corpus of documents and ${\mathcal{V}}$ be the vocabulary of tokens that appear in ${\mathcal{D}}$. We define a document as a sequence of sentences and we define a sentence as a sequence of tokens. Let ${\mathcal{X}}$ be the set of entities discovered in ${\mathcal{D}}$ and we refer to its size as ${\mathrm{X}}$. Each entity $x \in {\mathcal{X}}$ is linked to the tokens that mention x.^{Footnote 3} Let ${\mathcal{V}}'$ be the set of tokens linked to any $x \in {\mathcal{X}}$, and let ${\mathcal{M}}_x$ be the multiset of sentences that mention x in the corpus. For example, consider an entity named “Batman” and a document containing three sentences {Batman is good., He is smart. Life is good.}. “Batman” is linked to tokens Batman and He, ${\mathcal{V}}'= \{{\text{Batman, He}}\}$, and ${\mathcal{M}}_{\text{Batman}} =$ {Batman is good., He is smart.}.

In ESE, a system receives query ${\mathcal{Q}}$—a subset of ${\mathcal{X}}$—and has to sort the elements remaining in ${\mathcal{R}}= {\mathcal{X}}{\setminus} {\mathcal{Q}}$. The elements that are most similar to ${\mathcal{Q}}$ should appear higher in the sorted order and elements dissimilar to ${\mathcal{Q}}$ should be ranked lower.

4 Baseline methods

Before introducing NVSE, we describe the four baselines systems: BM25, Bayesian Sets, Word2Vecf and SetExpan. We do not compare to DeepSets (Zaheer et al. 2017), as it is a supervised method that requires entity clusters.

For each x, we create a feature vector $f_x \in {\mathbb{Z}}^{{\mathrm{F}}}$ from ${\mathcal{M}}_x$, by concatenating three vectors that count how many times (1) a token in ${\mathrm{V}}$ appeared in ${\mathcal{M}}_x$ (2) a document in ${\mathcal{D}}$ mentioned x and (3) a token in ${\mathcal{V}}'$ appeared in ${\mathcal{M}}_x$. Thus, ${\mathrm{F}}= {\mathrm{V}}+ {\mathrm{D}}+ {\mathrm{V}}'$.

4.1 BM25

Best Match 25 (BM25) is “one of the most successful text-retrieval algorithms” (Robertson and Zaragoza 2009).^{Footnote 4} BM25 ranks remaining entities in ${\mathcal{R}}$ according to the score function

$$\begin{aligned} \underset{BM}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = {\sum _{i=1}^{\mathrm{F}}} {\frac{{{\mathrm{IDF}}}[i] f_x[i] \bar{f}_{\mathcal{Q}}[i] (k_1 + 1) }{f_x[i] {+} k_{1} (1{-}b {+} b {\sum _{j} f_x[j]}/{\bar{L}})}}, \end{aligned}$$

where $f_{x}[j]$ denotes the j-th feature value in $f_{x}$, $\bar{f}_{\mathcal{Q}}$ is the sum of $f_x \forall x \in {\mathcal{Q}}$ and $\mathbb {I}$ is the indicator function. $k_1$ and b are hyperparameters that commonly set to 1.5 and 0.75 (Manning et al. 2008). $\bar{L}$ is the average total count of a feature in the entire corpus and ${{\mathrm{IDF}}}[i]$ is the inverse document frequency of the $i{\text{th}}$ feature (Appendix 1).

4.2 Bayesian sets

Ghahramani and Heller (2006) introduced the Bayesian Sets (BS) method which converts ESE into a bayesian model selection problem. BS compares the probabilities that the query entities are generated from a single sample of a latent variable $z \in \Delta ^{{\mathrm{F}}}$ with the probability that the entities were generated from independent samples. $\Delta ^{{\mathrm{F}}}$ is the ${\mathrm{F}}-1$ dimensional probability simplex. Note that z has the same dimensionality as the observed features. Given ${\mathcal{Q}}$ and $\pi$, the prior distribution of z, BS infers the posterior distribution of z, $p(z | {\mathcal{Q}})$, and computes the following score

$$\begin{aligned} \underset{BS}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = \log \frac{E_{p(z | {\mathcal{Q}})}[p(x|z)]}{E_{\pi (z)}[p(x|z)]}. \end{aligned}$$

(1)

Ghahramani and Heller (2006) computed ${{\mathrm{{score}}}}_{BS}$ in close form by selecting the conditional probability, p(x|z), from an exponential family distribution and setting $\pi$ to be its conjugate prior. They showed that if p(x|z) is multivariate Bernoulli then BS requires a single matrix multiplication (Appendix 3) and we use this setting for our experiments.

4.3 Word2Vecf

Levy and Goldberg (2014) generalize Mikolov et al. (2013)’s Skip-Gram model as Word2Vecf to include arbitrary contexts. We embed entities with Word2Vecf by using the entity IDs as words^{Footnote 5} and the tokens in the sentences mentioning those entities as contexts. Note that all tokens in the sentence, except for some stop words, are used as contexts and not just co-occurrent entities. We rank the entities in the order of their total distance to the entities in the query set as

$$\begin{aligned} \underset{W2V}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = - \sum _{\tilde{x} \in {\mathcal{Q}}} (v_x - v_{\tilde{x}})^2. \end{aligned}$$

(2)

Here, $v_{x}$ represents the L2-normalized embedding for x.

4.4 SetExpan

Shen et al. (2017) introduce SetExpan, a SOTA framework combining context feature selection with ranking ensembles, for set expansion. SetExpan outperformed other SE methods such as SEISA in their evaluation. SetExpan represents entities by the contexts that they are mentioned in. For example, the context features for Batman from Sect. 3 will be {__ is good, __ is smart}. The contexts are used to create a large feature vector which can be used to compute the inter-entity similarity. The authors argue that using all possible features for computing entity similarity can lead to overfitting and semantic drift. To combat these problems SetExpan builds the entity set iteratively by cycling between a context feature selection step and an entity selection step. In context feature selection, each context feature is assigned a score based on the set of currently expanded entities. Based on these scores, the context-features are reranked and the top few context features are selected. The entity selection proceeds by bootstrap sampling of the chosen context features and using those features to create multiple different ranked lists of entities. Multiple different ranked lists are finally combined via a heuristic method for ensembling different ranked lists to create a new set of expanded entities. This process is repeated to convergence to get the final list of expanded entities.

5 Neural variational set expansion

Like BS, Neural variational set expansion first determines the underlying concept, or topic, underlying the query and then ranks entities based on that concept. Our method differs from BS because we use a deep generative model with a low dimensional concept representation, to simulate how a concept may generate a query. Also we use a “distance” (Sect. 5.2) between posterior distributions for ranking entities in lieu of bayesian model comparison.

5.1 Inference step 1: concept discovery

Our model (Fig. 2) is as follows: $z \in {\mathbb {R}}^d$ is a low dimensional latent gaussian random variable representing the concept of a query. z is sampled from a fixed prior distribution $\pi = \mathcal {N}({\mathbf{0}}, \sigma ^2{\mathbf{I}})$, i.e. $z \sim \pi$. The members of ${\mathcal{Q}}$ are sampled conditionally independently given z. z is mapped via a multi layer perceptron (MLP), called ${{\mathrm{NN}}}^{(g)}_\theta$, to g, the p.m.f. of a multinomial distribution that generates $f_x$, the features of x. ${{\mathrm{NN}}}^{(g)}_\theta$ is a neural network with a softmax output layer and parameters $\theta$. $f_x \in {\mathbb{Z}}^{\mathrm{F}}$ are sampled i.i.d. from $p(f|z,\theta ) = {{\mathrm{NN}}}^{(g)}_\theta (z)$.^{Footnote 6}

In other words, the vector $f_x$ contains the counts of observed features for x that were sampled from g, and g was itself sampled by passing a gaussian random variable through a neural network.

Under this deep-generative model a concept vector can simultaneously trigger multiple observed features. This allows us to capture the correlations amongst features triggered by a concept. For example, the concept of president can simultaneously trigger features such as white house, executive order, or airforce one.

In order to infer the latent variable z ideally we should compute $p_\theta (z|{\mathcal{Q}})$, the posterior distribution of z given the observations ${\mathcal{Q}}$. Unfortunately, this computation is intractable because the prior is not conjugate to the likelihood that has a neural network. Another problem is that it is unrealistic to assume access to a large set of ESE queries at training time, because user’s information needs keep changing, therefore the approach used by Zaheer et al. (2017) in DeepSets to discriminatively learn a neural scoring function is impractical for set expansion. For the same reason it is also not possible to learn a single neural network whose input is ${\mathcal{Q}}$ and which directly approximates $p_\theta (z|{\mathcal{Q}})$. Therefore it is non-trivial to apply the VAE framework to ESE. To overcomes these problems we make the assumption that a query ${\mathcal{Q}}$ is conjunctive in nature, i.e. if entity $x_1$ and $x_2$ are present in ${\mathcal{Q}}$ then results that are relevant to both$x_1$ and $x_2$ simultaneously should be given a higher ranking than results that are related to $x_1$ but not $x_2$ or vice-versa. We implement the conjunction of entities in a query by combining the Product of Experts Hinton (1999) approach with the Variational Autoencoder (VAE) Kingma and Welling (2013) method to approximate $p_\theta (z|{\mathcal{Q}})$.

We first map each x to an approximate posterior $q_\phi (z|x)$ via a neural network ${{\mathrm{NN}}}^{(i)}_\phi$ and then we take their product to approximate $p_\theta (z|{\mathcal{Q}})$.

$$\begin{aligned} p_{\theta }(z | {\mathcal{Q}}) \approx q_\phi (z | {\mathcal{Q}}) \propto \prod _{x \in {\mathcal{Q}}} q_\phi (z | x). \end{aligned}$$

The $\phi$ parameters are estimated by minimizing $KL(q(z|x)\mid \mid p(z|x))$ as shown in Sect. 5.3.^{Footnote 7} The benefit of the POE approximation is that the posterior approximation $q_\phi (.|x)$ for each entity x in ${\mathcal{Q}}$ acts as an expert and the product of these experts will assign a high value to only that region where all the posteriors assign a high value. Therefore the POE approximation is a way of implementing conjunctive semantics for a query. Another benefit is that if $q_\phi (.|x)$ is an exponential family distribution with a constant base measure whose natural parameters are the output of ${{\mathrm{NN}}}^{(i)}_\phi$, then the product of the distributions $\prod _x q_\phi (\cdot |x)$ lies in the same exponential family whose natural parameters are simply the sum of individual neural network outputs.^{Footnote 8}^,^{Footnote 9} We use ${{\mathrm{NN}}}^{(i)}_\phi$ to compute the mean and log-variance of the gaussian distribution $q_\phi (z | x)$ (3) that we convert to the natural parameters of a Gaussian (4). Next, we add the natural parameters of the individual variational approximations $\xi _x, \Gamma _x$ to compute the parameters $\xi _{\mathcal{Q}}, \Gamma _{\mathcal{Q}}$ for $q_\phi (z | {\mathcal{Q}})$ (5). Finally, we compute $q_\phi (z|{\mathcal{Q}})$ (6).

$$\begin{aligned} \mu _x, \Sigma _x&= {{\mathrm{NN}}}^{(i)}_\phi (f_x) \end{aligned}$$

(3)

$$\begin{aligned} \xi _x,\ \ \Gamma _x&= \mu _x \Sigma _x^{-1},\ \ \Sigma _x^{-1}. \end{aligned}$$

(4)

$$\begin{aligned} \xi _{\mathcal{Q}},\ \ \Gamma _{\mathcal{Q}}&= \sum \nolimits _{x \in {\mathcal{Q}}} \xi _x,\ \ \sum \nolimits _{x \in {\mathcal{Q}}} \Gamma _x. \end{aligned}$$

(5)

$$\begin{aligned} {q_\phi (z|{\mathcal{Q}})}&= \mathcal {N}_c(z | \xi _{\mathcal{Q}}, \Gamma _{\mathcal{Q}}) \end{aligned}$$

(6)

$\mathcal {N}_c(z | \xi , \Gamma )$ is the multi-variate Gaussian distribution in terms of its natural parameters:

$$\begin{aligned} \frac{|\Gamma |^{1/2}}{(2\pi )^{D/2}}\exp \left( -\frac{(z^T \Gamma z - 2\xi ^T z + \xi ^T\Gamma ^{-1}\xi )}{2} \right) . \end{aligned}$$

5.2 Inference step 2: entity ranking

In order to rank the entities $x \in {\mathcal{R}}$, we design a similarity score between the probability distributions $q_\phi (z|{\mathcal{Q}})$ and $q_\phi (z|x)$ as an efficient substitute for bayesian model comparison. We use the distance between precision weighted means $\xi _{{\mathcal{Q}}}$ and $\xi _{x}$ to define our “distance” function as

$$\begin{aligned} \underset{NVSE}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = -||\xi _{{\mathcal{Q}}} - \xi _{x}||^2 . \end{aligned}$$

(7)

Our inter-distribution “distance” is not a proper distance because it changes as the location of both the input distributions is shifted by the same amount. We experimented with more standard, reparameterization invariant, divergences and kernels such as the KL-divergence and the Probability Product Kernel (Jebara et al. 2004), see (Appendix 4), but we found our approach to be faster and more accurate. We believe this is because the regularization from the prior that encourages the posteriors to be close to the origin makes shift invariance unnecessary.

5.3 Unsupervised training

NVSE is trained in an unsupervised fashion to learn its parameters $\theta$ and $\phi$. Kingma and Welling (2013); Rezende et al. (2014) proposed the VAE framework for learning richly parameterized conditional distributions $p_\theta (x | z)$ from unlabeled data. We follow Kingma and Welling (2013)’s reparameterization trick to train a VAE and maximize the Evidence Lower Bound:

$$\begin{aligned} E_{q_\phi (z | x)}[\log p_\theta (x | z)] - KL(q_\phi (z|x) || p(z)). \end{aligned}$$

(8)

During training, we do not have access to any clustering information or side information that tells us which entities can be grouped together. Therefore we assume that the entities $x \in {\mathcal{X}}$ were generated i.i.d. The model during training looks the same as Fig. 2 but with one difference: Q is a singleton set of just one entity.^{Footnote 10} Note that our learning method requires no supervision in contrast to methods like Deep Sets which require cluster information, or Neural Collaborative filtering methods which require a large dataset of user interactions.

6 Interpretability

We introduce a general approach for interpreting ESE models based on query rationales to explain the latent concept the model discovered and result justifications to provide evidence for why the system ranked an entity highly. Based on query rationales and result justifications, users can add weights to entities in a query to tell the system what aspects of the query to focus on or ignore.

6.1 Query rationale

A Query Rationale is a visualization of the latent beliefs of the ESE system given the query ${\mathcal{Q}}$. Given ${\mathcal{Q}}$, we construct a feature-importance-map $\gamma _{{\mathcal{Q}}}$ that measures the relative importance of the features in $f_x$ and we show the top features according to $\gamma _{\mathcal{Q}}$ as “Query Rationales”. Recall that the $j{\text{th}}$ component of $f_x$, associated with entity x, measures how often the $j{\text{th}}$ feature co-occurred with x. We now present how we construct $\gamma _{\mathcal{Q}}$ for NVSE and the baselines.

For BM25, $\gamma _{{\mathcal{Q}}}$ is simply $\bar{f_{{\mathcal{Q}}}}$. In BS, $\gamma _{{\mathcal{Q}}}$ is the weights from (11b): for each $j{\text{th}}$ component of $f_x$,

$$\begin{aligned} \gamma _{{\mathcal{Q}}}[j] = \log \frac{\tilde{\alpha }_{\mathcal{Q}}[j] \beta [j]}{\alpha [j]\tilde{\beta }_{\mathcal{Q}}[j]}. \end{aligned}$$

The benefit of generative methods such as BS and NVSE is that for them query rationales can be computed as a natural by-product of the generative process instead of as ad-hoc post-processing steps. For NVSE, ideally $\gamma _{{\mathcal{Q}}}$ should be the posterior distribution $p_\theta (f | {\mathcal{Q}})$. Since this is intractable we approximate it by sampling the inference network:

$$\begin{aligned} p_\theta (f | {\mathcal{Q}}) = E_{p_\theta (z | {\mathcal{Q}})} [p_{\theta }(f | z, {\mathcal{Q}})] \approx E_{q_\phi (z | {\mathcal{Q}})} [p_{\theta }(f|z)] . \end{aligned}$$

We further approximate the expectation with a single sample of the mean of $q_\phi (z | {\mathcal{Q}})$. Finally the feature importance map for NVSE is:

$$\begin{aligned} \gamma _{{\mathcal{Q}}} = p_\theta (f | E[q_\phi (z | {\mathcal{Q}})]). \end{aligned}$$

Because Word2Vecf finds the nearest-neighbor between entity embeddings, which are produced through a complicated learning process operating on the whole text corpus, it does not provide a natural way to determine the importance of a single sentence and therefore it is not possible to say what was the effect of a particular sentence on the query results. Similarly, since the SetExpan method works by extracting context features and iteratively expanding this feature set, it is not possible to determine the effect of a single sentence on the final search results.

6.2 Result justifications

We define result justifications as sentences in ${\mathcal{M}}_{x}$ that justify why an entity was ranked highly for a given query. Ranking the sentences that mention an entity is similar to ranking entities in ${\mathcal{R}}$. Just as we create a feature vector for each x, we create a feature vector for each sentence in ${\mathcal{M}}_{x}$ and use the same scoring function to rank the sentences based on the query. While computing a score for entity x based on a query, we also score each sentence in ${\mathcal{M}}_{x}$. Our approach to generate interpretable result justifications is agnostic to ESE methods with the caveat that for methods like Word2Vecf and SetExpan this will require retraining or reindexing over the corpus for each query. Our approach will not be feasible for such methods.

6.3 Weighted queries

Any recommendation system can occasionally fail to provide good results for a query. To improve a system’s responses in such cases we enable users to guide NVSE ’s results by using entity weights to influence the posterior distribution over topics.

If a user provides weights ${\varvec{\tau }}= \{ \tau _x \mid x \in {\mathcal{Q}}\}$, we compute the query features as

$$\begin{aligned} \xi _{{\mathcal{Q}},{\varvec{\tau }}},\ \ \Gamma _{{\mathcal{Q}},{\varvec{\tau }}} = \sum \nolimits _{x \in {\mathcal{Q}}} \tau _x \xi _x,\ \ \sum \nolimits _{x \in {\mathcal{Q}}} |\tau _x|\Gamma _x. \end{aligned}$$

(9)

The above formulae have an intuitive explanation: when an entity has a higher weight then the precision over the concepts activated by that entity is increased according to the magnitude of the weight, and the value of the precision weighted mean is also weighted by the user supplied weights. In turn, an entity with zero weight has zero effect on the final search result and entities with a high negative weight return entities diametrically opposite to that entity with higher confidence.

Weights can be applied to other methods as well. BM25 can multiply each $f_x$ by x’s weights when computing $\bar{f}_{\mathcal{Q}}$, and Word2Vecf can use a weighted average. It is not straight-forward to incorporate weights in BS and SetExpan systems. One possible way is to use bootstrap resampling of the query entities according to a softmax distribution over entity weights, but bootstrapping makes the system non-deterministic and therefore even more opaque for a user. Also bootstrap resampling requires multiple query executions and it is not straight-forward to combine the outputs of different search queries; therefore we do not advocate for bootstrapping.

7 Comparative experiments

We test the hypothesis that NVSE can help bridge the gap between advances in IR and real world use cases. We use human annotators on Amazon Mechanical Turk (AMT) to determine whether NVSE finds more relevant entities than our baseline methods in a real world, automatically generated KG.

7.1 Dataset

TinkerBell (Al-Badrashiny et al. 2017) is a KG construction system that achieved top performance in TAC-KGP2017 evaluation.^{Footnote 11} We used it as our automatic KG. For each entity e in TinkerBell we create ${\mathcal{M}}_e$ by concatenating all sentences that mention e and remove the top 100 most frequent features in the corpus from ${\mathcal{M}}_e$ to clean stop words. Tinkerbell was constructed from the TAC KGP 2017 evaluation source corpus, LDC2017E25, that contains 30K English documents and 60K Spanish and Chinese documents.^{Footnote 12} Half of the English documents come from online discussion forums and the other half from news sources, e.g. Reuters or the New York Times. Our experiments only use the 77,845 EDL entities within TinkerBell that are assigned the type Person. We use these links to create a map from DBPedia categories to entities in TinkerBell, say M. Each entity in TinkerBell is associated to spans of characters that mention that entity. We tokenize and sentence segment the documents in LDC2017E25 and associate sentences to each entity corresponding to mentions. In the end we get 344,735 sentences associated to the 77K entities. The median number of sentences associated to an entity is 1 and the maximum number of sentences is 4638 for the Barack Obama entity.^{Footnote 13} This is a good example of how automatic KGs differ from manually curated KGs. In TinkerBell most of the entities appear in only a single sentence so only a single fact may be known about them. In contrast KGs like FreeBase and DBPedia have a more uniform coverage of facts for entities present in them. Another difference is that relational information such as ancestry relations between entities are much more noisy in an automatically generated KB than in DBPedia which relies on manually curated information present in Wikipedia.

7.2 Implementation details

We prune the vocabulary by removing any tokens that occur less than 5 times across all entities. We end up with, ${\mathrm{F}}\,{=}\, 105448, {\mathrm{V}}= 61311$, ${\mathrm{D}}= 24661$, and ${\mathrm{V}}'= 19476$. We used BM25 implemented in Gensim (Řehůřek and Sojka 2010) and we implemented BS ourselves. We choose $\lambda \, =\, 0.5$, out of 0, 0.5, or 1, after visual inspection. We used Word2Vecf and SetExpan codebases released by the authors.^{Footnote 14} For NVSE, we set $d \, {=} \, 50$, $\sigma \,{=} \, 1$. The generative network ${{\mathrm{NN}}}^{(g)}_\theta$ does not have hidden layers and the inference network ${{\mathrm{NN}}}^{(i)}_\phi$ has 1 hidden layer of size 500 with a $\tanh$ non-linearity and two output layers for the mean $\mu _x$ and log of the diagonal of the variance $\Sigma _x$. We use a diagonal $\Sigma _x$.^{Footnote 15} For Word2Vecf, we used $d=100$ to use the same number of parameters per entity as in NVSE. We trained with default hyperparameters for 100 iterations. We used SetExpan with the default hyperparameters as well except that we limited the number of maximum iterations to 3 since we only needed top 4 entities for our experiments.

7.3 Experimental design

Prior work typically evaluates ESE on a small number of queries, constituting the most frequent entities, e.g. Ghahramani and Heller (2006) reported results for 10 queries with highly cited authors and Shen et al. (2017) used 20 test queries created of 2000 most frequent entities in Wikipedia. However in automatic KGs, most entities are mentioned only a few times. For example 60% of the entities in TinkerBell are mentioned once. We are primarily interested in unbiased evaluation over such entities, therefore we stratified the evaluation queries into three types.

The 1st type contains entities mentioned in only 1 sentence, the 2nd contains entities appearing in 2–10 sentences, and the 3rd contains entities mentioned in 11–100 sentences. We also stratified queries based on whether they had 3, or 5 entities. For each query type we randomly generate 80 queries by first sampling 80 Wikipedia categories and then sampling entities from those categories that were also part of the TinkerBell KG. This results in 480 queries. See Table 1 for examples.

For each query, we showed the names and first paragraphs from the Wikipedia abstracts of the query’s entities, to help the AMT workers disambiguate entities unfamiliar to them. Then we showed the workers the top 4 entities returned by each system. Each resultant entity was shown with up to 3 justification sentences.^{Footnote 16} Since SetExpan and Word2Vecf do not return justifications, we used NVSE to extract justifications for their results. We asked workers to rank the systems between 1, the best system, to 3, the worst; and we allowed for ties. The annotators found it difficult to compare results from 5 systems at a time so we split our evaluation into two groups. Group 1 compared NVSE to BS and BM25, and group 2 compared NVSE to SetExpan and Word2Vecf. We randomized the placement of the lists so that the workers could not figure out which system created which list.

Table 1 Examples of randomly created queries

Full size table

Table 2 The number of times a system was ranked 1st over 80 queries compared to other systems in the same group

Full size table

7.4 Results

Table 2 shows the number of times the annotators ranked each system as the best out of the 80 queries. Over all queries, NVSE returned better results compared to the 4 baselines systems. It performed best with 5 entities in the query where each entity was only mentioned up to 10 times in the corpus. This shows that NVSE is able to discern better quality topics from multiple entities with sparse data. Extended results showing second and third place rankings of the systems are given in Table 5 in the appendix which show that in cases that when NVSE does not rank first it is typically chosen as the second ranking system.

The IR method BM25 was the strongest baseline, outperforming BS and SetExpan, and even NVSE in two settings. We believe that this is because of the low-resource conditions of our evaluation where ad-hoc IR methods can have an advantage. Another reason why BM25 worked very well in our evaluation was because of the lack of auxilliary signals such as entity inter-relations and entity links and because all the entities were of person type. This makes our task different from the entity list completion (ELC) task (BALOG 2009) and a bit simpler for methods that focus heavily on lexical overlap. Another difference between the ESE task and the ELC task was that in the ELC task a descriptive prompt describing the query was also given to the users while evaluating the relevance of the returned results whereas no such prompt was given in the ESE task. We also found that sometimes BM25 was rated highly because it returned results that were highly relevant to a single query entity instead of being topically similar to all entities. For example, on the query associated with “The Apprentice Contestants” BM25’s results solely focused on Dennis Rodman, but NVSE tried to infer a common topic amongst entities and returned generic celebrities which annotators did not prefer.

On entities with little data, Word2Vecf and SetExpan perform poorly. Word2Vecf requires large amounts of data for learning useful representations (Altszyler et al. 2016) which explains why it performs poorly in our evaluation. The SetExpan algorithm directly uses context features extracted from the mentions of an entity, and returns entities with the same context features. This approach can overfit with low data. Even though SetExpan uses an ensembling method to reduce the variance of the algorithm, we believe using context-features causes overfitting when an entity appears in only a few sentences. Lastly, we believe that BS suffers because its impoverished generative model has neither non-linearities, nor low-dimensional topics for modeling correlations amongst tokens.

Table 3 The first row contains top 10 features most similar to $z_{j}$

Full size table

Table 4 The top row represents a query with weights in parentheses and the bottom row lists corresponding query rationales

Full size table

8 Analyzing interpretability

We now attempt to understand the similarity relations encoded in NVSE ’s internal concept representations to understand what it is learning. We also provide examples of how query rationales and query weights can help users fine-tune their queries.

8.1 Understanding the concept space

To gain some insight into the distribution over concepts inferred by NVSE we determined the top 10 words activated by individual dimension of z by computing ${{\mathrm{NN}}}^{(g)}_\theta (e_j)$ where $e_j$ is a one-hot vector in ${\mathbb {R}}^{50}$. Table 3 shows the top 10 words for selected components of z. We can easily recognize that dimensions 3, 33 and 37 of z represent finance, sports, and entertainment. Even though we did not constrain z to be component-wise interpretable, this structure naturally emerged after training.

8.2 Weights and query rationale

Table 4 depicts how the query rationale returned by NVSE changes in response to entity weights. In the first column the query is {Abu Bakr Baghdadi} and the query rationale tells us that NVSE focuses on iraq, baghdadi etc. The second column shows a different query {Osama Bin Laden} and the query rationales changes accordingly to pakistani and osama. The third and fourth column show rationales when the weights on “Laden” and “Baghdadi” are varied. When more weight is put on “Laden” then the query rationales contain more features that are associated to him, and when more weight is put on “Baghdadi”, then features such as “islamic” which is a token from his organization are returned. The last column shows an interesting configuration where a user is effectively asking for results that are similar to “Baghdadi” but dissimilar to “Laden” and the feature for kurdish gets activated. Since the system returns results in under 100ms, the user can fine-tune her query in real-time with the help of these query rationales.

We give one more example of the utility of negative weights: When ${\mathcal{Q}}= \{\text{Brady}\}$, NVSE ’s rationale is [brady, game, patriots, left, knee, field, tackle], indicating that NVSE associated the “Brady” entity with Tom Brady who is a member of the Patriots football team. When we added “Wes Welker” to ${\mathcal{Q}}$ with a negative weight, the query rationale changed to [brady, game, left, tackle, knee, back, field]. Since Wes is a Patriots receiver who received a negative weight in the query, NVSE deactivated the patriots feature and activated the tackle feature, opposite to a receiver.

9 Conclusion

We introduced NVSE as a step towards making advances in entity set expansion useful to real-world settings. NVSE is a novel unsupervised approach based on the VAE framework that discovers related entities from noisy knowledge graphs. NVSE ranks entities in a KG using an efficient and fast scoring function (7), ranking 80K entities on a commodity laptop in 100 ms.

Our experiments demonstrated that NVSE can be applied in real-world settings where automatically generated KGs are noisy. NVSE outperformed state of the art ESE systems and other strong baselines on a real world KG. We also presented a flexible approach to interpret ESE methods and justify their recommendations.

In future work, we will extend our work by improving our model using more powerful auto-encoders such as the Ladder VAE (Sønderby et al. 2016), secondly we will experiment with the use of side information such as links from a KG through the use of Graph Convolutional Networks (Kipf and Welling 2017). Third, we will like to quantitatively measure how query rationales and justifications help users in accomplishing their search task. Finally, we will incorporate confidence scores from the KG in our model. Although there may be future work to improve our ESE method, we believe that NVSE serves as a significant step towards utilizing KGs and semantics for information retrieval and understanding in real world settings.

Notes

We refer to the items in the seed as entities but they can also be referred to as items or elements.
https://nyti.ms/2hR1S15.
We ignore confidence scores that entity linking systems often assign to a link because confidence scores will prevent us from using a multinomial distribution to model a document as a bag-of-words.
Lucene replaced tf-idf with BM25 as its default algorithm: https://issues.apache.org/jira/browse/LUCENE-6789.
Converting entity mentions to entity IDs allows us to overcome issues related to embedding multi-word expressions as explained in Poliak et al. (2017).
Our generative model is inspired by Miao et al. (2016)’s NVDM. They assume that a single latent variable generates only one observation, but we posit that the same latent variable z generates all observations in ${\mathcal{Q}}$.
This is a generalization of Bouchacourt et al. (2017) combining variational approximations of posterior distributions since the product of gaussians is a Gaussian distribution.
Also notice that the POE approach recommends adding the outputs of the neural networks which is different than concatenating the features for all x in ${\mathcal{Q}}$ or naively adding the inputs of the neural network. (Appendix 2) gives more details.
Recently, Zaheer et al. (2017) gave a theorem that any permutation invariant function of sets must be representable as the function of a sum of features of elements of the set. We note that our POE approximation also has a similar form and is permutation invariant.
More informally, we remove the plates from Fig. 2.
Tinkerbell constructed a KG from LDC2017E25 that contains 30K English documents. Half of them are from online forums and the other half from Reuters and NYT. We focused on the 77,845 entities from English documents appearing in 344,735 sentences. 25,149 entities were also linked to DBPedia.
https://tac.nist.gov/2017/KGP/data.html.
The mean is 4.43, the standard deviation is 29.19, the minimum number of sentences for an entity is 1, the maximum number of sentences is 4638, and the median is 1 (44,317 entities).
https://bitbucket.org/yoavgo/word2vecf, https://github.com/mickeystroller/SetExpan.
Training NVSE on 1 Tesla K80 using the Adam optimizer with learning rate $5e^{{-}5}$ and minibatch size 64 took 12 h.
Figure 3 in (Appendix 5) illustrates the AMT interface.

References

Al-Badrashiny, M., Bolton, J., Tejavsi Chaganty, A., Clark, K., Harman, C., Huang, L., Lamm, M., Lei, J., Lu, D., Pan, X., Paranjape, A., Pavlick, E., Peng, H., Qi, P., Rastogi, P., See, A., Sun, K., Thomas, M., Tsai, C. T., Wu, H., Zhang, B., Callison-Burch, C., Cardie, C., Ji, H., Manning, C., Muresan, S., Rambow, O. C., Roth, D., Sammons, M., & Van Durme, B. (2017). Tinkerbell: Cross-lingual cold-start knowledge base construction. In Text analysis conference (TAC).
Altszyler, E., Sigman, M., & Slezak, D. F. (2016). Comparative study of LSA vs word2vec embeddings in small corpora: a case study in dreams database. CoRR http://arxiv.org/abs/1610.01520.
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open data. The semantic web pp. 722–735.
Balog K. (2009). Overview of the trec 2009 entity track. In Proc. TREC2009.
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3, 993–1022.
MATH Google Scholar
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD international conference on management of data, New York, NY: ACM.
Bouchacourt, D., Tomioka, R., & Nowozin, S. (2017). Multi-level variational autoencoder: Learning disentangled representations from grouped observations. arXiv preprint http://arxiv.org/abs/1705.08841.
Demartini, G., Iofciu, T., & De Vries, A. P. (2010). Overview of the inex 2009 entity ranking track. In Proceedings of the focused retrieval and evaluation, and 8th international conference on initiative for the evaluation of XML retrieval. INEX’09, (pp. 254–264). Berlin, Heidelberg: Springer.
Ghahramani, Z., & Heller, K. A. (2005). Bayesian sets. In Proceedings of the 18th international conference on neural information processing systems. NIPS’05, (pp. 435–442). Cambridge, MA, USA: MIT Press.
Gottipati, S., & Jiang, J. (2011). Linking entities to a knowledge base with query expansion. In Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics,(pp. 804–813).
He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T. S. (2017). Neural collaborative filtering. In Proceedings of the 26th international conference on World Wide Web. WWW ’17, Republic and Canton of Geneva, Switzerland, international World Wide Web conferences steering Committee, (pp. 173–182).
He, Y., & Grishman, R. (2015). Ice: Rapid information extraction customization for nlp novices. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Demonstrations, Denver, Colorado, Association for computational linguistics, (pp. 31–35).
He, Y., & Xin, D. (2011). Seisa: set expansion by iterative similarity aggregation. In Proceedings of the 20th international conference on World wide web, (pp. 427–436). ACM.
Hinton, G. E. (1999). Products of experts. In 1999 ninth international conference on artificial neural networks ICANN 99. (Conf. Publ. No. 470), (Vol 1, pp. 1–6).
Jebara, T., Kondor, R., & Howard, A. (2004). Probability product kernels. Journal of Machine Learning Research, 5, 819–844.
MathSciNet MATH Google Scholar
Kingma, D. P., & Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint http://arxiv.org/abs/1312.6114.
Kipf, T. N., & Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In Proceedings of ICLR.
Lang, J., & Henderson, J. (2013). Graph-based seed set expansion for relation extraction using random walk hitting times. In Proceedings of the 2013 conference of the north american chapter of the association for computational linguistics: Human language technologies, Atlanta, Georgia, Association for computational linguistics, (pp. 772–776).
Lee, W., Song, K., & Moon, I. C. (2017). Augmented variational autoencoders for collaborative filtering with auxiliary information. In ACM conference on information and knowledge management, Number 6. ACM. https://doi.org/10.475/1234.
Levy, O., & Goldberg, Y. (2014). Dependency-based word embeddings. In Proceedings of the 52nd annual meeting of the association for computational linguistics (volume 2: short papers), Baltimore, Maryland: Association for computational linguistics, (pp. 302–308).
Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on computational linguistics-Volume 2, Association for computational linguistics, (pp. 768–774).
Manning, C. D., Raghavan, P., & Schtze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.
Book MATH Google Scholar
Metzger, S., Schenkel, R., & Sydow, M. (2014). Aspect-based similar entity search in semantic knowledge graphs with diversity-awareness and relaxation. In Web Intelligence (WI) and intelligent agent technologies (IAT), 2014 IEEE/WIC/ACM international joint conferences on, (pp. 60–69). IEEE.
Miao, Y., Yu, L., & Blunsom, P. (2016). Neural variational inference for text processing. In International conference on machine learning, (pp. 1727–1736).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th international conference on neural information processing systems—Volume 2. NIPS’13, (pp. 3111–3119). USA: Curran Associates Inc.
Mitra, B., & Craswell, N. (2017). Neural models for information retrieval. ArXiv e-prints (May 2017).
Pantel, P., Crestan, E., Borkovsky, A., Popescu, A. M., & Vyas, V. (2009). Web-scale distributional similarity and entity set expansion. In Proceedings of the 2009 conference on empirical methods in natural language processing, Singapore, Association for Computational Linguistics, (pp. 938–947).
Paşca, M., & Van Durme, B. (2007). What you seek is what you get: Extraction of class attributes from query logs. In IJCAI.
Paşca, M., & Van Durme, B. (2008). Weakly-supervised acquisition of open-domain classes and class attributes from web documents and query logs. In Proceedings of ACL-08: HLT, (pp. 19–27).
Poliak, A., Rastogi, P., Martin, M. P., & Van Durme, B. (2017). Efficient, compositional, order-sensitive n-gram embeddings. In Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers, Valencia, Spain, Association for Computational Linguistics, (pp. 503–508).
Pujara, J., Augustine, E., & Getoor, L. (2017). Sparsity and noise: Where knowledge graph embeddings fall short. In Proceedings of the 2017 conference on empirical methods in natural language processing, Copenhagen, Denmark, Association for computational linguistics, (pp. 1751–1756).
Rastogi, P., Lyzinski, V., & Van Durme, B. (2017). Vertex nomination on the cold start knowledge graph. Human Language Technology Center of Excellence: Technical report.
Řehůřek, R., & Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks, Valletta, Malta, ELRA, (pp. 45–50).
Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082.
Robertson, S., & Zaragoza, H. (2009). The probabilistic relevance framework: Bm25 and beyond. Foundations Trends in Information Retrieval, 3(4), 333–389.
Article Google Scholar
Sadamitsu, K., Saito, K., Imamura, K., & Kikui, G. (2011). Entity set expansion using topic information. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, Portland, Oregon, USA, Association for computational linguistics, (pp. 726–731).
Shen, J., Wu, Z., Lei, D., Shang, J., Ren, X., & Han, J. (2017). Setexpan: Corpus-based set expansion via context feature selection and rank ensemble. In M. Ceci, J. Hollmén, L. Todorovski, C. Vens, & S. Džeroski (Eds.), Machine learning and knowledge discovery in databases (pp. 288–304). Cham: Springer International Publishing.
Chapter Google Scholar
Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., & Winther, O. (2016). Ladder variational autoencoders. In Proceedings of the 30th international conference on neural information processing systems. NIPS’16, (pp. 3738–3746). USA: Curran Associates Inc.
Tong, S., & Dean, J. (2008). System and methods for automatically creating lists (March 25 2008) US Patent 7,350,187.
Vartak, M., Thiagarajan, A., Miranda, C., Bratman, J., & Larochelle, H. (2017). A meta-learning perspective on cold-start recommendations for items. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30, pp. 6907–6917). Red Hook: Curran Associates Inc.
Google Scholar
Wang, R. C., & Cohen, W. W. (2007). Language-independent set expansion of named entities using the web. In Seventh IEEE international conference on data mining (ICDM 2007), (pp. 342–350).
Wang, R. C., & Cohen, W. W. (2008). Iterative set expansion of named entities using the web. In 2008 eighth IEEE international conference on data mining, (pp. 1091–1096).
Wang, R. C., & Cohen, W. W. (2009). Automatic set instance extraction using the web. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 1-Volume 1, Association for computational linguistics, )(pp. 441–449).
Wang, R. C., Schlaefer, N., Cohen, W. W., & Nyberg, E. (2008). Automatic set expansion for list question answering. In Proceedings of the 2008 conference on empirical methods in natural language processing, Honolulu, Hawaii, Association for computational linguistics, (pp. 947–954).
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., & Smola, A. J. (2017). Deep sets. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30, pp. 3394–3404). Red Hook: Curran Associates Inc.
Google Scholar
Zaidan, O., Eisner, J., & Piatko, C. (2007). Using annotator rationales to improve machine learning for text categorization. In Human language technologies 2007: The conference of the North American chapter of the association for computational linguistics; proceedings of the main conference, (pp. 260–267).
Zhang, X., Chen, Y., Chen, J., Du, X., Wang, K., & Wen, J. R. (2017). Entity set expansion via knowledge graphs. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval. SIGIR ’17, (pp. 1101–1104). New York, NY: ACM.
Zheng, Y., Shi, C., Cao, X., Li, X., & Wu, B. (2017). Entity set expansion with meta path in knowledge graph. In Pacific-Asia conference on knowledge discovery and data mining, (pp. 317–329). Springer.

Download references

Funding

Funding was provided by Defense Advanced Research Projects Agency (Grant No. FA8750-13-2-001).

Author information

Authors and Affiliations

Johns Hopkins University, Baltimore, USA
Pushpendre Rastogi, Adam Poliak, Vince Lyzinski & Benjamin Van Durme

Authors

Pushpendre Rastogi
View author publications
You can also search for this author in PubMed Google Scholar
Adam Poliak
View author publications
You can also search for this author in PubMed Google Scholar
Vince Lyzinski
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Van Durme
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pushpendre Rastogi.

Additional information

In: Joint Proceedings of the First International Workshop on Professional Search (ProfS2018); the Second Workshop on Knowledge Graphs and Semantics for Text Retrieval, Analysis, and Understanding (KG4IR); and the International Workshop on Data Search (DATA:SEARCH18). Co-located with SIGIR 2018, Ann Arbor, Michigan, USA—12 July 2018, published at http://ceur-ws.org.

Appendices

Appendix 1: IDF computed for BM25

BM25 is computed based on the average total count of a feature in the entire corpus and ${{\mathrm{IDF}}}[i]$ is the inverse document frequency of the $i{\text{th}}$ feature amongst all documents, which is defined as

$$\begin{aligned} {{\mathrm{IDF}}}[i]=&\log \frac{{\mathrm{X}}- {{\mathrm{DF}}}[i] + 0.5}{{{\mathrm{DF}}}[i] + 0.5} \\ {{\mathrm{DF}}}[i] =&\sum \nolimits _{x \in {\mathcal{X}}} \mathbb {I}[f_x[i] > 0]. \end{aligned}$$

Appendix 2: Computing product of experts for deep-exponential families

In this section we show how the product of experts can be computed simply by adding the output of the neural networks in the special case that the variational approximation has the following form:

$$\begin{aligned} q_\phi (z | x) \propto h(z) \exp ( \langle \psi (z) , {{\mathrm{NN}}}^{(i)}_\phi (x) \rangle ) \end{aligned}$$

(10)

where $\psi (z)$ are the features of z. If h is constant—which is true for a number of exponential family distributions such as the Bernoulli, Exponential, Pareto, Laplace, Gaussian, Gamma and the Wishart distributions—then:

$$\begin{aligned} q_\phi (z | x)&\propto \exp ( \langle \psi (z) , {{\mathrm{NN}}}^{(i)}_\phi (x) \rangle ). \end{aligned}$$

In turn,

$$\begin{aligned} \prod _{x \in {\mathcal{Q}}}q_\phi (z | x)&\propto \exp ( \langle \psi (z) , \sum _{x \in {\mathcal{Q}}}{{\mathrm{NN}}}^{(i)}_\phi (x) \rangle ). \end{aligned}$$

This shows that the product of experts can be computed simply by summing the outputs of the neural network activations for such deep-exponential families with constant base measure.

Appendix 3: Bayesian sets

The Bayesian Sets algorithm ranks the elements in ${\mathcal{X}}{\setminus} {\mathcal{Q}}$ according to the ratio of two probabilities:

$$\begin{aligned} {\textit{score}}(x) = \frac{p(x|{\mathcal{Q}})}{p(x)} = \frac{E_{p(z | {\mathcal{Q}})}[p(x|z)]}{E_{\pi (z)}[p(x|z)]} \end{aligned}$$

Instead of assuming the commonly used Beta-Binomial distribution we may assume that p(x|z) is a product of independent Poisson distributions with Gamma conjugate priors. I.e. $p(x | z) = \prod _k \frac{z_k^{x_k}}{x_k}$. The conjugate prior on z is a product of Gamma distributions,

$$\begin{aligned} p(z | \alpha , \beta ) = \prod _k \frac{{\beta _k}^{\alpha _k}}{\Gamma (\alpha _k)} {z_k}^{\alpha _k - 1} \exp (-\beta _k z_k). \end{aligned}$$

Let $f(x_k, \alpha _k, \beta _k) =$

$$\begin{aligned} {\left( \frac{{x_k+\alpha _k-1}}{x_k} \right) } \left( 1-\frac{1}{1+\beta _k}\right) ^{\alpha _k} \left( \frac{1}{1+\beta _k}\right) ^{x_k}. \end{aligned}$$

The Bayesian Sest score under these conditions is

$$\begin{aligned} {\textit{score}}(x) = \prod _k {\frac{f(x_k, \tilde{\alpha }_k, \tilde{\beta }_k)}{f(x_k, \alpha _k, \beta _k)}} \end{aligned}$$

Where $\tilde{\alpha }_k = \alpha _k+\sum _{x \in {\mathcal{Q}}}x_k$ and $\tilde{\beta }_k = \beta _k + \mathrm {Q}$. Note that if $\tilde{\alpha }_k = \alpha _k$ then ${\frac{f(x_k, \tilde{\alpha }_k, \tilde{\beta }_k)}{f(x_k, \alpha _k, \beta _k)}} = (\frac{1+\beta _k}{1+\beta _k+D})^{x_k}$ which means that features that occur in x that did not occur in ${\mathcal{Q}}$ are penalized based on the number of times the feature appeared. Therefore, the Gamma-Poisson distribution is a good approximation only when quantitative differences in the number of times a feature appears are important.

Finally we may assume that the components of x were sampled from conditionally independent gaussian distributions with unknown mean and precisions. I.e. $p(x | \mu , \tau ) =$

$$\begin{aligned} \prod _k \sqrt{\frac{\tau }{2\pi }}\exp (-(x_k-\mu _k)^2 \tau _k) \end{aligned}$$

and $p(\mu , \tau | \rho , \lambda , \alpha , \beta ) =$

$$\begin{aligned} \prod _k \frac{{\beta _k}^{\alpha _k} \sqrt{\lambda _k}}{\Gamma (\alpha _k) \sqrt{2\pi }}{\tau _k}^{\alpha _k - \frac{1}{2}} \exp (-\beta _k \tau _k) \exp \left( - \frac{\lambda _k \tau _k (\mu _k - \rho _k)^2}{2}\right) . \end{aligned}$$

In the following formulaes we omit the susbscript k for convenience.

$$\begin{aligned} \bar{x}&= \frac{1}{\mathrm {Q}}\sum _{x \in {\mathcal{Q}}} x\\ \tilde{\rho }&= \frac{\lambda \rho + \mathrm {Q}\bar{x}}{\lambda + \mathrm {Q}} \\ \tilde{\lambda }&= {\lambda + \mathrm {Q}}\\ \tilde{\alpha }&= \alpha + \mathrm {Q}/2\\ \tilde{\beta }&= \beta + \frac{1}{2}\sum _{x \in {\mathcal{Q}}} (x - \bar{x})^2 + \frac{\mathrm {Q}\lambda }{\mathrm {Q}+ \lambda }\frac{(\bar{x}- \tilde{\rho })^2}{2} \end{aligned}$$

The Bayesian Sets score is the ratio of two t distribution values

$$\begin{aligned} {\textit{score}}(x) = \prod _k \frac{t_{2\tilde{\alpha }_k}\left( x_k \mid \tilde{\rho }_k, \frac{\tilde{\beta }_k (\tilde{\lambda }_k + 1)}{\tilde{\alpha }_k \tilde{\lambda }_k}\right) }{t_{2\alpha _k}\left( x_k \mid \rho _k, \frac{\beta _k (\lambda _k + 1)}{\alpha _k \lambda _k}\right) } \end{aligned}$$

Now the value of $t_\nu (x |a, b)$ where a is the location parameter and b is the scale parameter is:

$$\begin{aligned} t_\nu (x |a, b) = \frac{\Gamma \left( \frac{\nu +1}{2}\right) }{\sqrt{b\nu \pi }\Gamma \left( \frac{\nu }{2}\right) } \left( 1 + \frac{(x-a)^2}{b\nu }\right) ^{-\frac{\nu +1}{2}} \end{aligned}$$

In order to use this distribution with count data, it is important to use some variance stabilizing transform, and then perform mean and variance normalization to preprocess all the count features. In this way we can set the priors $\tilde{\rho }_k$ to be 0 and $\lambda _k$ can be set uniformly to some small number such as 2 and $alpha_k, \beta _k$ can be chosen to be 2, 1 respectively.

1.1 Appendix 3.1 Binarizing feature counts

BS binarizes the feature vector $f_x$ as $f'_x$ via thresholding:

$$\begin{aligned} f'_x&[j] = \mathbb {I}[f_x[j] > \mu [j] + \lambda \sigma [j]]\\ \mu [j]&= \frac{{\sum _{x \in {\mathcal{X}}} f_x[j]}}{{\mathrm{X}}}, \sigma ^2[j] {=} {\frac{{ \sum _{x \in {\mathcal{X}}} (f_x[j] - \mu [j])^2}}{{\mathrm{X}}}}, \end{aligned}$$

where $\lambda \in {\mathbb {R}}$ is a hyperparameter. BS’s scoring function becomes

$$\begin{aligned} \underset{BS}{{{\mathrm{{score}}}}}({\mathcal{Q}}{,}x)&= {\sum _{j=1}^{\mathrm{F}}} {\left( \log \frac{\tilde{\alpha }_{\mathcal{Q}}[j] \beta [j]}{\alpha [j]\tilde{\beta }_{\mathcal{Q}}[j]}\right) } f'_x[j] \end{aligned}$$

(11a)

$$\begin{aligned} \tilde{\alpha }_{\mathcal{Q}}[j]&= \alpha [j] + \sum _{x \in {\mathcal{Q}}} f'_x[j] \end{aligned}$$

(11b)

$$\begin{aligned} \tilde{\beta }_{\mathcal{Q}}[j]&= \beta [j] + \mathrm {Q}- \sum _{x \in {\mathcal{Q}}} f'_x[j]. \end{aligned}$$

(11c)

Appendix 4: Ranking methods

A standard function for computing the distance between distributions is the KL-divergence. Another possibility to compute the distance between distributions is to compute the symmetric version of the KL-divergence. Another standard method for computing the similarity between two probability distributions is to compute the probability product kernel (PPK) between two distributions Jebara et al. (2004); i.e.

$$\begin{aligned} \langle q_\phi (z|{\mathcal{Q}}), q_\phi (z|x) \rangle = \int _z q_\phi (z|{\mathcal{Q}})q_\phi (z|x) dz \end{aligned}$$

In the special case that $q_\phi (z|{\mathcal{Q}})$ and $q_\phi (z|x)$ have the special deep-gaussian form then the KL divergence as well as the inner product can be computed in closed form. KL Divergence between two distributions normal distributions $p_1, p_2$ with parameters $(\mu _1, \Sigma _1)$ and $(\mu _2, \Sigma _2)$ is:

$$\begin{aligned} KL(p_1 || p_2) = \frac{1}{2}\left( {\text{tr}}(\Sigma _2^{-1}\Sigma _1) + (\mu _1 - \mu _2)^\top \Sigma _2^{-1}(\mu _1 - \mu _2) - d + \log \frac{\det (\Sigma _2)}{\det (\Sigma _1)}\right) . \end{aligned}$$

and PPK is

$$\begin{aligned} \exp (\frac{-(\mu _1 - \mu _2)^T (\Sigma _1 + \Sigma _2)^{-1} (\mu _1 - \mu _2)}{2} - \log \det ((\Sigma _1 + \Sigma _2))) \end{aligned}$$

In the further special case that $\mu _2 = {\mathbf{0}}, \Sigma _2 = {\mathbf{I}}$ then the KL divergence simplifies to:

$$\begin{aligned} KL(p_1 || p_2) = \frac{1}{2}\left( {\text{tr}}(\Sigma _1) + \mu _1^T \mu _1 - d - \log {\det (\Sigma _1)}\right) . \end{aligned}$$

However, we propose here a simple way to compute the distance between two normal distributions. If $\mu _1, \Sigma _1$ and $\mu _2, \Sigma _2$ are the mean and variance of two normal distributions, $p_1, p_2$ then we use the following distance

$$\begin{aligned} d(p_1, p_2) = ||{\mu _1 \Sigma _1^{-1} - \mu _2 \Sigma _2^{-1}}||^2 = ||\xi _1 - \xi _2||^2 \end{aligned}$$

This metric can be implemented as a single matrix multiplication while KL divergence and PPK cannot. Intuitively this distance gives higher weightage to those dimensions where the variance of the either the distributions is lower. In preliminary experiments we found this distance to be superior to KL divergence and PPL and we use this distance function in our experiments. We believe that the regularization from the gaussian prior that encourages the posterior distributions to be close to the origin make shift invariance unnecessary.

Appendix 5: Mechanical Turk HIT interface and extended results

Table 5 shows the second and third place rankings of the systems and extends the results shown in Table 2.

Table 5 The number of times a system was ranked 2nd (left subtable) and 3rd (right subtable) over 80 queries

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rastogi, P., Poliak, A., Lyzinski, V. et al. Neural variational entity set expansion for automatically populated knowledge graphs. Inf Retrieval J 22, 232–255 (2019). https://doi.org/10.1007/s10791-018-9342-1

Download citation

Received: 05 May 2018
Accepted: 15 October 2018
Published: 25 October 2018
Issue Date: 01 August 2019
DOI: https://doi.org/10.1007/s10791-018-9342-1

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Neural variational entity set expansion for automatically populated knowledge graphs

Abstract

Similar content being viewed by others

Generating Pseudo-ground Truth for Predicting New Concepts in Social Streams

Structural Bias in Knowledge Graphs for the Entity Alignment Task

Sample and Feature Enhanced Few-Shot Knowledge Graph Completion

1 Introduction

2 Related work

2.1 Methods dependent on external information

2.2 Methods for comparing entities

2.3 Queries as natural language

2.4 Neural collaborative filtering

3 Notation

4 Baseline methods

4.1 BM25

4.2 Bayesian sets

4.3 Word2Vecf

4.4 SetExpan

5 Neural variational set expansion

5.1 Inference step 1: concept discovery

5.2 Inference step 2: entity ranking

5.3 Unsupervised training

6 Interpretability

6.1 Query rationale

6.2 Result justifications

6.3 Weighted queries

7 Comparative experiments

7.1 Dataset

7.2 Implementation details

7.3 Experimental design

7.4 Results

8 Analyzing interpretability

8.1 Understanding the concept space

8.2 Weights and query rationale

9 Conclusion

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1: IDF computed for BM25

Appendix 2: Computing product of experts for deep-exponential families

Appendix 3: Bayesian sets

1.1 Appendix 3.1 Binarizing feature counts

Appendix 4: Ranking methods

Appendix 5: Mechanical Turk HIT interface and extended results

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation