1 Introduction

Imagine a physician trying to pin-point a specific diagnosis or a journalist investigating abuses of governmental power. In both scenarios, a domain expert may try to find answers based on prior known, relevant entities—either a list of diagnoses of with similar symptoms that a patient is experiencing or a list of known conspirators. Instead of manually looking for connections between potential answers and prior knowledge, a searcher would like to rely on an automatic Recommender to find the connections and answers for them, i.e. related entities.

In the information retrieval (IR) community, Entity set expansion (ESE) is the established task of recommending entities that are similar to a provided seed of entities.Footnote 1 ESE has been applied in Question Answering (Wang et al. 2008), Relation Extraction (Lang and Henderson 2013) and Information Extraction (He and Grishman 2015) settings. The physician and journalist in our example can not fully take advantage of IR advances in ESE for two main reasons. Recent advances (1) often assume access to a clean, large Knowledge Graph and (2) are uninterpretable.

Many advanced ESE algorithms rely on manually curated, clean Knowledge Graphs (KG), e.g. DBpedia (Auer et al. 2007) and Freebase (Bollacker et al. 2008). In real-world settings, users rarely have access to clean KGs, and instead may rely on automatically generated KGs. Such KGs are often noisy because they are created from complicated and error-prone NLP processes—illustrated in Fig. 1. For example, automatic KGs may include duplicate entities, associations (relations) between entities may be missing, and entities with similar names may be incorrectly disambiguated. These imperfections prevent machine learning approaches from performing well on automatically generated KGs. Furthermore, many ESE algorithms degrade as the sparsity and unreliability of KGs increases (Pujara et al. 2017; Rastogi et al. 2017). Advanced ESE methods, especially those that rely on neural networks, are uninterpretable (Mitra and Craswell 2017). If a physician can not explain decisions, patients may not trust her and if a journalist can not demonstrate how a certain individual is acting unethically or above the law, a resulting article may lack credibility. Furthermore, uniterpretability may limit the applications of advancements in IR, and more broadly artificial intelligence, as humans “won’t trust an A.I. unless it can explain itself.”Footnote 2

Fig. 1
figure 1

Our Entity set expansion (ESE) system assumes a corpus that has been labeled with entity mentions which are clustered via cross-document co-reference and linking to a knowledge base; together known as entity discovery and linking (EDL). Given a query containing Obama, Bush, and Clinton, the ESE system returns other U.S. presidents found in the KG

We introduce Neural variational set expansion (NVSE) to advance the applicability of ESE research. NVSE is an unsupervised model based on Variational Autoencoders (VAEs) that receives a query, uses a Bayesian approach to determine a latent concept that unifies entities in the query, and returns a ranked list of similar entities based on the previously determined unified latent concept. NVSE does not require supervised examples of queries and responses, nor pre-built clusters of entities. Instead, our method only requires sentences with linked entity mentions, i.e. spans of token associated with a KG entity, often included in automatically generated KGs.

NVSE is robust to noisy automatically generated KGs, thus removing the need to rely on manually curated, clean KGs. We evaluate NVSE on the ESE task using Tinkerbell (Al-Badrashiny et al. 2017), an automatically generated KG that placed first in the TAC KGP shared task. Unlike how ESE has been used to improve entity linking for KG construction (Gottipati and Jiang 2011), our goal is the opposite: we leverage noisy automatically generated KGs to perform ESE. NVSE is interpretable; it outputs query rationales—a summarization of features our models associates with the query—and result justifications—an ordered list of sentences from the underlying corpus that justify why our method returned that entity. Query rationales and result justifications are reminiscent of annotator rationales (Zaidan et al. 2007).

To our knowledge this is the first unsupervised neural approach for ESE as opposed to neural methods for supervised collaborative filtering (Lee et al. 2017). All code and data is available at https://github.com/se4u/nvse and a video demonstration of the system is available at https://youtu.be/sGO_wvuPIzM.

2 Related work

2.1 Methods dependent on external information

Since automatically generated KGs can be noisy, some methods utilize information beyond entity links and mentions to aid ESE. Paşca and Van Durme (2007) use search engine query logs to extract attributes related to entities and Paşca and Van Durme (2008) extract sets of instances associated with class labels based on web documents and queries. Pantel et al. (2009) use a large amount of web data as they apply a learned word similarity matrix extracted from a 200 billion word Internet crawl to the ESE task. Both He and Xin (2011)’s SEISA system and Tong and Dean (2008)’s Google Sets use lists of items from the Internet and try to determine which elements in the lists are most relevant to a query. Sadamitsu et al. (2011) rely on given topic information about the queried entities to train a discriminative system. More recent approaches also use external information. Zaheer et al. (2017) use LDA Blei et al. (2003) to create word clusters for supervision, and Vartak et al. (2017) use manual annotations by Twitter users. Zheng et al. (2017) uses inter-entity links in knowledge graphs which are very sparse in automatically generated KGs (Pujara et al. 2017; Rastogi et al. 2017). All of these approaches use more information than just entity links and mentions.

2.2 Methods for comparing entities

Set Expander for Any Language (SEAL) (Wang and Cohen 2007) and its variants (Wang and Cohen 2008, 2009) learn similarities between new words and example words using methods like Random Walks and Random Walks With Restart. Similar to Lin (1998)’s using cosine and Jaccard similarity to find similar words, SEISA uses these metrics to expand sets. These methods are limited to only extracting words that coocur. Because they are applied on web-scale data, SEAL and SEISA assume entities will eventually coocur. This assumption might not be valid in an underlying corpus used to automatically generate a KG. In contrast to those approaches, NVSE finds similar entities based on a kernel between distributions.

2.3 Queries as natural language

In the INEX-XER shared task, queries were represented as natural language questions (Demartini et al. 2010). Metzger et al. (2014) and Zhang et al. (2017) propose methods to extract related entities in a KG based on a natural language query. This scenario is similar to a person interacting with a system like Amazon Alexa. However, our setup better reflects users searching for similar entities in a KG as it is more efficient for users to type entities of interest instead of natural language text.

2.4 Neural collaborative filtering

We are not the first to incorporate neural methods in a recommendation system. Recently, He et al. (2017) and Lee et al. (2017) presented deep auto-encoders for collaborative filtering. Collaborative Filtering assumes a large dataset of previous user interactions with the search engine. For many domains it is not possible to create such a dataset since new data is added everyday and queries change rapidly based on different users and domains. Therefore, we propose the first neural method which does not use supervision for Entity set expansion.

3 Notation

Let \({\mathcal{D}}\) be the corpus of documents and \({\mathcal{V}}\) be the vocabulary of tokens that appear in \({\mathcal{D}}\). We define a document as a sequence of sentences and we define a sentence as a sequence of tokens. Let \({\mathcal{X}}\) be the set of entities discovered in \({\mathcal{D}}\) and we refer to its size as \({\mathrm{X}}\). Each entity \(x \in {\mathcal{X}}\) is linked to the tokens that mention x.Footnote 3 Let \({\mathcal{V}}'\) be the set of tokens linked to any \(x \in {\mathcal{X}}\), and let \({\mathcal{M}}_x\) be the multiset of sentences that mention x in the corpus. For example, consider an entity named “Batman” and a document containing three sentences {Batman is good., He is smart. Life is good.}. “Batman” is linked to tokens Batman and He, \({\mathcal{V}}'= \{{\text{Batman, He}}\}\), and \({\mathcal{M}}_{\text{Batman}} =\) {Batman is good., He is smart.}.

In ESE, a system receives query \({\mathcal{Q}}\)—a subset of \({\mathcal{X}}\)—and has to sort the elements remaining in \({\mathcal{R}}= {\mathcal{X}}{\setminus} {\mathcal{Q}}\). The elements that are most similar to \({\mathcal{Q}}\) should appear higher in the sorted order and elements dissimilar to \({\mathcal{Q}}\) should be ranked lower.

4 Baseline methods

Before introducing NVSE, we describe the four baselines systems: BM25, Bayesian Sets, Word2Vecf and SetExpan. We do not compare to DeepSets (Zaheer et al. 2017), as it is a supervised method that requires entity clusters.

For each x, we create a feature vector \(f_x \in {\mathbb{Z}}^{{\mathrm{F}}}\) from \({\mathcal{M}}_x\), by concatenating three vectors that count how many times (1) a token in \({\mathrm{V}}\) appeared in \({\mathcal{M}}_x\) (2) a document in \({\mathcal{D}}\) mentioned x and (3) a token in \({\mathcal{V}}'\) appeared in \({\mathcal{M}}_x\). Thus, \({\mathrm{F}}= {\mathrm{V}}+ {\mathrm{D}}+ {\mathrm{V}}'\).

4.1 BM25

Best Match 25 (BM25) is “one of the most successful text-retrieval algorithms” (Robertson and Zaragoza 2009).Footnote 4 BM25 ranks remaining entities in \({\mathcal{R}}\) according to the score function

$$\begin{aligned} \underset{BM}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = {\sum _{i=1}^{\mathrm{F}}} {\frac{{{\mathrm{IDF}}}[i] f_x[i] \bar{f}_{\mathcal{Q}}[i] (k_1 + 1) }{f_x[i] {+} k_{1} (1{-}b {+} b {\sum _{j} f_x[j]}/{\bar{L}})}}, \end{aligned}$$

where \(f_{x}[j]\) denotes the j-th feature value in \(f_{x}\), \(\bar{f}_{\mathcal{Q}}\) is the sum of \(f_x \forall x \in {\mathcal{Q}}\) and \(\mathbb {I}\) is the indicator function. \(k_1\) and b are hyperparameters that commonly set to 1.5 and 0.75 (Manning et al. 2008). \(\bar{L}\) is the average total count of a feature in the entire corpus and \({{\mathrm{IDF}}}[i]\) is the inverse document frequency of the \(i{\text{th}}\) feature (Appendix 1).

4.2 Bayesian sets

Ghahramani and Heller (2006) introduced the Bayesian Sets (BS) method which converts ESE into a bayesian model selection problem. BS compares the probabilities that the query entities are generated from a single sample of a latent variable \(z \in \Delta ^{{\mathrm{F}}}\) with the probability that the entities were generated from independent samples. \(\Delta ^{{\mathrm{F}}}\) is the \({\mathrm{F}}-1\) dimensional probability simplex. Note that z has the same dimensionality as the observed features. Given \({\mathcal{Q}}\) and \(\pi\), the prior distribution of z, BS infers the posterior distribution of z, \(p(z | {\mathcal{Q}})\), and computes the following score

$$\begin{aligned} \underset{BS}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = \log \frac{E_{p(z | {\mathcal{Q}})}[p(x|z)]}{E_{\pi (z)}[p(x|z)]}. \end{aligned}$$
(1)

Ghahramani and Heller (2006) computed \({{\mathrm{{score}}}}_{BS}\) in close form by selecting the conditional probability, p(x|z), from an exponential family distribution and setting \(\pi\) to be its conjugate prior. They showed that if p(x|z) is multivariate Bernoulli then BS requires a single matrix multiplication (Appendix 3) and we use this setting for our experiments.

4.3 Word2Vecf

Levy and Goldberg (2014) generalize Mikolov et al. (2013)’s Skip-Gram model as Word2Vecf to include arbitrary contexts. We embed entities with Word2Vecf by using the entity IDs as wordsFootnote 5 and the tokens in the sentences mentioning those entities as contexts. Note that all tokens in the sentence, except for some stop words, are used as contexts and not just co-occurrent entities. We rank the entities in the order of their total distance to the entities in the query set as

$$\begin{aligned} \underset{W2V}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = - \sum _{\tilde{x} \in {\mathcal{Q}}} (v_x - v_{\tilde{x}})^2. \end{aligned}$$
(2)

Here, \(v_{x}\) represents the L2-normalized embedding for x.

4.4 SetExpan

Shen et al. (2017) introduce SetExpan, a SOTA framework combining context feature selection with ranking ensembles, for set expansion. SetExpan outperformed other SE methods such as SEISA in their evaluation. SetExpan represents entities by the contexts that they are mentioned in. For example, the context features for Batman from Sect. 3 will be {__ is good, __ is smart}. The contexts are used to create a large feature vector which can be used to compute the inter-entity similarity. The authors argue that using all possible features for computing entity similarity can lead to overfitting and semantic drift. To combat these problems SetExpan builds the entity set iteratively by cycling between a context feature selection step and an entity selection step. In context feature selection, each context feature is assigned a score based on the set of currently expanded entities. Based on these scores, the context-features are reranked and the top few context features are selected. The entity selection proceeds by bootstrap sampling of the chosen context features and using those features to create multiple different ranked lists of entities. Multiple different ranked lists are finally combined via a heuristic method for ensembling different ranked lists to create a new set of expanded entities. This process is repeated to convergence to get the final list of expanded entities.

5 Neural variational set expansion

Like BS, Neural variational set expansion first determines the underlying concept, or topic, underlying the query and then ranks entities based on that concept. Our method differs from BS because we use a deep generative model with a low dimensional concept representation, to simulate how a concept may generate a query. Also we use a “distance” (Sect. 5.2) between posterior distributions for ranking entities in lieu of bayesian model comparison.

5.1 Inference step 1: concept discovery

Our model (Fig. 2) is as follows: \(z \in {\mathbb {R}}^d\) is a low dimensional latent gaussian random variable representing the concept of a query. z is sampled from a fixed prior distribution \(\pi = \mathcal {N}({\mathbf{0}}, \sigma ^2{\mathbf{I}})\), i.e. \(z \sim \pi\). The members of \({\mathcal{Q}}\) are sampled conditionally independently given z. z is mapped via a multi layer perceptron (MLP), called \({{\mathrm{NN}}}^{(g)}_\theta\), to g, the p.m.f. of a multinomial distribution that generates \(f_x\), the features of x. \({{\mathrm{NN}}}^{(g)}_\theta\) is a neural network with a softmax output layer and parameters \(\theta\). \(f_x \in {\mathbb{Z}}^{\mathrm{F}}\) are sampled i.i.d. from \(p(f|z,\theta ) = {{\mathrm{NN}}}^{(g)}_\theta (z)\).Footnote 6

In other words, the vector \(f_x\) contains the counts of observed features for x that were sampled from g, and g was itself sampled by passing a gaussian random variable through a neural network.

Fig. 2
figure 2

The generative model of query generation is on the left and the variational inference network is on the right. Small nodes denote probability distributions, gray nodes are observations and the black node \(\pi\) is the known prior. \({{\mathrm{NN}}}^{(g)}_\theta\) transforms z to g and the \({{\mathrm{NN}}}^{(i)}_\phi\) transforms \(f_x\) to \(q_\phi (z|x)\). a Generative network. b Inference network

Under this deep-generative model a concept vector can simultaneously trigger multiple observed features. This allows us to capture the correlations amongst features triggered by a concept. For example, the concept of president can simultaneously trigger features such as white house, executive order, or airforce one.

In order to infer the latent variable z ideally we should compute \(p_\theta (z|{\mathcal{Q}})\), the posterior distribution of z given the observations \({\mathcal{Q}}\). Unfortunately, this computation is intractable because the prior is not conjugate to the likelihood that has a neural network. Another problem is that it is unrealistic to assume access to a large set of ESE queries at training time, because user’s information needs keep changing, therefore the approach used by Zaheer et al. (2017) in DeepSets to discriminatively learn a neural scoring function is impractical for set expansion. For the same reason it is also not possible to learn a single neural network whose input is \({\mathcal{Q}}\) and which directly approximates \(p_\theta (z|{\mathcal{Q}})\). Therefore it is non-trivial to apply the VAE framework to ESE. To overcomes these problems we make the assumption that a query \({\mathcal{Q}}\) is conjunctive in nature, i.e. if entity \(x_1\) and \(x_2\) are present in \({\mathcal{Q}}\) then results that are relevant to both\(x_1\) and \(x_2\) simultaneously should be given a higher ranking than results that are related to \(x_1\) but not \(x_2\) or vice-versa. We implement the conjunction of entities in a query by combining the Product of Experts Hinton (1999) approach with the Variational Autoencoder (VAE) Kingma and Welling (2013) method to approximate \(p_\theta (z|{\mathcal{Q}})\).

We first map each x to an approximate posterior \(q_\phi (z|x)\) via a neural network \({{\mathrm{NN}}}^{(i)}_\phi\) and then we take their product to approximate \(p_\theta (z|{\mathcal{Q}})\).

$$\begin{aligned} p_{\theta }(z | {\mathcal{Q}}) \approx q_\phi (z | {\mathcal{Q}}) \propto \prod _{x \in {\mathcal{Q}}} q_\phi (z | x). \end{aligned}$$

The \(\phi\) parameters are estimated by minimizing \(KL(q(z|x)\mid \mid p(z|x))\) as shown in Sect. 5.3.Footnote 7 The benefit of the POE approximation is that the posterior approximation \(q_\phi (.|x)\) for each entity x in \({\mathcal{Q}}\) acts as an expert and the product of these experts will assign a high value to only that region where all the posteriors assign a high value. Therefore the POE approximation is a way of implementing conjunctive semantics for a query. Another benefit is that if \(q_\phi (.|x)\) is an exponential family distribution with a constant base measure whose natural parameters are the output of \({{\mathrm{NN}}}^{(i)}_\phi\), then the product of the distributions \(\prod _x q_\phi (\cdot |x)\) lies in the same exponential family whose natural parameters are simply the sum of individual neural network outputs.Footnote 8,Footnote 9 We use \({{\mathrm{NN}}}^{(i)}_\phi\) to compute the mean and log-variance of the gaussian distribution \(q_\phi (z | x)\) (3) that we convert to the natural parameters of a Gaussian (4). Next, we add the natural parameters of the individual variational approximations \(\xi _x, \Gamma _x\) to compute the parameters \(\xi _{\mathcal{Q}}, \Gamma _{\mathcal{Q}}\) for \(q_\phi (z | {\mathcal{Q}})\) (5). Finally, we compute \(q_\phi (z|{\mathcal{Q}})\) (6).

$$\begin{aligned} \mu _x, \Sigma _x&= {{\mathrm{NN}}}^{(i)}_\phi (f_x) \end{aligned}$$
(3)
$$\begin{aligned} \xi _x,\ \ \Gamma _x&= \mu _x \Sigma _x^{-1},\ \ \Sigma _x^{-1}. \end{aligned}$$
(4)
$$\begin{aligned} \xi _{\mathcal{Q}},\ \ \Gamma _{\mathcal{Q}}&= \sum \nolimits _{x \in {\mathcal{Q}}} \xi _x,\ \ \sum \nolimits _{x \in {\mathcal{Q}}} \Gamma _x. \end{aligned}$$
(5)
$$\begin{aligned} {q_\phi (z|{\mathcal{Q}})}&= \mathcal {N}_c(z | \xi _{\mathcal{Q}}, \Gamma _{\mathcal{Q}}) \end{aligned}$$
(6)

\(\mathcal {N}_c(z | \xi , \Gamma )\) is the multi-variate Gaussian distribution in terms of its natural parameters:

$$\begin{aligned} \frac{|\Gamma |^{1/2}}{(2\pi )^{D/2}}\exp \left( -\frac{(z^T \Gamma z - 2\xi ^T z + \xi ^T\Gamma ^{-1}\xi )}{2} \right) . \end{aligned}$$

5.2 Inference step 2: entity ranking

In order to rank the entities \(x \in {\mathcal{R}}\), we design a similarity score between the probability distributions \(q_\phi (z|{\mathcal{Q}})\) and \(q_\phi (z|x)\) as an efficient substitute for bayesian model comparison. We use the distance between precision weighted means \(\xi _{{\mathcal{Q}}}\) and \(\xi _{x}\) to define our “distance” function as

$$\begin{aligned} \underset{NVSE}{{{\mathrm{{score}}}}}({\mathcal{Q}}, x) = -||\xi _{{\mathcal{Q}}} - \xi _{x}||^2 . \end{aligned}$$
(7)

Our inter-distribution “distance” is not a proper distance because it changes as the location of both the input distributions is shifted by the same amount. We experimented with more standard, reparameterization invariant, divergences and kernels such as the KL-divergence and the Probability Product Kernel (Jebara et al. 2004), see (Appendix 4), but we found our approach to be faster and more accurate. We believe this is because the regularization from the prior that encourages the posteriors to be close to the origin makes shift invariance unnecessary.

5.3 Unsupervised training

NVSE is trained in an unsupervised fashion to learn its parameters \(\theta\) and \(\phi\). Kingma and Welling (2013); Rezende et al. (2014) proposed the VAE framework for learning richly parameterized conditional distributions \(p_\theta (x | z)\) from unlabeled data. We follow Kingma and Welling (2013)’s reparameterization trick to train a VAE and maximize the Evidence Lower Bound:

$$\begin{aligned} E_{q_\phi (z | x)}[\log p_\theta (x | z)] - KL(q_\phi (z|x) || p(z)). \end{aligned}$$
(8)

During training, we do not have access to any clustering information or side information that tells us which entities can be grouped together. Therefore we assume that the entities \(x \in {\mathcal{X}}\) were generated i.i.d. The model during training looks the same as Fig. 2 but with one difference: Q is a singleton set of just one entity.Footnote 10 Note that our learning method requires no supervision in contrast to methods like Deep Sets which require cluster information, or Neural Collaborative filtering methods which require a large dataset of user interactions.

6 Interpretability

We introduce a general approach for interpreting ESE models based on query rationales to explain the latent concept the model discovered and result justifications to provide evidence for why the system ranked an entity highly. Based on query rationales and result justifications, users can add weights to entities in a query to tell the system what aspects of the query to focus on or ignore.

6.1 Query rationale

A Query Rationale is a visualization of the latent beliefs of the ESE system given the query \({\mathcal{Q}}\). Given \({\mathcal{Q}}\), we construct a feature-importance-map \(\gamma _{{\mathcal{Q}}}\) that measures the relative importance of the features in \(f_x\) and we show the top features according to \(\gamma _{\mathcal{Q}}\) as “Query Rationales”. Recall that the \(j{\text{th}}\) component of \(f_x\), associated with entity x, measures how often the \(j{\text{th}}\) feature co-occurred with x. We now present how we construct \(\gamma _{\mathcal{Q}}\) for NVSE and the baselines.

For BM25, \(\gamma _{{\mathcal{Q}}}\) is simply \(\bar{f_{{\mathcal{Q}}}}\). In BS, \(\gamma _{{\mathcal{Q}}}\) is the weights from (11b): for each \(j{\text{th}}\) component of \(f_x\),

$$\begin{aligned} \gamma _{{\mathcal{Q}}}[j] = \log \frac{\tilde{\alpha }_{\mathcal{Q}}[j] \beta [j]}{\alpha [j]\tilde{\beta }_{\mathcal{Q}}[j]}. \end{aligned}$$

The benefit of generative methods such as BS and NVSE is that for them query rationales can be computed as a natural by-product of the generative process instead of as ad-hoc post-processing steps. For NVSE, ideally \(\gamma _{{\mathcal{Q}}}\) should be the posterior distribution \(p_\theta (f | {\mathcal{Q}})\). Since this is intractable we approximate it by sampling the inference network:

$$\begin{aligned} p_\theta (f | {\mathcal{Q}}) = E_{p_\theta (z | {\mathcal{Q}})} [p_{\theta }(f | z, {\mathcal{Q}})] \approx E_{q_\phi (z | {\mathcal{Q}})} [p_{\theta }(f|z)] . \end{aligned}$$

We further approximate the expectation with a single sample of the mean of \(q_\phi (z | {\mathcal{Q}})\). Finally the feature importance map for NVSE is:

$$\begin{aligned} \gamma _{{\mathcal{Q}}} = p_\theta (f | E[q_\phi (z | {\mathcal{Q}})]). \end{aligned}$$

Because Word2Vecf finds the nearest-neighbor between entity embeddings, which are produced through a complicated learning process operating on the whole text corpus, it does not provide a natural way to determine the importance of a single sentence and therefore it is not possible to say what was the effect of a particular sentence on the query results. Similarly, since the SetExpan method works by extracting context features and iteratively expanding this feature set, it is not possible to determine the effect of a single sentence on the final search results.

6.2 Result justifications

We define result justifications as sentences in \({\mathcal{M}}_{x}\) that justify why an entity was ranked highly for a given query. Ranking the sentences that mention an entity is similar to ranking entities in \({\mathcal{R}}\). Just as we create a feature vector for each x, we create a feature vector for each sentence in \({\mathcal{M}}_{x}\) and use the same scoring function to rank the sentences based on the query. While computing a score for entity x based on a query, we also score each sentence in \({\mathcal{M}}_{x}\). Our approach to generate interpretable result justifications is agnostic to ESE methods with the caveat that for methods like Word2Vecf and SetExpan this will require retraining or reindexing over the corpus for each query. Our approach will not be feasible for such methods.

6.3 Weighted queries

Any recommendation system can occasionally fail to provide good results for a query. To improve a system’s responses in such cases we enable users to guide NVSE ’s results by using entity weights to influence the posterior distribution over topics.

If a user provides weights \({\varvec{\tau }}= \{ \tau _x \mid x \in {\mathcal{Q}}\}\), we compute the query features as

$$\begin{aligned} \xi _{{\mathcal{Q}},{\varvec{\tau }}},\ \ \Gamma _{{\mathcal{Q}},{\varvec{\tau }}} = \sum \nolimits _{x \in {\mathcal{Q}}} \tau _x \xi _x,\ \ \sum \nolimits _{x \in {\mathcal{Q}}} |\tau _x|\Gamma _x. \end{aligned}$$
(9)

The above formulae have an intuitive explanation: when an entity has a higher weight then the precision over the concepts activated by that entity is increased according to the magnitude of the weight, and the value of the precision weighted mean is also weighted by the user supplied weights. In turn, an entity with zero weight has zero effect on the final search result and entities with a high negative weight return entities diametrically opposite to that entity with higher confidence.

Weights can be applied to other methods as well. BM25 can multiply each \(f_x\) by x’s weights when computing \(\bar{f}_{\mathcal{Q}}\), and Word2Vecf can use a weighted average. It is not straight-forward to incorporate weights in BS and SetExpan systems. One possible way is to use bootstrap resampling of the query entities according to a softmax distribution over entity weights, but bootstrapping makes the system non-deterministic and therefore even more opaque for a user. Also bootstrap resampling requires multiple query executions and it is not straight-forward to combine the outputs of different search queries; therefore we do not advocate for bootstrapping.

7 Comparative experiments

We test the hypothesis that NVSE can help bridge the gap between advances in IR and real world use cases. We use human annotators on Amazon Mechanical Turk (AMT) to determine whether NVSE finds more relevant entities than our baseline methods in a real world, automatically generated KG.

7.1 Dataset

TinkerBell (Al-Badrashiny et al. 2017) is a KG construction system that achieved top performance in TAC-KGP2017 evaluation.Footnote 11 We used it as our automatic KG. For each entity e in TinkerBell we create \({\mathcal{M}}_e\) by concatenating all sentences that mention e and remove the top 100 most frequent features in the corpus from \({\mathcal{M}}_e\) to clean stop words. Tinkerbell was constructed from the TAC KGP 2017 evaluation source corpus, LDC2017E25, that contains 30K English documents and 60K Spanish and Chinese documents.Footnote 12 Half of the English documents come from online discussion forums and the other half from news sources, e.g. Reuters or the New York Times. Our experiments only use the 77,845 EDL entities within TinkerBell that are assigned the type Person. We use these links to create a map from DBPedia categories to entities in TinkerBell, say M. Each entity in TinkerBell is associated to spans of characters that mention that entity. We tokenize and sentence segment the documents in LDC2017E25 and associate sentences to each entity corresponding to mentions. In the end we get 344,735 sentences associated to the 77K entities. The median number of sentences associated to an entity is 1 and the maximum number of sentences is 4638 for the Barack Obama entity.Footnote 13 This is a good example of how automatic KGs differ from manually curated KGs. In TinkerBell most of the entities appear in only a single sentence so only a single fact may be known about them. In contrast KGs like FreeBase and DBPedia have a more uniform coverage of facts for entities present in them. Another difference is that relational information such as ancestry relations between entities are much more noisy in an automatically generated KB than in DBPedia which relies on manually curated information present in Wikipedia.

7.2 Implementation details

We prune the vocabulary by removing any tokens that occur less than 5 times across all entities. We end up with, \({\mathrm{F}}\,{=}\, 105448, {\mathrm{V}}= 61311\), \({\mathrm{D}}= 24661\), and \({\mathrm{V}}'= 19476\). We used BM25 implemented in Gensim (Řehůřek and Sojka 2010) and we implemented BS ourselves. We choose \(\lambda \, =\, 0.5\), out of 0, 0.5, or 1, after visual inspection. We used Word2Vecf and SetExpan codebases released by the authors.Footnote 14 For NVSE, we set \(d \, {=} \, 50\), \(\sigma \,{=} \, 1\). The generative network \({{\mathrm{NN}}}^{(g)}_\theta\) does not have hidden layers and the inference network \({{\mathrm{NN}}}^{(i)}_\phi\) has 1 hidden layer of size 500 with a \(\tanh\) non-linearity and two output layers for the mean \(\mu _x\) and log of the diagonal of the variance \(\Sigma _x\). We use a diagonal \(\Sigma _x\).Footnote 15 For Word2Vecf, we used \(d=100\) to use the same number of parameters per entity as in NVSE. We trained with default hyperparameters for 100 iterations. We used SetExpan with the default hyperparameters as well except that we limited the number of maximum iterations to 3 since we only needed top 4 entities for our experiments.

7.3 Experimental design

Prior work typically evaluates ESE on a small number of queries, constituting the most frequent entities, e.g. Ghahramani and Heller (2006) reported results for 10 queries with highly cited authors and Shen et al. (2017) used 20 test queries created of 2000 most frequent entities in Wikipedia. However in automatic KGs, most entities are mentioned only a few times. For example 60% of the entities in TinkerBell are mentioned once. We are primarily interested in unbiased evaluation over such entities, therefore we stratified the evaluation queries into three types.

The 1st type contains entities mentioned in only 1 sentence, the 2nd contains entities appearing in 2–10 sentences, and the 3rd contains entities mentioned in 11–100 sentences. We also stratified queries based on whether they had 3, or 5 entities. For each query type we randomly generate 80 queries by first sampling 80 Wikipedia categories and then sampling entities from those categories that were also part of the TinkerBell KG. This results in 480 queries. See Table 1 for examples.

For each query, we showed the names and first paragraphs from the Wikipedia abstracts of the query’s entities, to help the AMT workers disambiguate entities unfamiliar to them. Then we showed the workers the top 4 entities returned by each system. Each resultant entity was shown with up to 3 justification sentences.Footnote 16 Since SetExpan and Word2Vecf do not return justifications, we used NVSE to extract justifications for their results. We asked workers to rank the systems between 1, the best system, to 3, the worst; and we allowed for ties. The annotators found it difficult to compare results from 5 systems at a time so we split our evaluation into two groups. Group 1 compared NVSE to BS and BM25, and group 2 compared NVSE to SetExpan and Word2Vecf. We randomized the placement of the lists so that the workers could not figure out which system created which list.

Table 1 Examples of randomly created queries
Table 2 The number of times a system was ranked 1st over 80 queries compared to other systems in the same group

7.4 Results

Table 2 shows the number of times the annotators ranked each system as the best out of the 80 queries. Over all queries, NVSE returned better results compared to the 4 baselines systems. It performed best with 5 entities in the query where each entity was only mentioned up to 10 times in the corpus. This shows that NVSE is able to discern better quality topics from multiple entities with sparse data. Extended results showing second and third place rankings of the systems are given in Table 5 in the appendix which show that in cases that when NVSE does not rank first it is typically chosen as the second ranking system.

The IR method BM25 was the strongest baseline, outperforming BS and SetExpan, and even NVSE in two settings. We believe that this is because of the low-resource conditions of our evaluation where ad-hoc IR methods can have an advantage. Another reason why BM25 worked very well in our evaluation was because of the lack of auxilliary signals such as entity inter-relations and entity links and because all the entities were of person type. This makes our task different from the entity list completion (ELC) task (BALOG 2009) and a bit simpler for methods that focus heavily on lexical overlap. Another difference between the ESE task and the ELC task was that in the ELC task a descriptive prompt describing the query was also given to the users while evaluating the relevance of the returned results whereas no such prompt was given in the ESE task. We also found that sometimes BM25 was rated highly because it returned results that were highly relevant to a single query entity instead of being topically similar to all entities. For example, on the query associated with “The Apprentice Contestants” BM25’s results solely focused on Dennis Rodman, but NVSE tried to infer a common topic amongst entities and returned generic celebrities which annotators did not prefer.

On entities with little data, Word2Vecf and SetExpan perform poorly. Word2Vecf requires large amounts of data for learning useful representations (Altszyler et al. 2016) which explains why it performs poorly in our evaluation. The SetExpan algorithm directly uses context features extracted from the mentions of an entity, and returns entities with the same context features. This approach can overfit with low data. Even though SetExpan uses an ensembling method to reduce the variance of the algorithm, we believe using context-features causes overfitting when an entity appears in only a few sentences. Lastly, we believe that BS suffers because its impoverished generative model has neither non-linearities, nor low-dimensional topics for modeling correlations amongst tokens.

Table 3 The first row contains top 10 features most similar to \(z_{j}\)
Table 4 The top row represents a query with weights in parentheses and the bottom row lists corresponding query rationales

8 Analyzing interpretability

We now attempt to understand the similarity relations encoded in NVSE ’s internal concept representations to understand what it is learning. We also provide examples of how query rationales and query weights can help users fine-tune their queries.

8.1 Understanding the concept space

To gain some insight into the distribution over concepts inferred by NVSE we determined the top 10 words activated by individual dimension of z by computing \({{\mathrm{NN}}}^{(g)}_\theta (e_j)\) where \(e_j\) is a one-hot vector in \({\mathbb {R}}^{50}\). Table 3 shows the top 10 words for selected components of z. We can easily recognize that dimensions 3, 33 and 37 of z represent finance, sports, and entertainment. Even though we did not constrain z to be component-wise interpretable, this structure naturally emerged after training.

8.2 Weights and query rationale

Table 4 depicts how the query rationale returned by NVSE changes in response to entity weights. In the first column the query is {Abu Bakr Baghdadi} and the query rationale tells us that NVSE focuses on iraq, baghdadi etc. The second column shows a different query {Osama Bin Laden} and the query rationales changes accordingly to pakistani and osama. The third and fourth column show rationales when the weights on “Laden” and “Baghdadi” are varied. When more weight is put on “Laden” then the query rationales contain more features that are associated to him, and when more weight is put on “Baghdadi”, then features such as “islamic” which is a token from his organization are returned. The last column shows an interesting configuration where a user is effectively asking for results that are similar to “Baghdadi” but dissimilar to “Laden” and the feature for kurdish gets activated. Since the system returns results in under 100ms, the user can fine-tune her query in real-time with the help of these query rationales.

We give one more example of the utility of negative weights: When \({\mathcal{Q}}= \{\text{Brady}\}\), NVSE ’s rationale is [brady, game, patriots, left, knee, field, tackle], indicating that NVSE associated the “Brady” entity with Tom Brady who is a member of the Patriots football team. When we added “Wes Welker” to \({\mathcal{Q}}\) with a negative weight, the query rationale changed to [brady, game, left, tackle, knee, back, field]. Since Wes is a Patriots receiver who received a negative weight in the query, NVSE deactivated the patriots feature and activated the tackle feature, opposite to a receiver.

9 Conclusion

We introduced NVSE as a step towards making advances in entity set expansion useful to real-world settings. NVSE is a novel unsupervised approach based on the VAE framework that discovers related entities from noisy knowledge graphs. NVSE ranks entities in a KG using an efficient and fast scoring function (7), ranking 80K entities on a commodity laptop in 100 ms.

Our experiments demonstrated that NVSE can be applied in real-world settings where automatically generated KGs are noisy. NVSE outperformed state of the art ESE systems and other strong baselines on a real world KG. We also presented a flexible approach to interpret ESE methods and justify their recommendations.

In future work, we will extend our work by improving our model using more powerful auto-encoders such as the Ladder VAE (Sønderby et al. 2016), secondly we will experiment with the use of side information such as links from a KG through the use of Graph Convolutional Networks (Kipf and Welling 2017). Third, we will like to quantitatively measure how query rationales and justifications help users in accomplishing their search task. Finally, we will incorporate confidence scores from the KG in our model. Although there may be future work to improve our ESE method, we believe that NVSE serves as a significant step towards utilizing KGs and semantics for information retrieval and understanding in real world settings.