1 Introduction

With increasing popularity of virtual assistants like SIRI and Google Now, users are interacting with search systems by asking natural language questions that often contain named entity mentions. Further, a large fraction of queries contain a named entity and searchers tend to use more question-queries for complex information needs [2]. Hence, fast and correct identification of named entities in user queries is crucial for query understanding and to map the query to information in structured knowledge base. Entity linking in search queries utilizes information derived from query logs and open knowledge bases such as DBPedia and Freebase. Such techniques, however, are not suited for enterprise and domain specific search systems such as legal, medical, healthcare, etc. due to very small user bases resulting in small query logs and absence of rich domain specific knowledge bases. Recently, there have been development of systems for automatic construction of semantic knowledge bases for domain specific corpora [3] and systems that use such domain specific knowledge bases [8]. We describe the method used for entity disambiguation and linking as implemented in one such system, Watson Discovery Advisor. It offers users a search interface to search for the indexed information and uses the underlying knowledge base to enhance search results and provide additional entity-centric data exploration capabilities. The system automatically constructs a structured knowledge base by identifying entities and their relationships from input text corpora using the method described by Castelli et al. [3]. Thus, for each relationship discovered by the system, the corresponding mention text provides additional contextual information about the entities and relationships present in that mention. We posit that the dense graph structure discovered from the corpus, as well as the additional context provided by the associated mention text can be utilized together for linking entity name mentions in search queries to corresponding entities in the graph. Our proposed entity linking algorithm is intuitive, relies on a theoretical sound probabilistic framework, is fast and scalable with an average response time of \(\varvec{\approx }\) 100 ms. Figure 1 shows the working of proposed algorithm in action where top ranked suggestions for named mentions Sergey and Larry are showed. As will be described in detail in next Section, note that the algorithm is making these suggestions by utilizing the terms in questions (search, algorithm) as well as relationships between all target entities for mentions “Sergey” and “Larry” in the graph. The algorithm figures out that entities “Sergey Brin” and “Larry Page” have strong evidences from their textual content as well as these two entities are strongly connected in the graph, and hence they are suggested as most probable relevant entities in the context of question.

Fig. 1.
figure 1

Entity suggestions produced by proposed approach using text and entity context in search query.

2 Proposed Approach

Let \(Q = \{C,T\}\) be the input query where T is the ambiguous token, and \(C = \{E_c,W_c\}\) is the context under which we have to disambiguate T. The context is provided by the words \((W_c=\{w_{c1},w_{c2},\dots ,w_{cl}\})\) in the query and the set of unambiguous entities \(E_c=\{e_{c1},e_{c2},\dots ,e_{cm}\}\). Note that initially, this entity set can be empty if there are no unambiguous entity mentions in the query and in such cases, only textual information is considered. The task is to map the ambiguous token T to one of the possible target entities. Let \(E_T = \{e_{T1},e_{T2},\dots ,e_{Tm}\}\) be the set of target entities for T. A ranked list of target entities can be constructed by computing \(P(e_{Ti}|C)\), i.e., the probability that the user is interested in entity \(e_{Ti}\) given the context C. Using Bayes’ theorem, we can write \(P(e_{Ti}|C)\) as follows.

$$\begin{aligned} P(e_{Ti} | C) = \dfrac{P(e_{Ti})P(C|e_{Ti})}{P(C)} \propto P(C|e_{Ti}) \end{aligned}$$
(1)

Since we are only interested in relative ordering of the target entities, we can ignore the denominator P(C) as its value will be same for all the target entities. Likewise, assuming all the entities to be equally probable in absence of any context, \(P(e_{T_i})\) can be ignored for ranking purposes. Assuming conditional independence for context terms as well as entities in context, we have:

$$\begin{aligned} P(e_{Ti}|C)&\propto P(W_c|e_{Ti}) \times P(E_c|e_{Ti}) = \underbrace{\prod _{w_c \in W_c} P(w_c|e_{Ti})}_{\text {text context}} \times \underbrace{\prod _{e_c \in E_c} P(e_c|e_{Ti})}_{\text {entity context}} \end{aligned}$$
(2)

Computing Entity Context Contribution: The entity context factor in Eq. 2 corresponds to the evidence for target entity given \(E_c\), the set of entities forming the context. For each individual entity \(e_c\) forming the context, we need to compute \(P(e_c|e_{Ti})\), i.e., the probability of observing \(e_c\) after observing the target entity \(e_{Ti}\). Intuitively, there is a higher chance of observing an entity that is involved in multiple relationship with \(e_{Ti}\) than an entity that only has a few relationships with \(e_{Ti}\). Thus, we can estimate \(P(e_c|e_{Ti})\) as follows:

$$\begin{aligned} P(e_c|e_{Ti}) = \dfrac{relCount(e_c,e_{Ti}) + 1}{relCount(e_c) + |E|} \end{aligned}$$
(3)

Note that the factor of 1 in numerator and |E| (size of entity set E) in the denominator have been added to smoothen the probability values for entities that are not involved in any relationship with \(e_{Ti}\).

Computing Text Context Contribution: The text context factor in Eq. 2 corresponds to the evidence for target entity given \(W_c\), the terms present in the input query. For each individual query term \(w_c\), we need to compute \(P(w_c|e_{Ti})\), i.e., the probability of observing \(w_c\) given \(e_{Ti}\). This probability can be estimated by using the mention language model of \(e_{Ti}\) as follows.

$$\begin{aligned} P(w_c|E_{Ti})&= P(w_c|M_{Ti})= \dfrac{\text {no. of times }w_c \text { appears in mentions of } E_{Ti} + 1}{|M_{Ti}| + N } \end{aligned}$$
(4)

Here, N is the size of the vocabulary. Since entities are discovered automatically from text, these mentions provide important context information as illustrated in Sect. 1.

3 Evaluation

We use a semantic graph constructed from text of all articles in Wikipedia by automatically extracting the entities and their relations by using IBM’s Statistical Information and Relation Extraction (SIRE) toolkitFootnote 1. Even though there exist popular knowledge bases like DBPedia that contain high quality data, we chose to construct a semantic graph using automated means as such a graph will be closer to many practical real world scenarios where high quality curated graphs are often not available and one has to resort to automatic methods of constructing knowledge bases. Our graph contains more than 30 millions entities and 192 million distinct relationships in comparison to 4.5 million entities and 70 million relationships in DBpedia. For evaluating the proposed approach, we use the KORE50 [5] dataset that contains 50 short sentences with highly ambiguous entity mentions. This widely used dataset is considered amongst the hardest dataset for entity disambiguation. Average sentence length (after stop word removal) is 6.88 words per sentence and each sentence has 2.96 entity mentions on an average. Every mention has an average of 631 candidates to disambiguate in YAGO knowledge base [9]. However, it varies for different knowledge bases. Our automatically constructed knowledge base has 2,261 candidates per mention to disambiguate illustrating the difficulty in entity linking due to high noise in automatically constructed knowledge bases when compared with manually curated/cleaned knowledge bases such as DBpedia. The results of our proposed approach and various other state-of-the-art methods for entity linking on the same dataset are tabulated in Table 1. We note that the performance of our proposed approach is comparable or better than the other approaches, despite dealing with much noisier data. Further, average response time for proposed approach is about 100 ms, as we utilize the signals from mention text and relationship information about entities instead of performing complex and time consuming graph operations as in other methods, while not sacrificing on the accuracy.

Table 1. Entity disambiguation accuracy

4 Conclusions

In this paper, we addressed the problem of mapping entity mentions in natural language search queries to corresponding entities in an automatically constructed knowledge graph. We proposed an approach that utilizes the dense graph structure as well as additional context provided by the mention text. Comparative evaluation on a standard dataset with state-of-the-art approaches shows the strengths of our proposed approach in achieving high accuracy with super fast response times. The proposed approach is currently deployed in an enterprise semantic search system called Watson Discovery Advisor and our future work will focus on developing the approach further to utilize user click-feedback for improving the quality of entity suggestions.