Keywords

1 Introduction

Topic modeling has been widely applied in various domains such as text mining [1, 2], information retrieval [21] and image processing [5, 12] etc. Latent Dirichlet Allocation (LDA) is one of the most studied topic modeling methods. This is a generative model based on the extension of probabilistic Latent Semantic Indexing (pLSI) [4]. LDA has been widely investigated in modeling text corpora with exchangeable discrete data (i.e. bag-of-words), where text documents are modeled as the mixture over topics drawn from a Dirichlet distribution, and topics are modeled as a multinomial distribution of words drawn from a finite set.

With the increase in the availability of text resources of various characteristics (temporal, tagged, multi-class, etc.) and different applications, researchers have proposed various LDA variants [14, 16, 18, 19] in the literature. Traditional LDA does not perform effectively if text corpus has skewed topic distributions or highly overlapping keywords between topics. Further, many of the real world events are entity driven. For example, news reporting of different terror attacks share many important keywords such as blast, bomb, IED, terror, attack, etc., but differentiated by named entities like person name, locations, organizations, etc. Over such text collection, an unsupervised generative approach like LDA fails to separate documents related to different events effectively because of high overlapping keywords. In this paper, we address the above problem by assuming to have prior knowledge about the underlying topics and its representative named entities. With the increase in the availability of real-world datasets such as natural disaster, terror attack, political discussion, such assumption related to entity driven events becomes relevant in today’s scenario. We refer to such representative named entities as prioritized entities. In the literature, SeededLDA [8] also considers similar assumptions, however for generalized keywords. It samples each word of a document as seed word or non-seed word for each topic and accordingly modifies the document-topic and topic word distributions. Whereas, in the case of Prioritized Named Entity driven LDA (PNE-LDA), we assume the information of each word to be treated as prioritized or non-prioritized as an external input and estimate the topic-word distribution.

From experimental observations over three different datasets of different nature (i.e. Bomb Blast, Reuters-21578, and 20-Newsgroup), it is evident that the proposed PNE-LDA outperforms its LDA counterparts for entity driven topics. The remainder of this paper is organized as follows. We discuss relevant related works in Sect. 2 which is followed by the proposed solution in Sect. 3. Section 4 describes the characteristics of datasets and insight into the results. Finally, we conclude our work with future work in Sect. 5.

2 Literature Review

Topic modeling has been used in past decades to mine the important hidden information in the large text corpus [1, 2, 21]. The application of text modeling varies from clustering the text corpus, mining the scientific articles, to the detection of the event on Twitter and news publications. LDA is one of the prominent topic modeling techniques that had been suitably modified to suit the need. Rosen et al. [16] proposed Author-Topic Model, a variant of LDA to mine the authors’ research interest from the research paper collections by modeling the content of research paper. McCallum et al. [14] extend Author-Topic model to Author-Recipient-Topic model for mining topic, interaction relationship between sender and receiver, and people roles from Enron and academic email. Another variants of LDA Topics-over-Time LDA (TOT-LDA) [19], Dynamic LDA [3], and Temporal LDA [20] incorporate temporal factors to find granular topics and evolution of topics over time from the text documents like scientific publication and Twitter.

To include the supervised information into LDA, Labeled LDA (L-LDA) [15] establishes one to one relationship between LDA topics and user defined class labels. Thereafter several variants of LDA have been proposed to include the supervised information; for e.g, sLDA [13], DiscLDA [11] and MedLDA [23] for classification of documents with single label; while DP-MRM [10], and Dep-LDA [17] and Boost Multi class L-LDA [9] for classification of documents with multiple labels. Source-LDA [22] has been proposed to include external information from wikipedia to guide the topic word formation. Seed-LDA [8] is one of the variants of LDA which incorporates the user’s understanding of the corpus and bias the topic formation process with the help of representative word of each topics which is more useful for the various extrinsic tasks such as document classification.

Fig. 1.
figure 1

Plate diagram of LDA, SeededLDA and the proposed PNE-LDA

3 Methodology

In this section, we discuss the proposed prioritized named entity driven LDA (PNE-LDA). LDA is a generic probabilistic graphical model capturing the latent semantic topic distribution across large document collections. PNE-LDA extends LDA for clustering the text corpus with the help of prior information in terms of prioritized named entities. By giving prior information about the cluster in terms of prioritized named entities, we can improve the quality of cluster, especially in case of highly overlapping clusters.

3.1 Prioritized Named Entity Driven LDA (PNE-LDA)

Despite having high overlapping of word distribution among clusters, every cluster has distinguishable named entities like persons, locations, and organizations, which we call as prioritized named entity. As the occurrence of such prioritized named entity is rare compared with general words, it is required to project these words separately onto topics from the general words.

Plate diagram of the LDA, SeededLDA and our proposed PNE-LDA, is shown in Fig. 1 for comparison. In PNE-LDA, we consider one observed random variable \(x_{nd}\) similar to SeededLDA, acting as a switch, which denotes about a word being a prioritized word or general word. The switch variable \(x_{nd}\) is observed in contrast to SeededLDA, where it is sampled for each topic based on beta distribution. In PNE-LDA, we use different Dirichlet distributions for prioritized and non-prioritized words to generate topic-word distributions as shown in Fig. 1(c). A list of prioritized named entity is provided to the model apriori to the start of PNE-LDA. The generative algorithm of PNE-LDA is given in Algorithm 1.

We first sample prioritized named entity topic distributions (\(\beta _1\)) and general topic word distribution (\(\beta _2\)) using uniform Dirichlet parameter \(\eta _1\) and \(\eta _2\) respectively. For each document in corpus, we first sample the document topic distribution \(\theta _d\), using Dirichlet prior \(\alpha \). Further, for each word, we sample a word from either topic-prioritized named entity distribution or topic-general word distributions based on the values of \(x_{nd}\). If \(x_{nd}\) is 0, we sample from topic-prioritized named entity distribution; else we sample from the topic-general word distribution.

PNE-LDA focuses on the study of emphasizing more weights on prioritized keywords. Varying the value of Dirichlet parameters i.e., prioritized named entity Dirichlet parameter (\(\eta _1\)) and general word Dirichlet parameter (\(\eta _2\)), we can put different weights on a prioritized named entity and general word respectively. Like SeededLDA, PNE-LDA divides topic-word relationship in two parts: (i) topic-prioritized named entity word relationship (\(\eta _1\)) and (ii) topic-general word relationship (\(\eta _2\)) respectively.

figure a

The conditional probability of assigning a topic j to a word \(w_{nd}\) of document d by PNE-LDA can be written as:

$$\begin{aligned} P(z_{nd}=j | z_{\lnot nd}, w_{\lnot nd}) \propto \left\{ \begin{aligned}&(\alpha + n^{w_{d}}_{\lnot nd, j} ) \frac{\eta _1 + n^{(w_{nd})}_{\lnot nd,j}}{ V_1 .\eta _1 + n^{(.)}_{\lnot nd,j} } ,&\text {if } x_{nd}=0 \\&(\alpha + n^{w_{d}}_{\lnot nd, j} ) \frac{\eta _2 + n^{(w_{nd})}_{\lnot nd,j}}{ V_2 .\eta _2 + n^{(.)}_{\lnot nd,j} } ,&\text {Otherwise} \end{aligned} \right. \end{aligned}$$
(1)

where \(x_{nd} = 0\) indicates prioritized named entity and \(x_{nd}=1\) indicates general words. The various symbols used in Eq. 1 are as follows: (i) \(w_{nd}\) represents word at \(n^{th}\) index of document d (ii) \(z_{nd}\) represents topic of the word at \(n^{th}\) index of document d (iii) \(z_{\lnot nd}\) represents all topics-word assignment except the current word topic assignment (iv) \(w_{\lnot nd}\) represents all words in the vocabulary except the current word (v) \(n^{w_{d}}_{\lnot nd, j}\) represents number of words of current document assigned to current topic j except the current word \(w_{nd}\) (vi) \(n^{(w_{nd})}_{\lnot nd,j}\) represents number of words assigned to current topic j and similar to current word, except current word \(w_{nd}\) (vii) \(n^{(.)}_{\lnot nd,j} =\sum _{ \forall w_{nd} \in V}{n^{(w_{nd})}_{\lnot nd,j}}\) represents number of words assigned to current topic j except current word \(w_{nd}\). (viii) V1 and V2 represent the number of vocabulary of prioritized named entity and non-prioritized named entity.

In Eq. 1, left hand side of equation \(p(z_{nd} =j)\) resembles the probability of getting a topic j for word at \(n^{th}\) index of document d. The first term of right-hand side equation resembles the probability of choosing a topic j from a multinomial distribution of topics in \(d_{th}\) document. The second term of right-hand side of the equation refers to choosing a word \(w_{nd}\) from the topic j. If the word is prioritized word, we sample it from prioritized word-topic distribution parameterized by \(\eta _1\); otherwise, we sample it from general word-topic distribution parameterized by \(\eta _2\).

3.2 Prioritizing Named Entities

Ideally named entities representing the target topics can be given as external inputs. For a random dataset, identifying such entities is an expensive operation. In this study, we consider all the entities present in the document. For identifying the named entities, we have used Standford NER [6]. For the bomb blast dataset, since there is a need to identify Indian named entities, we have used and adapted Stanford NER, which is trained to recognize India named entities. Further, these named entities can be assigned different priority. However, for simplicity, this study assigns equal priority to all the named entities present in the documents. Once we identify representative named entities, the proposed PNE-LDA considers these entities as prioritized keywords and rest as the non-prioritized keywords.

4 Experimental Setups

Datasets: We have experimented with three datasets namely Bomb Blast, Reuters-21578, and 20-Newsgroups respectively as described in Table 1. The Bomb Blast dataset is our locally collected and processed dataset, which consist of 855 news articles reporting 53 different bomb blast events occurred in different parts of India. The Reuters-21578 dataset consists of 5,485 documents spanning across 8 clusters and 20-Newsgroups dataset consists of 11,293 documents spanning across 20 clusters. Among the three datasets, Bomb Blast has most occurrences of named entities while Reuters dataset has least occurrences of named entities in the documents. Bomb blast dataset mostly has the person, location and organization name whereas 20-Newsgroups dataset has only person and organization names. Reuters-21578 dataset is mostly related to business articles and hence has a limited number of named entities.

Table 1. Characteristics of the experimental datasets. NE represent named entity

Experimental Setup: We have experimented with different Document-Topic Dirichlet parameter \(\alpha \) and set it as 0.1 for all LDA, SeededLDA and PNE-LDA as \(\alpha =0.1\) provides best result for all three datasets. For LDA, we have Topic-Word Dirichlet parameter \(\eta \) to 0.2 for Bomb blast Dataset, \(\eta \) to 0.3 for Reuters Dataset and \(\eta \) to 0.2 for 20-Newsgroup. For PNE-LDA and Seeded-LDA, to assign higher weights to named entities, we have assigned the weights of \(\eta _1\) less than \(\eta _2\) for all three datasets. For SeededLDA we have used the same value of \(\eta _1\) and \(\eta _2\) as PNE-LDA corresponding to three datasets. For Bomb blast dataset, we have set the topic-prioritized named entity Dirichlet parameter \(\eta _1\) as 0.1 and topic-general word Dirichlet parameter \(\eta _2\) as 0.2. For Reuters dataset, we have set \(\eta _1\) as 0.2 and \(\eta _2\) as 0.3 respectively. For 20-Newsgroup, we have set \(\eta _1\) as 0.1 and \(\eta _2\) as 0.2 respectively. Although we have experimented with other combination of \(\eta _1\) (0.1, 0.2, 0.3, 0.4) and \(\eta _2\) (0.1, 0.2, 0.3, 0.4), but we are reporting parameter with the best performance with respect to both F-measure and RandIndex. We have run 120 iterations of Gibbs sampling for all the LDA and PNE-LDA experiments.

In order to investigate the performance of different methods, we have used the document clustering task to measure the quality of topics given by LDA, SeededLDA and PNE-LDA. For all the experimental setups, we consider the number of topics as the number of clusters present in the respective datasets as shown in Table 1. LDA, SeededLDA and PNE-LDA returns the document-topic proportions. Each document is assigned to the cluster defined by the topic with maximum proportions. Once each document is assigned with a cluster id (which is topic id with maximum proportions), we evaluate clustering performance using F-measure and RandIndex.

Table 2. Topic evaluation of LDA, PNE-LDA and SeededLDA over Bomb Blast, Reuters-21578 and 20-Newsgroups datasets.

Observations: Table 2 represents comparative study of topic quality given given by three models LDA, PNE-LDA, and SeededLDA. As shown in the table, the proposed PNE-LDA outperforms both LDA and seeded LDA for both Bomb Blast and 20-Newsgroups. As mentioned above, topics in these two datasets are named entity driven. Whereas, in less named entity driven dataset (Reuter-21578), PNE-LDA under-performs both LDA and SeededLDA. It is evident from these observation that the proposed PNE-LDA is more effective to determine real world events which can be represented by named entities defining the topics.

5 Conclusion and Future Work

In this paper, we propose a novel entity driven Topic modeling based model namely PNE-LDA, for clustering documents with high overlapping word and named entities driven clusters. From the various experimental observation over three datasets namely Bomb Blast, Reuters-21578 and 20-Newsgroups, it is observed that the proposed methods outperforms its state of the art counterparts namely LDA and SeededLDA over dataset with high occurrences of named entity. Thus the proposed method is more suitable for detecting real world events like bomb blasts.