Prioritized Named Entity Driven LDA for Document Clustering

Kumar, Durgesh; Singh, Sanasam Ranbir

doi:10.1007/978-3-030-34872-4_33

Durgesh Kumar¹⁴ &
Sanasam Ranbir Singh¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11942))

Included in the following conference series:

International Conference on Pattern Recognition and Machine Intelligence

1258 Accesses

Abstract

Topic modeling methods like LSI, pLSI, and LDA have been widely studied in text mining domain for various applications like document representation, document clustering/classification, information retrieval, etc. However, such unsupervised methods are effective over corpus with well separable topics. In real-world applications, topics might be of highly overlapping in nature. For example, a news corpus of different terror attacks has highly overlapping keywords across reporting of different terror events. In this paper, we propose a variant of LDA, named as Prioritized Named Entity driven LDA (PNE-LDA), which can address the issue of overlapping topics by prioritizing named entities related to the topics. From various experimental setups, it is observed that the proposed method outperforms its counterparts in entity driven overlapping topics.

You have full access to this open access chapter, Download conference paper PDF

Incorporating Entities in News Topic Modeling

Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence

Autocoder Guide Multi-category Topic Clustering for Keywords Matching

Keywords

1 Introduction

Topic modeling has been widely applied in various domains such as text mining [1, 2], information retrieval [21] and image processing [5, 12] etc. Latent Dirichlet Allocation (LDA) is one of the most studied topic modeling methods. This is a generative model based on the extension of probabilistic Latent Semantic Indexing (pLSI) [4]. LDA has been widely investigated in modeling text corpora with exchangeable discrete data (i.e. bag-of-words), where text documents are modeled as the mixture over topics drawn from a Dirichlet distribution, and topics are modeled as a multinomial distribution of words drawn from a finite set.

With the increase in the availability of text resources of various characteristics (temporal, tagged, multi-class, etc.) and different applications, researchers have proposed various LDA variants [14, 16, 18, 19] in the literature. Traditional LDA does not perform effectively if text corpus has skewed topic distributions or highly overlapping keywords between topics. Further, many of the real world events are entity driven. For example, news reporting of different terror attacks share many important keywords such as blast, bomb, IED, terror, attack, etc., but differentiated by named entities like person name, locations, organizations, etc. Over such text collection, an unsupervised generative approach like LDA fails to separate documents related to different events effectively because of high overlapping keywords. In this paper, we address the above problem by assuming to have prior knowledge about the underlying topics and its representative named entities. With the increase in the availability of real-world datasets such as natural disaster, terror attack, political discussion, such assumption related to entity driven events becomes relevant in today’s scenario. We refer to such representative named entities as prioritized entities. In the literature, SeededLDA [8] also considers similar assumptions, however for generalized keywords. It samples each word of a document as seed word or non-seed word for each topic and accordingly modifies the document-topic and topic word distributions. Whereas, in the case of Prioritized Named Entity driven LDA (PNE-LDA), we assume the information of each word to be treated as prioritized or non-prioritized as an external input and estimate the topic-word distribution.

From experimental observations over three different datasets of different nature (i.e. Bomb Blast, Reuters-21578, and 20-Newsgroup), it is evident that the proposed PNE-LDA outperforms its LDA counterparts for entity driven topics. The remainder of this paper is organized as follows. We discuss relevant related works in Sect. 2 which is followed by the proposed solution in Sect. 3. Section 4 describes the characteristics of datasets and insight into the results. Finally, we conclude our work with future work in Sect. 5.

2 Literature Review

Topic modeling has been used in past decades to mine the important hidden information in the large text corpus [1, 2, 21]. The application of text modeling varies from clustering the text corpus, mining the scientific articles, to the detection of the event on Twitter and news publications. LDA is one of the prominent topic modeling techniques that had been suitably modified to suit the need. Rosen et al. [16] proposed Author-Topic Model, a variant of LDA to mine the authors’ research interest from the research paper collections by modeling the content of research paper. McCallum et al. [14] extend Author-Topic model to Author-Recipient-Topic model for mining topic, interaction relationship between sender and receiver, and people roles from Enron and academic email. Another variants of LDA Topics-over-Time LDA (TOT-LDA) [19], Dynamic LDA [3], and Temporal LDA [20] incorporate temporal factors to find granular topics and evolution of topics over time from the text documents like scientific publication and Twitter.

To include the supervised information into LDA, Labeled LDA (L-LDA) [15] establishes one to one relationship between LDA topics and user defined class labels. Thereafter several variants of LDA have been proposed to include the supervised information; for e.g, sLDA [13], DiscLDA [11] and MedLDA [23] for classification of documents with single label; while DP-MRM [10], and Dep-LDA [17] and Boost Multi class L-LDA [9] for classification of documents with multiple labels. Source-LDA [22] has been proposed to include external information from wikipedia to guide the topic word formation. Seed-LDA [8] is one of the variants of LDA which incorporates the user’s understanding of the corpus and bias the topic formation process with the help of representative word of each topics which is more useful for the various extrinsic tasks such as document classification.

3 Methodology

In this section, we discuss the proposed prioritized named entity driven LDA (PNE-LDA). LDA is a generic probabilistic graphical model capturing the latent semantic topic distribution across large document collections. PNE-LDA extends LDA for clustering the text corpus with the help of prior information in terms of prioritized named entities. By giving prior information about the cluster in terms of prioritized named entities, we can improve the quality of cluster, especially in case of highly overlapping clusters.

3.1 Prioritized Named Entity Driven LDA (PNE-LDA)

Despite having high overlapping of word distribution among clusters, every cluster has distinguishable named entities like persons, locations, and organizations, which we call as prioritized named entity. As the occurrence of such prioritized named entity is rare compared with general words, it is required to project these words separately onto topics from the general words.

Plate diagram of the LDA, SeededLDA and our proposed PNE-LDA, is shown in Fig. 1 for comparison. In PNE-LDA, we consider one observed random variable $x_{nd}$ similar to SeededLDA, acting as a switch, which denotes about a word being a prioritized word or general word. The switch variable $x_{nd}$ is observed in contrast to SeededLDA, where it is sampled for each topic based on beta distribution. In PNE-LDA, we use different Dirichlet distributions for prioritized and non-prioritized words to generate topic-word distributions as shown in Fig. 1(c). A list of prioritized named entity is provided to the model apriori to the start of PNE-LDA. The generative algorithm of PNE-LDA is given in Algorithm 1.

We first sample prioritized named entity topic distributions ($\beta _1$) and general topic word distribution ($\beta _2$) using uniform Dirichlet parameter $\eta _1$ and $\eta _2$ respectively. For each document in corpus, we first sample the document topic distribution $\theta _d$, using Dirichlet prior $\alpha $. Further, for each word, we sample a word from either topic-prioritized named entity distribution or topic-general word distributions based on the values of $x_{nd}$. If $x_{nd}$ is 0, we sample from topic-prioritized named entity distribution; else we sample from the topic-general word distribution.

PNE-LDA focuses on the study of emphasizing more weights on prioritized keywords. Varying the value of Dirichlet parameters i.e., prioritized named entity Dirichlet parameter ($\eta _1$) and general word Dirichlet parameter ($\eta _2$), we can put different weights on a prioritized named entity and general word respectively. Like SeededLDA, PNE-LDA divides topic-word relationship in two parts: (i) topic-prioritized named entity word relationship ($\eta _1$) and (ii) topic-general word relationship ($\eta _2$) respectively.

The conditional probability of assigning a topic j to a word $w_{nd}$ of document d by PNE-LDA can be written as:

$$\begin{aligned} P(z_{nd}=j | z_{\lnot nd}, w_{\lnot nd}) \propto \left\{ \begin{aligned}&(\alpha + n^{w_{d}}_{\lnot nd, j} ) \frac{\eta _1 + n^{(w_{nd})}_{\lnot nd,j}}{ V_1 .\eta _1 + n^{(.)}_{\lnot nd,j} } ,&\text {if } x_{nd}=0 \\&(\alpha + n^{w_{d}}_{\lnot nd, j} ) \frac{\eta _2 + n^{(w_{nd})}_{\lnot nd,j}}{ V_2 .\eta _2 + n^{(.)}_{\lnot nd,j} } ,&\text {Otherwise} \end{aligned} \right. \end{aligned}$$

(1)

where $x_{nd} = 0$ indicates prioritized named entity and $x_{nd}=1$ indicates general words. The various symbols used in Eq. 1 are as follows: (i) $w_{nd}$ represents word at $n^{th}$ index of document d (ii) $z_{nd}$ represents topic of the word at $n^{th}$ index of document d (iii) $z_{\lnot nd}$ represents all topics-word assignment except the current word topic assignment (iv) $w_{\lnot nd}$ represents all words in the vocabulary except the current word (v) $n^{w_{d}}_{\lnot nd, j}$ represents number of words of current document assigned to current topic j except the current word $w_{nd}$ (vi) $n^{(w_{nd})}_{\lnot nd,j}$ represents number of words assigned to current topic j and similar to current word, except current word $w_{nd}$ (vii) $n^{(.)}_{\lnot nd,j} =\sum _{ \forall w_{nd} \in V}{n^{(w_{nd})}_{\lnot nd,j}}$ represents number of words assigned to current topic j except current word $w_{nd}$. (viii) V1 and V2 represent the number of vocabulary of prioritized named entity and non-prioritized named entity.

In Eq. 1, left hand side of equation $p(z_{nd} =j)$ resembles the probability of getting a topic j for word at $n^{th}$ index of document d. The first term of right-hand side equation resembles the probability of choosing a topic j from a multinomial distribution of topics in $d_{th}$ document. The second term of right-hand side of the equation refers to choosing a word $w_{nd}$ from the topic j. If the word is prioritized word, we sample it from prioritized word-topic distribution parameterized by $\eta _1$; otherwise, we sample it from general word-topic distribution parameterized by $\eta _2$.

3.2 Prioritizing Named Entities

Ideally named entities representing the target topics can be given as external inputs. For a random dataset, identifying such entities is an expensive operation. In this study, we consider all the entities present in the document. For identifying the named entities, we have used Standford NER [6]. For the bomb blast dataset, since there is a need to identify Indian named entities, we have used and adapted Stanford NER, which is trained to recognize India named entities. Further, these named entities can be assigned different priority. However, for simplicity, this study assigns equal priority to all the named entities present in the documents. Once we identify representative named entities, the proposed PNE-LDA considers these entities as prioritized keywords and rest as the non-prioritized keywords.

4 Experimental Setups

Datasets: We have experimented with three datasets namely Bomb Blast, Reuters-21578, and 20-Newsgroups respectively as described in Table 1. The Bomb Blast dataset is our locally collected and processed dataset, which consist of 855 news articles reporting 53 different bomb blast events occurred in different parts of India. The Reuters-21578 dataset consists of 5,485 documents spanning across 8 clusters and 20-Newsgroups dataset consists of 11,293 documents spanning across 20 clusters. Among the three datasets, Bomb Blast has most occurrences of named entities while Reuters dataset has least occurrences of named entities in the documents. Bomb blast dataset mostly has the person, location and organization name whereas 20-Newsgroups dataset has only person and organization names. Reuters-21578 dataset is mostly related to business articles and hence has a limited number of named entities.

Table 1. Characteristics of the experimental datasets. NE represent named entity

Full size table

Experimental Setup: We have experimented with different Document-Topic Dirichlet parameter $\alpha $ and set it as 0.1 for all LDA, SeededLDA and PNE-LDA as $\alpha =0.1$ provides best result for all three datasets. For LDA, we have Topic-Word Dirichlet parameter $\eta $ to 0.2 for Bomb blast Dataset, $\eta $ to 0.3 for Reuters Dataset and $\eta $ to 0.2 for 20-Newsgroup. For PNE-LDA and Seeded-LDA, to assign higher weights to named entities, we have assigned the weights of $\eta _1$ less than $\eta _2$ for all three datasets. For SeededLDA we have used the same value of $\eta _1$ and $\eta _2$ as PNE-LDA corresponding to three datasets. For Bomb blast dataset, we have set the topic-prioritized named entity Dirichlet parameter $\eta _1$ as 0.1 and topic-general word Dirichlet parameter $\eta _2$ as 0.2. For Reuters dataset, we have set $\eta _1$ as 0.2 and $\eta _2$ as 0.3 respectively. For 20-Newsgroup, we have set $\eta _1$ as 0.1 and $\eta _2$ as 0.2 respectively. Although we have experimented with other combination of $\eta _1$ (0.1, 0.2, 0.3, 0.4) and $\eta _2$ (0.1, 0.2, 0.3, 0.4), but we are reporting parameter with the best performance with respect to both F-measure and RandIndex. We have run 120 iterations of Gibbs sampling for all the LDA and PNE-LDA experiments.

In order to investigate the performance of different methods, we have used the document clustering task to measure the quality of topics given by LDA, SeededLDA and PNE-LDA. For all the experimental setups, we consider the number of topics as the number of clusters present in the respective datasets as shown in Table 1. LDA, SeededLDA and PNE-LDA returns the document-topic proportions. Each document is assigned to the cluster defined by the topic with maximum proportions. Once each document is assigned with a cluster id (which is topic id with maximum proportions), we evaluate clustering performance using F-measure and RandIndex.

Table 2. Topic evaluation of LDA, PNE-LDA and SeededLDA over Bomb Blast, Reuters-21578 and 20-Newsgroups datasets.

Full size table

Observations: Table 2 represents comparative study of topic quality given given by three models LDA, PNE-LDA, and SeededLDA. As shown in the table, the proposed PNE-LDA outperforms both LDA and seeded LDA for both Bomb Blast and 20-Newsgroups. As mentioned above, topics in these two datasets are named entity driven. Whereas, in less named entity driven dataset (Reuter-21578), PNE-LDA under-performs both LDA and SeededLDA. It is evident from these observation that the proposed PNE-LDA is more effective to determine real world events which can be represented by named entities defining the topics.

5 Conclusion and Future Work

In this paper, we propose a novel entity driven Topic modeling based model namely PNE-LDA, for clustering documents with high overlapping word and named entities driven clusters. From the various experimental observation over three datasets namely Bomb Blast, Reuters-21578 and 20-Newsgroups, it is observed that the proposed methods outperforms its state of the art counterparts namely LDA and SeededLDA over dataset with high occurrences of named entity. Thus the proposed method is more suitable for detecting real world events like bomb blasts.

References

Aggarwal, C.C., Wang, H.: Text mining in social networks. In: Aggarwal, C. (ed.) Social Network Data Analytics, pp. 353–378. Springer, Boston (2011). https://doi.org/10.1007/978-1-4419-8462-3_13
Chapter MATH Google Scholar
AlSumait, L., Barbará, D., Domeniconi, C.: On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: Eighth IEEE ICDM 2008, pp. 3–12. IEEE (2008)
Google Scholar
Blei, D.M., Lafferty, J.D.: Dynamic topic models. In: Proceedings of the 23rd ICML, pp. 113–120. ACM (2006)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Chong, W., Blei, D., Li, F.F.: Simultaneous image classification and annotation. In: 2009 IEEE Conference on CVPR, pp. 1903–1910. IEEE (2009)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting on ACL, pp. 363–370. ACL (2005)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. 101(suppl 1), 5228–5235 (2004)
Article Google Scholar
Jagarlamudi, J., Daumé III, H., Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the ACL, pp. 204–213. ACL (2012)
Google Scholar
Jankowski, M.: Boost multi-class sLDA model for text classification. In: Rutkowski, L., Scherer, R., Korytkowski, M., Pedrycz, W., Tadeusiewicz, R., Zurada, J.M. (eds.) ICAISC 2018. LNCS (LNAI), vol. 10841, pp. 633–644. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91253-0_59
Chapter Google Scholar
Kim, D., Kim, S., Oh, A.: Dirichlet process with mixed random measures: a nonparametric topic model for labeled data. arXiv preprint arXiv:1206.4658 (2012)
Lacoste-Julien, S., Sha, F., Jordan, M.I.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in NIPS, pp. 897–904 (2009)
Google Scholar
Lienou, M., Maitre, H., Datcu, M.: Semantic annotation of satellite images using latent Dirichlet allocation. IEEE GRSL 7(1), 28–32 (2010)
Google Scholar
Mcauliffe, J.D., Blei, D.M.: Supervised topic models. In: Advances in NIPS, pp. 121–128 (2008)
Google Scholar
McCallum, A., Wang, X., Corrada-Emmanuel, A.: Topic and role discovery in social networks with experiments on enron and academic email. J. Artif. Intell. Res. 30, 249–272 (2007)
Article Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on EMNLP, vol. 1, pp. 248–256. ACL (2009)
Google Scholar
Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence, pp. 487–494. AUAI Press (2004)
Google Scholar
Rubin, T.N., Chambers, A., Smyth, P., Steyvers, M.: Statistical topic models for multi-label document classification. Mach. Learn. 88(1–2), 157–208 (2012)
Article MathSciNet Google Scholar
Tu, Y., Johri, N., Roth, D., Hockenmaier, J.: Citation author topic model in expert search. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 1265–1273. ACL (2010)
Google Scholar
Wang, X., McCallum, A.: Topics over time: a non-Markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD, pp. 424–433. ACM (2006)
Google Scholar
Wang, Y., Agichtein, E., Benzi, M.: TM-LDA: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD, pp. 123–131. ACM (2012)
Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: ACM SIGIR. ACM (2006)
Google Scholar
Wood, J., Tan, P., Wang, W., Arnold, C.: Source-LDA: enhancing probabilistic topic models using prior knowledge sources. In: 2017 IEEE 33rd ICDE, pp. 411–422. IEEE (2017)
Google Scholar
Zhu, J., Ahmed, A., Xing, E.P.: MedLDA: maximum margin supervised topic models. J. Mach. Learn. Res. 13(Aug), 2237–2278 (2012)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, India
Durgesh Kumar & Sanasam Ranbir Singh

Authors

Durgesh Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Sanasam Ranbir Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Durgesh Kumar .

Editor information

Editors and Affiliations

Tezpur University, Tezpur, India
Bhabesh Deka
Indian Statistical Institute, Kolkata, India
Pradipta Maji
Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Tezpur University, Tezpur, India
Dhruba Kumar Bhattacharyya
Indian Institute of Technology Guwahati, Guwahati, India
Prabin Kumar Bora
Indian Statistical Institute, Kolkata, India
Sankar Kumar Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, D., Singh, S.R. (2019). Prioritized Named Entity Driven LDA for Document Clustering. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11942. Springer, Cham. https://doi.org/10.1007/978-3-030-34872-4_33

Download citation

DOI: https://doi.org/10.1007/978-3-030-34872-4_33
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34871-7
Online ISBN: 978-3-030-34872-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Prioritized Named Entity Driven LDA for Document Clustering

Abstract

Similar content being viewed by others

Incorporating Entities in News Topic Modeling

Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence

Autocoder Guide Multi-category Topic Clustering for Keywords Matching

Keywords

1 Introduction

2 Literature Review

3 Methodology

3.1 Prioritized Named Entity Driven LDA (PNE-LDA)

3.2 Prioritizing Named Entities

4 Experimental Setups

5 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Societies and partnerships

Navigation

Prioritized Named Entity Driven LDA for Document Clustering

Abstract

Similar content being viewed by others

Incorporating Entities in News Topic Modeling

Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence

Autocoder Guide Multi-category Topic Clustering for Keywords Matching

Keywords

1 Introduction

2 Literature Review

3 Methodology

3.1 Prioritized Named Entity Driven LDA (PNE-LDA)

3.2 Prioritizing Named Entities

4 Experimental Setups

5 Conclusion and Future Work

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation