Hierarchical topic modeling with automatic knowledge mining
Introduction
Traditional topic models, such as LDA (Latent Dirichlet Allocation) (Blei, Ng, & Jordan, 2003) and HDP (Hierarchical Dirichlet Process) (Teh, Jordan, Beal, & Blei, 2006), have been widely used to learn latent topics from text corpora in expert systems, information systems and knowledge management applications (Wu et al., 2017). However, these topic models and their extensions can only find topics in a flat structure, but fail to discover the hierarchical relationship among topics. Such a drawback can limit the application of topic models since many applications need and have inherent hierarchical topic structures, such as the categories hierarchy in Web pages (Ming, Wang, & Chua, 2010), aspects hierarchy in reviews (Kim, Zhang, Chen, Oh, & Liu, 2013) and research topics hierarchy in academia community (Paisley, Wang, Blei, & Jordan, 2015). In a topic hierarchy, the topics near the root have more general semantics, and the topics close to leaves have more specific semantics.
Hierarchical topic modeling is a challenging problem, mainly due to two reasons. The first is that, compared to ordinary topic modeling that only learns topics with no structure, hierarchical topic modeling needs to learn the topics to form meaningful word clusters, as well as the hierarchical structure among topics. The second is that the number of topics at each level is unknown and cannot be set to a predefined value. Some hierarchical topic models (HTM for short) have been proposed (Blei, Griffiths, Jordan, 2010, Blei, Griffiths, Jordan, Tenenbaum, 2005, Mimno, Li, McCallum, 2007, Paisley, Wang, Blei, Jordan, 2015). However, researchers have shown that the existing HTMs often achieve unsatisfactory results, containing incoherent topics and unreasonable structures (Kang, Ma, Liu, 2012, Mao, Ming, Chua, Li, Yan, Li, 2012b). An incoherent topic refers to that in this topic, there exist a part of top topical words that are not consistent with the semantic meaning of other words. The unreasonable structure refers to that in a topic hierarchy, there are some topics in the upper level (parent topics) that cannot be regarded as the generalization of the topics in the associated lower level (children topics). Or some topics in the lower level (children topics) are not the semantic specialization of the topics in the associated upper level (parent topics).
We give a part of topic hierarchy in the following Fig. 1 to illustrate the problems of incoherent topic and unreasonable structure. The shown topic hierarchy is learned by a basic hierarchical topic model, i.e., hLDA (hierarchical LDA), which is proposed by Blei et al. (2010), and is learned from the abstracts corpus of the AAAI Conference on Artificial Intelligence (AAAI for short). The abstracts corpus of AAAI consists of the abstracts of papers published in AAAI over six years, which is crawled by us as one corpus of the entire multi-domain dataset. The detailed explanation of the datasets used in this paper will be given in the experiment section (Section 5.1).
The shown topic hierarchy is with three level, and it can be observed that there exist several deficiencies, including
- 1.
Incoherence topics. For example, for the root topic with general semantics, the topical word large lowers the coherence of the semantic meaning, which originally should be referred to information processing. The same case happens in the shown topic at the second level, the semantic meaning of which originally should be referred to knowledge base. However, the fifth topical word review is not consistent with such a semantic meaning, since review should have been in a topic on sentiment analysis or text mining.
- 2.
Unreasonable structure. Although the topic at the third level, which is highlighted with red box, has coherent semantic meaning on robotics, this topic should not be a child topic of the topic at the second level. Since the semantic meaning of the topic at the second level is on knowledge base.
- 3.
The third type of deficiencies can be regarded as a special case of unreasonable structure. An example is the fifth topical word information in the topic at the third level. That is, as a general word, information should be in the topics of upper levels, for example, at the root topic. However, in the current topic hierarchy, information is at a specific topic, the semantic meaning of which is on knowledge graph.
Based on the mechanism of hierarchical topic modeling, we find that there are two main reasons leading to the defective topic hierarchies.
- 1.
Similar to ordinary topic modeling, hierarchical topic modeling relies on how often words co-occur in the corpus, i.e., “higher-order co-occurrence”(Heinrich, 2008). So even though some words are semantically more general and should occur in the topics of upper levels, if those words do not occur frequently enough, such words are likely to not occur at the right level. In contrast, if some words are semantically specific but occur frequently, these words are likely to be wrongly at upper levels in the learned topic hierarchy.
- 2.
In hierarchical topic modeling, people also adopt the bag-of-words model, in which words are generated independently, ignoring the semantic relation among words. However, in real-world text, words are generated by certain language rules and probably have semantic relation with each other.
To fix the defects of existing methods, some researchers tried to acquire better topic hierarchies by introducing labeled data (Kang, Ma, Liu, 2012, Mao, He, Yan, Li, 2012a; Petinot, McKeown, & Thadani, 2011) or linking relationship among documents (Wang, Liu, Desai, Danilevsky, & Han, 2014). However, the labeled data or linking relationships do not exist in every corpus, and probably need much manual work. In this paper, we propose an algorithm to mine prior knowledge from other domain corpora, and use such knowledge to learn a better topic hierarchy. A domain refers to a field, and since we focus on hierarchical topic modeling in text mining, in this paper, a domain is represented by its associated corpus, where the contained documents are related to the same theme or several similar themes. Since the labeled corpora are hard to acquire, our algorithm does not require any labeled data, and does not depend on any linking information. Besides, unlike some other HTMs proposed for specific applications, such as HTM for sentiment analysis (Kim et al., 2013) and HTM for phrase mining (Wang et al., 2013), our algorithm is label and application independent.
There are several existing knowledge-based topic models (Andrzejewski, Zhu, Craven, Recht, 2011, Chen, Mukherjee, Liu, 2014, Xie, Yang, Xing, 2015, Zhang, Zhong, 2016) that can utilize some types of knowledge, for example, the semantic correlation among words. However, their knowledge is provided manually or acquired under flat topics, i.e., with no hierarchical topic structure, so their knowledge cannot be used in our problem. Our proposed algorithm automatically discovers the knowledge, and organizes the knowledge into a hierarchy (see an example in Fig. 3, Section 4.1), which consists of knowledge sets (k-set for short). A k-set contains words that are likely to belong to the same topic at a certain level. In this paper, we propose to use k-set as the knowledge unit. We have two observations here.
- 1.
There exist many overlapping topics among different domains. For example, almost every conference with machine learning as the main area or as a sub-area (one conference is regarded as one domain) has topics on supervised learning and unsupervised learning. The topic on supervised learning is likely to be formed by general topical words such as {training, supervised, class, data}, and the topic on unsupervised learning is likely to be formed by general topical words such as {unlabeled, data, clustering, unsupervised}.
In this paper, a topic is represented by its top 20 words ranked by probabilities in descending order. The shared topics often contain sets of words frequently co-occurring, which can form potential k-sets.
- 2.
There also exist similar hierarchical structures or sub-structures shared among domains. For example, the two conferences AAAI and NIPS share the following topic hierarchy on: supervised learning → classification → regression. The topic in the second level on classification is likely to be formed by topical words such as {classifier, pattern, label, error}, and the topic in the third level on regression is likely to be formed by specific topical words such as {linear, kernel, parameter, regression}.
In this paper, we use the two types of shared information above as prior knowledge to correct the defective topic hierarchy of each domain corpus. The main contributions of this paper are summarized as follows:
- 1.
It proposes a novel knowledge-based hierarchical topic model (KHTM), which is capable of mining prior knowledge automatically, and incorporating the mined knowledge to learn a superior topic hierarchy. We give the detailed generative process of the model, and the corresponding parameter estimation method based on Gibbs sampling.
- 2.
It proposes a learning algorithm that embeds KHTM in an iterative way, to continuously improve the learning results.
- 3.
It designs a hierarchical structure to maintain knowledge, which is represented as word pairs satisfying frequent co-occurrence with level constraint.
- 4.
It crawled two new multi-domain datasets. Each dataset consists of the paper abstracts of 20 conferences and journals over six years. The first dataset contains abstracts of conferences and journals purely from computer science and the second dataset consists of a mixture of biology, physics and chemistry abstracts. We will release the two datasets publicly to facilitate the related research.
The rest of this paper is organized as follows. Section 2 summarizes the related work. Section 3 states the background model and proposed algorithm. Section 4 elaborates the proposed knowledge-based hierarchial topic model (KHTM). Section 5 presents the experimental results and the analysis. Section 6 concludes the paper.
Section snippets
Related work
Hierarchical topic modeling has attracted much attention in expert system community and knowledge management community (Mao, He, Yan, Li, 2012a, Pavlinek, Podgorelec, 2017). The task of hierarchical topic modeling is to learn a topic hierarchy from text corpus, in which the topics are organized according to the semantic generality of each topic. A salient characteristic of hierarchical topic modeling is that, for each parent topic, the number of its associated children topics cannot be known or
Base model and the proposed learning algorithm
In this paper, we employ the hierarchical LDA model (hLDA for short) (Blei et al., 2005) as base model. In this section, we first briefly review the hLDA model, and then present our proposed algorithm.
The proposed KHTM model
This section elaborates the proposed model KHTM (Algorithm 2), which consists of three components: knowledge mining, knowledge utilization and parameter estimation.
Experiment setting
Datasets. We crawled two new datasets for the experiments. The first dataset contains the paper abstracts of 20 conferences and journals from computer science over six years, including AAAI, CIKM, CVPR, SIGIR, etc. We name this dataset Computer Science dataset. The second dataset contains the paper abstracts of 20 conferences and journals from biology, chemistry and physics, also over six years. We name this dataset Natural Science dataset. In both datasets, each conference or journal is
Conclusion
In this paper, we proposed a novel hierarchical topic model KHTM that can mine knowledge from topic hierarchies of multiple domains corpora, and leverage such knowledge to learn a better topic hierarchy for the target domain. Also, we proposed an iterative algorithm that can improve the topic hierarchies in a continuous manner. The mined knowledge is organized in a hierarchical structure, and can be maintained and improved. We crawled two new multi-domain datasets. The evaluation results showed
Acknowledgments
This paper is granted by Fundamental Research Fund for Central Universities (No. JBX171007), National Natural Science Fund of China (No.61702391), Zhejiang Provincial Natural Science Foundation (No.LY12F02003) and China Postdoctoral Science Foundation (No.2013M540492).
References (31)
- et al.
Text classification method based on self-training and lda topic models
Expert Systems with Applications (ESWA)
(2017) - et al.
A topic modeling based approach to novel document automatic summarization
Expert Systems with Applications (ESWA)
(2017) - et al.
Improving short text classification by learning vector representations of both words and hidden topics
Knowledge-Based Systems
(2016) - et al.
Fast algorithms for mining association rules in large databases
Proceedings of international conference on very large data bases (VLDB)
(1994) - et al.
A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic
Proceedings of international joint conference on artificial intelligence (IJCAI)
(2011) - et al.
The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies
Journal of the ACM (JACM)
(2010) - et al.
Hierarchical topic models and the nested chinese restaurant process
Advances in neural information processing systems (NIPS)
(2005) - et al.
Latent dirichlet allocation
Journal of Machine Learning Research (JMLR)
(2003) - et al.
Reading tea leaves: How humans interpret topic models
Advances in neural information processing systems (NIPS)
(2009) - et al.
Aspect extraction with automated prior knowledge learning
Proceedings of annual meeting of the association for computational linguistics (ACL)
(2014)
Finding scientific topics
Proceedings of the national academy of sciences of the united states of america (PNAS)
Parameter estimation for text analysis
Technical note
Incorporating lexical priors into topic models
Proceedings of the conference of the European chapter of the association for computational linguistics (EACL 12)
Transfer topic modeling with ease and scalability
Proceedings of SIAM international conference on data mining (SDM)
Modeling topic hierarchies with the recursive chinese restaurant process
Proceedings of ACM conference on information and knowledge management (CIKM)
Cited by (36)
A probabilistic topic model based on short distance Co-occurrences
2022, Expert Systems with ApplicationsCitation Excerpt :The path is a random variable which follows a Chinese restaurant process. Another example of such models that assume a kind of relationship between topics is the Knowledge-based Hierarchical Topic Model (KHTM) (Xu et al., 2018) which improves HLDA by incorporating prior knowledge into the model and using the generalized Polya urn model (GPU). Trying to take the document structure into account is another line of improvement over LDA.
Inferring the number and order of embedded topics across documents
2021, Procedia Computer ScienceLeveraging Transformer-based BERTopic Model on Stakeholder Insights Towards Philippine UAQTE
2024, International Journal of Engineering Trends and TechnologyHybrid Principal Component Analysis Using Boosting Classification Techniques: Categorical Boosting
2024, Lecture Notes in Networks and SystemsScale-Invariant Infinite Hierarchical Topic Model
2023, Proceedings of the Annual Meeting of the Association for Computational Linguistics