Elsevier

Expert Systems with Applications

Volume 103, 1 August 2018, Pages 106-117
Expert Systems with Applications

Hierarchical topic modeling with automatic knowledge mining

https://doi.org/10.1016/j.eswa.2018.03.008Get rights and content

Highlights

  • Propose a novel knowledge-based hierarchical topic model.

  • Propose a learning algorithm that continuously improves the results.

  • Design a hierarchical structure to maintain knowledge.

  • Propose the parameter estimation method based on Gibbs sampling.

Abstract

Traditional topic modeling has been widely studied and popularly employed in expert systems and information systems. However, traditional topic models cannot discover structural relations among topics, thus losing the chance to explore the data more deeply. Hierarchical topic modeling has the capability of learning topics, as well as discovering the hierarchical topic structure from text data. But purely unsupervised models tend to generate weak topic hierarchies. To solve this problem, we propose a novel knowledge-based hierarchical topic model (KHTM), which can incorporate prior knowledge into topic hierarchy building. A key novelty of this model is that it can mine prior knowledge automatically from the topic hierarchies of multiple domains corpora. In this paper, the knowledge is represented as the word pairs which satisfy the requirement of frequent co-occurrence, and knowledge is organized in form of hierarchical structure. We also propose an iterative learning algorithm. For evaluation, we crawled two new multi-domain datasets and conducted comprehensive experiments. The experimental results show that our algorithm and model can generate more coherent topics, and more reasonable hierarchical structure.

Introduction

Traditional topic models, such as LDA (Latent Dirichlet Allocation) (Blei, Ng, & Jordan, 2003) and HDP (Hierarchical Dirichlet Process) (Teh, Jordan, Beal, & Blei, 2006), have been widely used to learn latent topics from text corpora in expert systems, information systems and knowledge management applications (Wu et al., 2017). However, these topic models and their extensions can only find topics in a flat structure, but fail to discover the hierarchical relationship among topics. Such a drawback can limit the application of topic models since many applications need and have inherent hierarchical topic structures, such as the categories hierarchy in Web pages (Ming, Wang, & Chua, 2010), aspects hierarchy in reviews (Kim, Zhang, Chen, Oh, & Liu, 2013) and research topics hierarchy in academia community (Paisley, Wang, Blei, & Jordan, 2015). In a topic hierarchy, the topics near the root have more general semantics, and the topics close to leaves have more specific semantics.

Hierarchical topic modeling is a challenging problem, mainly due to two reasons. The first is that, compared to ordinary topic modeling that only learns topics with no structure, hierarchical topic modeling needs to learn the topics to form meaningful word clusters, as well as the hierarchical structure among topics. The second is that the number of topics at each level is unknown and cannot be set to a predefined value. Some hierarchical topic models (HTM for short) have been proposed (Blei, Griffiths, Jordan, 2010, Blei, Griffiths, Jordan, Tenenbaum, 2005, Mimno, Li, McCallum, 2007, Paisley, Wang, Blei, Jordan, 2015). However, researchers have shown that the existing HTMs often achieve unsatisfactory results, containing incoherent topics and unreasonable structures (Kang, Ma, Liu, 2012, Mao, Ming, Chua, Li, Yan, Li, 2012b). An incoherent topic refers to that in this topic, there exist a part of top topical words that are not consistent with the semantic meaning of other words. The unreasonable structure refers to that in a topic hierarchy, there are some topics in the upper level (parent topics) that cannot be regarded as the generalization of the topics in the associated lower level (children topics). Or some topics in the lower level (children topics) are not the semantic specialization of the topics in the associated upper level (parent topics).

We give a part of topic hierarchy in the following Fig. 1 to illustrate the problems of incoherent topic and unreasonable structure. The shown topic hierarchy is learned by a basic hierarchical topic model, i.e., hLDA (hierarchical LDA), which is proposed by Blei et al. (2010), and is learned from the abstracts corpus of the AAAI Conference on Artificial Intelligence (AAAI for short). The abstracts corpus of AAAI consists of the abstracts of papers published in AAAI over six years, which is crawled by us as one corpus of the entire multi-domain dataset. The detailed explanation of the datasets used in this paper will be given in the experiment section (Section 5.1).

The shown topic hierarchy is with three level, and it can be observed that there exist several deficiencies, including

  • 1.

    Incoherence topics. For example, for the root topic with general semantics, the topical word large lowers the coherence of the semantic meaning, which originally should be referred to information processing. The same case happens in the shown topic at the second level, the semantic meaning of which originally should be referred to knowledge base. However, the fifth topical word review is not consistent with such a semantic meaning, since review should have been in a topic on sentiment analysis or text mining.

  • 2.

    Unreasonable structure. Although the topic at the third level, which is highlighted with red box, has coherent semantic meaning on robotics, this topic should not be a child topic of the topic at the second level. Since the semantic meaning of the topic at the second level is on knowledge base.

  • 3.

    The third type of deficiencies can be regarded as a special case of unreasonable structure. An example is the fifth topical word information in the topic at the third level. That is, as a general word, information should be in the topics of upper levels, for example, at the root topic. However, in the current topic hierarchy, information is at a specific topic, the semantic meaning of which is on knowledge graph.

Based on the mechanism of hierarchical topic modeling, we find that there are two main reasons leading to the defective topic hierarchies.

  • 1.

    Similar to ordinary topic modeling, hierarchical topic modeling relies on how often words co-occur in the corpus, i.e., “higher-order co-occurrence”(Heinrich, 2008). So even though some words are semantically more general and should occur in the topics of upper levels, if those words do not occur frequently enough, such words are likely to not occur at the right level. In contrast, if some words are semantically specific but occur frequently, these words are likely to be wrongly at upper levels in the learned topic hierarchy.

  • 2.

    In hierarchical topic modeling, people also adopt the bag-of-words model, in which words are generated independently, ignoring the semantic relation among words. However, in real-world text, words are generated by certain language rules and probably have semantic relation with each other.

To fix the defects of existing methods, some researchers tried to acquire better topic hierarchies by introducing labeled data (Kang, Ma, Liu, 2012, Mao, He, Yan, Li, 2012a; Petinot, McKeown, & Thadani, 2011) or linking relationship among documents (Wang, Liu, Desai, Danilevsky, & Han, 2014). However, the labeled data or linking relationships do not exist in every corpus, and probably need much manual work. In this paper, we propose an algorithm to mine prior knowledge from other domain corpora, and use such knowledge to learn a better topic hierarchy. A domain refers to a field, and since we focus on hierarchical topic modeling in text mining, in this paper, a domain is represented by its associated corpus, where the contained documents are related to the same theme or several similar themes. Since the labeled corpora are hard to acquire, our algorithm does not require any labeled data, and does not depend on any linking information. Besides, unlike some other HTMs proposed for specific applications, such as HTM for sentiment analysis (Kim et al., 2013) and HTM for phrase mining (Wang et al., 2013), our algorithm is label and application independent.

There are several existing knowledge-based topic models (Andrzejewski, Zhu, Craven, Recht, 2011, Chen, Mukherjee, Liu, 2014, Xie, Yang, Xing, 2015, Zhang, Zhong, 2016) that can utilize some types of knowledge, for example, the semantic correlation among words. However, their knowledge is provided manually or acquired under flat topics, i.e., with no hierarchical topic structure, so their knowledge cannot be used in our problem. Our proposed algorithm automatically discovers the knowledge, and organizes the knowledge into a hierarchy (see an example in Fig. 3, Section 4.1), which consists of knowledge sets (k-set for short). A k-set contains words that are likely to belong to the same topic at a certain level. In this paper, we propose to use k-set as the knowledge unit. We have two observations here.

  • 1.

    There exist many overlapping topics among different domains. For example, almost every conference with machine learning as the main area or as a sub-area (one conference is regarded as one domain) has topics on supervised learning and unsupervised learning. The topic on supervised learning is likely to be formed by general topical words such as {training, supervised, class, data}, and the topic on unsupervised learning is likely to be formed by general topical words such as {unlabeled, data, clustering, unsupervised}.

    In this paper, a topic is represented by its top 20 words ranked by probabilities in descending order. The shared topics often contain sets of words frequently co-occurring, which can form potential k-sets.

  • 2.

    There also exist similar hierarchical structures or sub-structures shared among domains. For example, the two conferences AAAI and NIPS share the following topic hierarchy on: supervised learning  →  classification  →  regression. The topic in the second level on classification is likely to be formed by topical words such as {classifier, pattern, label, error}, and the topic in the third level on regression is likely to be formed by specific topical words such as {linear, kernel, parameter, regression}.

In this paper, we use the two types of shared information above as prior knowledge to correct the defective topic hierarchy of each domain corpus. The main contributions of this paper are summarized as follows:

  • 1.

    It proposes a novel knowledge-based hierarchical topic model (KHTM), which is capable of mining prior knowledge automatically, and incorporating the mined knowledge to learn a superior topic hierarchy. We give the detailed generative process of the model, and the corresponding parameter estimation method based on Gibbs sampling.

  • 2.

    It proposes a learning algorithm that embeds KHTM in an iterative way, to continuously improve the learning results.

  • 3.

    It designs a hierarchical structure to maintain knowledge, which is represented as word pairs satisfying frequent co-occurrence with level constraint.

  • 4.

    It crawled two new multi-domain datasets. Each dataset consists of the paper abstracts of 20 conferences and journals over six years. The first dataset contains abstracts of conferences and journals purely from computer science and the second dataset consists of a mixture of biology, physics and chemistry abstracts. We will release the two datasets publicly to facilitate the related research.

The rest of this paper is organized as follows. Section 2 summarizes the related work. Section 3 states the background model and proposed algorithm. Section 4 elaborates the proposed knowledge-based hierarchial topic model (KHTM). Section 5 presents the experimental results and the analysis. Section 6 concludes the paper.

Section snippets

Related work

Hierarchical topic modeling has attracted much attention in expert system community and knowledge management community (Mao, He, Yan, Li, 2012a, Pavlinek, Podgorelec, 2017). The task of hierarchical topic modeling is to learn a topic hierarchy from text corpus, in which the topics are organized according to the semantic generality of each topic. A salient characteristic of hierarchical topic modeling is that, for each parent topic, the number of its associated children topics cannot be known or

Base model and the proposed learning algorithm

In this paper, we employ the hierarchical LDA model (hLDA for short) (Blei et al., 2005) as base model. In this section, we first briefly review the hLDA model, and then present our proposed algorithm.

The proposed KHTM model

This section elaborates the proposed model KHTM (Algorithm 2), which consists of three components: knowledge mining, knowledge utilization and parameter estimation.

Experiment setting

Datasets. We crawled two new datasets for the experiments. The first dataset contains the paper abstracts of 20 conferences and journals from computer science over six years, including AAAI, CIKM, CVPR, SIGIR, etc. We name this dataset Computer Science dataset. The second dataset contains the paper abstracts of 20 conferences and journals from biology, chemistry and physics, also over six years. We name this dataset Natural Science dataset. In both datasets, each conference or journal is

Conclusion

In this paper, we proposed a novel hierarchical topic model KHTM that can mine knowledge from topic hierarchies of multiple domains corpora, and leverage such knowledge to learn a better topic hierarchy for the target domain. Also, we proposed an iterative algorithm that can improve the topic hierarchies in a continuous manner. The mined knowledge is organized in a hierarchical structure, and can be maintained and improved. We crawled two new multi-domain datasets. The evaluation results showed

Acknowledgments

This paper is granted by Fundamental Research Fund for Central Universities (No. JBX171007), National Natural Science Fund of China (No.61702391), Zhejiang Provincial Natural Science Foundation (No.LY12F02003) and China Postdoctoral Science Foundation (No.2013M540492).

References (31)

  • M. Pavlinek et al.

    Text classification method based on self-training and lda topic models

    Expert Systems with Applications (ESWA)

    (2017)
  • Z. Wu et al.

    A topic modeling based approach to novel document automatic summarization

    Expert Systems with Applications (ESWA)

    (2017)
  • H. Zhang et al.

    Improving short text classification by learning vector representations of both words and hidden topics

    Knowledge-Based Systems

    (2016)
  • R. Agrawal et al.

    Fast algorithms for mining association rules in large databases

    Proceedings of international conference on very large data bases (VLDB)

    (1994)
  • D. Andrzejewski et al.

    A framework for incorporating general domain knowledge into latent dirichlet allocation using first-order logic

    Proceedings of international joint conference on artificial intelligence (IJCAI)

    (2011)
  • D.M. Blei et al.

    The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies

    Journal of the ACM (JACM)

    (2010)
  • D.M. Blei et al.

    Hierarchical topic models and the nested chinese restaurant process

    Advances in neural information processing systems (NIPS)

    (2005)
  • D.M. Blei et al.

    Latent dirichlet allocation

    Journal of Machine Learning Research (JMLR)

    (2003)
  • J. Chang et al.

    Reading tea leaves: How humans interpret topic models

    Advances in neural information processing systems (NIPS)

    (2009)
  • Z. Chen et al.

    Aspect extraction with automated prior knowledge learning

    Proceedings of annual meeting of the association for computational linguistics (ACL)

    (2014)
  • T.L. Griffiths et al.

    Finding scientific topics

    Proceedings of the national academy of sciences of the united states of america (PNAS)

    (2004)
  • G. Heinrich

    Parameter estimation for text analysis

    Technical note

    (2008)
  • J. Jagarlamudi et al.

    Incorporating lexical priors into topic models

    Proceedings of the conference of the European chapter of the association for computational linguistics (EACL 12)

    (2012)
  • J.-H. Kang et al.

    Transfer topic modeling with ease and scalability

    Proceedings of SIAM international conference on data mining (SDM)

    (2012)
  • J.H. Kim et al.

    Modeling topic hierarchies with the recursive chinese restaurant process

    Proceedings of ACM conference on information and knowledge management (CIKM)

    (2012)
  • Cited by (36)

    • A probabilistic topic model based on short distance Co-occurrences

      2022, Expert Systems with Applications
      Citation Excerpt :

      The path is a random variable which follows a Chinese restaurant process. Another example of such models that assume a kind of relationship between topics is the Knowledge-based Hierarchical Topic Model (KHTM) (Xu et al., 2018) which improves HLDA by incorporating prior knowledge into the model and using the generalized Polya urn model (GPU). Trying to take the document structure into account is another line of improvement over LDA.

    • Leveraging Transformer-based BERTopic Model on Stakeholder Insights Towards Philippine UAQTE

      2024, International Journal of Engineering Trends and Technology
    • Scale-Invariant Infinite Hierarchical Topic Model

      2023, Proceedings of the Annual Meeting of the Association for Computational Linguistics
    View all citing articles on Scopus
    View full text