Semantic multi-grain mixture topic model for text analysis

https://doi.org/10.1016/j.eswa.2010.08.146Get rights and content

Abstract

Granular topic extraction and modeling are fundament tasks in text analysis. Hierarchical topic clustering algorithms and hierarchical topic models are usually employed for these purposes. However, it is difficult to make a clear distinguish between each pair of hierarchical topics from the semantic granularity point of view. STG (semantic topic granularity) is proposed to indicate the details degree of topic description, and aim at providing discrimination for topics from semantic aspect. A new model, mgMTM (multi-grain mixture topic model) based on STG is then proposed to model grain topics. DCT (discrete cosine transform) is employed to provide a mechanism for computing STG, extracting grain topics and learning mgMTM. Experiments on real world datasets show that the proposed model has lower perplexity score than that of LDA model and thus has better generalization performance in describing text. Experiments also show that the description of the extracted grain topics can be well explained with respect to a dataset including topics about recent global financial crisis.

Research highlights

► General degree of topics can be quantized by topic granularity. ► DCT provides a mechanism for computing semantic topic granularity. ► A mixture semantic topic model is proposed to describe the multi-grain topics.

Introduction

Enormous text has been produced every day on the Internet. Hence, the computer automatic analysis of text is strongly required to provide recognizable topics for kinds of topic-based analysis tasks, such as opinions extraction, topic propagation and topic evolution (Mei et al., 2007, Zeng et al., 2009). For these purposes, many topic extraction algorithms and topic models have been devised to help find out topics that hide in text corpus (Zeng and Zhang, 2007, Zeng and Zhang, 2009, Zhong and Ghosh, 2005). However, topics are inherently hierarchical or granular, as a result, finding the hierarchical topic structures becomes an important task in topic extraction.

A typical method is to utilize hierarchical clustering algorithms (Gil-García1, Badía-Contelles, & Pons-Porrata, 2006). By considering each document as a point in high dimension space composed of words, a merge operation on two most similar points then is performed. By iteration the merge step, a hierarchical structure can finally be generated (Duan et al., 2005; Gil-García1 et al., 2006). Several improvements on the basic algorithms are done to utilize similarity measurement (Wang & Imad, 2007), merge and split strategy (Rodrigues, Gama, & Pedroso, 2008), etc. Considering a document as a probability distribution is another choice. In this case, document similarity is usually measured by probability distance, such as KL-divergence (Zhong, 2003). Zhong utilized model based k-means clustering algorithm by considering a document as multinomial model, and a general document hierarchical clustering algorithm was proposed (Zhong, 2003, Zhong and Ghosh, 2005). Several efficient algorithms introduce hashing technique to produce a hierarchical decomposition of topic space based on the asymmetric relationships between terms (Gollapudi & Panigraphy, 2006).

As another approach, hierarchical topic models like hLDA (Blei, Griths, & Jordan, 2004) and Pachinko allocation (Li & McCallum, 2006), models topics in a hierarchical structure. These kinds of topic models describe semantic interactions between topics that are typically at the document level. In hLDA model, each document is assigned to a path through a hierarchical topic tree, and each word in a given document is assigned to a topic at one of the levels of that path. MGLDA is proposed to model opinion review at document or sentence levels based on LDA. However, it produces topics of global and local granularity (Titov & McDonald, 2008).

Obviously, the two approaches provide natural ways to generate hierarchical topic structure. However, as far as the hierarchical topics clustering algorithms are concerned, the clusters are generated in the sense of topics similarity in measurement space, which consists of only feature words selected from original documents. Note that some valuable words in expressing granularity are usually omitted in this process. As a result, general semantic topics need to be induced by users. Hierarchical topic models describe the word distribution between each neighbor hierarchical levels, and the granularity, that is, the hierarchical level, is lack of comparable indication. Hence, from the semantic granularity point of view, it is difficult to make a clear distinguish between each pair of hierarchical topics via the two approaches, which prevents the improvement on the accuracy of topic modeling.

We first explore the topic granularity which is closely related to semantic characteristic, and STG (semantic topic granularity) is proposed to provide a semantic discrimination indicator for topics. By considering the topics in corpus as a mixture of STG topics, a new model, mgMTM (multi-grain mixture topic model) is proposed to model multi-grain topics. DCT (discrete cosine transform) is employed to provide a mechanism for computing STG and extracting grain topics in learning mgMTM.

The main contributions of the paper are as follows:

  • (1)

    Semantic topic granularity is proposed to provide a mechanism for describing the semantic detail degree of selected topics. Being inspired by the granularity expression in image compression, DCT is employed to figure out the value of STG.

  • (2)

    A mixture STG topic model is proposed to describe the multi-grain topics. A quantization method to determine feature words for a grain topic is proposed, based on which, STG topic extraction algorithm is proposed to construct the mixture model.

  • (3)

    Extensive experiments are performed on real world datasets. The validity of the proposed model and algorithms is verified by the means of perplexity and the explanation on the generated topic description.

The organization of the paper is as follows: In the next section, we present some basic term definitions and induce the definition of STG based on DCT. In Section 3, we describe multi-grain mixture topic model, and the model learning algorithm. In the fourth section, we describe the experiments on real world datasets and analyze the results, including perplexity score and topic description. In Section 5, conclusion and future work are pointed out.

Section snippets

Inferring semantic topic granularity

Topic grain has been used to express the detail degree of topics in corpus. For example, hierarchical level which is generated by hierarchical topic clustering algorithms can be considered as the grain indicator. The topics in upper level of hierarchical tree are more general than that in lower level. Although multi-grain topic model is proposed in Titov and McDonald (2008), the grain topic is roughly classified into two grains, that is, global topics and local topic. However, the kind of topic

Multi-grain mixture topic model

For a corpus or document, there exist several granularities which describe different detail degrees of topics. To describe this kind of topics, a multi-grain mixture topic model, in which each mixture component is corresponding to a semantic granularity, is described in this section.

Experiments and analysis

In this section, experiments are performed on several real world datasets, and empirical evaluation of mgMTM is discussed by means of quantitative and qualitative analysis. For the quantitative analysis we show that the perplexity of a held-out test set with respect to mgMTM is smaller than that of LDA. For the qualitative analysis we show that grain topics inferred by mgMTM do correspond to different detail degree.

Conclusion and future work

To provide a clear distinguish between each pair of hierarchical topics, STG is proposed to indicate the details degree of topic description, and provide discrimination for topics from semantic aspect. Multi-grain mixture topic model based on STG is then proposed to model grain topics. Experiments on real world datasets show that the proposed model has lower perplexity score than that of LDA model and thus has better generalization performance in describing text. Recognizable grain topics can

Acknowledgement

The paper is supported by National Natural Science Foundation of China (Grant No. 61073170) and Shanghai Leading Academic Discipline Project (Project No.B114). The paper is also supported by the career development plan for new teachers of Fudan University.

References (20)

  • J.P. Zeng et al.

    Variable space hidden markov model for topic detection and analysis

    Knowledge-Based Systems

    (2007)
  • J.P. Zeng et al.

    Incorporating topic transition in topic detection and tracking algorithms

    Expert Systems With Applications

    (2009)
  • Biem, A. (2003). A model selection criterion for classification: Application to HMM topology optimization. In The...
  • Blei, D., Griths, T., & Jordan, M. (2004). Hierarchical topic models and the nested chinese restaurant process. In...
  • D. Blei et al.

    Latent Dirichlet allocation

    Journal of Machine Learning Research

    (2003)
  • Duan, J. J., Wang, W., Liu, B., XUE, Y., Zhou, H. F., & Shi, B. L. (2005). Incorporating with recursive model training...
  • Gil-García1, R. J., Badía-Contelles, J. M., & Pons-Porrata, A. (2006). A general framework for agglomerative...
  • Gollapudi, S., & Panigraphy, R. (2006). Exploiting asymmetry in hierarchical topic extraction. In Proceedings of...
  • K.W. Gregory

    The JPEG still picture compression standard

    Communications of the ACM

    (1991)
  • Gruber, A., Weiss, Y., & Rosen-Zvi, M. (2007). Hidden topic markov models. In Proceedings of the conference on...
There are more references available in the full text version of this article.

Cited by (10)

  • Analyzing the discriminative attributes of products using text mining focused on cosmetic reviews

    2018, Information Processing and Management
    Citation Excerpt :

    Researchers have studied “Opinion Mining” for more detailed topics and features in comparison to general opinion mining. Zeng, Duan, Wang, and Wu (2011) proposed a method of extracting detailed topics from documents using a multi-grain mixture topic model. Xianghua, Guo, Yanyan, and Zhiqiang (2013) extracted topics using the LDA model and classified the sentiment by using the HowNet lexicon.

  • A step forward for Topic Detection in Twitter: An FCA-based approach

    2016, Expert Systems with Applications
    Citation Excerpt :

    FCA theory defines a partial order relationship of the formal concepts that leads to the automatic construction of a concept lattice: a hierarchical data representation that explores correlations, similarities, anomalies, or even inconsistencies in the data structures. Hierarchical representations are desirable in order to deal with the Topic Detection task because, as stated by Zeng et al. (2011), topics are inherently hierarchical and the finding of such hierarchical topic structures becomes a crucial step. In contrast, many of the aforementioned techniques, no matter the approach followed (classification, matrix factorization, clustering, probabilistic of graph based), propose flat topic representations, which do not capture the topic-inherent hierarchy.

  • Web objectionable text content detection using topic modeling technique

    2013, Expert Systems with Applications
    Citation Excerpt :

    Content detection is achieved by probability calculation in a semantic space, so that sentence with objectionable meanings but without objectionable words can be detected. Semantic space can be modeled by means of several semantic topic modeling techniques, such as Latent Semantic Analysis (LSA) (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990), Probabilistic LSA (PLSA) (Hofmann, 1990), and other similar topic models (Zeng & Zhang, 2007; Zeng, Duan, Wang, & Wu, 2011). LSA takes a matrix of word occurrence in documents as the input, and then singular value decomposition (SVD) is applied to the matrix to deduce three other matrices.

  • Subtopic Detection Algorithm Based on Hierarchical Clustering

    2019, Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science)
  • Content extraction from advertisement display boards utilizing Region growing algorithm

    2016, 2016 IEEE International Conference on Advances in Electronics, Communication and Computer Technology, ICAECCT 2016
View all citing articles on Scopus
View full text