Semantic multi-grain mixture topic model for text analysis

doi:10.1016/j.eswa.2010.08.146

Expert Systems with Applications

Volume 38, Issue 4, April 2011, Pages 3574-3579

https://doi.org/10.1016/j.eswa.2010.08.146 Get rights and content

Abstract

Granular topic extraction and modeling are fundament tasks in text analysis. Hierarchical topic clustering algorithms and hierarchical topic models are usually employed for these purposes. However, it is difficult to make a clear distinguish between each pair of hierarchical topics from the semantic granularity point of view. STG (semantic topic granularity) is proposed to indicate the details degree of topic description, and aim at providing discrimination for topics from semantic aspect. A new model, mgMTM (multi-grain mixture topic model) based on STG is then proposed to model grain topics. DCT (discrete cosine transform) is employed to provide a mechanism for computing STG, extracting grain topics and learning mgMTM. Experiments on real world datasets show that the proposed model has lower perplexity score than that of LDA model and thus has better generalization performance in describing text. Experiments also show that the description of the extracted grain topics can be well explained with respect to a dataset including topics about recent global financial crisis.

Research highlights

► General degree of topics can be quantized by topic granularity. ► DCT provides a mechanism for computing semantic topic granularity. ► A mixture semantic topic model is proposed to describe the multi-grain topics.

Introduction

Enormous text has been produced every day on the Internet. Hence, the computer automatic analysis of text is strongly required to provide recognizable topics for kinds of topic-based analysis tasks, such as opinions extraction, topic propagation and topic evolution (Mei et al., 2007, Zeng et al., 2009). For these purposes, many topic extraction algorithms and topic models have been devised to help find out topics that hide in text corpus (Zeng and Zhang, 2007, Zeng and Zhang, 2009, Zhong and Ghosh, 2005). However, topics are inherently hierarchical or granular, as a result, finding the hierarchical topic structures becomes an important task in topic extraction.

A typical method is to utilize hierarchical clustering algorithms (Gil-García1, Badía-Contelles, & Pons-Porrata, 2006). By considering each document as a point in high dimension space composed of words, a merge operation on two most similar points then is performed. By iteration the merge step, a hierarchical structure can finally be generated (Duan et al., 2005; Gil-García1 et al., 2006). Several improvements on the basic algorithms are done to utilize similarity measurement (Wang & Imad, 2007), merge and split strategy (Rodrigues, Gama, & Pedroso, 2008), etc. Considering a document as a probability distribution is another choice. In this case, document similarity is usually measured by probability distance, such as KL-divergence (Zhong, 2003). Zhong utilized model based k-means clustering algorithm by considering a document as multinomial model, and a general document hierarchical clustering algorithm was proposed (Zhong, 2003, Zhong and Ghosh, 2005). Several efficient algorithms introduce hashing technique to produce a hierarchical decomposition of topic space based on the asymmetric relationships between terms (Gollapudi & Panigraphy, 2006).

As another approach, hierarchical topic models like hLDA (Blei, Griths, & Jordan, 2004) and Pachinko allocation (Li & McCallum, 2006), models topics in a hierarchical structure. These kinds of topic models describe semantic interactions between topics that are typically at the document level. In hLDA model, each document is assigned to a path through a hierarchical topic tree, and each word in a given document is assigned to a topic at one of the levels of that path. MGLDA is proposed to model opinion review at document or sentence levels based on LDA. However, it produces topics of global and local granularity (Titov & McDonald, 2008).

Obviously, the two approaches provide natural ways to generate hierarchical topic structure. However, as far as the hierarchical topics clustering algorithms are concerned, the clusters are generated in the sense of topics similarity in measurement space, which consists of only feature words selected from original documents. Note that some valuable words in expressing granularity are usually omitted in this process. As a result, general semantic topics need to be induced by users. Hierarchical topic models describe the word distribution between each neighbor hierarchical levels, and the granularity, that is, the hierarchical level, is lack of comparable indication. Hence, from the semantic granularity point of view, it is difficult to make a clear distinguish between each pair of hierarchical topics via the two approaches, which prevents the improvement on the accuracy of topic modeling.

We first explore the topic granularity which is closely related to semantic characteristic, and STG (semantic topic granularity) is proposed to provide a semantic discrimination indicator for topics. By considering the topics in corpus as a mixture of STG topics, a new model, mgMTM (multi-grain mixture topic model) is proposed to model multi-grain topics. DCT (discrete cosine transform) is employed to provide a mechanism for computing STG and extracting grain topics in learning mgMTM.

The main contributions of the paper are as follows:

(1)
Semantic topic granularity is proposed to provide a mechanism for describing the semantic detail degree of selected topics. Being inspired by the granularity expression in image compression, DCT is employed to figure out the value of STG.
(2)
A mixture STG topic model is proposed to describe the multi-grain topics. A quantization method to determine feature words for a grain topic is proposed, based on which, STG topic extraction algorithm is proposed to construct the mixture model.
(3)
Extensive experiments are performed on real world datasets. The validity of the proposed model and algorithms is verified by the means of perplexity and the explanation on the generated topic description.

The organization of the paper is as follows: In the next section, we present some basic term definitions and induce the definition of STG based on DCT. In Section 3, we describe multi-grain mixture topic model, and the model learning algorithm. In the fourth section, we describe the experiments on real world datasets and analyze the results, including perplexity score and topic description. In Section 5, conclusion and future work are pointed out.

Section snippets

Inferring semantic topic granularity

Topic grain has been used to express the detail degree of topics in corpus. For example, hierarchical level which is generated by hierarchical topic clustering algorithms can be considered as the grain indicator. The topics in upper level of hierarchical tree are more general than that in lower level. Although multi-grain topic model is proposed in Titov and McDonald (2008), the grain topic is roughly classified into two grains, that is, global topics and local topic. However, the kind of topic

Multi-grain mixture topic model

For a corpus or document, there exist several granularities which describe different detail degrees of topics. To describe this kind of topics, a multi-grain mixture topic model, in which each mixture component is corresponding to a semantic granularity, is described in this section.

Experiments and analysis

In this section, experiments are performed on several real world datasets, and empirical evaluation of mgMTM is discussed by means of quantitative and qualitative analysis. For the quantitative analysis we show that the perplexity of a held-out test set with respect to mgMTM is smaller than that of LDA. For the qualitative analysis we show that grain topics inferred by mgMTM do correspond to different detail degree.

Conclusion and future work

To provide a clear distinguish between each pair of hierarchical topics, STG is proposed to indicate the details degree of topic description, and provide discrimination for topics from semantic aspect. Multi-grain mixture topic model based on STG is then proposed to model grain topics. Experiments on real world datasets show that the proposed model has lower perplexity score than that of LDA model and thus has better generalization performance in describing text. Recognizable grain topics can

Acknowledgement

The paper is supported by National Natural Science Foundation of China (Grant No. 61073170) and Shanghai Leading Academic Discipline Project (Project No.B114). The paper is also supported by the career development plan for new teachers of Fudan University.

References (20)

J.P. Zeng et al.
Variable space hidden markov model for topic detection and analysis
Knowledge-Based Systems
(2007)
J.P. Zeng et al.
Incorporating topic transition in topic detection and tracking algorithms
Expert Systems With Applications
(2009)
Biem, A. (2003). A model selection criterion for classification: Application to HMM topology optimization. In The...
Blei, D., Griths, T., & Jordan, M. (2004). Hierarchical topic models and the nested chinese restaurant process. In...
D. Blei et al.
Latent Dirichlet allocation
Journal of Machine Learning Research
(2003)
Duan, J. J., Wang, W., Liu, B., XUE, Y., Zhou, H. F., & Shi, B. L. (2005). Incorporating with recursive model training...
Gil-García1, R. J., Badía-Contelles, J. M., & Pons-Porrata, A. (2006). A general framework for agglomerative...
Gollapudi, S., & Panigraphy, R. (2006). Exploiting asymmetry in hierarchical topic extraction. In Proceedings of...
K.W. Gregory
The JPEG still picture compression standard
Communications of the ACM
(1991)
Gruber, A., Weiss, Y., & Rosen-Zvi, M. (2007). Hidden topic markov models. In Proceedings of the conference on...

There are more references available in the full text version of this article.

Cited by (10)

Analyzing the discriminative attributes of products using text mining focused on cosmetic reviews
2018, Information Processing and Management
Citation Excerpt :
Researchers have studied “Opinion Mining” for more detailed topics and features in comparison to general opinion mining. Zeng, Duan, Wang, and Wu (2011) proposed a method of extracting detailed topics from documents using a multi-grain mixture topic model. Xianghua, Guo, Yanyan, and Zhiqiang (2013) extracted topics using the LDA model and classified the sentiment by using the HowNet lexicon.
Consumers evaluate products through online reviews, in addition to sharing their product experiences. Online reviews affect product marketing, and companies use online reviews to investigate consumer attitudes and perceptions of their products. However, when analyzing a review, it is often the case that specific contexts are not taken into consideration and meaningful information is not obtained from the analysis results. This study suggests a methodology for analyzing reviews in the context of comparing two competing products. In addition, by analyzing the discriminative attributes of competing products, we were able to derive more specific information than an overall product analysis. Analyzing the discriminative attributes in the context of comparing competing products provides clarity on analyzing the strengths and weaknesses of competitive products and provides realistic information that can help the company's management activities. Considering this purpose, this study collected a review of the BB Cream product line in the cosmetics field. The analysis was sequentially carried out in three stages. First, we extracted words that represent discriminative attributes by analyzing the percentage difference of words. Second, different attribute words were classified according to the meaning used in the review by using latent semantic analysis. Finally, the polarity of discriminative attribute words was analyzed using Labeled-LDA. This analysis method can be used as a market research method as it can extract more information than a traditional survey or interview method, and can save cost and time through the automation of the program.
A step forward for Topic Detection in Twitter: An FCA-based approach
2016, Expert Systems with Applications
Citation Excerpt :
FCA theory defines a partial order relationship of the formal concepts that leads to the automatic construction of a concept lattice: a hierarchical data representation that explores correlations, similarities, anomalies, or even inconsistencies in the data structures. Hierarchical representations are desirable in order to deal with the Topic Detection task because, as stated by Zeng et al. (2011), topics are inherently hierarchical and the finding of such hierarchical topic structures becomes a crucial step. In contrast, many of the aforementioned techniques, no matter the approach followed (classification, matrix factorization, clustering, probabilistic of graph based), propose flat topic representations, which do not capture the topic-inherent hierarchy.
The Topic Detection Task in Twitter represents an indispensable step in the analysis of text corpora and their later application in Online Reputation Management. Classification, clustering and probabilistic techniques have been traditionally applied, but they have some well-known drawbacks such as the need to fix the number of topics to be detected or the problem of how to integrate the prior knowledge of topics with the detection of new ones. This motivates the current work, where we present a novel approach based on Formal Concept Analysis (FCA), a fully unsupervised methodology to group similar content together in thematically-based topics (i.e., the FCA formal concepts) and to organize them in the form of a concept lattice. Formal concepts are conceptual representations based on the relationships between tweet terms and the tweets that have given rise to them. It allows, in contrast to other approaches in the literature, their clear interpretability. In addition, the concept lattice represents a formalism that describes the data, explores correlations, similarities, anomalies and inconsistencies better than other representations such as clustering models or graph-based representations. Our rationale is that these theoretical advantages may improve the Topic Detection process, making them able to tackle the problems related to the task. To prove this point, our FCA-based proposal is evaluated in the context of a real-life Topic Detection task provided by the Replab 2013 CLEF Campaign. To demonstrate the efficiency of the proposal, we have carried out several experiments focused on testing: (a) the impact of terminology selection as an input to our algorithm, (b) the impact of concept selection as the outcome of our algorithm, and; (c) the efficiency of the proposal to detect new and previously unseen topics (i.e., topic adaptation). An extensive analysis of the results has been carried out, proving the suitability of our proposal to integrate previous knowledge of prior topics without losing the ability to detect novel and unseen topics as well as improving the best Replab 2013 results.
Web objectionable text content detection using topic modeling technique
2013, Expert Systems with Applications
Citation Excerpt :
Content detection is achieved by probability calculation in a semantic space, so that sentence with objectionable meanings but without objectionable words can be detected. Semantic space can be modeled by means of several semantic topic modeling techniques, such as Latent Semantic Analysis (LSA) (Deerwester, Dumais, Landauer, Furnas, & Harshman, 1990), Probabilistic LSA (PLSA) (Hofmann, 1990), and other similar topic models (Zeng & Zhang, 2007; Zeng, Duan, Wang, & Wu, 2011). LSA takes a matrix of word occurrence in documents as the input, and then singular value decomposition (SVD) is applied to the matrix to deduce three other matrices.
Web 2.0 technologies have made it easily for Web users to create and spread objectionable text content, which has been shown harmful to Web users, especially young children. Although detection methods based on key word list are superior in achieving faster detection and lower memory consumption, they fail to detect text content that is objectionable in semantic description. A framework that can perfectly integrate semantic model and detection method is proposed to perform probability inference for detecting this kind of Web text content. Based on the observation that an objectionable scene could be described by a set of sentences, a topic model which is learnt from the set is employed to act as a semantic model of the objectionable scene. For a given sentence, probability value which shows the likelihood of the sentence with respect to the model is calculated in the framework. Then we use a mapping function to transform the probability value into a new indicator which is convenient for making final decision. Extensive comparison experiments on two real world text sets show that the framework can effectively recognize semantic objectionable text, and both the detection rate and the false alarm rate are superior to those of traditional methods.
Subtopic Detection Algorithm Based on Hierarchical Clustering
2019, Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science)
Hierarchical clustering based on single-pass for breaking topic detection and tracking
2018, High Technology Letters
Content extraction from advertisement display boards utilizing Region growing algorithm
2016, 2016 IEEE International Conference on Advances in Electronics, Communication and Computer Technology, ICAECCT 2016

View all citing articles on Scopus

View full text

Semantic multi-grain mixture topic model for text analysis

Abstract

Research highlights

Introduction

Section snippets

Inferring semantic topic granularity

Multi-grain mixture topic model

Experiments and analysis

Conclusion and future work

Acknowledgement

Knowledge-Based Systems

Expert Systems With Applications

Latent Dirichlet allocation

Journal of Machine Learning Research

The JPEG still picture compression standard

Communications of the ACM