Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation

doi:10.1016/j.engappai.2019.103279

Engineering Applications of Artificial Intelligence

Volume 87, January 2020, 103279

https://doi.org/10.1016/j.engappai.2019.103279 Get rights and content

Abstract

Micro-blog has changed people’s life, study, and work styles. Every day, we want to know what public opinion news happens and how it evolves. Extracting and tracking these topics correctly help us better understand the latest public opinions and pay attention to their evolution. To extract topics from Microblog posts accurately, we adopt five unique features of micro-blogs to drive the joint probability distributions of all words and topics, and improve LDA into our topic extraction model(named MF-LDA). To track evolution trend of the topic, we propose a hot topic life cycle model (named HTLCM). We divide the HTLCM into five stages, namely, birth, growth, maturity, decline, and disappearance. The HTLCM determines whether a topic is the candidate hot topic or not and estimates hot topic evolution stages. On the other hand, we propose a hot topic tracking (shorten for HTT) algorithm which integrates MF-LDA and HTLCM. First, the HTT assigns candidate hot topics, which are labeled by HTLCM, to the corresponding time window according to the release time. Second, to obtain the hot topic in each time window, we input Micro-blog posts of each time window into MF-LDA in order. By analyzing changes in these hot topics, we track the changes in their contents. The experiment results show that MF-LDA has a lower perplexity and higher coverage rate than LDA under the same conditions. We conclude parameters of the Transition regions of our proposed HTLCM model. The MR, FR of our proposed HTLCM model are lower than 18%. The average P, R, F of the HTT algorithm are 85.64%, 84.97%, 85.66%, respectively. A practical application on topicFemale driver beats male driver in chengdu shows an excellent effect and practical significance of HTLCM model and HTT algorithm in extracting and tracking hot topics.

Introduction

With rapid development of communication technologies and popularization of smartphones, more and more people begin to use mobile Internet. On December 2017, the number of Internet users in China reached $731$ million, among which 695 million are mobile Internet users. This proportion increases from 90.1% (the end of 2015) to 95.1% (Anon., 2019). The high-speed development of the mobile Internet network rapidly rise development of social network platforms, such as Sina Micro-blog.

The registered users in Sina Micro-blog share videos, images, and text messages of 140 words to other users. Micro-blog platforms have hundreds of millions of data flows every day. The data can cover all aspects of human life and contain abundant amounts of valuable information.

Micro-blog hot topics usually refer to some sudden public events and important published information that can cause resonance and intense discussion among the public. In current Micro-blog posts, some texts embed between two “ $♯$ ” labels, such as “ $♯$ 9.3 anti-war victory parade $♯$ ”. We define the texts of this format as explicit topics. However, when people publish their Micro-blog posts, they rarely and initiatively add “ $♯$ ” labels to mark a topic that is widely discussed. We refer to these topics as implicit topics that are hidden in Micro-blog posts. Thus, we easily and artificially extract these topics from Micro-blog posts. The traditional technologies extracting and tracking topics focus on long text. The text contents of Micro-blogs are short and have messy formats. Applying the technologies for Micro-blog posts generates poor results because of the high sparsity of the data. Nowadays, more researches accelerate developments of technologies extracting hot topics for emerging social platforms. To extract hot topics, term frequency–inverse document frequency (TF–IDF) produces statistics of the words included in the document (Li et al., 2018). However, these techniques do not take into account the semantic meanings of these documents. Some works about probabilistic topic models for extracting hot topic from long texts achieve favorable results (Zhou and Chen, 2014). However, these models are not suitable to extract hot topic from short texts (such as Micro-blog, QQ, etc.). On the other hand, once we find interest hot topics on social networks, we always want to know whether they will evolve into public opinions or not. At present, there are lots of research works which they pay close attention to hot topic evolution, such as event-based information organization approaches (Allan, 2002), grey system theory approaches (Wang et al., 2014b). However, these methods are difficult to track the evolution of the hot topics of short texts (Wan et al., 2019). In this paper, we focus on extracting hot topics from these short texts about Micro-blog posts and tracking their evolution on Micro-blog social networks.

Section snippets

Related works

Extracting and tracking topic (ETT) is an information technology to help people cope with the growing amount of Internet information. This technology identify new topics in the news media information flow and keep track of unknown topics. ETT includes five specific subtasks (Allan, 2002), namely, story segmentation, topic tracking, topic detecting, first-story detecting, and link detecting. These methods solving these tasks mainly consider the probability distribution of the topic words in the

Micro-blog features

The LDA model shows an excellent performance in extracting topics with long texts such as web pages and news. Micro-blog posts are only in short text format. If the LDA model directly used to extract topics for Micro-blog posts, then the model is limited by sparse data of Micro-blog text, and is unable to achieve a good performance. In addition, some features (such as praises, post users, forwarding numbers, etc.) of Micro-blog text are not available in traditional long texts. The LDA model

Micro-blog hot topic tracking

In this section, we divide the MF-LDA model into five stages. Our main tasks are to build a life cycle model for each hot topic. We continuously revise the parameters by integrating life cycle models of each hot topic, and propose a new algorithm named Hot Topic Tracking (HTT) by combining the MF-LDA model. This algorithm not only tracks hot topic but also pre-identifies new topics from new Micro-blog posts and determines whether these topics become hot topics.

MF-LDA model experiment and analysis

To extract hot topics of Micro-blog posts published in a certain period, we compute Micro-blog post eigenvectors ${\vec{χ}}_{a u}$ , ${\vec{χ}}_{a t}$ , and ${\vec{χ}}_{f}$ and input them into the MF-LDA model.

Conclusion and future work

In this paper, we focus on extracting and tracking the hot topic of Micro-blog posts. We propose an improved topic extraction model (MF-LDA, Microblog Features Latent Dirichlet Allocation) to extract hot topics in micro-blog posts. The MF-LDA model has improved the traditional LDA (Latent Dirichlet Allocation) model by combining five features: the number of praises, the number of comments, the number of forwardings, release times and user authority. Some new features, such as Attention Value $a t ($

Acknowledgments

Project supported by the National Natural Science Foundation of China (Nos. 61472329, 61532009, and 61872298) and Sichuan Science and Technology Program (2018GZ0096).

References (37)

BicalhoP. et al.
A general framework to expand short text for topic modeling
Inform. Sci.
(2017)
GuoJ. et al.
Mining hot topics from Twitter streams
Procedia Comput. Sci.
(2012)
HanY. et al.
Predicting the topic influence trends in social media with multiple models
Neurocomputing
(2014)
WanJ.H. et al.
Information propagation model based on hybrid social factors of opportunity, trust and motivation
Neurocomputing
(2019)
WangX.Q. et al.
Grey system theory based prediction for topic trend on internet
Eng. Appl. Artif. Intell.
(2014)
WatanabeS. et al.
Topic tracking language model for speech recognition
Comput. Speech Lang.
(2011)
YehJ.F. et al.
Topic detection and tracking for conversational content by using conceptual dynamic latent Dirichlet allocation
Neurocomputing
(2016)
AllanJ.
Topic Detection and Tracking: Event-Based Information Organization
(2002)
Anon., ., 2019. Feb. 41th Statistical report on China Internet development. [Online]. Available:...
Antsoftware, 2019....

BleiD.M. et al.

Latent Dirichlet allocation

J. Mach. Learn. Res.

(2003)

Cataldi, M., Caro, L.D., C. Schifanella, C., 2010. Emerging topic detection on twitter based on temporal and social...

ChenK.Y. et al.

Hot topic extraction based on timeline analysis and multidimensional sentence modeling

IEEE Trans. Knowl. Data Eng.

(2007)

CigarrJ. et al.

A step forward for topic detection in Twitter

Expert Syst. Appl.

(2016)

DeerwesterS.

Indexing by latent semantic analysis

J. Assoc. Inf. Sci. Technol.

(1990)

GrithsT.

Gibbs Sampling in the Generative Model

(2002)

HeinrichG.

Parameter Estimation for Text Analysis

(2008)

Hofmann, T., 1999. Probabilistic latent semantic indexing. In: Proc. Int’l Conf. on Research and Development in...

Cited by (42)

An ALBERT-based TextCNN-Hatt hybrid model enhanced with topic knowledge for sentiment analysis of sudden-onset disasters
2023, Engineering Applications of Artificial Intelligence
Sudden-onset disasters put forward new requirements for on the state authorities’ ability to analyze public opinion sentiment. However, traditional sentiment analysis methods ignore the contextual semantic relationships and out-of-vocabulary words, and their computational resource utilization is excessive compared to their expected accuracy. In this paper, an ALBERT-based model combined with a text convolution neural network, a hierarchical attention mechanism and the latent Dirichlet allocation is proposed to create a hybrid model enhanced with topic knowledge for sentiment analysis of sudden-onset disasters. Weibo text data from a rainstorm disaster in China are used to evaluate the model’s performance. Compared with the XLNet, DistilBERT and RoBERTa models, the experimental results demonstrate that the proposed approach is capable of achieving better performance by incorporating external topic knowledge into the language representation model to compensate for the limited vocabulary data.
BTD: An effective business-related hot topic detection scheme in professional social networks
2023, Information Sciences
Professional social networks (PSNs) usually involve a large amount of valuable information for the business world. A heterogeneous network is constructed based on the structural characteristics of several communities from a PSN. Then, an effective business-related hot topic detection (BTD) scheme in PSNs is proposed, and this BTD scheme extracts business-related topics from posts found on the PSN. Furthermore, a business-related hot topic detection algorithm is proposed by extending the PageRank algorithm based on the heterogeneous network. The performance of the proposed method is evaluated by using real data from a PSN for B2B e-commerce. The experimental results show that the proposed method is able to detect business-related hot topics in heterogeneous networks from three aspects: affiliation relationships between posts and topics, users’ contributions to posts, and following relationships among users. The coverage rate is higher and the degree of distinction is greater than those of existing typical methods.
Intelligent mining of safety hazard information from construction documents using semantic similarity and information entropy
2023, Engineering Applications of Artificial Intelligence
Citation Excerpt :
In engineering field, machine learning methods have been widely used to obtain structured data information, with high accuracy and robustness (Karasu et al., 2020; Karasu and Altan, 2019). Recently, many studies have utilized natural language processing (NLP) technology and machine learning method to extract text information, especially safety hazard information (Xu et al., 2021b; Du et al., 2020; Qiu et al., 2021; Chen et al., 2021). Tixier et al. (2016) developed a text information mining system to obtain a large and highly reliable structured attribute and outcome dataset from unstructured accident reports.
Project construction on-site is known to be very dangerous workplace environments due to large numbers of safety hazards. Analysis of construction safety hazards is essential to formulate rational safety management plans and prevent accidents. Construction documents contain large volumes of safety hazard information available for analysis. However, such analyses are challenging because the safety hazard information in the construction documents is presented in an unstructured or semi-structured format. This study proposes a method for intelligent mining of safety hazard information, which comprises safety hazard technical term recognition and safety hazard information analysis. The safety hazard technical term recognition model is developed based on semantic similarity and information correlation to build a safety hazard technical term library. The safety hazard information based on the technical term library is mined and analyzed using the term frequency-inverse document frequency method (TF-IDF). Finally, the proposed method is applied to build the safety hazard technical term library, which contains 2697 technical terms, and develop a hydraulic project construction safety hazard analysis system, which can realize the intelligent recognition and application of technical terms. Meanwhile, this system can automatically extract safety hazard information and provide a visualization interface to intuitively show the safety hazard analysis results, which improves the extraction efficiency of safety hazard information. The study provides a new approach for recognizing technical terms and mining safety hazard information, which can lead to enhancing management efficiency and practical knowledge discovery for safety management.
Lifecycle research of social media rumor refutation effectiveness based on machine learning and visualization technology
2022, Information Processing and Management
Rumor refutation is a common method to control rumors to address potential risks. This paper studies the social media rumor refutation effectiveness lifecycle (SMRREL), focusing on three important characteristics (i.e., lifespan, peak value, and distribution) to provide support for (1) enhancing the persistence and intensity of rumor refutation effectiveness and (2) investigating the changing law of rumor refutation effectiveness. In total, 77,080 comment records, 55,847 forward records, and other pertinent data of 251 rumor refutation microblogs from an official rumor refutation platform are collected to perform analysis. To explore how the lifespan and peak value of SMRREL are influenced by the possible affecting factors, five regressors (i.e., RFRegressor, AdaBoostRegressor, XGBoostRegressor, LGBMRegressor, and CatBoostRegressor) are trained based on the collected data. The XGBoostRegressor shows the best performance, and the results are shown and explained using SHapley Additive exPlanations (SHAP). To investigate the distribution of SMRREL, lifecycle graphs of rumor refutation effectiveness are summarized and divided into three types, i.e., Outburst, Multiple Peaks, and Steep Slope. Finally, based on the results of the SMRREL analysis, corresponding decision-making recommendations are proposed to make better persistence and intensity of rumor refutation effectiveness.
How do destination negative events trigger tourists’ perceived betrayal and boycott? The moderating role of relationship quality
2022, Tourism Management
Citation Excerpt :
The occurrence time of these events ranges from 2018 to 2021. The hashtags employed were all trending on Weibo and triggered wide discussion by the public in a short period of time (Du et al., 2020). The hashtags used are shown in Table 1.
This study presented and tested a conceptual model that examined how a negative event at a tourism destination influenced perceived betrayal and boycott among tourists. A mixed method approach with three studies was adopted to verify the proposed hypotheses. In Study 1, using Weibo microblogging platform data, we evaluated the impacts of a negative event on tourists' perception of betrayal and intentions to participate in a tourism boycott. In Study 2, an experimental study was conducted to investigate the relationships among the negative event, perceptions of betrayal, and propensity for a tourism boycott. In Study 3, an additional experimental study revealed that relationship quality would moderate the influences of negative events on perceptions of betrayal and intention to join a boycott. The ﬁndings of this study offer theoretical and managerial implications for destination management organizations’ responses to negative events.
Enhancing representation in the context of multiple-channel spam filtering
2022, Information Processing and Management
Citation Excerpt :
Moreover, BERT-based (Bidirectional Encoder Representations from Transformers) deep learning approach has been recently introduced and is able to capture semantic and long-distance dependencies in sentences to improve the classification performance (AbdulNabi & Yaseen, 2021). Topic-based models (iv) are probabilistic schemes used to analyse large collections of words to detect which of them are usually included in the same documents (Du et al., 2020). The words that are used jointly are grouped into “topics” which make it possible to determine the similarity of specific documents with these generated topics.
This study addresses the usage of different features to complement synset-based and bag-of-words representations of texts in the context of using classical ML approaches for spam filtering (Ferrara, 2019). Despite the existence of a large number of complementary features, in order to improve the applicability of this study, we have selected only those that can be computed regardless of the communication channel used to distribute content. Feature evaluation has been performed using content distributed through different channels (social networks and email) and classifiers (Adaboost, Flexible Bayes, Naïve Bayes, Random Forests, and SVMs). The results have revealed the usefulness of detecting some non-textual entities (such as URLs, Uniform Resource Locators) in the addressed distribution channels. Moreover, we also found that compression properties and/or information regarding the probability of correctly guessing the language of target texts could be successfully used to improve the classification in a wide range of situations. Finally, we have also detected features that are influenced by specific fashions and habits of users of certain Internet services (e.g. the existence of words written in capital letters) that are not useful for spam filtering.

View all citing articles on Scopus

^☆: Project supported by the National Natural Science Foundation of China (Nos. 61472329, 61532009, and 61872298) and Sichuan Science and Technology Program (No. 2018GZ0096).

^☆☆: No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103279.

View full text

Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation☆,☆☆

Abstract

Introduction

Section snippets

Related works

Micro-blog features

Micro-blog hot topic tracking

MF-LDA model experiment and analysis

Conclusion and future work

Acknowledgments

Inform. Sci.

Procedia Comput. Sci.

Neurocomputing

Neurocomputing

Eng. Appl. Artif. Intell.

Comput. Speech Lang.

Neurocomputing

Topic Detection and Tracking: Event-Based Information Organization

Latent Dirichlet allocation

J. Mach. Learn. Res.

Hot topic extraction based on timeline analysis and multidimensional sentence modeling

IEEE Trans. Knowl. Data Eng.

A step forward for topic detection in Twitter

Expert Syst. Appl.

Indexing by latent semantic analysis

J. Assoc. Inf. Sci. Technol.

Gibbs Sampling in the Generative Model

Parameter Estimation for Text Analysis