Extracting and tracking hot topics of micro-blogs based on improved Latent Dirichlet Allocation,☆☆

https://doi.org/10.1016/j.engappai.2019.103279Get rights and content

Abstract

Micro-blog has changed people’s life, study, and work styles. Every day, we want to know what public opinion news happens and how it evolves. Extracting and tracking these topics correctly help us better understand the latest public opinions and pay attention to their evolution. To extract topics from Microblog posts accurately, we adopt five unique features of micro-blogs to drive the joint probability distributions of all words and topics, and improve LDA into our topic extraction model(named MF-LDA). To track evolution trend of the topic, we propose a hot topic life cycle model (named HTLCM). We divide the HTLCM into five stages, namely, birth, growth, maturity, decline, and disappearance. The HTLCM determines whether a topic is the candidate hot topic or not and estimates hot topic evolution stages. On the other hand, we propose a hot topic tracking (shorten for HTT) algorithm which integrates MF-LDA and HTLCM. First, the HTT assigns candidate hot topics, which are labeled by HTLCM, to the corresponding time window according to the release time. Second, to obtain the hot topic in each time window, we input Micro-blog posts of each time window into MF-LDA in order. By analyzing changes in these hot topics, we track the changes in their contents. The experiment results show that MF-LDA has a lower perplexity and higher coverage rate than LDA under the same conditions. We conclude parameters of the Transition regions of our proposed HTLCM model. The MR, FR of our proposed HTLCM model are lower than 18%. The average P, R, F of the HTT algorithm are 85.64%, 84.97%, 85.66%, respectively. A practical application on topicFemale driver beats male driver in chengdu shows an excellent effect and practical significance of HTLCM model and HTT algorithm in extracting and tracking hot topics.

Introduction

With rapid development of communication technologies and popularization of smartphones, more and more people begin to use mobile Internet. On December 2017, the number of Internet users in China reached 731 million, among which 695 million are mobile Internet users. This proportion increases from 90.1% (the end of 2015) to 95.1% (Anon., 2019). The high-speed development of the mobile Internet network rapidly rise development of social network platforms, such as Sina Micro-blog.

The registered users in Sina Micro-blog share videos, images, and text messages of 140 words to other users. Micro-blog platforms have hundreds of millions of data flows every day. The data can cover all aspects of human life and contain abundant amounts of valuable information.

Micro-blog hot topics usually refer to some sudden public events and important published information that can cause resonance and intense discussion among the public. In current Micro-blog posts, some texts embed between two “” labels, such as “ 9.3 anti-war victory parade ”. We define the texts of this format as explicit topics. However, when people publish their Micro-blog posts, they rarely and initiatively add “” labels to mark a topic that is widely discussed. We refer to these topics as implicit topics that are hidden in Micro-blog posts. Thus, we easily and artificially extract these topics from Micro-blog posts. The traditional technologies extracting and tracking topics focus on long text. The text contents of Micro-blogs are short and have messy formats. Applying the technologies for Micro-blog posts generates poor results because of the high sparsity of the data. Nowadays, more researches accelerate developments of technologies extracting hot topics for emerging social platforms. To extract hot topics, term frequency–inverse document frequency (TF–IDF) produces statistics of the words included in the document (Li et al., 2018). However, these techniques do not take into account the semantic meanings of these documents. Some works about probabilistic topic models for extracting hot topic from long texts achieve favorable results (Zhou and Chen, 2014). However, these models are not suitable to extract hot topic from short texts (such as Micro-blog, QQ, etc.). On the other hand, once we find interest hot topics on social networks, we always want to know whether they will evolve into public opinions or not. At present, there are lots of research works which they pay close attention to hot topic evolution, such as event-based information organization approaches (Allan, 2002), grey system theory approaches (Wang et al., 2014b). However, these methods are difficult to track the evolution of the hot topics of short texts (Wan et al., 2019). In this paper, we focus on extracting hot topics from these short texts about Micro-blog posts and tracking their evolution on Micro-blog social networks.

Section snippets

Related works

Extracting and tracking topic (ETT) is an information technology to help people cope with the growing amount of Internet information. This technology identify new topics in the news media information flow and keep track of unknown topics. ETT includes five specific subtasks (Allan, 2002), namely, story segmentation, topic tracking, topic detecting, first-story detecting, and link detecting. These methods solving these tasks mainly consider the probability distribution of the topic words in the

Micro-blog features

The LDA model shows an excellent performance in extracting topics with long texts such as web pages and news. Micro-blog posts are only in short text format. If the LDA model directly used to extract topics for Micro-blog posts, then the model is limited by sparse data of Micro-blog text, and is unable to achieve a good performance. In addition, some features (such as praises, post users, forwarding numbers, etc.) of Micro-blog text are not available in traditional long texts. The LDA model

Micro-blog hot topic tracking

In this section, we divide the MF-LDA model into five stages. Our main tasks are to build a life cycle model for each hot topic. We continuously revise the parameters by integrating life cycle models of each hot topic, and propose a new algorithm named Hot Topic Tracking (HTT) by combining the MF-LDA model. This algorithm not only tracks hot topic but also pre-identifies new topics from new Micro-blog posts and determines whether these topics become hot topics.

MF-LDA model experiment and analysis

To extract hot topics of Micro-blog posts published in a certain period, we compute Micro-blog post eigenvectors χau, χat, and χf and input them into the MF-LDA model.

Conclusion and future work

In this paper, we focus on extracting and tracking the hot topic of Micro-blog posts. We propose an improved topic extraction model (MF-LDA, Microblog Features Latent Dirichlet Allocation) to extract hot topics in micro-blog posts. The MF-LDA model has improved the traditional LDA (Latent Dirichlet Allocation) model by combining five features: the number of praises, the number of comments, the number of forwardings, release times and user authority. Some new features, such as Attention Value at(

Acknowledgments

Project supported by the National Natural Science Foundation of China (Nos. 61472329, 61532009, and 61872298) and Sichuan Science and Technology Program (2018GZ0096).

References (37)

  • BleiD.M. et al.

    Latent Dirichlet allocation

    J. Mach. Learn. Res.

    (2003)
  • Cataldi, M., Caro, L.D., C. Schifanella, C., 2010. Emerging topic detection on twitter based on temporal and social...
  • ChenK.Y. et al.

    Hot topic extraction based on timeline analysis and multidimensional sentence modeling

    IEEE Trans. Knowl. Data Eng.

    (2007)
  • CigarrJ. et al.

    A step forward for topic detection in Twitter

    Expert Syst. Appl.

    (2016)
  • DeerwesterS.

    Indexing by latent semantic analysis

    J. Assoc. Inf. Sci. Technol.

    (1990)
  • GrithsT.

    Gibbs Sampling in the Generative Model

    (2002)
  • HeinrichG.

    Parameter Estimation for Text Analysis

    (2008)
  • Hofmann, T., 1999. Probabilistic latent semantic indexing. In: Proc. Int’l Conf. on Research and Development in...
  • Cited by (42)

    • Intelligent mining of safety hazard information from construction documents using semantic similarity and information entropy

      2023, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In engineering field, machine learning methods have been widely used to obtain structured data information, with high accuracy and robustness (Karasu et al., 2020; Karasu and Altan, 2019). Recently, many studies have utilized natural language processing (NLP) technology and machine learning method to extract text information, especially safety hazard information (Xu et al., 2021b; Du et al., 2020; Qiu et al., 2021; Chen et al., 2021). Tixier et al. (2016) developed a text information mining system to obtain a large and highly reliable structured attribute and outcome dataset from unstructured accident reports.

    • How do destination negative events trigger tourists’ perceived betrayal and boycott? The moderating role of relationship quality

      2022, Tourism Management
      Citation Excerpt :

      The occurrence time of these events ranges from 2018 to 2021. The hashtags employed were all trending on Weibo and triggered wide discussion by the public in a short period of time (Du et al., 2020). The hashtags used are shown in Table 1.

    • Enhancing representation in the context of multiple-channel spam filtering

      2022, Information Processing and Management
      Citation Excerpt :

      Moreover, BERT-based (Bidirectional Encoder Representations from Transformers) deep learning approach has been recently introduced and is able to capture semantic and long-distance dependencies in sentences to improve the classification performance (AbdulNabi & Yaseen, 2021). Topic-based models (iv) are probabilistic schemes used to analyse large collections of words to detect which of them are usually included in the same documents (Du et al., 2020). The words that are used jointly are grouped into “topics” which make it possible to determine the similarity of specific documents with these generated topics.

    View all citing articles on Scopus

    Project supported by the National Natural Science Foundation of China (Nos. 61472329, 61532009, and 61872298) and Sichuan Science and Technology Program (No. 2018GZ0096).

    ☆☆

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.engappai.2019.103279.

    View full text