ABSTRACT
As text data continues to grow quickly, it is increasingly important to develop intelligent systems to help people manage and make use of vast amounts of text data ("big text data''). As a new family of effective general approaches to text data retrieval and analysis, probabilistic topic models---notably Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocations (LDA), and their many extensions---have been studied actively in the past decade with widespread applications. These topic models are powerful tools for extracting and analyzing latent topics contained in text data; they also provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language and about any topics. This tutorial systematically reviews the major research progress in probabilistic topic models and discuss their applications in text retrieval and text mining. The tutorial provides (1) an in-depth explanation of the basic concepts, underlying principles, and the two basic topic models (i.e., PLSA and LDA) that have widespread applications, (2) an introduction to EM algorithms and Bayesian inference algorithms for topic models, (3) a hands-on exercise to allow the tutorial attendants to learn how to use the topic models implemented in the MeTA Open Source Toolkit and experiment with provided data sets, (4) a broad overview of all the major representative topic models that extend PLSA or LDA, and (5) a discussion of major challenges and future research directions.
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. Vol. 3 (March . 2003), 993--1022. Google Scholar
- Thomas Hofmann . 1999. Probabilistic Latent Semantic Indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99). ACM, New York, NY, USA, 50--57. Google ScholarDigital Library
- Sean Massung, Chase Geigle, and ChengXiang Zhai . 2016. MeTA: A Unified Toolkit for Text Retrieval and Analysis Proceedings of ACL-2016 System Demonstrations. Association for Computational Linguistics, Berlin, Germany, 91--96. http://anthology.aclweb.org/P16--4016Google Scholar
Index Terms
- A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis
Recommendations
Probabilistic Topic Models for Text Data Retrieval and Analysis
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information RetrievalText data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop ...
Topic evolution based on the probabilistic topic model: a review
Accurately representing the quantity and characteristics of users' interest in certain topics is an important problem facing topic evolution researchers, particularly as it applies to modern online environments. Search engines can provide information ...
Tutorial on probabilistic topic models
COMAD '11: Proceedings of the 17th International Conference on Management of DataOver the last decade, probabilistic topic models have emerged as an extremely powerful and popular tool for analyzing large collections of unstructured data. While originally proposed for textual data, topic models have since been applied for various ...
Comments