tutorial

A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis

Authors:
ChengXiang Zhai

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Chase Geigle

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information RetrievalJune 2018Pages 1395–1398https://doi.org/10.1145/3209978.3210189

Published:27 June 2018Publication History

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

Pages 1395–1398

ABSTRACT

As text data continues to grow quickly, it is increasingly important to develop intelligent systems to help people manage and make use of vast amounts of text data ("big text data''). As a new family of effective general approaches to text data retrieval and analysis, probabilistic topic models---notably Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocations (LDA), and their many extensions---have been studied actively in the past decade with widespread applications. These topic models are powerful tools for extracting and analyzing latent topics contained in text data; they also provide a general and robust latent semantic representation of text data, thus improving many applications in information retrieval and text mining. Since they are general and robust, they can be applied to text data in any natural language and about any topics. This tutorial systematically reviews the major research progress in probabilistic topic models and discuss their applications in text retrieval and text mining. The tutorial provides (1) an in-depth explanation of the basic concepts, underlying principles, and the two basic topic models (i.e., PLSA and LDA) that have widespread applications, (2) an introduction to EM algorithms and Bayesian inference algorithms for topic models, (3) a hands-on exercise to allow the tutorial attendants to learn how to use the topic models implemented in the MeTA Open Source Toolkit and experiment with provided data sets, (4) a broad overview of all the major representative topic models that extend PLSA or LDA, and (5) a discussion of major challenges and future research directions.

References

David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. Vol. 3 (March . 2003), 993--1022. Google Scholar
Thomas Hofmann . 1999. Probabilistic Latent Semantic Indexing. In Proceedings of the 22Nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '99). ACM, New York, NY, USA, 50--57. Google ScholarDigital Library
Sean Massung, Chase Geigle, and ChengXiang Zhai . 2016. MeTA: A Unified Toolkit for Text Retrieval and Analysis Proceedings of ACL-2016 System Demonstrations. Association for Computational Linguistics, Berlin, Germany, 91--96. http://anthology.aclweb.org/P16--4016Google Scholar

Index Terms

A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Topic modeling
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Language models
  2. Information systems applications
    1. Data mining

Recommendations

Probabilistic Topic Models for Text Data Retrieval and Analysis
SIGIR '17: Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

Text data include all kinds of natural language text such as web pages, news articles, scientific literature, emails, enterprise documents, and social media posts. As text data continues to grow quickly, it is increasingly important to develop ...
Read More
Topic evolution based on the probabilistic topic model: a review

Accurately representing the quantity and characteristics of users' interest in certain topics is an important problem facing topic evolution researchers, particularly as it applies to modern online environments. Search engines can provide information ...
Read More
Tutorial on probabilistic topic models
COMAD '11: Proceedings of the 17th International Conference on Management of Data

Over the last decade, probabilistic topic models have emerged as an extremely powerful and popular tool for analyzing large collections of unstructured data. While originally proposed for textual data, topic models have since been applied for various ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval
June 2018
1509 pages
ISBN:9781450356572
DOI:10.1145/3209978
General Chairs:
Kevyn Collins-Thompson
University of Michigan, United States
,
Qiaozhu Mei
University of Michigan, United States
,
Program Chairs:
Brian Davison
Lehigh University, United States
,
Yiqun Liu
Tsinghua University, China
,
Emine Yilmaz
University College London, United Kingdom
Copyright © 2018 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2018
Check for updates
Author Tags
data mining
information retrieval
language models
probabilistic topic models
text mining
Qualifiers
- tutorial
Conference

Acceptance Rates
SIGIR '18 Paper Acceptance Rate86of409submissions,21%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 339
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis

SIGIR '18: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Probabilistic Topic Models for Text Data Retrieval and Analysis

Topic evolution based on the probabilistic topic model: a review

Tutorial on probabilistic topic models