ABSTRACT
The number of books available online is increasing, but user interfaces may not be taking full advantage of advances in machine learning techniques that could help users navigate, explore, discover and understand interesting and useful content in books. Using a group of ten students and over one thousand crowdsourced judgments, we conducted multiple user studies to evaluate topics and related passages in books, all learned by topic modeling. Using ten books, selected from humanities (e.g. Plato's Republic), social sciences (e.g. Marx's Capital) and sciences (e.g. Einstein's Relativity), and four different evaluation experiments, we show that users agree that the learned topics are coherent and important to the book, and related to the automatically generated passages. We show how crowdsourced evaluations are useful, and can complement more focused evaluations using students who have studied the texts. This work provides a framework for (1) learning topics and related passages in books, and (2) evaluating those learned topics and passages, and moves one step toward automatic annotation to support topic navigation of books.
- Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. The Journal of Machine Learning Research, 3, 993-1022. Google ScholarDigital Library
- Hearst, M. A. (1997). TextTiling: Segmenting text into multiparagraph subtopic passages. Computational linguistics, 23(1), 33-6. Google ScholarDigital Library
Index Terms
- Learning topics and related passages in books
Recommendations
Group topic model: organizing topics into groups
AbstractLatent Dirichlet allocation defines hidden topics to capture latent semantics in text documents. However, it assumes that all the documents are represented by the same topics, resulting in the “forced topic” problem. To solve this problem, we ...
Text, Topics, and Turkers: A Consensus Measure for Statistical Topics
HT '15: Proceedings of the 26th ACM Conference on Hypertext & Social MediaTopic modeling is an important tool in social media analysis, allowing researchers to quickly understand large text corpora by investigating the topics underlying them. One of the fundamental problems of topic models lies in how to assess the quality of ...
Extractive text summarization using clustering-based topic modeling
AbstractText summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. ...
Comments