research-article

On handling textual errors in latent document modeling

Authors:

Dongwon LeeAuthors Info & Claims

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 2089 - 2098

https://doi.org/10.1145/2505515.2505555

Published: 27 October 2013 Publication History

Abstract

As large-scale text data become available on the Web, textual errors in a corpus are often inevitable (e.g., digitizing historic documents). Due to the calculation of frequencies of words, however, such textual errors can significantly impact the accuracy of statistical models such as the popular Latent Dirichlet Allocation (LDA) model. To address such an issue, in this paper, we propose two novel extensions to LDA (i.e., TE-LDA and TDE-LDA): (1) The TE-LDA model incorporates textual errors into term generation process; and (2) The TDE-LDA model extends TE-LDA further by taking into account topic dependency to leverage on semantic connections among consecutive words even if parts are typos. Using both real and synthetic data sets with varying degrees of "errors", our TDE-LDA model outperforms: (1) the traditional LDA model by 16%-39% (real) and 20%-63% (synthetic); and (2) the state-of-the-art N-Grams model by 11%-27% (real) and 16%-54% (synthetic).

References

[1]

D. M. Blei and J. D. Lafferty. Dynamic topic models. In ICML, 2006.

Digital Library

[2]

D. M. Blei and J. D. Lafferty. A correlated topic model of science. In Annals of Applied Statistics, 2007.

[3]

D. M. Blei and P. J. Moreno. Topic segmentation with an aspect hidden markov model. In SIGIR, 2001.

Digital Library

[4]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. In Journal of Machine Learning Research, 2003.

Digital Library

[5]

C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In NIPS, 2006.

Digital Library

[6]

X. Chen, C. Lu, Y. An, and P. Achananuparp. Probabilistic models for topic learning from images and captions in online biomedical literatures. In CIKM, 2009.

Digital Library

[7]

J. Eisenstein. Hierarchical text segmentation from multi-scale lexical cohesion. In HLT-NAACL, 2009.

Digital Library

[8]

J. Eisenstein and R. Barzilay. Bayesian unsupervised topic segmentation. In EMNLP, 2008.

Digital Library

[9]

T. L. Griffiths, M. Steyvers, D. M. Blei, and J. B. Tenenbaum. Integrating topics and syntax. In Advances in Neural Information Processing Systems, 2005.

[10]

A. Gruber, M. Rosen-Zvi, and Y. Weiss. Hidden topic markov models. In AISTATS, 2007.

[11]

T. Hofmann. Probabilistic latent semantic analysis. In UAI, 1999.

Digital Library

[12]

Y. Liu, A. Niculescu-Mizil, and W. Gryc. Topic-link lda: Joint models of topic and author community. In ICML, 2009.

Digital Library

[13]

W. B. Lund and E. K. Ringger. Improving optical character recognition through efficient multiple system alignment. In JCDL, 2009.

Digital Library

[14]

R. M. Nallapati, A. Ahmed, E. P. Xing, and W. W. Cohen. Joint latent topic models for text and citations. In SIGKDD, 2008.

Digital Library

[15]

D. Newman, C. Chemudugunta, and P. Smyth. Statistical entity-topic models. In SIGKDD, 2006.

Digital Library

[16]

I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed gibbs sampling for latent dirichlet allocation. In SIGKDD, 2008.

Digital Library

[17]

M. Purver, T. L. Griffiths, K. P. Kording, and J. B. Tenenbaum. Unsupervised topic modelling for multi-party spoken discourse. In ACL, 2006.

Digital Library

[18]

M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In UAI, 2004.

Digital Library

[19]

M. M. Shafiei and E. E. Milios. A statistical model for topic segmentation and clustering. In AI, 2008.

Digital Library

[20]

M. Steyvers, P. Smyth, M. Rosen-Zvi, and T. Griffiths. Probabilistic author-topic models for information discovery. In SIGKDD, 2004.

Digital Library

[21]

D. D. Walker, W. B. Lund, and E. K. Ringger. Evaluating models of latent document semantics in the presence of ocr errors. In EMNLP, 2010.

Digital Library

[22]

H. Wallach. Topic modeling: Beyond bag-of-words. In ICML, 2006.

Digital Library

[23]

X. Wang, A. McCallum, and X. Wei. Topical n-grams: Phrase and topic discovery, with an application to information retrieval. In ICDM, 2007.

Digital Library

[24]

M. Wick, M. Ross, and E. Miller. Context-sensitive error correction: Using topic models to improve ocr. In ICDAR, 2007.

Digital Library

[25]

T. Yang and D. Lee. Towards noise-resilient document modeling. In CIKM, 2011.

Digital Library

Index Terms

On handling textual errors in latent document modeling
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Towards noise-resilient document modeling
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

We introduce a generative probabilistic document model based on latent Dirichlet allocation (LDA), to deal with textual errors in the document collection. Our model is inspired by the fact that most large-scale text data are machine-generated and thus ...
Latent Dirichlet learning for document summarization
ICASSP '09: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing

Automatic summarization is developed to extract the representative contents or sentences from a large corpus of documents. This paper presents a new hierarchical representation of words, sentences and documents in a corpus, and infers the Dirichlet ...
A Latent Dirichlet Framework for Relevance Modeling
AIRS '09: Proceedings of the 5th Asia Information Retrieval Symposium on Information Retrieval Technology

Relevance-based language models operate by estimating the probabilities of observing words in documents relevant (or pseudo relevant) to a topic. However, these models assume that if a document is relevant to a topic, then all tokens in the document are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

October 2013

2612 pages

ISBN:9781450322638

DOI:10.1145/2505515

General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM'13

Sponsor:

CIKM'13: 22nd ACM International Conference on Information and Knowledge Management

October 27 - November 1, 2013

California, San Francisco, USA

Acceptance Rates

CIKM '13 Paper Acceptance Rate 143 of 848 submissions, 17%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
129
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten