skip to main content
10.1145/1281192.1281256acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior

Published: 12 August 2007 Publication History

Abstract

Documents, such as those seen on Wikipedia and Folksonomy, have tended to be assigned with multiple topics as a meta-data.Therefore, it is more and more important to analyze a relationship between a document and topics assigned to the document. In this paper, we proposed a novel probabilistic generative model of documents with multiple topics as a meta-data. By focusing on modeling the generation process of a document with multiple topics, we can extract specific properties of documents with multiple topics.Proposed model is an expansion of an existing probabilistic generative model: Parametric Mixture Model (PMM). PMM models documents with multiple topics by mixing model parameters of each single topic. Since, however, PMM assigns the same mixture ratio to each single topic, PMM cannot take into account the bias of each topic within a document. To deal with this problem, we propose a model that considers Dirichlet distribution as a prior distribution of the mixture ratio.We adopt Variational Bayes Method to infer the bias of each topic within a document. We evaluate the proposed model and PMM using MEDLINE corpus.The results of F-measure, Precision and Recall show that the proposed model is more effective than PMM on multiple-topic classification. Moreover, we indicate the potential of the proposed model that extracts topics and document-specific keywords using information about the assigned topics.

References

[1]
H. ATTIAS. 1999. Learning parameters and structure of latent variable models by Variational Bayes. in Proc of Uncertainty in Artificial Intelligence.
[2]
C. M. BISHOP. 2006. Pattern Recognition And Machine Learning (Information Science and Statistics), p.687. Springer-Verlag.
[3]
D. M. BLEI, ANDREW Y. NG, AND M. I. JORDAN. 2001. Latent Dirichlet Allocation. Neural Information Processing Systems 14.
[4]
D. M. BLEI, ANDREW Y. NG, AND M. I. JORDAN. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research, vol.3, pp.993-1022.
[5]
C. D. MANNING AND H.SCHTZE. 1999. Foundations of statistical natural language processing. MIT press, Cambridge.
[6]
MINKA. 2002. Estimating a Dirichlet distribution. Technical Report.
[7]
Y. W. TEH, M. I. JORDAN, M. J. BEAL, AND D. M. BLEI. 2003. Hierarchical dirichlet processes. Technical Report 653, Department Of Statistics, UC Berkeley.
[8]
UEDA, N. AND SAITO, K. 2002. Parametric mixture models for multi-topic text. Neural Information Processing Systems 15.
[9]
UEDA, N. AND SAITO, K. 2002. Singleshot detection of multi-category text using parametric mixture models. ACM SIG Knowledge Discovery and Data Mining.
[10]
Y. YANG AND J. PEDERSON 1997. A comparative study on feature selection in text categorization. Proc. International Conference on Machine Learning.

Cited By

View all
  • (2018)BALSON: BAYESIAN LEAST SQUARES OPTIMIZATION WITH NONNEGATIVE L1-NORM CONSTRAINT2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2018.8517036(1-6)Online publication date: Sep-2018
  • (2015)Non-Negative Matrix Factorization with Auxiliary Information on Overlapping GroupsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.237336127:6(1615-1628)Online publication date: 28-Apr-2015
  • (2011)A statistical model for topically segmented documentsProceedings of the 14th international conference on Discovery science10.5555/2050236.2050257(247-261)Online publication date: 5-Oct-2011
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dirichlet distribution
  2. multiple topic
  3. probability model
  4. text clustering
  5. variational bayes method

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)BALSON: BAYESIAN LEAST SQUARES OPTIMIZATION WITH NONNEGATIVE L1-NORM CONSTRAINT2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP)10.1109/MLSP.2018.8517036(1-6)Online publication date: Sep-2018
  • (2015)Non-Negative Matrix Factorization with Auxiliary Information on Overlapping GroupsIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.237336127:6(1615-1628)Online publication date: 28-Apr-2015
  • (2011)A statistical model for topically segmented documentsProceedings of the 14th international conference on Discovery science10.5555/2050236.2050257(247-261)Online publication date: 5-Oct-2011
  • (2011)A Statistical Model for Topically Segmented DocumentsDiscovery Science10.1007/978-3-642-24477-3_21(247-261)Online publication date: 2011
  • (2009)Named entity recognition in queryProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval10.1145/1571941.1571989(267-274)Online publication date: 19-Jul-2009
  • (2009)Named entity mining from click-through data using weakly supervised latent dirichlet allocationProceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining10.1145/1557019.1557165(1365-1374)Online publication date: 28-Jun-2009

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media