MPTM: A Topic Model for Multi-Part Documents

Xie, Zhipeng; Jiang, Liyang; Ye, Tengju; He, Zhenying

doi:10.1007/978-3-319-18123-3_10

Zhipeng Xie^17,18,
Liyang Jiang^17,18,
Tengju Ye^17,18 &
…
Zhenying He^17,18

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9050))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1774 Accesses

Abstract

Topic models have been successfully applied to uncover hidden probabilistic structures in collections of documents, where documents are treated as unstructured texts. However, it is not uncommon that some documents, which we call multi-part documents, are composed of multiple named parts. To exploit the information buried in the document-part relationships in the process of topic modeling, this paper adopts two assumptions: the first is that all parts in a given document should have similar topic distributions, and the second is that the multiple versions (corresponding to multiple named parts) of a given topic should have similar word distributions. Based on these two underlying assumptions, we propose a novel topic model for multi-part documents, called Multi-Part Topic Model (or MPTM in short), and develop its construction and inference method with the aid of the techniques of collapsed Gibbs sampling and maximum likelihood estimation. Experimental results on real datasets demonstrate that our approach has not only achieved significant improvement on the qualities of discovered topics, but also boosted the performance in information retrieval and document classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In: ICML, pp. 25–32 (2009)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Blei, D., Lafferty, J.: Correlated topic models. Advances in neural information processing systems 18, 147–154 (2006). MIT Press, Cambridge, MA
Google Scholar
Blei, D., McAuliffe, J.: Supervised topic models. (2010). arXiv preprint arXiv:1003.0783
Google Scholar
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 27 (2011)
Article Google Scholar
Chen, Z., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., Ghosh, R.: Exploiting domain knowledge in aspect extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2013) (2013)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101(Suppl 1), 5228–5235 (2004)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 50–57 (1999)
Google Scholar
Jagarlamudi, J., Daumé III, H., and Udupa, R.: Incorporating lexical priors into topic models. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 204–213 (2012)
Google Scholar
Lacoste-Julien, S., Sha, F., and Jordan, M.: DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in Neural Information Processing Systems, pp. 89–904 (2008)
Google Scholar
Lau, J.H., Baldwin, T., Newman, D.: On collocations and topic models. ACM Transactions on Speech and Language Processing (TSLP) 10(3), 10 (2013)
Google Scholar
Li, W., McCallum, A.: Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 577–584 (2006)
Google Scholar
Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, pp. 880–889 (2009)
Google Scholar
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 262–272 (2011)
Google Scholar
Minka, T.: Estimating a Dirichlet distribution. Technical Report (2012). http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: Proceedings of the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 100–108 (2010)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256 (2009)
Google Scholar
Tam, Y.-C., Schultz, T.: Correlated latent semantic model for unsupervised LM adaptation. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 41–44 (2007)
Google Scholar
Zhu, X., Blei, D., Lafferty, J.: TagLDA: bringing document structure knowledge into topic models. Technical Report TR-1553, University of Wisconsin (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, China
Zhipeng Xie, Liyang Jiang, Tengju Ye & Zhenying He
Shanghai Key Laboratory of Data Science, Fudan University, Shanghai, China
Zhipeng Xie, Liyang Jiang, Tengju Ye & Zhenying He

Authors

Zhipeng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Liyang Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Tengju Ye
View author publications
You can also search for this author in PubMed Google Scholar
Zhenying He
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhipeng Xie .

Editor information

Editors and Affiliations

Universität München, München, Germany
Matthias Renz
University of Southern California, Los Angeles, USA
Cyrus Shahabi
University of Queensland, Brisbane, Australia
Xiaofang Zhou
Monash University, Clayton, Australia
Muhammad Aamir Cheema

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xie, Z., Jiang, L., Ye, T., He, Z. (2015). MPTM: A Topic Model for Multi-Part Documents. In: Renz, M., Shahabi, C., Zhou, X., Cheema, M. (eds) Database Systems for Advanced Applications. DASFAA 2015. Lecture Notes in Computer Science(), vol 9050. Springer, Cham. https://doi.org/10.1007/978-3-319-18123-3_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-18123-3_10
Published: 09 April 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18122-6
Online ISBN: 978-3-319-18123-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics