Skip to main content

Refine the Corpora Based on Document Manifold

  • Conference paper
Advanced Data Mining and Applications (ADMA 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8346))

Included in the following conference series:

  • 2343 Accesses

Abstract

Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Belkin, M.: Problems of Learning on Manifolds. PhD thesis, University of Chicago (2003)

    Google Scholar 

  2. Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol. 14 (2001)

    Google Scholar 

  3. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research 7, 2399–2434 (2006)

    MATH  MathSciNet  Google Scholar 

  4. Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of machine Learning Research (2003)

    Google Scholar 

  5. He, X., Cai, D., Liu, H., Ma, W.-Y.: Locality preserving indexing for document representation. In: Proc. 2004 Int.Conf. on Research and Development in Information Retrieval (SIGIR 2004), Sheffield, UK, pp. 96–103 (July 2004)

    Google Scholar 

  6. Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)

    Article  Google Scholar 

  7. Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling Hidden Topics on Document Manifold. In: Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM 2008), Napa Valley, CA (October 2008)

    Google Scholar 

  8. Cai, D., Wang, X., He, X.: Probabilistic dyadic data analysis with local and global consistency. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 105–112 (2009)

    Google Scholar 

  9. Cai, D., He, X., Han, J.: Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913 (2011)

    Article  Google Scholar 

  10. Hofmann, T.: Probabilistic latent semantic indexing. In: Proc.1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR 1999) (1999)

    Google Scholar 

  11. Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)

    Article  MATH  Google Scholar 

  12. Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learningin Graphical Models. Kluwer (1998)

    Google Scholar 

  13. Lee, J.M.: Introduction to Smooth Manifolds. Springer, NewYork (2002)

    MATH  Google Scholar 

  14. Si, L., Jin, R.: Adjusting mixture weights of Gaussian mixture model via regularized probabilistic latent semantic analysis. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 622–631. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  15. Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 266–273 (2005)

    Google Scholar 

  16. Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 1052–1059 (2005)

    Google Scholar 

  17. Sha, F., Saul, L.: Analysis and extension of spectral methods for nonlinear dimensionality reduction. In: International Workshop on Machine Learning, vol. 22 (2005)

    Google Scholar 

  18. Cai, D., He, X.: Manifold Adaptive Experimental Design for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 24(4), 707–719 (2012)

    Article  Google Scholar 

  19. Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)

    Google Scholar 

  20. Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: Uncertainty in Artificial Intelligence (UAI 2008) (2008)

    Google Scholar 

  21. Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics (2011)

    Google Scholar 

  22. Zhang, L., Chen, C., Bu, J., Chen, Z., Cai, D., Han, J.: Locally Discriminative Coclustering. IEEE Transactions on Knowledge and Data Engineering 24(6), 1025–1035 (2012)

    Article  Google Scholar 

  23. Bu, J., Xu, B., Wu, C., Chen, C., Zhu, J., Cai, D.: Unsupervised face-name association via commute distance. In: ACM Multimedia (ACM-MM 2012) (2012)

    Google Scholar 

  24. Zhu, J., Ma, H., Chen, C., Bu, J.: Social Recommendation Using Low-Rank Semi-definite Program. In: AAAI 2011 (2011)

    Google Scholar 

  25. Liu, X., Song, M., Zhao, Q., Tao, D., Chen, C., Bu, J.: Attribute-restricted latent topic model for person re-identification. Pattern Recognition (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yao, C., Wang, Y., Chen, G. (2013). Refine the Corpora Based on Document Manifold. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53914-5_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-53914-5_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-53913-8

  • Online ISBN: 978-3-642-53914-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics