Abstract
Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Belkin, M.: Problems of Learning on Manifolds. PhD thesis, University of Chicago (2003)
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol. 14 (2001)
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research 7, 2399–2434 (2006)
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of machine Learning Research (2003)
He, X., Cai, D., Liu, H., Ma, W.-Y.: Locality preserving indexing for document representation. In: Proc. 2004 Int.Conf. on Research and Development in Information Retrieval (SIGIR 2004), Sheffield, UK, pp. 96–103 (July 2004)
Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)
Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling Hidden Topics on Document Manifold. In: Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM 2008), Napa Valley, CA (October 2008)
Cai, D., Wang, X., He, X.: Probabilistic dyadic data analysis with local and global consistency. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 105–112 (2009)
Cai, D., He, X., Han, J.: Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913 (2011)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc.1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR 1999) (1999)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)
Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learningin Graphical Models. Kluwer (1998)
Lee, J.M.: Introduction to Smooth Manifolds. Springer, NewYork (2002)
Si, L., Jin, R.: Adjusting mixture weights of Gaussian mixture model via regularized probabilistic latent semantic analysis. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 622–631. Springer, Heidelberg (2005)
Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 266–273 (2005)
Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 1052–1059 (2005)
Sha, F., Saul, L.: Analysis and extension of spectral methods for nonlinear dimensionality reduction. In: International Workshop on Machine Learning, vol. 22 (2005)
Cai, D., He, X.: Manifold Adaptive Experimental Design for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 24(4), 707–719 (2012)
Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: Uncertainty in Artificial Intelligence (UAI 2008) (2008)
Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics (2011)
Zhang, L., Chen, C., Bu, J., Chen, Z., Cai, D., Han, J.: Locally Discriminative Coclustering. IEEE Transactions on Knowledge and Data Engineering 24(6), 1025–1035 (2012)
Bu, J., Xu, B., Wu, C., Chen, C., Zhu, J., Cai, D.: Unsupervised face-name association via commute distance. In: ACM Multimedia (ACM-MM 2012) (2012)
Zhu, J., Ma, H., Chen, C., Bu, J.: Social Recommendation Using Low-Rank Semi-definite Program. In: AAAI 2011 (2011)
Liu, X., Song, M., Zhao, Q., Tao, D., Chen, C., Bu, J.: Attribute-restricted latent topic model for person re-identification. Pattern Recognition (2012)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yao, C., Wang, Y., Chen, G. (2013). Refine the Corpora Based on Document Manifold. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53914-5_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-53914-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53913-8
Online ISBN: 978-3-642-53914-5
eBook Packages: Computer ScienceComputer Science (R0)