Refine the Corpora Based on Document Manifold

Yao, Chengwei; Wang, Yilin; Chen, Gencai

doi:10.1007/978-3-642-53914-5_27

Chengwei Yao²⁵,
Yilin Wang²⁶ &
Gencai Chen²⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8346))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2400 Accesses

Abstract

Nowadays, it is quite challenging to track and utilize overwhelming news information generated by internet. One approach is using topic models, such as pLSI, LDA, LPI, LapPLSI, LTM etc, to discover news topics automatically. However, in many real applications, the topics inferred by all these kinds of models are not much useful, because there are always a proportion of the documents actually belong to no topics. In this paper, we proposed a new technique to refine the document corpora before topic modeling. Inspired by manifold theory, we use Laplacian eigenmaps to discover the submanifold structure of the document space, and try to find those documents with loose relations to other documents, then exclude them from the corpora. Experiments show that topic models combined with our algorithm can improve the quality of the topics significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Document Analysis Based on Multi-view Intact Space Learning with Manifold Regularization

Neural Topic Model with Distance Awareness

Multi-view subspace text clustering

Article 04 October 2024

References

Belkin, M.: Problems of Learning on Manifolds. PhD thesis, University of Chicago (2003)
Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. In: NIPS, vol. 14 (2001)
Google Scholar
Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: A geometric framework for learning from examples. Journal of Machine Learning Research 7, 2399–2434 (2006)
MATH MathSciNet Google Scholar
Blei, D., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of machine Learning Research (2003)
Google Scholar
He, X., Cai, D., Liu, H., Ma, W.-Y.: Locality preserving indexing for document representation. In: Proc. 2004 Int.Conf. on Research and Development in Information Retrieval (SIGIR 2004), Sheffield, UK, pp. 96–103 (July 2004)
Google Scholar
Cai, D., He, X., Han, J.: Document clustering using locality preserving indexing. IEEE Transactions on Knowledge and Data Engineering 17(12), 1624–1637 (2005)
Article Google Scholar
Cai, D., Mei, Q., Han, J., Zhai, C.: Modeling Hidden Topics on Document Manifold. In: Proc. 2008 ACM Conf. on Information and Knowledge Management (CIKM 2008), Napa Valley, CA (October 2008)
Google Scholar
Cai, D., Wang, X., He, X.: Probabilistic dyadic data analysis with local and global consistency. In: Proceedings of the 26th Annual International Conference on Machine Learning (ICML 2009), pp. 105–112 (2009)
Google Scholar
Cai, D., He, X., Han, J.: Locally Consistent Concept Factorization for Document Clustering. IEEE Transactions on Knowledge and Data Engineering 23(6), 902–913 (2011)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc.1999 Int. Conf. on Research and Development in Information Retrieval (SIGIR 1999) (1999)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Machine Learning 42(1-2), 177–196 (2001)
Article MATH Google Scholar
Neal, R., Hinton, G.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learningin Graphical Models. Kluwer (1998)
Google Scholar
Lee, J.M.: Introduction to Smooth Manifolds. Springer, NewYork (2002)
MATH Google Scholar
Si, L., Jin, R.: Adjusting mixture weights of Gaussian mixture model via regularized probabilistic latent semantic analysis. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 622–631. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhang, D., Chen, X., Lee, W.S.: Text classification with kernels on the multinomial manifold. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 266–273 (2005)
Google Scholar
Zhu, X., Lafferty, J.: Harmonic mixtures: combining mixture models and graph-based methods for inductive and scalable semi-supervised learning. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), pp. 1052–1059 (2005)
Google Scholar
Sha, F., Saul, L.: Analysis and extension of spectral methods for nonlinear dimensionality reduction. In: International Workshop on Machine Learning, vol. 22 (2005)
Google Scholar
Cai, D., He, X.: Manifold Adaptive Experimental Design for Text Categorization. IEEE Transactions on Knowledge and Data Engineering 24(4), 707–719 (2012)
Article Google Scholar
Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning (2006)
Google Scholar
Wang, C., Blei, D., Heckerman, D.: Continuous time dynamic topic models. In: Uncertainty in Artificial Intelligence (UAI 2008) (2008)
Google Scholar
Wang, C., Paisley, J., Blei, D.: Online variational inference for the hierarchical Dirichlet process. Artificial Intelligence and Statistics (2011)
Google Scholar
Zhang, L., Chen, C., Bu, J., Chen, Z., Cai, D., Han, J.: Locally Discriminative Coclustering. IEEE Transactions on Knowledge and Data Engineering 24(6), 1025–1035 (2012)
Article Google Scholar
Bu, J., Xu, B., Wu, C., Chen, C., Zhu, J., Cai, D.: Unsupervised face-name association via commute distance. In: ACM Multimedia (ACM-MM 2012) (2012)
Google Scholar
Zhu, J., Ma, H., Chen, C., Bu, J.: Social Recommendation Using Low-Rank Semi-definite Program. In: AAAI 2011 (2011)
Google Scholar
Liu, X., Song, M., Zhao, Q., Tao, D., Chen, C., Bu, J.: Attribute-restricted latent topic model for person re-identification. Pattern Recognition (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science and Technology, Zhejiang University, China
Chengwei Yao & Gencai Chen
University of Nottingham, NG7 2RD, Nottingham, UK
Yilin Wang

Authors

Chengwei Yao
View author publications
You can also search for this author in PubMed Google Scholar
Yilin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Gencai Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

US Air Force Office of Scientific Research, 106-0032, Tokyo, Japan
Hiroshi Motoda
School of Computer Science and Technology, Zhejiang University, 310027, Hangzhou, China
Zhaohui Wu
Faculty of Engineering and Information Technology, University of Technology, Chippendale, 2008, Sydney, NSW, Australia
Longbing Cao
Department of Computing Science, University of Alberta, T6G 2E8, Edmonton, Canada
Osmar Zaiane
College of Computer Science and Technology, Zhejiang University, Hangzhou, China
Min Yao
School of Computer Science, Fudan University, 200433, Shanghai, China
Wei Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yao, C., Wang, Y., Chen, G. (2013). Refine the Corpora Based on Document Manifold. In: Motoda, H., Wu, Z., Cao, L., Zaiane, O., Yao, M., Wang, W. (eds) Advanced Data Mining and Applications. ADMA 2013. Lecture Notes in Computer Science(), vol 8346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-53914-5_27

Download citation

DOI: https://doi.org/10.1007/978-3-642-53914-5_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-53913-8
Online ISBN: 978-3-642-53914-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics