Selecting the Right Features for Bipartite-Based Text Clustering

Qu, Chao; Li, Yong; Zhang, Jie; Hu, Tianming; Chen, Qian

doi:10.1007/978-3-540-88192-6_46

Chao Qu⁶,
Yong Li⁶,
Jie Zhang⁶,
Tianming Hu^6,7 &
…
Qian Chen⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

2468 Accesses

Abstract

Document datasets can be described with a bipartite graph where terms and documents are modeled as vertices on two sides respectively. Partitioning such a graph yields a co-clustering of words and documents, in the hope that the cluster topic can be captured by the top terms and documents in the same cluster. However, single terms alone are often not enough to capture the semantics of documents. To that end, in this paper, we propose to employ hyperclique patterns of terms as additional features for document representation. Then we use F-score to select the top discriminative features to construct the bipartite. Finally, the extensive experiments indicated that compared to the standard bipartite formulation, our approach is able to achieve better clustering performance at a smaller graph size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baker, L.D., McCallum, A.: Distributional clustering of words for text classification. In: SIGIR, pp. 96–103 (1998)
Google Scholar
Usui, S., Naud, A., Ueda, N., Taniguchi, T.: 3d-SE viewer: A text mining tool based on bipartite graph visualization. In: IJCNN, pp. 1103–1108 (2007)
Google Scholar
Xiong, H., Tan, P.N., Kumar, V.: Mining strong affinity association patterns in data sets with skewed support distribution. In: ICDM, pp. 387–394 (2003)
Google Scholar
Xiong, H., Tan, P.N., Kumar, V.: Hyperclique pattern discovery. Data Mining and Knowledge Discovery 13(2), 219–242 (2006)
Article MathSciNet Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31(3), 264–323 (1999)
Article Google Scholar
Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)
Article MATH Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)
Google Scholar
Hu, T., Liu, L., Qu, C., Sung, S.Y.: Joint cluster based co-clustering for clustering ensembles. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 284–295. Springer, Heidelberg (2006)
Chapter Google Scholar
Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. AAAI: Workshop of Artificial Intelligence for Web Search, pp. 58–64 (2000)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD, pp. 269–274 (2001)
Google Scholar
Huang, Y., Xiong, H., Wu, W., Zhang, Z.: A hybrid approach for mining maximal hyperclique patterns. In: ICTAI, pp. 354–361 (2004)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 888–905 (2000)
Article Google Scholar
Dhillon, I.S., Guan, Y., Kulis, B.: A fast kernel-based multilevel algorithm for graph clustering. In: SIGKDD, pp. 629–634 (2005)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: SIGKDD, pp. 16–22 (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

Dongguan University of Technology, China
Chao Qu, Yong Li, Jie Zhang, Tianming Hu & Qian Chen
East China Normal University, China
Tianming Hu

Authors

Chao Qu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Li
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Tianming Hu
View author publications
You can also search for this author in PubMed Google Scholar
Qian Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science, Sichuan University, 610065, Chengdu, China
Changjie Tang
Department of Computer Science, The University of Western Ontario, Canada
Charles X. Ling
School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Faculty of Science & Engineering, York University, 355 Lumbers Building, M3J 1P3, Toronto, Ontario, Canada
Nick J. Cercone
School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, 4072, Queensland, Australia
Xue Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Qu, C., Li, Y., Zhang, J., Hu, T., Chen, Q. (2008). Selecting the Right Features for Bipartite-Based Text Clustering. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_46

Download citation

DOI: https://doi.org/10.1007/978-3-540-88192-6_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88191-9
Online ISBN: 978-3-540-88192-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics