Skip to main content

Selecting the Right Features for Bipartite-Based Text Clustering

  • Conference paper
Advanced Data Mining and Applications (ADMA 2008)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5139))

Included in the following conference series:

  • 2468 Accesses

Abstract

Document datasets can be described with a bipartite graph where terms and documents are modeled as vertices on two sides respectively. Partitioning such a graph yields a co-clustering of words and documents, in the hope that the cluster topic can be captured by the top terms and documents in the same cluster. However, single terms alone are often not enough to capture the semantics of documents. To that end, in this paper, we propose to employ hyperclique patterns of terms as additional features for document representation. Then we use F-score to select the top discriminative features to construct the bipartite. Finally, the extensive experiments indicated that compared to the standard bipartite formulation, our approach is able to achieve better clustering performance at a smaller graph size.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baker, L.D., McCallum, A.: Distributional clustering of words for text classification. In: SIGIR, pp. 96–103 (1998)

    Google Scholar 

  2. Usui, S., Naud, A., Ueda, N., Taniguchi, T.: 3d-SE viewer: A text mining tool based on bipartite graph visualization. In: IJCNN, pp. 1103–1108 (2007)

    Google Scholar 

  3. Xiong, H., Tan, P.N., Kumar, V.: Mining strong affinity association patterns in data sets with skewed support distribution. In: ICDM, pp. 387–394 (2003)

    Google Scholar 

  4. Xiong, H., Tan, P.N., Kumar, V.: Hyperclique pattern discovery. Data Mining and Knowledge Discovery 13(2), 219–242 (2006)

    Article  MathSciNet  Google Scholar 

  5. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Computing Surveys 31(3), 264–323 (1999)

    Article  Google Scholar 

  6. Zhao, Y., Karypis, G.: Empirical and theoretical comparisons of selected criterion functions for document clustering. Machine Learning 55(3), 311–331 (2004)

    Article  MATH  Google Scholar 

  7. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining (2000)

    Google Scholar 

  8. Hu, T., Liu, L., Qu, C., Sung, S.Y.: Joint cluster based co-clustering for clustering ensembles. In: Li, X., Zaïane, O.R., Li, Z. (eds.) ADMA 2006. LNCS (LNAI), vol. 4093, pp. 284–295. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  9. Strehl, A., Ghosh, J., Mooney, R.: Impact of similarity measures on web-page clustering. In: Proc. AAAI: Workshop of Artificial Intelligence for Web Search, pp. 58–64 (2000)

    Google Scholar 

  10. Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: SIGKDD, pp. 269–274 (2001)

    Google Scholar 

  11. Huang, Y., Xiong, H., Wu, W., Zhang, Z.: A hybrid approach for mining maximal hyperclique patterns. In: ICTAI, pp. 354–361 (2004)

    Google Scholar 

  12. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)

    Google Scholar 

  13. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Analysis and Machine Intelligence 22, 888–905 (2000)

    Article  Google Scholar 

  14. Dhillon, I.S., Guan, Y., Kulis, B.: A fast kernel-based multilevel algorithm for graph clustering. In: SIGKDD, pp. 629–634 (2005)

    Google Scholar 

  15. Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: SIGKDD, pp. 16–22 (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Qu, C., Li, Y., Zhang, J., Hu, T., Chen, Q. (2008). Selecting the Right Features for Bipartite-Based Text Clustering. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2008. Lecture Notes in Computer Science(), vol 5139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88192-6_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88192-6_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88191-9

  • Online ISBN: 978-3-540-88192-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics