Skip to main content

Web Text Clustering with Dynamic Themes

  • Conference paper
Web Information Systems and Mining (WISM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Included in the following conference series:

  • 1332 Accesses

Abstract

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm–Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EMalgorithm. Journal of Royal Statist. Soc. B 39, 1 (1977)

    MathSciNet  MATH  Google Scholar 

  2. Khan, S., Ahmad, A.: Cluster Centre Initialization Algorithm for K-Means Clustering. Pattern Recognition 25, 1293–1302 (2004)

    Article  Google Scholar 

  3. Morinaga, S., Yamanishi, K.: Tracking dynamics of topic trends using a _nite mixture model. In: Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 811–816 (2004)

    Google Scholar 

  4. Roy, S., Gevry, D., Pottenger, W.M.: Methodologies for trend detection in textual datamining. In: The Textmine 2002 Workshop, Second SIAM International Conference on Data Mining (2002)

    Google Scholar 

  5. Zhai, C.X., Mei, Q.Z.: Discovering Evolutionary Theme Patterns from Text-An Exploration of Temporal Text Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198–207 (2005)

    Google Scholar 

  6. Zhai, C.X., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, pp. 743–748 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hung, P.J., Hsu, P.Y., Cheng, M.S., Wen, C.H. (2011). Web Text Clustering with Dynamic Themes. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23982-3_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23981-6

  • Online ISBN: 978-3-642-23982-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics