Web Text Clustering with Dynamic Themes

Hung, Ping Ju; Hsu, Ping Yu; Cheng, Ming Shien; Wen, Chih Hao

doi:10.1007/978-3-642-23982-3_16

Ping Ju Hung^21,22,
Ping Yu Hsu^21,22,
Ming Shien Cheng^21,22 &
…
Chih Hao Wen^21,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6988))

Included in the following conference series:

International Conference on Web Information Systems and Mining

1332 Accesses

Abstract

Research of data mining has developed many technologies of filtering out useful information from vast data, documents clustering is one of the important technologies. There are two approaches of documents clustering, one is clustering with metadata of documents, and the other is clustering with content of documents. Most of previous clustering approaches with documents contents focused on the documents summary (summary of single or multiple files) and the words vector analysis of documents, found the few and important keywords to conduct documents clustering. In this study, we categorize hot commodity on the web then denominate them, in accordance with the web text (abstracts) of these hot commodity and their accessing times. Firstly, parsing Chinese web text of documents for hot commodity, applied the hierarchical agglomerative clustering algorithm–Ward method to analyze the properties of words into themes and decide the number s of themes. Secondly, adopting the Cross Collection Mixture Model which applied in Temporal Text Mining and the accessing times( the degree of user identification words) to collect dynamic themes, then gather stable words by probability distribution to be the vectors of documents clustering. Thirdly, estimate parameters with Expectation Maximization (EM) algorithm. Finally, apply K-means with extracted dynamic themes to be the features of documents clustering. This study proposes a novel approach of documents clustering and through a series of experiment, it is proven that the algorithm is effective and can improve the accuracy of clustering results.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EMalgorithm. Journal of Royal Statist. Soc. B 39, 1 (1977)
MathSciNet MATH Google Scholar
Khan, S., Ahmad, A.: Cluster Centre Initialization Algorithm for K-Means Clustering. Pattern Recognition 25, 1293–1302 (2004)
Article Google Scholar
Morinaga, S., Yamanishi, K.: Tracking dynamics of topic trends using a _nite mixture model. In: Proceedings of the 2004 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 811–816 (2004)
Google Scholar
Roy, S., Gevry, D., Pottenger, W.M.: Methodologies for trend detection in textual datamining. In: The Textmine 2002 Workshop, Second SIAM International Conference on Data Mining (2002)
Google Scholar
Zhai, C.X., Mei, Q.Z.: Discovering Evolutionary Theme Patterns from Text-An Exploration of Temporal Text Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 198–207 (2005)
Google Scholar
Zhai, C.X., Velivelli, A., Yu, B.: A Cross-Collection Mixture Model for Comparative Text Mining. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, pp. 743–748 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Business Administration, National Central University, No.300, Jhongda Rd., Jhongli City, Taoyuan County, 32001, Taiwan (R.O.C.)
Ping Ju Hung, Ping Yu Hsu, Ming Shien Cheng & Chih Hao Wen
Department of Industrial Engineering and Management, Ming Chi University of Technology, No.84, Gongzhuan Rd., Taishan Dist., New Taipei City, 24301, Taiwan (R.O.C.)
Ping Ju Hung, Ping Yu Hsu, Ming Shien Cheng & Chih Hao Wen

Authors

Ping Ju Hung
View author publications
You can also search for this author in PubMed Google Scholar
Ping Yu Hsu
View author publications
You can also search for this author in PubMed Google Scholar
Ming Shien Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Chih Hao Wen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Inforamtion Science, University of Macau, Av. Padre Tomás Pereira, Taipa, Macau, China
Zhiguo Gong
School of Computer, Shanghai University, 200444, Shanghai, China
Xiangfeng Luo
College of Computer and Software, Taiyuan University of Technology, 030024, Taiyuan, China
Junjie Chen
School of Computer and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jingsheng Lei
Department of Business Administration, Caritas Institute of Higher Education, 18 Chui Ling Road, Tseung Kwan O, Hong Kong, China
Fu Lee Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hung, P.J., Hsu, P.Y., Cheng, M.S., Wen, C.H. (2011). Web Text Clustering with Dynamic Themes. In: Gong, Z., Luo, X., Chen, J., Lei, J., Wang, F.L. (eds) Web Information Systems and Mining. WISM 2011. Lecture Notes in Computer Science, vol 6988. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23982-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-23982-3_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23981-6
Online ISBN: 978-3-642-23982-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics