ABSTRACT
In the era of Web 2.0, the number of blogs has explosively increased. With the appearance of social network services, blogs has become the places for sharing professional knowledge and personal branding. So, in order to understand the trends of topics or to analyze the content of blogs, the time sensitive topic extraction and topic change analysis is important and necessary. In the previous studies, most of topic extraction models extracted topic words independently from each time slice and tried to combine those. However, these methods did not show a good performance in analyzing topic trends because the topics extracted from time slices are independent. To cope with this problem, we propose a term frequency smoothing method which weaves time slices so that the more related topics are extracted from each time slice and a better topic trend analysis is generated. In order to extract topics from smoothed term frequencies, LDA, a generative topic model, is adopted. The evaluation of the proposed method on IT blogs shows that it can effectively discover quite meaningful topic patterns and topic words.
- Aixin Sun, Maggy Anastasia Suryanto, and Ying Liu. 2007. Blog classification using tags: an empirical study. In Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers (ICADL'07). Springer-Verlag, Berlin, Heidelberg, 307--316. Google ScholarDigital Library
- Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. 2002. Web classification using support vector machine. In Proceedings of the 4th international workshop on Web information and data management (WIDM '02). ACM, New York, NY, USA, 96--99. DOI=http://doi.acm.org/10.1145/584931.584952. Google ScholarDigital Library
- ChengXiang Zhai, Atulya Velivelli, and Bei Yu. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '04). ACM, New York, NY, USA, 743--748. DOI=http://doi.acm.org/10.1145/1014052.1014150. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022. Google Scholar
- Darren Rowse, Chris Garret. 2008. PROBLOGGER: Selects for Blogging Your Way to a Six-Figure Income. Google ScholarDigital Library
- George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (March 2003), 1289--1305. Google ScholarDigital Library
- Goose html parser, https://github.com/jiminoc/goose/wiki.Google Scholar
- Google Trends Service, http://www.google.com/trends.Google Scholar
- J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA.Google Scholar
- James Allan, Ron Papka, and Victor Lavrenko. 1998. Online new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 37--45. DOI=http://doi.acm.org/10.1145/290941.290954 Google ScholarDigital Library
- Jon Kleinberg. 2002. Bursty and hierarchical structure in streams. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 91--101. DOI=http://doi.acm.org/10.1145/775047.775061. Google ScholarDigital Library
- Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. 2003. On the bursty evolution of blogspace. In Proceedings of the 12th international conference on World Wide Web(WWW '03). ACM, New York, NY, USA, 568--576. DOI=http://doi.acm.org/10.1145/775152.775233 Google ScholarDigital Library
- Salton G. and McGill, M. J. 1983. Introduction to modern information retrieval. Google ScholarDigital Library
- Technorati, http://www.technorati.com/.Google Scholar
- Wikipedia, http://en.wikipedia.org/wiki/Tf%E2%80%93idf.Google Scholar
- Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. 2007. Mining correlated burstytopic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07). ACM, New York, NY, USA, 784--793. DOI=http://doi.acm.org/10.1145/1281192.1281276. Google ScholarDigital Library
- Yiming Yang, Tom Ault, Thomas Pierce, and Charles W. Lattimer. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '00). ACM, New York, NY, USA, 65--72. DOI=http://doi.acm.org/10.1145/345508.345550. Google ScholarDigital Library
Index Terms
- Blog topic analysis using TF smoothing and LDA
Recommendations
Research on Multi-document Summarization Based on LDA Topic Model
IHMSC '14: Proceedings of the 2014 Sixth International Conference on Intelligent Human-Machine Systems and Cybernetics - Volume 02Compared with VSM (Vector Space Model) and graph-ranking models, LDA (Latent Dirichlet Allocation) Model can discover latent topics in the corpus and latent topics are beneficial to use sentence-ranking mechanisms to form a good summary. In the paper, ...
LDA-based online topic detection using tensor factorization
In the information retrieval field, effective and efficient extraction of topics from large-scale online text streams is challenging because it is a fully unsupervised learning task without prior knowledge. Most previous studies have focused on how to ...
Multi-aspect Blog sentiment analysis based on LDA topic model and hownet lexicon
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IIBlog is an important web2.0 application, which attracts many users to express their subjective reviews about financial events, political events and other objects. Usually a Blog page includes more than one theme. However the existing researches of multi-...
Comments