ABSTRACT
Dimension reduction techniques for text documents can be used for in the preprocessing phrase of blog mining, but these techniques can be more effective if they deal with the nature of the blogs properly. In this paper we propose a novel algorithm called PostRank using shallow approach to identify theme of the blog or blog representative words in order to reduce the dimensions of blogs. PostRank uses a graph-based syntactic representation of the weblog by taking into account some structural features of weblog. At the first step it models the blog as a complete graph and assumes the theme of the blog as a query applied to a search engine like Google and each post as a search result. It tries to rank the posts using Markov chain model like PageRank in Google. We used the ranking model under the assumption that top ranked nodes contain blog best representative words. Then it tries to identify post groups according to their scores. Finally this algorithm analyzes the first group using statistical methods(like TF-IDF) to identify blog representative words. Other groups are candidates of having blog theme after occurring change of theme to the blog. By arriving new instances of posts we try to update the blog graph by setting the initial scores of old nodes in the Markov chain to their final score from last run and continue the PostRank iterations until reaching convergence point. If half of the representative words have changed we would say that theme of the weblog has been changed.
We evaluated our method on the Persianblog dataset and obtained promising results. The blogs have been assigned to ten representative words by human beings and the results of PostRank have been compared to them and results of old related algorithms in this area.
- Tang B, Shepherd M, Milios E, Heywood M (2005) Comparing and combining dimension reduction techniques for efficient text clustering. Proceeding of SIAM International Workshop on Feature Selection for Data Mining: 17--26.Google Scholar
- Molina LC, Belanche L, Nebot A (2002) Feature selection algorithms: a survey and experimental evaluation. Proceeding of ICDM'02:306--313. Google ScholarDigital Library
- Carbonell, J., and Goldstein, J.: The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In: SIGIR98. Melbourne, Australia (1998) Google ScholarDigital Library
- Radev, D., Allison, T., Blair-Goldensohn, S., Blitzer, J. C., Elebi, A., Dimitrov, S., Drabek, E., Hakim, A., Lam, W., Liu, D., Otterbacher, J., Qi, H., Saggion, H., Teufel, S., Topper, M., Winkel, A., and Zhang, Z.: MEAD - a Platform for Multidocument Multilingual Text Summarization. In: LREC. Lisbon, Portugal (2004)Google Scholar
- Berger, A. L., Mittal, V. O.: OCELOT: a System for Summarizing Web Pages. In: 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 144--151, Athens, Greece (2000) Google ScholarDigital Library
- Sun, J. T., Shen, D., Zeng, H. J., Yang, Q., Lu, Y., Chen, Z.: Web-page Summarization Using clickthrough Data. In: SIGIR'05, pp. 194--201, Salvador, Brazil (2005) Google ScholarDigital Library
- Shen, D., Chen, Z., Yang, Q., Zeng, H. J., Zhang, B., Lu, Y., Ma, W., Y.: Web-page Classification through Summarization. In: 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom (2004) Google ScholarDigital Library
- Minqing H, Bing L: Mining and summarizing customer reviews. Proceeding of SIGKDD'04:168--177((2004)). Google ScholarDigital Library
- Ku, L. W., Liang, Y. T., Chen, H. H.: Opinion Extraction, Summarization and Tracking in News and Blog Corpora. In: AAAI-CAAW'06, Stanford, CA, USA (2006)Google Scholar
- Zhou, L., Hovy, E.: On the Summarization of Dynamically Introduced Information: Online Discussions and Blogs. In: AAAI-CAAW'06, Stanford, CA, USA (2006)Google Scholar
- Hu, M., Sun, A., Lim, E. P.: Comments-Oriented Blog Summarization by Sentence Extraction. In: CIKM '07, pp. 901--904, Lisbon, Portugal (2007) Google ScholarDigital Library
- Lin, Y. R., Sundaram, H.: Blog antenna: summarization of personal blog temporal dynamics based on self-similarity factorization. Proceeding of International Conference on Multimedia and Expo (ICME'07): 540--543, Beijing, China (2007)Google ScholarCross Ref
- Jafari-Asbagh, M., Sayyadiharikandeh, M., Abolhassani, H.: Blog Summarization for Mining Persian Blogs, SNPD (2009).Google Scholar
- Manning, C. D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval, i0521865719, 9780521865715, Cambridge University Press(2008) Google ScholarDigital Library
- Sharifloo, A. A. and Shamsfard, M.: A bottom up approach to Persian stemming', IJCNLP, Hyderabad, India(2008)Google Scholar
- Taghva, K., Beckley, R., Sadeh, M.: A List of Farsi Stopwords. Technical Report, 2003-01, Information Science Research Institute, University of Nevada, Las Vegas (2003)Google Scholar
- PostRank: a new algorithm for incremental finding of persian blog representative words
Recommendations
Identifying the influential bloggers in a community
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data MiningBlogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their ...
Disinformation Warfare: Understanding State-Sponsored Trolls on Twitter and Their Influence on the Web
WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceOver the past couple of years, anecdotal evidence has emerged linking coordinated campaigns by state-sponsored actors with efforts to manipulate public opinion on the Web, often around major political events, through dedicated accounts, or “trolls.” ...
Rumor Gauge: Predicting the Veracity of Rumors on Twitter
Special Issue on KDD 2016 and Regular PapersThe spread of malicious or accidental misinformation in social media, especially in time-sensitive situations, such as real-world emergencies, can have harmful effects on individuals and society. In this work, we developed models for automated ...
Comments