ABSTRACT
The analysis of query logs from blog search engines show that news-related queries occupy a significant portion of the logs. This raises a interesting research question on whether the blogosphere can be used to identify important news stories. In this paper, we present novel approaches to identify important news story headlines from the blogosphere for a given day. The proposed system consists of two components based on the language model framework, the query likelihood and the news headline prior. For the query likelihood, we propose several approaches to estimate the query language model and the news headline language model. We also suggest several criteria to evaluate the news headline prior that is the prior belief about the importance or newsworthiness of the news headline for a given day. Experimental results show that our system significantly outperforms a baseline system. Specifically, the proposed approach gives 2.62% and 10.19% further increases in MAP and P@5 over the best performing result of the TREC'09 Top Stories Identification Task.
- R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of WSDM 2009, pages 5--14. ACM, 2009. Google ScholarDigital Library
- J. Allan, M. E. Connell, W. B. Croft, F.-F. Feng, D. Fisher, and X. Li. Inquery and trec-9. In Proceedings of TREC-9, pages 551--562, 2000.Google Scholar
- J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of SIGIR 1998, pages 37--45. ACM, 1998. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarCross Ref
- T. Brants, F. Chen, and A. Farahat. A system for new event detection. In Proceedings of SIGIR 2003, pages 330--337. ACM, 2003. Google ScholarDigital Library
- J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR 1998, pages 335--336. ACM, 1998. Google ScholarDigital Library
- C. C. Chen, Y.-T. Chen, Y. Sun, and M. C. Chen. Life cycle modeling of news events using aging theory. In Proceedings of ECML 2003, pages 47--59, 2003.Google ScholarDigital Library
- K.-Y. Chen, L. Luesukprasert, and S.-c. T. Chou. Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Trans. on Knowl. and Data Eng., 19(8):1016--1025, 2007. Google ScholarDigital Library
- H. L. Chieu and Y. K. Lee. Query based event extraction along a timeline. In Proceedings of SIGIR 2004, pages 425--432. ACM, 2004. Google ScholarDigital Library
- C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Buttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR 2008, pages 659--666, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.Google ScholarCross Ref
- Q. He, K. Chang, and E.-P. Lim. Analyzing feature trajectories for event detection. In Proceedings of SIGIR 2007, pages 207--214. ACM, 2007. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR 1999, pages 50--57. ACM, 1999. Google ScholarDigital Library
- R. Jones and F. Diaz. Temporal profiles of queries. ACM Trans. Inf. Syst., 25(3):14, 2007. Google ScholarDigital Library
- J. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of SIGKDD 2002, pages 91--101. ACM, 2002. Google ScholarDigital Library
- P. Kolari, A. Java, and T. Finin. Characterizing the splogosphere. In Proceedings of 3rd Annl. Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th Word Wide Web Conf., 2006.Google Scholar
- G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of SIGIR 2004, pages 297--304. ACM, 2004. Google ScholarDigital Library
- J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR 2001, pages 111--119. ACM, 2001. Google ScholarDigital Library
- Y. Lee, S.-H. Na, and J.-H. Lee. An improved feedback approach using relevant local posts for blog feed retrieval. In Proceeding of CIKM 2009, pages 1971--1974. ACM, 2009. Google ScholarDigital Library
- Y. Lv and C. Zhai. Positional language models for information retrieval. In Proceedings of SIGIR 2009, pages 299--306. ACM, 2009. Google ScholarDigital Library
- C. Macdonald, I. Ounis, and I. Soboroff. Overview of the TREC-2009 Blog Track. In Proceedings of TREC 2009, 2010.Google Scholar
- G. Mishne and M. de Rijke. A study of blog search. In Proceedings of ECIR 2006, pages 289--301. Springer, 2006. Google ScholarDigital Library
- S.-H. Nam, S.-H. Na, Y. Lee, and J.-H. Lee. Diffpost: Filtering non-relevant content based on content difference between two consecutive blog posts. In Proceedings of ECIR 2009, pages 791--795. Springer-Verlag, 2009. Google ScholarDigital Library
- C. Wang, M. Zhang, L. Ru, and S. Ma. Automatic online news topic ranking using media focus and user attention based on aging theory. In Proceeding of CIKM 2008, pages 1033--1042. ACM, 2008. Google ScholarDigital Library
- Y. Yang, T. Pierce, and J. Carbonell. A study of retrospective and on-line event detection. In Proceedings of SIGIR 1998, pages 28--36. ACM, 1998. Google ScholarDigital Library
- Y. Yang, J. Zhang, J. Carbonell, and C. Jin. Topic-conditioned novelty detection. In Proceedings of SIGKDD 2002, pages 688--693. ACM, 2002. Google ScholarDigital Library
- C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, 2004. Google ScholarDigital Library
- K. Zhang, J. Zi, and L. G. Wu. New event detection based on indexing-tree and named entity. In Proceedings of SIGIR 2007, pages 215--222. ACM, 2007. Google ScholarDigital Library
- Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy detection in adaptive filtering. In Proceedings of SIGIR 2002, pages 81--88. ACM, 2002. Google ScholarDigital Library
Index Terms
- Mining the blogosphere for top news stories identification
Recommendations
Identifying top news stories based on their popularity in the blogosphere
AbstractA huge volume of news stories are reported by various news channels, on a daily basis. Subscribing to all the stories and keeping track of the important ones day after day is very time-consuming. This paper proposes several approaches to identify ...
Identifying the influential bloggers in a community
WSDM '08: Proceedings of the 2008 International Conference on Web Search and Data MiningBlogging becomes a popular way for a Web user to publish information on the Web. Bloggers write blog posts, share their likes and dislikes, voice their opinions, provide suggestions, report news, and form groups in Blogosphere. Bloggers form their ...
Analyzing topological characteristics of the Korean blogosphere
Due to their popularity and widespread use, blogs have become an important medium through which many people communicate and exchange information on the World Wide Web (WWW). The blogosphere has provided many opportunities for individuals and companies ...
Comments