skip to main content
10.1145/1835449.1835516acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Mining the blogosphere for top news stories identification

Authors Info & Claims
Published:19 July 2010Publication History

ABSTRACT

The analysis of query logs from blog search engines show that news-related queries occupy a significant portion of the logs. This raises a interesting research question on whether the blogosphere can be used to identify important news stories. In this paper, we present novel approaches to identify important news story headlines from the blogosphere for a given day. The proposed system consists of two components based on the language model framework, the query likelihood and the news headline prior. For the query likelihood, we propose several approaches to estimate the query language model and the news headline language model. We also suggest several criteria to evaluate the news headline prior that is the prior belief about the importance or newsworthiness of the news headline for a given day. Experimental results show that our system significantly outperforms a baseline system. Specifically, the proposed approach gives 2.62% and 10.19% further increases in MAP and P@5 over the best performing result of the TREC'09 Top Stories Identification Task.

References

  1. R. Agrawal, S. Gollapudi, A. Halverson, and S. Ieong. Diversifying search results. In Proceedings of WSDM 2009, pages 5--14. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Allan, M. E. Connell, W. B. Croft, F.-F. Feng, D. Fisher, and X. Li. Inquery and trec-9. In Proceedings of TREC-9, pages 551--562, 2000.Google ScholarGoogle Scholar
  3. J. Allan, R. Papka, and V. Lavrenko. On-line new event detection and tracking. In Proceedings of SIGIR 1998, pages 37--45. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. J. Mach. Learn. Res., 3:993--1022, 2003. Google ScholarGoogle ScholarCross RefCross Ref
  5. T. Brants, F. Chen, and A. Farahat. A system for new event detection. In Proceedings of SIGIR 2003, pages 330--337. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Carbonell and J. Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. In Proceedings of SIGIR 1998, pages 335--336. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. C. Chen, Y.-T. Chen, Y. Sun, and M. C. Chen. Life cycle modeling of news events using aging theory. In Proceedings of ECML 2003, pages 47--59, 2003.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. K.-Y. Chen, L. Luesukprasert, and S.-c. T. Chou. Hot topic extraction based on timeline analysis and multidimensional sentence modeling. IEEE Trans. on Knowl. and Data Eng., 19(8):1016--1025, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. H. L. Chieu and Y. K. Lee. Query based event extraction along a timeline. In Proceedings of SIGIR 2004, pages 425--432. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova, A. Ashkan, S. Buttcher, and I. MacKinnon. Novelty and diversity in information retrieval evaluation. In Proceedings of SIGIR 2008, pages 659--666, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1):1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  12. Q. He, K. Chang, and E.-P. Lim. Analyzing feature trajectories for event detection. In Proceedings of SIGIR 2007, pages 207--214. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of SIGIR 1999, pages 50--57. ACM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Jones and F. Diaz. Temporal profiles of queries. ACM Trans. Inf. Syst., 25(3):14, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Kleinberg. Bursty and hierarchical structure in streams. In Proceedings of SIGKDD 2002, pages 91--101. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Kolari, A. Java, and T. Finin. Characterizing the splogosphere. In Proceedings of 3rd Annl. Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th Word Wide Web Conf., 2006.Google ScholarGoogle Scholar
  17. G. Kumaran and J. Allan. Text classification and named entities for new event detection. In Proceedings of SIGIR 2004, pages 297--304. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR 2001, pages 111--119. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. Lee, S.-H. Na, and J.-H. Lee. An improved feedback approach using relevant local posts for blog feed retrieval. In Proceeding of CIKM 2009, pages 1971--1974. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Lv and C. Zhai. Positional language models for information retrieval. In Proceedings of SIGIR 2009, pages 299--306. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Macdonald, I. Ounis, and I. Soboroff. Overview of the TREC-2009 Blog Track. In Proceedings of TREC 2009, 2010.Google ScholarGoogle Scholar
  22. G. Mishne and M. de Rijke. A study of blog search. In Proceedings of ECIR 2006, pages 289--301. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S.-H. Nam, S.-H. Na, Y. Lee, and J.-H. Lee. Diffpost: Filtering non-relevant content based on content difference between two consecutive blog posts. In Proceedings of ECIR 2009, pages 791--795. Springer-Verlag, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. Wang, M. Zhang, L. Ru, and S. Ma. Automatic online news topic ranking using media focus and user attention based on aging theory. In Proceeding of CIKM 2008, pages 1033--1042. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Yang, T. Pierce, and J. Carbonell. A study of retrospective and on-line event detection. In Proceedings of SIGIR 1998, pages 28--36. ACM, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Yang, J. Zhang, J. Carbonell, and C. Jin. Topic-conditioned novelty detection. In Proceedings of SIGKDD 2002, pages 688--693. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst., 22(2):179--214, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Zhang, J. Zi, and L. G. Wu. New event detection based on indexing-tree and named entity. In Proceedings of SIGIR 2007, pages 215--222. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Zhang, J. Callan, and T. Minka. Novelty and redundancy detection in adaptive filtering. In Proceedings of SIGIR 2002, pages 81--88. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining the blogosphere for top news stories identification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval
      July 2010
      944 pages
      ISBN:9781450301534
      DOI:10.1145/1835449

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 19 July 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGIR '10 Paper Acceptance Rate87of520submissions,17%Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader