Skip to main content
Log in

Organization and Tagging of Blog and News Entries Based on Content Reuse

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

As their popularity as dynamic platforms for information dissemination and sharing increases, the use of Weblogs (blogs) which track and comment on real world (political, news, entertainment) events is also growing. The success of the blog as a popular medium for information sharing, on the other hand, is also its weakest spot in that there is little support beyond keyword based searches for blog entries. Consequently, there is impending need for navigational support, which can help users relate a large, diverse, and inherently distributed collection of blogosphere. In this paper, we first note that the existence of large degrees of content overlaps in the form of quotation/commentary pairs (as well as content borrowings across media outlets) can be leveraged for tracking the topic development patterns within the blogosphere. Relying on this observation, we first propose focus and flow analysis techniques that rely on reuse detection and focus and flow to help place blog entries into logical organizations. We then show that these implicit or explicit quotations as well as focus analysis could be leveraged to identify the contexts in which entries occur; thus, resulting in more effective tagging. Thus, we propose CDIP (a collection-driven, yet individuality-preserving tagging system) which relies on relationships provided by quotation/reuse detection and semantic-focus analysis to automatically tag the blogs in such a way that, not-only the related blogs share tags, but also individuality of the entries is preserved for discriminating tag-based accesses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11

Similar content being viewed by others

References

  1. All Experts (2008). All Experts homepage. http://www.allexperts.com/central/expert.htm.

  2. dmoz (2009). Open Directory Project. http://www.dmoz.org/.

  3. Sifry, D. (2009). David Sifry’s Blog. http://www.sifry.com/alerts/.

  4. Adar, E., & Adamic, L. A. (2005). Tracking information epidemics in blogspace. In Proceedings of the 2005 IEEE/WIC/ACM international conference on web intelligence.

  5. Allan, J., Wade, C., & Bolivar, A. (2003). Retrieval and novelty detection at the sentence level. In Proceedings of international ACM SIGIR conference.

  6. Brooks, C. H., & Montanez, N. (2006). Improved annotation of the blogosphere via autotagging and hierarchical clustering. In Proceedings of international conference on World Wide Web.

  7. Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of uncertainty in artificial intelligence.

  8. Kim, J. W., Candan, K. S., & Mehmet Dönderler, E. (2005). Topic segmentation of message hierarchies for indexing and navigation support. In Proceedings of international conference on World Wide Web.

  9. Kim, J. W., & Candan, K. S. (2006). CP/CV: Concept similarity mining without frequency information from domain describing taxonomies. In Proceedings of international conference on information and knowledge management.

  10. Kim, J. W., Candan, K. S., & Tatemura, J. (2007). CDIP: Collection-driven, yet individuality-preserving automated blog tagging. In Proceedings of the international conference on semantic computing.

  11. Kim, J. W., Candan, K. S., & Tatemura, J. (2009). Efficient overlap and content reuse detection in blogs and online news articles. In Proceedings of the 18th international conference on World Wide Web.

  12. Mei, Q., Liu, C., Su, H., & Zhai, C. (2006). A Probabilistic approach to spatiotemporal theme pattern mining on weblogs. In Proceedings of international conference on World Wide Web.

  13. Metzler, D., Bernstein, Y., Croft, W. B., Moffat, A., & Zobel, J. (2005). Similarity measures for tracking information flow. In Proceedings of international conference on information and knowledge management.

  14. Mishne, G. (2006). AutoTag: A collaborative approach to automated tag assignment for weblog posts. In Proceedings of international conference on World Wide Web.

  15. Nakajima, S., Tatemura, J., Hino, Y., Hara, Y., & Tanaka, K. (2005). Discovering important bloggers based on a blog thread analysis. In Workshop on the Weblogging Ecosystem.

  16. Qi, Y., & Candan, K. S. (2006). CUTS: Curvature-based development pattern analysis and segmentation for blogs and other text streams. In Proceedings of international conference on hypertext and hypermedia series.

  17. Qin, T., Liu, T., Zhang, X., Chen, Z., & Ma, W. (2005). A study of relevance propagation for web search. In Proceedings of international ACM SIGIR conference.

  18. Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 19(1), 17–30.

    Article  Google Scholar 

  19. Resnik, P. (1999). Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research (JAIR), 11, 95–130.

    MATH  Google Scholar 

  20. Salton, G., Fox, E. A., & Wu, H. (1983). Extended boolean information retrieval. Communications of the ACM, 26(11), 1022–1036.

    Article  MATH  MathSciNet  Google Scholar 

  21. Shakery, A., & Zhai, C. (2003). Relevance propagation for topic distillation UIUC TREC-2003 web track experiments. In Text retrieval conference.

  22. Song, R., Wen, J. R., Shi, S., Xin, G., Liu, T. Y., Qin, T., et al. (2004). Microsoft research Asia at web track and terabyte track of TREC 2004. In Text retrieval conference.

  23. Tseng, B., Tatemura, J., & Wu, Y. (2005). Tomographic clustering to visualize blog communities as mountain views. In WWW’04 workshop on the weblogging ecosystem.

  24. Sanderson, M., & Zobel, J. (2005). Information retrieval system evaluation: Effort, sensitivity, and reliability. In Proceedings of international ACM SIGIR conference.

  25. Yang, Y., Zhang, J., Carbonell, J., & Jin, C. (2002). Picconditioned novelty detection. In Proceedings of international conference on knowledge discovery and data mining.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. Selçuk Candan.

Additional information

This work has been partially supported by the NSF Grant “MAISON: Middleware for Accessible Information Spaces on NSDL”. This is an extended version of a work originally published at the IEEE International Conference on Semantic Computing 2007 [10].

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, J.W., Candan, K.S. & Tatemura, J. Organization and Tagging of Blog and News Entries Based on Content Reuse . J Sign Process Syst Sign Image Video Technol 58, 407–421 (2010). https://doi.org/10.1007/s11265-009-0384-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-009-0384-x

Keywords

Navigation