skip to main content
10.1145/2448556.2448631acmconferencesArticle/Chapter ViewAbstractPublication PagesicuimcConference Proceedingsconference-collections
research-article

Blog topic analysis using TF smoothing and LDA

Published:17 January 2013Publication History

ABSTRACT

In the era of Web 2.0, the number of blogs has explosively increased. With the appearance of social network services, blogs has become the places for sharing professional knowledge and personal branding. So, in order to understand the trends of topics or to analyze the content of blogs, the time sensitive topic extraction and topic change analysis is important and necessary. In the previous studies, most of topic extraction models extracted topic words independently from each time slice and tried to combine those. However, these methods did not show a good performance in analyzing topic trends because the topics extracted from time slices are independent. To cope with this problem, we propose a term frequency smoothing method which weaves time slices so that the more related topics are extracted from each time slice and a better topic trend analysis is generated. In order to extract topics from smoothed term frequencies, LDA, a generative topic model, is adopted. The evaluation of the proposed method on IT blogs shows that it can effectively discover quite meaningful topic patterns and topic words.

References

  1. Aixin Sun, Maggy Anastasia Suryanto, and Ying Liu. 2007. Blog classification using tags: an empirical study. In Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers (ICADL'07). Springer-Verlag, Berlin, Heidelberg, 307--316. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Aixin Sun, Ee-Peng Lim, and Wee-Keong Ng. 2002. Web classification using support vector machine. In Proceedings of the 4th international workshop on Web information and data management (WIDM '02). ACM, New York, NY, USA, 96--99. DOI=http://doi.acm.org/10.1145/584931.584952. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. ChengXiang Zhai, Atulya Velivelli, and Bei Yu. 2004. A cross-collection mixture model for comparative text mining. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '04). ACM, New York, NY, USA, 743--748. DOI=http://doi.acm.org/10.1145/1014052.1014150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3 (March 2003), 993--1022. Google ScholarGoogle Scholar
  5. Darren Rowse, Chris Garret. 2008. PROBLOGGER: Selects for Blogging Your Way to a Six-Figure Income. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. George Forman. 2003. An extensive empirical study of feature selection metrics for text classification. J. Mach. Learn. Res. 3 (March 2003), 1289--1305. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Goose html parser, https://github.com/jiminoc/goose/wiki.Google ScholarGoogle Scholar
  8. Google Trends Service, http://www.google.com/trends.Google ScholarGoogle Scholar
  9. J. Allan, J. Carbonell, G. Doddington, J. Yamron, and Y. Yang. 1998. Topic detection and tracking pilot study: Final report. In Proceedings of the DARPA.Google ScholarGoogle Scholar
  10. James Allan, Ron Papka, and Victor Lavrenko. 1998. Online new event detection and tracking. In Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '98). ACM, New York, NY, USA, 37--45. DOI=http://doi.acm.org/10.1145/290941.290954 Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jon Kleinberg. 2002. Bursty and hierarchical structure in streams. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '02). ACM, New York, NY, USA, 91--101. DOI=http://doi.acm.org/10.1145/775047.775061. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Ravi Kumar, Jasmine Novak, Prabhakar Raghavan, and Andrew Tomkins. 2003. On the bursty evolution of blogspace. In Proceedings of the 12th international conference on World Wide Web(WWW '03). ACM, New York, NY, USA, 568--576. DOI=http://doi.acm.org/10.1145/775152.775233 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Salton G. and McGill, M. J. 1983. Introduction to modern information retrieval. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Technorati, http://www.technorati.com/.Google ScholarGoogle Scholar
  15. Wikipedia, http://en.wikipedia.org/wiki/Tf%E2%80%93idf.Google ScholarGoogle Scholar
  16. Xuanhui Wang, ChengXiang Zhai, Xiao Hu, and Richard Sproat. 2007. Mining correlated burstytopic patterns from coordinated text streams. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '07). ACM, New York, NY, USA, 784--793. DOI=http://doi.acm.org/10.1145/1281192.1281276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yiming Yang, Tom Ault, Thomas Pierce, and Charles W. Lattimer. 2000. Improving text categorization methods for event tracking. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (SIGIR '00). ACM, New York, NY, USA, 65--72. DOI=http://doi.acm.org/10.1145/345508.345550. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Blog topic analysis using TF smoothing and LDA

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ICUIMC '13: Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication
        January 2013
        772 pages
        ISBN:9781450319584
        DOI:10.1145/2448556

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 January 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate251of941submissions,27%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader