Skip to main content
Log in

Mining the interests of Chinese microbloggers via keyword extraction

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

Microblogging provides a new platform for communicating and sharing information among Web users. Users can express opinions and record daily life using microblogs. Microblogs that are posted by users indicate their interests to some extent. We aim to mine user interests via keyword extraction from microblogs. Traditional keyword extraction methods are usually designed for formal documents such as news articles or scientific papers. Messages posted by microblogging users, however, are usually noisy and full of new words, which is a challenge for keyword extraction. In this paper, we combine a translation-based method with a frequency-based method for keyword extraction. In our experiments, we extract keywords for microblog users from the largest microblogging website in China, Sina Weibo. The results show that our method can identify users’ interests accurately and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Kwak H, Lee C, Park H, Moon S. What is Twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web. 2010, 591–600

  2. Liu Z, Chen X, Zheng Y, Sun M. Automatic keyphrase extraction by bridging vocabulary gap. In: Proceedings of the 15th Conference on Computational Natural Language Learning. 2011, 135–144

  3. Brown P F, Pietra S A D, Pietra V J D, Mercer R L. The mathematics of statistical machine translation: parameter estimation. Computational linguistics, 1993, 19(2): 263–311

    Google Scholar 

  4. Koehn P. Statistical Machine Translation. Cambridge: Cambridge University Press, 2010

    MATH  Google Scholar 

  5. Berger A L, Lafferty J D. Information retrieval as statistical translation. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 222–229

  6. Karimzadehgan M, Zhai C X. Estimation of statistical translation models based on mutual information for ad hoc information retrieval. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010, 323–330

  7. Duygulu P, Barnard K, de Freitas J F G, Forsyth D A. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In: Proceedings of the 7th European Conference on Computer Vision, Part IV. 2002, 97–112

  8. Berger A L, Caruana R, Cohn D, Freitag D, Mittal V O. Bridging the lexical chasm: statistical approaches to answer-finding. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2000, 192–199

  9. Echihabi A, Marcu D. A noisy-channel approach to question answering. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics. 2003, 16–23

  10. Murdock V, Croft W B. Simple translation models for sentence retrieval in factoid question answering. In: Proceedings of SIGIR 2004 Workshop on Information Retrieval for Question Answering. 2004

  11. Soricut R, Brill E. Automatic question answering using the web: beyond the factoid. Information Retrieval, 2006, 9(2): 191–206

    Article  Google Scholar 

  12. Xue X, Jeon J, Croft W B. Retrieval models for question and answer archives. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 475–482

  13. Riezler S, Vasserman A, Tsochantaridis I, Mittal V, Liu Y. Statistical machine translation for query expansion in answer retrieval. In: Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 2007, 464–471

  14. Riezler S, Liu Y, Vasserman A. Translating queries into snippets for improved query expansion. In: Proceedings of the 22nd International Conference on Computational Linguistics. 2008, 737–744

  15. Riezler S, Liu Y. Query rewriting using monolingual statistical machine translation. Computational Linguistics, 2010, 36(3): 569–582

    Article  Google Scholar 

  16. Banko M, Mittal V O, Witbrock M J. Headline generation based on statistical translation. In: Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics. 2000, 318–325

  17. Liu Z, Wang H, Wu H, Li S. Collocation extraction using monolingual word alignment method. In: Proceedings of 2009 Conference on Empirical Methods in Natural Language Processing. 2009, 487–495

  18. Liu Z, Wang H, Wu H, Li S. Improving statistical machine translation with monolingual collocation. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010, 825–833

  19. Quirk C, Brockett C, Dolan W B. Monolingual machine translation for paraphrase generation. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. 2004, 142–149

  20. Zhao S, Wang H, Liu T. Paraphrasing with search engine query logs. In: Proceedings of the 23rd International Conference on Computational Linguistics. 2010, 1317–1325

  21. Frank E, Paynter G W, Witten I H, Gutwin C, Nevill-Manning C G. Domain-specific keyphrase extraction. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence. 1999, 668–673

  22. Witten I H, Paynter G W, Frank E, Gutwin C, Nevill-Manning C G. Kea: practical automatic keyphrase extraction. In: Proceedings of 4th ACM conference on Digital Libraries. 1999, 254–255

  23. Turney P D. Learning algorithms for keyphrase extraction. Information Retrieval, 2000, 2(4): 303–336

    Article  Google Scholar 

  24. Salton G, Buckley C. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 1988, 24(5): 513–523

    Article  Google Scholar 

  25. Mihalcea R, Tarau P. Textrank: bringing order into texts. In: Proceedings of 2004 Conference on Empirical Methods in Natural Language Processing. 2004, 404–411

  26. Page L, Brin S, Motwani R, Winograd T. The pagerank citation ranking: bringing order to the web. Technical Report, Stanford Digital Library Technologies Project, 1998

  27. Landauer T K, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse Processes, 1998, 25(2&3): 259–284

    Article  Google Scholar 

  28. Hofmann T. Probabilistic latent semantic indexing. In: Proceedings of 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999, 50–57

  29. Blei D M, Ng A Y, Jordan M I. Latent Dirichlet allocation. Journal of Machine Learning Research, 2003, 3: 993–1022

    MATH  Google Scholar 

  30. Heinrich G. Parameter estimation for text analysis. http://www.arbylon.net/publications/text-est

  31. Blei D M, Lafferty J D. Topic Models. In: Srivastava A, Sahami M, eds. Text Mining: Classification, Clustering, and Applications. London: Chapman & Hall, 2009

    Google Scholar 

  32. Zhao D, Rosson MB. How and why people twitter: the role that microblogging plays in informal communication at work. In: Proceedings of ACM 2009 International Conference on Supporting Group Work. 2009, 243–252

  33. Savage N. Twitter as medium and message. Communications of the ACM, 2011, 54(3): 18–20

    Article  Google Scholar 

  34. Zhao W X, Jiang J, Weng J, He J, Lim E, Yan H, Li X. Comparing twitter and traditional media using topic models. In: Proceedings of the 33rd European Conference on IR Research. 2011, 338–349

  35. Java A, Song X, Finin T, Tseng B.Why we Twitter: understanding microblogging usage and communities. In: Proceedings of 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis. 2007, 56–65

  36. Teevan J, Ramage D, Morris MR. #Twittersearch: a comparison of microblog search and web search. In: Proceedings of the 4th International Conference on Web Search and Web Data Mining. 2011, 35–44

  37. Mustafaraj E, Metaxas P. From obscurity to prominence in minutes: political speech and real-time search. In: Proceedings of Web Science Conference. 2010.

  38. Phelan O, McCarthy K, Smyth B. Using twitter to recommend realtime topical news. In: Proceedings of the 3rd ACM conference on Recommender systems. 2009, 385–388

  39. Sakaki T, Okazaki M, Matsuo Y. Earthquake shakes twitter users: realtime event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 851–860

  40. Culotta A. Detecting influenza outbreaks by analyzing twitter messages. In: Proceedings of KDD Workshop on Social Media Analytics. 2010

  41. Earle P S, Guy M, Ostrum C, Horvath S, Buckmaster R A. Omg earthquake! Can twitter improve earthquake response? In: Proceedings of 2009 AGU Fall Meeting Abstracts, Vol 1. 2009

  42. Petrovic S, Osborne M, Lavrenko V. Streaming first story detection with application to Twitter. In: Proceedings of 2010 Human Language Technologies: Conference of the North American Chapter of the Association for Computational Linguistics. 2010, 181–189

  43. Cha M, Haddadi H, Benevenuto F, Gummadi K P. Measuring user influence in Twitter: the million follower fallacy. In: Proceedings of the 4th International AAAI Conference on Weblogs and Social. 2010, 10–17

  44. Tumasjan A, Sprenger T O, Sandner P G, Welpe I M. Predicting elections with Twitter: what 140 characters reveal about political sentiment. In: Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 2010, 178–185

  45. OConnor B, Balasubramanyan R, Routledge B R, Smith N A. From tweets to polls: linking text sentiment to public opinion time series. In: Proceedings of the 4th International AAAI Conference onWeblogs and Social Media. 2010, 122–129

  46. Pak A, Paroubek P. Twitter as a corpus for sentiment analysis and opinion mining. In: Proceedings of International Conference on Language Resources and Evaluation. 2010

  47. Jiang L, Yu M, Zhou M, Liu X, Zhao T. Target-dependent twitter sentiment classification. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol 1. 2011, 151–160

    Google Scholar 

  48. Agarwal A, Xie B, Vovsha I, Rambowand O, Passonneau R. Sentiment analysis of twitter data. In: Proceedings of Workshop on Language in Social Media. 2011, 30–38

  49. Qu Z, Liu Y. Interactive group suggesting for twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol 2. 2011, 519–523

    Google Scholar 

  50. Huang J, Thornton K M, Efthimiadis E N. Conversational tagging in Twitter. In: Proceedings of the 21st ACM Conference on Hypertext and Hypermedia. 2010, 173–178

  51. Efron M. Hashtag retrieval in a microblogging environment. In: Proceeding of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010, 787–788

  52. Gimpel K, Schneider N, Oonnor B, Das D, Mills D, Eisenstein J, Heilman M, Yogatama D, Flanigan J, Smith N A. Part-of-speech tagging for twitter: annotation, features, and experiments. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol 2. 2011, 42–47

    Google Scholar 

  53. Finin T, Murnane W, Karandikar A, Keller N, Martineau J, Dredze M. Annotating named entities in twitter data with crowdsourcing. In: Proceedings of NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. 2010, 80–88

  54. Liu X, Zhang S, Wei F, Zhou M. Recognizing named entities in tweets. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Human Language Technologies, Vol 1. 2011, 359–367

  55. Ritter A, Clark S, Mausam, Etzioni O. Named entity recognition in tweets: an experimental study. In: Proceedings of 2011 Conference on Empirical Methods in Natural Language Processing. 2011, 1524–1534

  56. Han B, Baldwin T. Lexical normalisation of short text messages: makn sens a #twitter. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Vol 1. 2011, 368–378

  57. Wu W, Zhang B, Ostendorf M. Automatic generation of personalized annotation tags for twitter users. In: Proceedings of Human Language Technologies: Conference of the North American Chapter of the Association of. 2010, 689–692

  58. Zhang K, Sun M. A stacked model based on word lattice for Chinese word segmentation and part-of-speech tagging. http://nlp.csai.tsinghua.edu.cn/thulac

  59. Jiang W, Mi H, Liu Q. Word lattice reranking for Chinese word segmentation and part-of-speech tagging. In: Proceedings of the 22nd International Conference on Computational Linguistics. 2008, 385–392

  60. Viegas F B, Wattenberg M, Feinberg J. Participatory visualization with Wordle. IEEE Transactions on Visualization and Computer Graphics, 2009, 15(6): 1137–1144

    Article  Google Scholar 

  61. Och F J, Ney H. A systematic comparison of various statistical alignment models. Computational linguistics, 2003, 29(1): 19–51

    Article  Google Scholar 

  62. Wan X, Xiao J. Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of the 23rd AAAI Conference on Artificial Intelligence. 2008, 855–860

  63. Liu Y, Liu Q, Lin S. Discriminative word alignment by linear modeling. Computational Linguistics, 2010, 36(3): 303–339

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhiyuan Liu.

Additional information

Zhiyuan Liu is a post doctor of Tsinghua University. He got his bachelor degree in 2006 and his PhD in 2011 from the Department of Computer Science and Technology, Tsinghua University. His research interests include Chinese language computing and social computing.

Xinxiong Chen is a PhD student of the Department of Computer Science and Technology, Tsinghua University. He got his bachelor degree in 2011 from the Department of Computer Science and Technology, Tsinghua University. His research interests include Chinese language computing and social computing.

Maosong Sun is a professor of the Department of Computer Science and Technology, Tsinghua University. His research interests are Chinese language computing, information retrieval and social computing. He has published about 140 papers in academic journals and conferences. He has served as program committee members in numerous conferences, and many times as conference chairs or program committee chairs. He is the vice president of the Chinese Information Processing Society, the council member of China Computer Federation, the member-at-large of ACM China Council, the member of Expert Committee of National Language Resource Surveillance and Research Center, the Director of Tsinghua University-National University of Singapore Joint Research Center on Next Generation Search Technologies, and the Editor-in-Chief of the Journal of Chinese Information Processing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Chen, X. & Sun, M. Mining the interests of Chinese microbloggers via keyword extraction. Front. Comput. Sci. 6, 76–87 (2012). https://doi.org/10.1007/s11704-011-1174-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-011-1174-8

Keywords

Navigation