Skip to main content
Log in

Modelling and predicting news popularity

  • Short Paper
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

We explore the problem of learning to predict the popularity of an article in online news media. By “popular” we mean an article that was among the “most read” articles of a given day in the news outlet that published it. We show that this cannot be modelled simply as the binary classification task of separating popular from unpopular articles, thereby assuming that popularity is an absolute property. Instead, we propose to view popularity in the perspective of a competitive situation where the popular articles are those which were the most appealing on that particular day. This leads to the notion of an “appeal” function, to model which we use a linear function in the bag of words representation. The parameters of this linear function are learnt from a training set formed by pairs of documents, one of which was popular and the other which appeared on the same page and date, without becoming popular. To learn the appeal function we use Ranking Support Vector Machines, using data collected from six different outlets over a period of 1 year. We show that our method can predict which articles will become popular, as well as extracting those keywords that mostly affect the appeal function. This also enables us to compare different outlets from the point of view of their readers’ preference patterns. Remarkably, this is achieved using very limited information, namely the textual content of title and description of each article, the page and date of publication, and whether it became popular.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. Ali O, Flaounas I, De Bie T, Mosdell N, Lewis J, Cristianini N (2010) Automating news content analysis: an application to gender bias and readability, pp 36–43

  2. Bautin M, Ward C, Patil A, Skiena S (2010) Access: news and blog analysis for the social sciences. In: Proceedings of the 19th international conference on World Wide Web (WWW), pp 1229–1232

  3. Billsus D, Pazzani MJ (2007) Adaptive news access. In: The adaptive Web

  4. Boser B, Guyon I, Vapnik V (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the 5th conference on computational learning theory (COLT), pp 144–152

  5. Center PR (2010) When technology makes headlines: the media’s double vision about the digital age. Tech. rep., Pew Research Center’s Project for Excellence in Journalism

  6. Chang CC, Lin CJ (2011) LIBSVM a library for support vector machines. ACM Trans Intell Syst Technol 2:271–2727

    Article  Google Scholar 

  7. Cristianini N, Shawe-Taylor J (2000) An introduction to support vector machines and other kernel-based learning methods. Cambridge University Press, Cambridge

  8. Das A, Datar M, Garg A, Rajaram S (2007) Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th international conference on World Wide Web (WWW), pp 271–280

  9. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representation for text categorization. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM), pp 148–155

  10. Flaounas I, Ali O, Lansdall-Welfare T, De Bie T, Mosdell N, Lewis J, Cristianini N (2012) Research methods in the age of digital journalism. Digit Journalism 1:1–15

  11. Flaounas I, Ali O, Turchi M, Snowsill T, Nicart F, De Bie T, Cristianini N (2011) NOAM: news outlets analysis and monitoring system. In: Proceedings of the 2011 ACM SIGMOD international conference on management of data. ACM, New York, pp 1275–1278

  12. Flaounas I, Turchi M, Ali O, Fyson N, De Bie T, Mosdell N, Lewis J, Cristianini N (2010) The structure of EU mediasphere. PLoS ONE 5:e14243

  13. Flaounas IN, Turchi M, De Bie T, Cristianini N (2009) Inference and validation of networks. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD), pp 344–358

  14. Fürnkranz J, Hüllermeier E (2010) Preference learning: an introduction. In: Preference learning. Springer, New York

  15. Gans HJ (2004) Deciding what’s news: a study of CBS evening news, NBC nightly news, Newsweek, and Time, 25th anniversary edition. Northwestern University Press, Evanston

  16. Hensinger E, Flaounas I, Cristianini N (2010) Learning the preferences of news readers with SVM and Lasso ranking. In: Proceedings of the 6th conference on artificial intelligence applications and innovations (AIAI), pp 179–186

  17. Jiang X, Hu Y, Li H (2009) A ranking approach to keyphrase extraction. In: Proceedings of the 32nd international ACM conference on research and development in information retrieval (SIGIR), pp 756–757

  18. Joachims T (1999) Making large-scale SVM learning practical. In: Advances in kernel methods: support vector learning, chap. 11. MIT Press, Cambridge, pp 169–184

  19. Joachims T (2002) Learning to classify text using support vector machines. Kluwer, Berlin

  20. Joachims T (2002) Optimizing search engines using clickthrough data. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 33–142

  21. Joachims T, Radlinski F (2007) Search engines that learn from implicit feedback. IEEE Comput 40(8):34–40

    Article  Google Scholar 

  22. Kompan M, Bieliková M (2010) Content-based news recommendation. In: Proceedings of the 11th international conference on E-commerce and web technologies (EC-Web 2010), pp 61–72

  23. Lerman K, Hogg T (2010) Using a model of social dynamics to predict popularity of news. In: Proceedings of the 19th international conference on World Wide Web (WWW), pp 621–630

  24. Lewis D, Yang Y, Rose T, Li F (2004) Rcv1: a new benchmark collection for text categorization research. J Mach Learn Res 5:361–397

    Google Scholar 

  25. Lim J (2010) Convergence of attention and prominence dimensions of salience among major online newspapers. J Comput Mediat Commun 15(15):293–313

    Article  Google Scholar 

  26. Linden G (2008) People who read this article also read. Spectrum IEEE 45(3):46–60

    Article  Google Scholar 

  27. Liu B (2007) Web data mining, exploring hyperlinks, contents, and usage data. Springer, New York

  28. Liu J, Dolan P, Pedersen ER (2010) Personalized news recommendation based on click behavior. In: Proceeding of the 14th international conference on intelligent user interfaces (IUI). ACM, New York, pp 31–40

  29. McCreadie RMC, Macdonald C, Ounis I (2010) News article ranking: leveraging the wisdom of bloggers. In: Proceedings of the 9th international conference on computer-assisted information retrieval (RIAO), pp 40–48

  30. Paterson C (ed) (2008) Making online news: the ethnography of new media production. Peter Lang Pub Inc, New York

  31. Phelan O, McCarthy K, Smyth B (2009) Using twitter to recommend real-time topical news. In: Proceedings of the 2009 ACM conference on recommender systems (RecSys 2009), pp 385–388

  32. Porter M (1980) An algorithm for suffix stripping. Program 14:130–137

    Article  Google Scholar 

  33. Sandhaus E (2008) The New York Times annotated corpus. In: Linguistic data consortium. Philadelphia

  34. Schmidt M (2005) Least squares optimization with L1-norm regularization. Project report. http://www.di.ens.fr/mschmidt/Software/lasso.html

  35. Schölkopf B, Smola A (2002) Learning with kernels. MIT Press, Cambridge, MA

  36. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  37. Shawe-Taylor J, Cristianini N (2004) Kernel methods for pattern analysis. Cambridge University Press, Cambridge

  38. Snowsill T, Flaounas I, De Bie T, Cristianini N (2010) Detecting events in a million New York Times articles. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML/PKDD), pp 615–618

  39. Steinberger R, Pouliquen B, Van der Goot E (2009) An introduction to the Europe media monitor family of applications. In: Information access in a multilingual world—proceedings of the SIGIR 2009 Workshop (SIGIR-CLIR’2009), pp 1–8

  40. Szabó G, Huberman BA (2010) Predicting the popularity of online content. Commun ACM 53(8):80–88

    Article  Google Scholar 

  41. Tibshirani R (1996) Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B (Methodological) 58(1):267–288

    MathSciNet  MATH  Google Scholar 

  42. Wang C, Zhang M, Ru L, Ma S (2008) Automatic online news topic ranking using media focus and user attention based on aging theory. In: Proceedings of the 17th ACM conference on information and knowledge management (CIKM 2008), pp 1033–1042

  43. Wu F, Huberman BA (2008) Popularity, novelty and attention. In: Proceedings of the 9th ACM conference on electronic commerce (EC-2008), pp 240–245

  44. Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM conference on research and development in information retrieval (SIGIR), pp 42–49

  45. Yu H (2005) SVM selective sampling for ranking with application to data retrieval. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 354–363

Download references

Acknowledgments

This research was supported by the PASCAL2 Network of Excellence and by the European FP7 project “Complacs” (FP7/2007-2013 under grant agreement no 270327).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elena Hensinger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hensinger, E., Flaounas, I. & Cristianini, N. Modelling and predicting news popularity. Pattern Anal Applic 16, 623–635 (2013). https://doi.org/10.1007/s10044-012-0314-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-012-0314-6

Keywords

Navigation