Skip to main content
Log in

The 10 million follower fallacy: audience size does not prove domain-influence on Twitter

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

With the advent of social networks and micro-blogging systems, the way of communicating with other people and spreading information has changed substantially. Persons with different backgrounds, age and education exchange information and opinions, spanning various domains and topics, and have now the possibility to directly interact with popular users and authoritative information sources usually unreachable before the advent of these environments. As a result, the mechanism of information propagation changed deeply, the study of which is indispensable for the sake of understanding the evolution of information networks. To cope up with this intention, in this paper, we propose a novel model which enables to delve into the spread of information over a social network along with the change in the user relationships with respect to the domain of discussion. For this, considering Twitter as a case study, we aim at analyzing the multiple paths the information follows over the network with the goal of understanding the dynamics of the information contagion with respect to the change of the topic of discussion. We then provide a method for estimating the influence among users by evaluating the nature of the relationship among them with respect to the topic of discussion they share. Using a vast sample of the Twitter network, we then present various experiments that illustrate our proposal and show the efficacy of the proposed approach in modeling this information spread.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. http://www.twitter.com.

  2. http://www.tumblr.com.

  3. http://www.facebook.com.

  4. http://plus.google.com/.

  5. A divulgative article expressing the same high-level idea can be found at http://www.nytimes.com/external/readwriteweb/2010/03/19/19readwriteweb-the-million-follower-fallacy-audience-size-d-3203.html.

  6. www.facebook.com/.

  7. http://secondlife.com/.

  8. www.informatik.uni-trier.de/~ley/db/.

  9. Please notice that, as pointed out in literature, the inverse document frequency factor cannot be positively applied in our work. In fact, it diminishes the weight of terms that occur very frequently in the corpus, while increasing the weight of terms that occur rarely. In our case, we believe that this is not a suitable weighting scheme. Terms that appear in most of the documents in the corpus are likely to be highly relevant for the domain. Please also notice that, in order to exclude common function words (such as conjunctions and articles), we have removed stop-words with common techniques, and we have only considered nouns in our computation.

  10. An interesting observation is possible: the highest ranked \(n\)-grams are mostly uni-grams and simply reflect the distribution of the letters of the alphabet in the language of the document. In other words, the most frequent \(n\)-grams are most of the time correlated to the language. Thus, considering that the most frequent \(n\)-grams for the considered topic profiles resulted to be very similar due to this fact (while they start differing consistently in the lowest part of ranked \(n\)-grams list), we excluded from our analysis the uni-grams.

  11. The sampling rate of the used Twitter account is 10 % over an average of 200 millions per day. More information are available at http://apiwiki.Twitter.com.

  12. http://www.weibo.com.

  13. http://www.facebook.com.

References

  1. Adar E, Adamic LA (2005) Tracking information epidemics in blogspace. In: IEEE/WIC/ACM international conference on web intelligence, WI’05. IEEE Computer Society, pp 207–214. doi:10.1109/WI.2005.151

  2. Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proc Natl Acad Sci 106(51):21544–21549

    Article  Google Scholar 

  3. Bakshy E, Karrer B, Adamic LA (2009) Social influence and the diffusion of user-created content. In: Proceedings of the 10th ACM conference on electronic commerce, EC’09. ACM, pp 325–334

  4. Barabasi AL, Jeong H, Neda Z, Ravasz E, Schubert A, Vicsek T (2002) Evolution of the social network of scientific collaborations. Phys A: Stat Mech Appl 311(3–4):590–614

    Article  MathSciNet  MATH  Google Scholar 

  5. Castillo C, Mendoza M, Poblete B (2011) Information credibility on twitter. In: Proceedings of the 20th international conference on World wide web, WWW’11, pp 675–684. ACM, New York, NY, USA. doi:10.1145/1963405.1963500

  6. Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: MDMKDD’10, pp 4:1–4:10. ACM, New York, NY, USA

  7. Cataldi M, Di Caro L, Schifanella C (2014) Personalized emerging topic detection based on a term aging model. ACM Trans Intell Syst Technol 5(1):27. doi:10.1145/2542182.2542189

  8. Cataldi M, Mittal N, Aufaure MA (2013) Estimating domain-based user influence in social networks. In: Proceedings of the 28th symposium on applied computing, SAC 2013. ACM, New York, NY, USA

  9. Cavnar WB, Trenkle JM (1994) N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd annual symposium on document analysis and information retrieval, pp 161–175

  10. Cha M, Benevenuto F, Ahn YY, Gummadi KP (2012) Delayed information cascades in flickr: measurement, analysis, and modeling. Comput Netw 56(3):1066–1076. doi:10.1016/j.comnet.2011.10.020

    Article  Google Scholar 

  11. Cha M, Benevenuto F, Haddadi H, Gummadi PK (2012) The world of connections and information flow in twitter. IEEE Trans Syst Man Cybern Part A 42(4):991–998

    Article  Google Scholar 

  12. Cha M, Haddadi H, Benevenuto F, Gummadi KP (2010) Measuring User Influence in Twitter: the million follower fallacy. In: Proceedings of the 4th international AAAI conference on weblogs and social media (ICWSM), The AAAI Press, Menlo Park, California, pp 10–17

  13. Chubin DE (1976) The conceptualization of scientific specialties. Sociol Q 17(4):448–476

    Article  Google Scholar 

  14. Crane D (1969) Social structure in a group of scientists: a test of the “invisible college” hypothesis. Am Sociol Rev 3:335–352

    Article  Google Scholar 

  15. de Beaver D, Rosen R (1979) Studies in scientific collaboration. Scientometrics 1(2):133–149

    Article  Google Scholar 

  16. Di Caro L, Cataldi M, Schifanella C (2012) The d-index: discovering dependences among scientific collaborators from their bibliographic data records. Int J Scientometr. pp 1–25. doi:10.1007/s11192-012-0762-1

  17. Erceg V, Greenstein LJ, Tjandra SY, Parkoff SR, Gupta A, Kulic B, Julius AA, Bianchi R (2006) An empirically based path loss model for wireless channels in suburban environments. IEEE J Sel A Commun 17(7):1205–1211. doi:10.1109/49.778178

  18. Favenza A, Cataldi M, Sapino ML, Messina A (2008) Topic development based refinement of audio-segmented television news. In: NLDB’08, Springer, Berlin, Heidelberg, pp 226–232

  19. Friedman N (2000) Being bayesian about network structure. In: Machine learning, pp 201–210

  20. Getoor L, Friedman N, Koller D, Taskar B (2002) Learning probabilistic models of link structure. J Mach Learn Res 3:679–707

    MathSciNet  Google Scholar 

  21. Goldenberg J, Libai B, Muller E (2001) Talk of the network: a complex systems look at the underlying process of word-of-mouth. Mark Lett 12(3):211–223

    Article  Google Scholar 

  22. Goyal A, Bonchi F, Lakshmanan LV (2010) Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on Web search and data mining, WSDM’10ACM, New York, NY, USA, pp 241–250

  23. Granovetter M (1978) Threshold models of collective behavior. Am J Sociol 83(6):1420–1443. doi:10.1086/226707

    Article  Google Scholar 

  24. Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: WWW’04, pp 491–501. ACM

  25. Gruhl D, Liben-Nowell D, Guha R, Tomkins A (2004) Information diffusion through blogspace. SIGKDD Explor Newsl 6(2):43–52. doi:10.1145/1046456.1046462

  26. Hou H, Kretschmer H, Liu Z (2008) The structure of scientific collaboration networks in scientometrics. Scientometrics 75(2):189–202

    Article  Google Scholar 

  27. Joachims T (1997) A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Proceedings of the fourteenth international conference on machine learning, ICML’97 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 143–151

  28. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Proceedings of the 10th European conference on machine learning, ECML ’98, Springer, London, UK, pp 137–142

  29. Katz JS, Martin BR (1997) What is research collaboration? Res Policy 26:1–18

    Article  Google Scholar 

  30. Kempe D, Kleinberg J, Tardos E (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’03, ACM, New York, NY, USA, pp 137–146. doi:10.1145/956750.956769

  31. Khanafiah D, Situngkir H (2004) Social balance theory: revisiting Heider’s balance theory for many agents. Technical report

  32. Kumar R, Novak J, Raghavan P, Tomkins A (2004) Structure and evolution of blogspace. Commun ACM 47(12):35–39. doi:10.1145/1035134.1035162

    Article  Google Scholar 

  33. Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW’10, pp 591–600. ACM

  34. Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web 1(1). doi:10.1145/1232722.1232727

  35. Leskovec J, Faloutsos C (2006) Sampling from large graphs. In: KDD ’06, pp 631–636. ACM. doi:10.1145/1150402.1150479

  36. Liben-Nowell D, Kleinberg J (2003) The link prediction problem for social networks. In: CIKM ’03, pp 556–559. ACM

  37. Melin G, Persson O (1996) Studying research collaboration using co-authorships. Scientometrics 36: 363–377

  38. Moon S, You J, Kwak H, Kim D, Jeong H (2010) Understanding topological mesoscale features in community mining. In: 2010 second international conference on communication systems and networks (COMSNETS), IEEE Press, Piscataway, NJ, USA, pp 1–10

  39. Newman MEJ (2001) Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E 64(1): 016131

  40. Page L, Brin S, Motwani R, Winograd T (1998) The pagerank citation ranking: Bringing order to the web. In: WWW’98, pp 161–172

  41. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  42. Rocchio J (1971) Relevance feedback in information retrieval, pp 313–323

  43. Romero DM, Galuba W, Asur S, Huberman BA (2011) Influence and passivity in social media. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—Volume Part III, ECML PKDD’11. Springer, Berlin, Heidelberg, pp 18–33. http://dl.acm.org/citation.cfm?id=2034161.2034164

  44. Salton G (1989) Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley, Boston

    Google Scholar 

  45. Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24:513–523

    Article  Google Scholar 

  46. Schifanella C, Caro LD, Cataldi M, Aufaure MA (2012) The d-index: a web environment for analyzing dependences among scientific collaborators. In: KDD, pp 1520–1523. ACM

  47. Shapin S (1981) Laboratory life. The social construction of scientific facts. Med Hist 25(3):341–342

    Article  Google Scholar 

  48. Suen CY (1979) n-gram Statistics for natural language understanding and text processing. IEEE Trans Pattern Anal Mach Intell 1(2):164–172. doi:10.1109/TPAMI.1979.4766902

    Article  Google Scholar 

  49. Sun E, Rosenn I, Marlow C, Lento TM (2009) Gesundheit! modeling contagion through facebook news feed. In: Proceedings of International AAAI conference on weblogs and social media, 1–8

  50. Weng J, Lim EP, Jiang J, He Q (2010) Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining, WSDM ’10, pp. 261–270. ACM, New York, NY, USA. doi:10.1145/1718487.1718520

  51. Wu S, Hofman JM, Mason WA, Watts DJ (2011) Who says what to whom on twitter. In: Proceedings of the 20th international conference on world wide web, WWW ’11, pp 705–714. ACM, New York, NY, USA. doi:10.1145/1963405.1963504

  52. Yang Y (1999) An evaluation of statistical approaches to text categorization. J Inf Retr 1:67–88

    Google Scholar 

  53. Zhao Q, Mitra P, Chen B (2007) Temporal and information flow based event detection from social text streams. In: Proceedings of the 22nd national conference on artificial intelligence, vol 2., AAAI’07AAAI Press, Menlo Park, California, pp 1501–1506

  54. Zipf G (1949) Human behaviour and the principle of least-effort. Addison-Wesley, Cambridge

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mario Cataldi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cataldi, M., Aufaure, MA. The 10 million follower fallacy: audience size does not prove domain-influence on Twitter. Knowl Inf Syst 44, 559–580 (2015). https://doi.org/10.1007/s10115-014-0773-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0773-8

Keywords

Navigation