skip to main content
research-article

Sampling Content from Online Social Networks: Comparing Random vs. Expert Sampling of the Twitter Stream

Published: 04 June 2015 Publication History

Abstract

Analysis of content streams gathered from social networking sites such as Twitter has several applications ranging from content search and recommendation, news detection to business analytics. However, processing large amounts of data generated on these sites in real-time poses a difficult challenge. To cope with the data deluge, analytics companies and researchers are increasingly resorting to sampling. In this article, we investigate the crucial question of how to sample content streams generated by users in online social networks. The traditional method is to randomly sample all the data. For example, most studies using Twitter data today rely on the 1% and 10% randomly sampled streams of tweets that are provided by Twitter. In this paper, we analyze a different sampling methodology, one where content is gathered only from a relatively small sample (<1%) of the user population, namely, the expert users. Over the duration of a month, we gathered tweets from over 500,000 Twitter users who are identified as experts on a diverse set of topics, and compared the resulting expert sampled tweets with the 1% randomly sampled tweets provided publicly by Twitter. We compared the sampled datasets along several dimensions, including the popularity, topical diversity, trustworthiness, and timeliness of the information contained within them, and on the sentiment/opinion expressed on specific topics. Our analysis reveals several important differences in data obtained through the different sampling methodologies, which have serious implications for applications such as topical search, trustworthy content recommendations, breaking news detection, and opinion mining.

References

[1]
Xavier Amatriain, Neal Lathia, Josep M. Pujol, Haewoon Kwak, and Nuria Oliver. 2009. The wisdom of the few: A collaborative filtering approach based on expert opinions from the web. In Proceedings of ACM International SIGIR Conference on Research and Development in Information Retrieval (SIGIR’09). ACM, New York, NY, 532--539.
[2]
Sebastien Ardon, Amitabha Bagchi, Anirban Mahanti, Amit Ruhela, Aaditeshwar Seth, Rudra Mohan Tripathy, and Sipat Triukose. 2013. Spatio-temporal and events based analysis of topic popularity in Twitter. In Proceedings of ACM International Conference on Information and Knowledge Management (CIKM’13). ACM, New York, NY, 219--228.
[3]
Sitaram Asur and Bernardo A. Huberman. 2010. Predicting the future with social media. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology. IEEE Computer Society, Washington, DC, 492--499.
[4]
Parantapa Bhattacharya, Saptarshi Ghosh, Juhi Kulshrestha, Mainack Mondal, Muhammad Bilal Zafar, Niloy Ganguly, and Krishna P. Gummadi. 2014. Deep Twitter diving: Exploring topical groups in microblogs at scale. In Proceedings of ACM Conference on Computer Supported Cooperative Work &amp; Social Computing (CSCW&rsquo;’14). ACM, New York, NY, 197--210.
[5]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research 3 (March 2003), 993--1022.
[6]
M. M. Bradley and P. J. Lang. 1999. Affective norms for english words (ANEW): Instruction manual and affective ratings. Technical Report C-1, Center for Research in Psychophysiology, University of Florida (1999).
[7]
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web search engine. In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 107--117.
[8]
E. J. Candes and M. B. Wakin. 2008. An introduction to compressive sampling. IEEE Signal Processing Magazine 25, 2 (2008), 21--30.
[9]
Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, and Krishna P. Gummadi. 2010. Measuring user influence in Twitter: The million follower fallacy. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.
[10]
Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011a. Find me the right content&excl; Diversity-based sampling of social media spaces for topic-centric search. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’11). AAAI Press.
[11]
Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011b. Identifying relevant social media content: Leveraging information diversity and user cognition. In Proceedings of ACM Conference on Hypertext and Social Media. ACM, New York, NY, 161--170.
[12]
Munmun De Choudhury, Yu-Ru Lin, Hari Sundaram, K. Selcuk Candan, Lexing Xie, and Aisling Kelliher. 2010. How does the data sampling strategy impact the discovery of information diffusion in social media&quest; In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). The AAAI Press.
[13]
Daantje Derks, Arjan E. R. Bos, and Jasper von Grumbkow. 2007. Emoticons and social interaction on the internet: The importance of social context. Computers in Human Behavior 23, 1 (2007), 842--849.
[14]
Eugene F. Fama. 1970. Efficient capital markets: A review of theory and empirical work. The Journal of Finance 25, 2 (1970), 383--417.
[15]
Ove Frank. 1978. Sampling and estimation in large social networks. Social Networks 1, 1 (1978), 91--101.
[16]
Saptarshi Ghosh, Naveen Sharma, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012a. Cognos: Crowdsourcing search for topic experts in microblogs. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 575--584.
[17]
Saptarshi Ghosh, Bimal Viswanath, Farshad Kooti, Naveen Sharma, Gautam Korlam, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012b. Understanding and combating link farming in the Twitter social network. In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 61--70.
[18]
Saptarshi Ghosh, Muhammad Bilal Zafar, Parantapa Bhattacharya, Naveen Sharma, Niloy Ganguly, and Krishna Gummadi. 2013. On sampling the wisdom of crowds: Random vs. expert sampling of the Twitter stream. In Proceedings of ACM International Conference on Conference on Information &amp; Knowledge Management (CIKM). ACM, New York, NY, USA, 1739--1744.
[19]
Minas Gjoka, Maciej Kurant, Carter T. Butts, and Athina Markopoulou. 2010. Walking in Facebook: A case study of unbiased sampling of OSNs. In Proceedings of IEEE Conference on Information Communications (INFOCOM’10). IEEE Press, Piscataway, NJ, 2498--2506.
[20]
Sandra Gonzalez-Bailon, Ning Wang, Alejandro Rivero, Javier Borge-Holthoefer, and Yamir Moreno. 2014. Assessing the bias in samples of large online networks. Social Networks 38 (July 2014), 16--27.
[21]
Catherine Grady and Matthew Lease. 2010. Crowdsourcing document relevance assessment with Mechanical Turk. In Proceedings of NAACL HLT Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk (CSLDAMT 2010). Association for Computational Linguistics, Stroudsburg, PA, USA, 172--179.
[22]
Mark Granovetter. 1976. Network sampling: Some first steps. American Journal of Sociology 81, 6 (1976), 1287--1303.
[23]
Chris Grier, Kurt Thomas, Vern Paxson, and Michael Zhang. 2010. @spam: The underground on 140 characters or less. In Proceedings of ACM Conference on Computer and Communications Security (CCS’10). ACM, New York, NY, 27--37.
[24]
Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustrank. In Proceedings of International Conference on Very Large Data Bases (VLDB) - Volume 30. VLDB Endowment, 576--587.
[25]
Aniko Hannak, Eric Anderson, Lisa Feldman Barrett, Sune Lehmann, Alan Mislove, and Mirek Riedewald. 2012. Tweetin’ in the rain: Exploring societal-scale effects of weather on mood. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’12). AAAI Press, Dublin, Ireland.
[26]
Liran Katzir, Edo Liberty, and Oren Somekh. 2011. Estimating sizes of social networks via biased sampling. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 597--606.
[27]
W. Kellogg. 2006. Information rates in sampling and quantization. IEEE Transactions on Information Theory 13, 3 (2006), 506--511.
[28]
Balachander Krishnamurthy, Phillipa Gill, and Martin Arlitt. 2008. A few chirps about Twitter. In Proceedings of ACM Workshop on Online Social Networks (WOSN). ACM, New York, NY, USA, 19--24.
[29]
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media&quest; In Proceedings of International Conference on World Wide Web (WWW). ACM, New York, NY, USA, 591--600.
[30]
Jure Leskovec and Christos Faloutsos. 2006. Sampling from large graphs. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 631--636.
[31]
Chenliang Li, Jianshu Weng, Qi He, Yuxia Yao, Anwitaman Datta, Aixin Sun, and Bu-Sung Lee. 2012. TwiNER: Named entity recognition in targeted Twitter stream. In Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, New York, NY, 721--730.
[32]
Jimmy Lin, Rion Snow, and William Morgan. 2011. Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 422--429.
[33]
lists-howtouse. 2013. Twitter Help Center—Using Twitter Lists. Retrieved from https://support.twitter.com/articles/76460-using-twitter-lists.
[34]
Bing Liu. 2006. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer-Verlag.
[35]
Michael Mathioudakis and Nick Koudas. 2010. TwitterMonitor: Trend detection over the Twitter stream. In Proceedings of ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 1155--1158.
[36]
Fred Morstatter, Jürgen Pfeffer, and Huan Liu. 2014. When is it biased&quest;: Assessing the representativeness of Twitter’s streaming API. In Proceedings of International Conference on World Wide Web Companion. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 555--556.
[37]
Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough&quest; Comparing data from Twitter’s streaming API with Twitter’s firehose. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’13). AAAI Press.
[38]
Xuan-Hieu Phan and Cam-Tu Nguyen. 2007. GibbsLDA&plus;&plus;: A C/C&plus;&plus; Implementation of Latent Dirichlet Allocation (LDA). Retrieved from http://gibbslda.sourceforge.net/.
[39]
R. M. Poses, C. Bekes, R. L. Winkler, W. E. Scott, and F. J. Copare. 1990. Are two (inexperienced) heads better than one (experienced) head&quest; Averaging house officers’ prognostic judgments for critically ill patients. Archives of Internal Medicine 150, 9 (Sept. 1990), 1874--1878.
[40]
Daniel Ramage, Susan Dumais, and Dan Liebling. 2010. Characterizing microblogs with topic models. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press.
[41]
J. Romberg. 2008. Imaging via compressive sampling. Signal Processing Magazine, IEEE 25, 2 (2008), 14--20.
[42]
Paat Rusmevichientong, David M. Pennock, Steve Lawrence, and C. Lee Giles. 2001. Methods for sampling pages uniformly from the world wide web. In Proceedings of the AAAI Symposium on Using Uncertainty within Computation. AAAI Press, 121--128.
[43]
Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. 2010. Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of International Conference on World Wide Web (WWW’10). ACM, New York, NY, 851--860.
[44]
Jagan Sankaranarayanan, Hanan Samet, Benjamin E. Teitler, Michael D. Lieberman, and Jon Sperling. 2009. TwitterStand: News in tweets. In Proceedings of ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS’09). ACM, New York, NY, 42--51.
[45]
Naveen Kumar Sharma, Saptarshi Ghosh, Fabricio Benevenuto, Niloy Ganguly, and Krishna Gummadi. 2012. Inferring who-is-who in the Twitter social network. ACM SIGCOMM Computer Communication Review 42, 4 (Sept. 2012), 533--538.
[46]
spritzer-gnip-blog. 2011. Guide to the Twitter API—Part 3 of 3: An Overview of Twitter’s Streaming API. Retrieved from http://blog.gnip.com/tag/spritzer/.
[47]
Jaime Teevan, Daniel Ramage, and Merredith Ringel Morris. 2011. #TwitterSearch: A comparison of microblog search and web search. In Proceedings of International ACM Conference on Web Search and Data Mining (WSDM’11). ACM, New York, NY, 35--44.
[48]
Kurt Thomas, Chris Grier, Vern Paxson, and Dawn Song. 2011. Suspended accounts in retrospect: An analysis of Twitter spam. In Proceedings of ACM Internet Measurement Conference (IMC’11). ACM, New York, NY, 243--258.
[49]
A. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’10). AAAI Press, 178--185.
[50]
twitter-rate-limit. 2013. Rate Limiting—Twitter Developers. Retrieved from https://dev.twitter.com/docs/rate-limiting.
[51]
Twitter-stats. 2014. Twitter Statistics—Statistics Brain. Retrieved from http://www.statisticbrain.com/twitter-statistics/.
[52]
Twitter-stream-api. 2012. GET Statuses/Sample—Twitter Developers. Retrieved from https://dev.twitter.com/docs/api/1/get/statuses/sample.
[53]
Claudia Wagner, Vera Liao, Peter Pirolli, Les Nelson, and Markus Strohmaier. 2012. It’s not in their tweets: Modeling Topical expertise of Twitter users. In Proceedings of AASE/IEEE International Conference on Social Computing (SocialCom’12). 91--100.
[54]
Shaomei Wu, Jake M. Hofman, Winter A. Mason, and Duncan J. Watts. 2011. Who says what to whom on Twitter. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 705--714.
[55]
Lei Yang, Tao Sun, Ming Zhang, and Qiaozhu Mei. 2012b. We know what @you #tag: does the dual role affect hashtag adoption&quest; In Proceedings of International Conference on World Wide Web (WWW’12). ACM, New York, NY, 261--270.
[56]
Xintian Yang, Amol Ghoting, Yiye Ruan, and Srinivasan Parthasarathy. 2012a. A framework for summarizing and analyzing Twitter feeds. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 370--378.
[57]
Zhijun Yin, Liangliang Cao, Jiawei Han, Chengxiang Zhai, and Thomas Huang. 2011. Geographical topic discovery and comparison. In Proceedings of International Conference on World Wide Web (WWW’11). ACM, New York, NY, 247--256.

Cited By

View all
  • (2024)Assessing the Interplay Between Public Attention and Government Responsiveness With Digital Trace Data: Navigating Leadership and Followership in China’s COVID-19 Vaccination CampaignSocial Science Computer Review10.1177/08944393241258217Online publication date: 3-Jun-2024
  • (2024)Comparing methods for creating a national random sample of twitter usersSocial Network Analysis and Mining10.1007/s13278-024-01327-514:1Online publication date: 14-Aug-2024
  • (2023)A Dynamic Drilling Sampling Method and Evaluation Model for Big Streaming DataInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402341003633:11n12(1725-1748)Online publication date: 18-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web
ACM Transactions on the Web  Volume 9, Issue 3
June 2015
187 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/2788341
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2015
Accepted: 01 March 2015
Revised: 01 October 2014
Received: 01 February 2014
Published in TWEB Volume 9, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Sampling content streams
  2. Twitter
  3. Twitter Lists
  4. random sampling
  5. sampling from experts

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • Government of India (Ref. No.: ITRA/15 (58) /Mobile/DISARM/05)
  • DeITY
  • Indo-German Max Planck Centre for Computer Science (IMPECS)
  • Information Technology Research Academy (ITRA)
  • Postdoctoral fellowship from the Alexander von Humboldt Foundation
  • Fellowship from Tata Consultancy Services

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)46
  • Downloads (Last 6 weeks)1
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Assessing the Interplay Between Public Attention and Government Responsiveness With Digital Trace Data: Navigating Leadership and Followership in China’s COVID-19 Vaccination CampaignSocial Science Computer Review10.1177/08944393241258217Online publication date: 3-Jun-2024
  • (2024)Comparing methods for creating a national random sample of twitter usersSocial Network Analysis and Mining10.1007/s13278-024-01327-514:1Online publication date: 14-Aug-2024
  • (2023)A Dynamic Drilling Sampling Method and Evaluation Model for Big Streaming DataInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402341003633:11n12(1725-1748)Online publication date: 18-Oct-2023
  • (2023)Exploring the boundaries of open innovation: Evidence from social media miningTechnovation10.1016/j.technovation.2021.102447119(102447)Online publication date: Jan-2023
  • (2022)Hybrid Onion Layered System for the Analysis of Collective Subjectivity in Social NetworksIEEE Access10.1109/ACCESS.2022.321746710(115435-115468)Online publication date: 2022
  • (2020)Mitigating the Impact of Data Sampling on Social Media Analysis and MiningIEEE Transactions on Computational Social Systems10.1109/TCSS.2020.29706027:2(546-555)Online publication date: Apr-2020
  • (2020)The Unified Framework of Media Diversity: A Systematic Literature ReviewDigital Journalism10.1080/21670811.2020.17643748:5(605-642)Online publication date: 28-May-2020
  • (2019)Social Data: Biases, Methodological Pitfalls, and Ethical BoundariesFrontiers in Big Data10.3389/fdata.2019.000132Online publication date: 11-Jul-2019
  • (2019)Representing the TwittersphereInternational Journal of Information Management: The Journal for Information Professionals10.1016/j.ijinfomgt.2019.01.01948:C(175-184)Online publication date: 1-Oct-2019
  • (2019)Fast crawling methods of exploring content distributed over large graphsKnowledge and Information Systems10.1007/s10115-018-1178-x59:1(67-92)Online publication date: 1-Apr-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media