skip to main content
10.1145/3589334.3645503acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Navigating the Post-API Dilemma

Published: 13 May 2024 Publication History

Abstract

Recent decisions to discontinue access to social media APIs are having detrimental effects on Internet research and the field of computational social science as a whole. This lack of access to data has been dubbed the Post-API era of Internet research. Fortunately, popular search engines have the means to crawl, capture, and surface social media data on their Search Engine Results Pages (SERP) if provided the proper search query, and may provide a solution to this dilemma. In the present work we ask: does SERP provide a complete and unbiased sample of social media data? Is SERP a viable alternative to direct API-access? To answer these questions, we perform a comparative analysis between (Google) SERP results and nonsampled data from Reddit and Twitter/X. We find that SERP results are highly biased in favor of popular posts; against political, pornographic, and vulgar posts; are more positive in their sentiment; and have large topical gaps. Overall, we conclude that SERP is not a viable alternative to social media API access.

Supplemental Material

MP4 File
video presentation
MP4 File
Supplemental video

References

[1]
Valerio Basile, Francesco Cauteruccio, and Giorgio Terracina. 2021. How dramatic events can affect emotionality in social posting: The impact of COVID-19 on Reddit. Future Internet 13, 2 (2021), 29.
[2]
David M Blei. 2012. Probabilistic topic models. Commun. ACM 55, 4 (2012), 77--84.
[3]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.
[4]
Nicholas Botzer, Shawn Gu, and Tim Weninger. 2022. Analysis of moral judgment on reddit. IEEE Transactions on Computational Social Systems (2022).
[5]
Sung-Hyuk Cha. 2007. Comprehensive survey on distance/similarity measures between probability density functions. City 1, 2 (2007), 1.
[6]
Michel-Marie Deza and Elena Deza. 2006. Dictionary of distances. Elsevier.
[7]
Peter Sheridan Dodds, Joshua R Minot, Michael V Arnold, Thayer Alshaabi, Jane Lydia Adams, David Rushing Dewhurst, Tyler J Gray, Morgan R Frank, Andrew J Reagan, and Christopher M Danforth. 2023. Allotaxonometry and rank-turbulence divergence: A universal instrument for comparing complex systems. EPJ Data Science 12, 1 (2023), 37.
[8]
Noura Farra, Elie Challita, Rawad Abou Assi, and Hazem Hajj. 2010. Sentence-level and document-level sentiment mining for arabic texts. In 2010 IEEE international conference on data mining workshops. IEEE, 1114--1119.
[9]
Martin Gerlach, Francesc Font-Clos, and Eduardo G Altmann. 2016. Similarity of symbol frequency distributions with heavy tails. Physical Review X 6, 2 (2016), 021009.
[10]
Tarleton Gillespie. 2020. Content moderation, AI, and the question of scale. Big Data & Society 7, 2 (2020), 2053951720943234.
[11]
Maria Glenski, Corey Pennycuff, and Tim Weninger. 2017. Consumers and curators: Browsing and voting patterns on reddit. IEEE Transactions on Computational Social Systems 4, 4 (2017), 196--206.
[12]
Yuting Guo, Xiangjue Dong, Mohammed Ali Al-Garadi, Abeed Sarker, Cecile Paris, and Diego Mollá Aliod. 2020. Benchmarking of transformer-based pre- trained models on social media text classification datasets. In Proceedings of the the 18th annual workshop of the australasian language technology association. 86--91.
[13]
Naeemul Hassan, Amrit Poudel, Jason Hale, Claire Hubacek, Khandaker Tasnim Huq, Shubhra Kanti Karmaker Santu, and Syed Ishtiaque Ahmed. 2020. Towards automated sexual violence report tracking. In Proceedings of the international AAAI conference on web and social media, Vol. 14. 250--259.
[14]
Mike Isaac. 2023. "Reddit Wants to Get Paid for Helping to Teach Big A.I. Systems?. The New York Times".
[15]
Franziska B Keller, David Schoch, Sebastian Stier, and JungHwan Yang. 2020. Political astroturfing on twitter: How to coordinate a disinformation campaign. Political communication 37, 2 (2020), 256--280.
[16]
Bing Liu. 2020. Sentiment analysis: Mining opinions, sentiments, and emotions. Cambridge university press.
[17]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).
[18]
Daniel Loureiro, Francesco Barbieri, Leonardo Neves, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. Timelms: Diachronic language models from twitter. arXiv preprint arXiv:2202.03829 (2022).
[19]
lucene. [n. d.]. Apache Lucene - TokenStream Class. https://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/TokenStream.html?is-external=true Accessed on September 27, 2023.
[20]
VenkataSwamy Martha, Weizhong Zhao, and Xiaowei Xu. 2013. A study on Twitter user-follower network: A network based analysis. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. 1405--1409.
[21]
Leland McInnes, John Healy, and James Melville. 2018. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018).
[22]
Qiaozhu Mei, Xu Ling, Matthew Wondra, Hang Su, and ChengXiang Zhai. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In Proceedings of the 16th international conference on World Wide Web. 171--180.
[23]
Chad A Melton, Olufunto A Olusanya, Nariman Ammar, and Arash Shaban-Nejad. 2021. Public sentiment analysis and topic modeling regarding COVID-19 vaccines on the Reddit social media platform: A call to action for strengthening vaccine confidence. Journal of Infection and Public Health 14, 10 (2021), 1505--1512.
[24]
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).
[25]
George Armitage Miller. 1951. Language and communication. (1951).
[26]
Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen Carley. 2013. Is the sample good enough? comparing data from twitter's streaming api with twitter's firehose. In Proceedings of the international AAAI conference on web and social media, Vol. 7. 400--408.
[27]
Mark Myslín, Shu-Hong Zhu, Wendy Chapman, Mike Conway, et al . 2013. Using twitter to examine smoking behavior and perceptions of emerging tobacco products. Journal of medical Internet research 15, 8 (2013), e2534.
[28]
E Nikos, L Angeliki, P Georgios, and C Konstantinos. 2011. ELS: A word-level method for entity-level analysis. In WIMS 2011 Proceedings of the International Conference on Web Intelligence, Mining and Semantics.
[29]
Diogo Pacheco, Pik-Mai Hui, Christopher Torres-Lugo, Bao Tran Truong, Alessandro Flammini, and Filippo Menczer. 2021. Uncovering coordinated networks on social media: methods and case studies. In Proceedings of the international AAAI conference on web and social media, Vol. 15. 455--466.
[30]
Zhao Pan, Yaobin Lu, Bin Wang, and Patrick YK Chau. 2017. Who do you think you are? Common and differential effects of social self-identity on social media usage. Journal of Management Information Systems 34, 1 (2017), 71--101.
[31]
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.
[32]
Juergen Pfeffer, Daniel Matter, Kokil Jaidka, Onur Varol, Afra Mashhadi, Jana Lasser, Dennis Assenmacher, Siqi Wu, Diyi Yang, Cornelia Brantner, et al . 2023. Just another day on Twitter: a complete 24 hours of Twitter data. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 17. 1073--1081.
[33]
Shalini Priya, Ryan Sequeira, Joydeep Chandra, and Sourav Kumar Dandapat. 2019. Where should one get news updates: Twitter or Reddit. Online Social Networks and Media 9 (2019), 17--29.
[34]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
[35]
S. Smith. 2020. Coronavirus (COVID19) tweets - early April. https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-early-april. Accessed: 25 January 2024.
[36]
S. Smith. 2020. Coronavirus (COVID19) tweets - late April. https://www.kaggle.com/datasets/smid80/coronavirus-covid19-tweets-late-april.
[37]
Olof Sundin, Dirk Lewandowski, and Jutta Haider. 2022. Whose relevance? Web search engines as multisided relevance machines. Journal of the Association for Information Science and Technology 73, 5 (2022), 637--642.
[38]
Domenico Trezza. 2023. To scrape or not to scrape, this is dilemma. The post-API scenario and implications on digital research. Frontiers in Sociology 8 (2023), 1145038.
[39]
Sarah Vieweg, Amanda L Hughes, Kate Starbird, and Leysia Palen. 2010. Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In Proceedings of the SIGCHI conference on human factors in computing systems. 1079--1088.
[40]
Charlie Wang and Ben Luo. 2021. Predicting $ gme stock price movement using sentiment from reddit r/wallstreetbets. In Proceedings of the Third Workshop on Financial Technology and Natural Language Processing. 22--30.
[41]
Jianshu Weng and Bu-Sung Lee. 2011. Event detection in twitter. In Proceedings of the international aaai conference on web and social media, Vol. 5. 401--408.
[42]
Sarita Yardi, Daniel Romero, Grant Schoenebeck, et al. 2010. Detecting spam in a twitter network. First monday (2010).
[43]
Ainur Yessenalina, Yisong Yue, and Claire Cardie. 2010. Multi-level structured models for document-level sentiment classification. In Proceedings of the 2010 conference on empirical methods in natural language processing. 1046--1056.
[44]
Sean Young, Debo Dutta, and Gopal Dommety. 2009. Extrapolating psychological insights from Facebook profiles: A study of religion and relationship status. CyberPsychology & Behavior 12, 3 (2009), 347--350.

Cited By

View all
  • (2024)The Use of Natural Language Processing Methods in Reddit to Investigate Opioid Use: Scoping ReviewJMIR Infodemiology10.2196/511564(e51156)Online publication date: 13-Sep-2024
  • (2024)“I’m in the Bluesky Tonight”: Insights from a year worth of social dataPLOS ONE10.1371/journal.pone.031033019:11(e0310330)Online publication date: 5-Nov-2024
  • (2024)Leveraging computational methods for nonprofit social media research: a systematic review and methodological frameworkJournal of Chinese Governance10.1080/23812346.2024.23650089:3(303-327)Online publication date: 17-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '24: Proceedings of the ACM Web Conference 2024
May 2024
4826 pages
ISBN:9798400701719
DOI:10.1145/3589334
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 May 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. bias
  2. data access
  3. search
  4. social media

Qualifiers

  • Research-article

Funding Sources

Conference

WWW '24
Sponsor:
WWW '24: The ACM Web Conference 2024
May 13 - 17, 2024
Singapore, Singapore

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)139
  • Downloads (Last 6 weeks)9
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)The Use of Natural Language Processing Methods in Reddit to Investigate Opioid Use: Scoping ReviewJMIR Infodemiology10.2196/511564(e51156)Online publication date: 13-Sep-2024
  • (2024)“I’m in the Bluesky Tonight”: Insights from a year worth of social dataPLOS ONE10.1371/journal.pone.031033019:11(e0310330)Online publication date: 5-Nov-2024
  • (2024)Leveraging computational methods for nonprofit social media research: a systematic review and methodological frameworkJournal of Chinese Governance10.1080/23812346.2024.23650089:3(303-327)Online publication date: 17-Jun-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media