Skip to main content
Log in

API-based social media collecting as a form of web archiving

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Social media is increasingly a topic of study across a range of disciplines. Despite this popularity, current practices and open source tools for social media collecting do not adequately support today’s scholars or support building robust collections for future researchers. We are continuing to develop and improve Social Feed Manager (SFM), an open source application assisting scholars collecting data from Twitter’s API for their research. Based on our experience with SFM to date and the viewpoints of archivists and researchers, we are reconsidering assumptions about API-based social media collecting and identifying requirements to guide the application’s further development. We suggest that aligning social media collecting with web archiving practices and tools addresses many of the most pressing needs of current and future scholars conducting quality social media research. In this paper, we consider the basis for these new requirements, describe in depth an alignment between social media collecting and web archiving, outline a technical approach for effecting this alignment, and show how the technical approach has been implemented in SFM.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. Search of NSF awards on the term “social media” on February 4, 2016 returns 455 results. https://www.nsf.gov/awardsearch/simpleSearchResult?queryText=%22social+media%22&ActiveAwards=true.

  2. This development was supported by a grant (#LG-46-13-0257-13) from the Institute of Museum and Library Services to GWU Libraries from 2013 to 2014.

  3. We refer here to “wayback software”, a generic term for software that plays back WARC files, as distinguished from “The Wayback Machine”, an instance and implementation of wayback software hosted by the Internet Archive. Two examples of wayback software are the International Internet Preservation Consortium’s OpenWayback [13] and Ilya Kreymer’s pywb [14].

  4. https://web.archive.org/web/*/http://geocities.com.

  5. https://web.archive.org/web/*/http://blogger.com.

  6. https://web.archive.org/web/*/http://friendster.com.

  7. Although the URL “http://myspace.com” was captured from 1996 forward, MySpace was founded and launched at that URL in 2003. https://web.archive.org/web/20031004101518/http://myspace.com/.

  8. https://web.archive.org/web/*/https://www.flickr.com/.

  9. ArchiveSocial requires social media account owners to login and give ArchiveSocial permission to their social media data. One of the authors of the paper worked with the adoption of ArchiveSocial at the State Archives of North Carolina.

  10. Noting also that, “The research, development, and technical experimentation necessary to advance the archiving tools on these fronts will not come from the majority of web archiving organizations with their fractional staff time commitments” [37].

  11. https://archive.org/.

  12. http://www.loc.gov/webarchiving/.

  13. Many of us remember Friendster, MySpace and other extinct social platforms. Though certainly more popular, even Twitter itself seems to be experiencing a stall in the growth of its user base [54].

  14. The GW Libraries are collaborating with Johns Hopkins University and Georgetown University in this grant work, entitled “Blogging and Microblogging: Preserving Non-Official Voices in China’s Anti-Corruption Campaign”.

  15. Another aspect of tweets is the metadata that accompanies it when harvested from the API. This metadata contains social network information, in that they contain references to (and/or retweets of) other accounts. In addition, tweets contain complete user profile information, which often changes over time. This metadata has research potential, which is why we have also saved it.

  16. https://dev.twitter.com/rest/reference/get/statuses/user_timeline.

  17. For Twitter, this is commonly referred to as “dehydration” and is useful because it allows exchanging datasets within the constraints of Twitter’s terms of service.

  18. https://www.elastic.co/.

  19. http://gwu-libraries.github.io/sfm-ui/about/roadmap.

References

  1. GW Libraries: gwu-libraries/social-feed-manager (2012). https://github.com/gwu-libraries/social-feed-manager. Accessed 10 Feb 2016

  2. GW Libraries: Welcome to Social Feed Manager! (2015). http://social-feed-manager.readthedocs.org/en/latest/. Accessed 12 Feb 2016

  3. Chudnov, D., Kerchner, D., Sharma, A., Wrubel, L.: Technical challenges in developing software to collect twitter data. Code4lib J. (2014) http://journal.code4lib.org/articles/10097. Accessed 10 Feb 2016

  4. Hayes, D., Lawless, J.L.: Women on the run: gender, media, and political campaigns in a polarized Era. Cambridge University Press, Cambridge (2016). http://books.google.com/books/about/Women_on_the_Run.html?hl=&id=fXNNDAAAQBAJ. Accessed 10 Feb 2016

  5. GW Libraries: gwu-libraries/sfm-ui (2015). https://github.com/gwu-libraries/sfm-ui. Accessed 10 Feb 2016

  6. GW Libraries: Social Feed Manager (SFM) documentation (2015). http://sfm.readthedocs.org/en/latest/. Accessed 12 Feb 2016

  7. International Internet Preservation Consortium: About IIPC (2012). http://netpreserve.org/about-us. Accessed 10 Feb 2016

  8. International Internet Preservation Consortium: About archiving (2012). http://netpreserve.org/web-archiving/about-archiving. Accessed 10 Feb 2016

  9. Jack, P., Levitt, N.: Heritrix (2014). https://webarchive.jira.com/wiki/display/Heritrix. Accessed 10 Feb 2016

  10. Kreymer, I.: Webrecorder/webrecorder (2015). https://github.com/webrecorder/webrecorder. Accessed 11 Feb 2016

  11. Internet Archive: Internetarchive/warcprox (2012). https://github.com/internetarchive/warcprox. Accessed 11 Feb 2016

  12. International Internet Preservation Consortium: The WARC format (2015). http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.0/. Accessed 10 Feb 2016

  13. International Internet Preservation Consortium: iipc/openwayback (2013). https://github.com/iipc/openwayback. Accessed 11 Feb 2016

  14. Kreymer, I.: ikreymer/pywb (2013). https://github.com/ikreymer/pywb. Accessed 11 Feb 2016

  15. Thomson, S.D.: Preserving social media (2016). doi:10.7207/twr16-01. http://www.dpconline.org/component/docman/doc_download/1486-twr16-01. Accessed 10 Feb 2016

  16. Bercovici, J.: Who coined “Social Media”? Web pioneers compete for credit. Forbes. (2010). http://www.forbes.com/sites/jeffbercovici/2010/12/09/who-coined-social-media-web-pioneers-compete-for-credit/. Accessed 10 Feb 2016

  17. Espley, S., Carpentier, F., Pop, R., Medjkoune, L.: Collect, preserve, access: applying the governing principles of the national archives UK government web archive to social media content. Alexandria 25, 31–50 (2014). doi:10.7227/ALX.0019. http://openurl.ingenta.com/content/xref?genre=article&issn=0955-7490&volume=25&issue=1&spage=31. Accessed 10 Feb 2016

  18. Bragg, M., Eubank, K., Ricker, J.: Preserving Web 2.0. Presented at: Best practices exchange (2009) https://webarchive.jira.com/wiki/download/attachments/5734676/BPE_web2_partner+meeting.ppt?version=1&modificationDate=1257454424180. Accessed 10 Feb 2016

  19. Ricker, J.: A flickr of Hope: harvesting social networking sites with archive-it. Presented at: NDIIPP partners meeting (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/ndiipp2010.pdf. Accessed 10 Feb 2016

  20. Ricker, J.: Archiving social media sites in North Carolina. Presented at: Best practices exchange (2010). http://digitalpreservation.ncdcr.gov/asgii/presentations/bpe2010.pdf. Accessed 10 Feb 2016

  21. Trent, R., Kenney, K.: Social Media Archiving in State Government. Presented at: Tri-State archivists meeting (2013). http://digitalpreservation.ncdcr.gov/asgii/presentations/snca_2013_socialmedia.pdf. Accessed 10 Feb 2016

  22. McNealy, J.E.: The privacy implications of digital preservation: Social media archives and the social networks theory of privacy. Elon Univ. Law Rev. 3, 133–160 (2010). http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2027036. Accessed 10 Feb 2016

  23. Miao, T.A.: Access denied: how social media accounts fall outside the scope of intellectual property law and into the realm of the computer fraud and abuse act. Fordham Intell. Prop. Med. Ent. LJ 23, 1017 (2012). http://heinonlinebackup.com/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/frdipm23&section=32. Accessed 10 Feb 2016

  24. Moyer, M.W.: Twitter opens its cage. Sci. Am. 310, 16 (2014). http://www.ncbi.nlm.nih.gov/pubmed/25004563. Accessed 10 Feb 2016

  25. NDSA Content Working Group: Web Archiving Survey Report. National Digital Stewardship Alliance (2012). http://www.digitalpreservation.gov/ndsa/working_groups/documents/ndsa_web_archiving_survey_report_2012.pdf. Accessed 10 Feb 2016

  26. Bowers, K., Dolan-Mescal, A., Donovan, L., et al.: Occupy archives panel. Presented at: Annual Society of American Archivists Meeting (2013). http://archives2013.sched.org/event/14m52JH/session-303-occupy-archives. Accessed 10 Feb 2016

  27. King, L.: Emory digital scholars archive occupy wall street Tweets. Emory Rep. (2012). http://news.emory.edu/stories/2012/09/er_occupy_wall_street_tweets_archive/campus.html. Accessed 10 Feb 2016

  28. Del Signore, J.: Museums Archiving Occupy Wall Street: Historical Preservation Or “Taxpayer-Funded Hoarding”? Gothamist (2011). http://gothamist.com/2011/12/26/occupy_wall_street_the_museum_exhib.php. Accessed 10 Feb 2016

  29. Chitturi, K., Yang, S.: Real-time archiving of spontaneous events (Use-Case; Hurricane Sandy) and visualizing disaster phases appearing in Tweets. Presented at: Archive-it partner meeting at Best practices exchange. (2012). https://webarchive.jira.com/wiki/download/attachments/40075274/Real-%C2%AD%E2%80%90%26me%20Archiving%20of%20Spontaneous%20Events%20%28Use-%C2%AD%E2%80%90Case%20-%20Hurricane%20Sandy%29.pdf. Accessed 10 Feb 2016

  30. Gueguen, G.: Capturing the Zeitgeist. (2012). http://www.slideshare.net/guegueng/capturing-the-zeitgeist. Accessed 10 Feb 2016

  31. National Archives and Record Administration: Best practices for social media capture. National Archives and Record Administration (2013). http://www.archives.gov/records-mgmt/resources/socialmediacapture.pdf. Accessed 10 Feb 2016

  32. Trent, R.: Social media archive BETA is live! The G.S. 132 Files (2012). https://ncrecords.wordpress.com/2012/12/04/social-media-archive-beta-is-live/. Accessed 10 Feb 2016

  33. Emory Libraries: emory-libraries/Twap (2011). https://github.com/emory-libraries/Twap. Accessed 11 Feb 2016

  34. North Carolina State University Libraries: NCSU-Libraries/lentil (2013). https://github.com/NCSU-Libraries/lentil. Accessed 11 Feb 2016

  35. Thomson, S.D., Kilbride, W.: Preserving social media: the problem of access. New Rev. Inf. Netw. 20, 261–275 (2015). doi:10.1080/13614576.2015.1114842

    Article  Google Scholar 

  36. Pennock, M.: Web-archiving (2013). doi:10.7207/twr13-01

  37. Bailey, J., Grotke, A., Hanna, K., et al.: Web archiving in the United States: a 2013 survey. National Digital Stewardship Alliance (2014). http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_USWebArchivingSurvey_2013.pdf. Accessed 10 Feb 2016

  38. Boyd, D., Crawford, K.: Critical questions for big data. Inf. Commun. Soc. 15, 662–679 (2012). doi:10.1080/1369118X.2012.678878

    Article  Google Scholar 

  39. Bruns, A.: Faster than the speed of print: reconciling “big data” social media analysis and academic scholarship. First Monday (2013). doi:10.5210/fm.v18i10.4879. http://journals.uic.edu/ojs/index.php/fm/article/view/4879. Accessed 20 July 2016

  40. Tufekci, Z.: Big questions for social media big data: representativeness, validity and other methodological pitfalls. arXiv:1403.7400

  41. Hajtnik, T., Uglešić, K., Živkovič, A.: Acquisition and preservation of authentic information in a digital age. Publ. Relat. Rev. 41, 264–271 (2015). doi:10.1016/j.pubrev.2014.12.001. http://www.sciencedirect.com/science/article/pii/S0363811114001945. Accessed 10 Feb 2016

  42. Eltgrowth, D.R.: Best evidence and the Wayback machine: toward a workable authentication standard for archived Internet evidence. Fordham Law Rev. 78, 181 (2009). http://heinonline.org/hol-cgi-bin/get_pdf.cgi?handle=hein.journals/flr78&section=8. Accessed 10 Feb 2016

  43. AIIM: AIIM TR31-2004, Legal acceptance of records produced by information technology systems (2004). http://www.aiim.org/Resources/Standards/AIIM_TR_31. Accessed 20 July 2016

  44. State Archives of North Carolina: Guidelines for managing trustworthy digital public records (2000). http://archives.ncdcr.gov/Portals/3/PDF/guidelines/guidelines_for_digital_public_records.pdf. Accessed 10 Feb 2016

  45. Markham, A., Buchanan, E., Committee, A.E.W. Others: Ethical decision-making and Internet research: Version 2.0. Association of Internet Researchers (2012). http://www.uwstout.edu/ethicscenter/upload/aoirethicsprintablecopy.pdf. Accessed 10 Feb 2016

  46. Leetaru, K.: Are research ethics obsolete in the Era of big data? Forbes (2016). http://www.forbes.com/sites/kalevleetaru/2016/06/17/are-research-ethics-obsolete-in-the-era-of-big-data/. Accessed 20 July 2016

  47. Council for Big Data, Ethics, and Society (2016). http://bdes.datasociety.net/. Accessed 20 July 2016

  48. Summers, E.: Introducing documenting the now—documenting DocNow. Medium (2016). https://news.docnow.io/introducing-documenting-the-now-416874c07e0. Accessed 25 July 2016

  49. Townsend, L., Wallace, C.: Social media research: a guide to ethics. The University of Aberdeen. http://www.dotrural.ac.uk/socialmediaresearchethics.pdf. Accessed 10 Feb 2016

  50. Milligan, I., Webster, P.: The Web archive bibliography. Web archives for historians (2014). https://webarchivehistorians.org/the-web-archive-bibliography/. Accessed 22 July 2016

  51. Milligan, I.: Finding community in the Ruins of GeoCities: distantly reading a web archive. Bull. IEEE Tech. Commit. Dig. Lib. (2015). http://www.ieee-tcdl.org/Bulletin/v11n2/papers/milligan.pdf. Accessed 10 Feb 2016

  52. Milligan, I.: Lost in the infinite archive: the promise and pitfalls of web archives. Int. J. Hum. Arts Comput. 10, 78–94 (2016). doi:10.3366/ijhac.2016.0161

    Article  Google Scholar 

  53. Webster, P.: Why historians should care about web archiving. Webstory: Peter Webster’s blog. (2012). https://peterwebster.me/2012/10/08/why-historians-should-care-about-web-archiving/. Accessed 14 July 2016

  54. Statista (2016) Twitter: number of monthly active users 2015. Statista. http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/. Accessed 11 Feb 2016

  55. Summers, E.: URLs in Tweets Mentioning Ferguson: August 10–27, 2014 (2014). https://edsu.github.io/ferguson-urls/index.html. Accessed 10 Feb 2016

  56. Baumann, R.: Archiving Video from #Ferguson: on Archivy. Medium. (2015) https://medium.com/on-archivy/archiving-video-from-ferguson-504e95859756. Accessed 10 Feb 2016

  57. Milligan, I., Ruest, N., Lin, J.: The Gatekeepers vs. the Masses. Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries. (2016). doi:10.1145/2910896.2910913

  58. Consultative Committee for Space Data Systems: Reference model for an Open Archival Information System (OAIS). CCSDS Secretariat, Washington, DC (2012). http://public.ccsds.org/publications/archive/650x0m2.pdf. Accessed 10 Feb 2016

  59. Commission on Preservation and Access, Research Libraries Group, Task Force on Digital Archiving: Preserving digital information: report of the task force on archiving of digital information (1996). https://books.google.com/books?id=T9YmrgEACAAJ. Accessed 10 Feb 2016

  60. Provenance Working Group: PROV-overview (2013). https://www.w3.org/TR/prov-overview/. Accessed 6 June 2016

  61. Kerchner, D., Littman, J., Peterson, C. et al.: The Provenance of a Tweet (2016). https://scholarspace.library.gwu.edu/downloads/h128nd689. Accessed 20 July 2016

  62. Internet Archive: internetarchive/brozzler. GitHub. https://github.com/internetarchive/brozzler. Accessed 14 July 2016

  63. Littman, J.: Social media harvesting techniques. GW Libraries (2015). https://library.gwu.edu/scholarly-technology-group/posts/social-media-harvesting-techniques. Accessed 10 Feb 2016

  64. Foo, C.: chfoo/wpull (2013). https://github.com/chfoo/wpull. Accessed 10 Feb 2016

  65. Van de Sompel, H., Nelson, M., Sanderson, R.: HTTP framework for time-based access to resource states–Memento (2013). https://tools.ietf.org/rfc/rfc7089.txt. Accessed 10 Feb 2016

  66. Wrubel, L.: Announcing SFM Version 1.0. Social Feed Manager (2016). http://gwu-libraries.github.io/sfm-ui/posts/2016-06-20-releasing-1-0. Accessed 15 July 2016

  67. Twitter, Inc. The Streaming APIs. https://dev.twitter.com/streaming/overview. Accessed 20 July 2016

  68. REST APIs. Twitter Developers. https://dev.twitter.com/rest/public. Accessed 20 July 2016

  69. Flickr.: Flickr services (2005). https://www.flickr.com/services/api/. Accessed 20 July 2016

  70. Weibo Corporation: Weibo API (2012). http://open.weibo.com/wiki/API%E6%96%87%E6%A1%A3/en. Accessed 20 July 2016

  71. Summers, E.: edsu/twarc (2013). doi:10.5281/zenodo.17385. https://github.com/edsu/twarc. Accessed 10 Feb 2016

  72. Stüvel, S.A.: sybrenstuvel/flickrapi (2013). https://github.com/sybrenstuvel/flickrapi. Accessed 18 Oct 2016

  73. Internet Archive: internetarchive/warc. GitHub. https://github.com/internetarchive/warc. Accessed 15 July 2016

  74. Dolan, S.: stedolan/jq (2012). https://github.com/stedolan/jq. Accessed 18 Oct 2016

  75. Clarke, N.: JWAT-tools (2012). https://sbforge.org/display/JWAT/JWAT-Tools. Accessed 18 Oct 2016

  76. Internet Archive: internetarchive/warctools (2010). https://github.com/internetarchive/warctools. Accessed 18 Oct 2016

Download references

Acknowledgements

This work is supported by Grant #NARDI-14-50017-14 from the National Historical Publications and Records Commission.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Justin Littman.

Additional information

Justin Littman is the lead author. All other authors contributed significantly to this work and participated in the writing of the paper. The authors are listed alphabetically. Laura Wrubel is the current principal investigator on the grant supporting this work; Daniel Chudnov held this role previously.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Littman, J., Chudnov, D., Kerchner, D. et al. API-based social media collecting as a form of web archiving. Int J Digit Libr 19, 21–38 (2018). https://doi.org/10.1007/s00799-016-0201-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-016-0201-7

Keywords

Navigation