skip to main content
10.1145/3589132.3625654acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
short-paper

Data and Resources Paper: A Multi-granularity Decade-Long Geo-Tagged Twitter Dataset for Spatial Computing

Published: 22 December 2023 Publication History

Abstract

This paper presents a publicly accessible large-scale geo-tagged Twitter dataset, comprising 95.8 million tweets from 247 countries, spanning from Jan. 2012 to Dec. 2021. To systematically extract this dataset from over 57.18 TB of raw tweets, we employed parallel computing on a 40-node cluster with 480 CPU cores. Distinguishing it from most existing Twitter datasets, our dataset includes four-level granularity tweet locations, two-level granularity user profile locations, and tweet text languages, enabling personalized queries. To enhance the open accessibility of our dataset, we have designed an innovative interactive online query system (https://sigspatial.yunhefeng.me) and provided free-to-use JSON APIs (https://github.com/ResponsibleAILab/unt-geotweet-api) for customized queries to retrieve tweet IDs in tweet coordinate, tweet text-based location, and user location modes. Then users can use https://github.com/ResponsibleAILab/unt-tweet-rehydration to download complete tweet information. Furthermore, we have demonstrated the practical utility of our dataset through two applications: human movement modeling and geo-aware Large Language Model (LLM) tuning. Our geo-tagged Twitter dataset, along with the accompanying query system and APIs, contributes to the research community and opens up avenues for multidisciplinary investigations and the advancement of knowledge.

References

[1]
Tuğrulcan Elmas, Rebekah Overdorf, and Karl Aberer. 2023. Misleading repurposing on twitter. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 17. 209--220.
[2]
Yuqin Jiang, Zhenlong Li, and Xinyue Ye. 2019. Understanding demographic and socioeconomic biases of geotagged Twitter users at the county level. Cartography and geographic information science 46, 3 (2019), 228--242.
[3]
Amir Karami, Rachana Redd Kadari, Lekha Panati, Siva Prasad Nooli, Harshini Bheemreddy, and Parisa Bozorgi. 2021. Analysis of geotagging behavior: Do geotagged users represent the Twitter population? ISPRS International Journal of Geo-Information 10, 6 (2021), 373.
[4]
Quynh C Nguyen, Dapeng Li, Hsien-Wen Meng, Suraj Kath, Elaine Nsoesie, Feifei Li, and Ming Wen. 2016. Building a national neighborhood dataset from geotagged Twitter data for indicators of happiness, diet, and physical activity. JMIR public health and surveillance 2, 2 (2016), e5869.
[5]
Ramya Tekumalla, Javad Rafiei Asl, and Juan M Banda. 2020. Mining Archive. org's twitter stream grab for pharmacovigilance research gold. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 14. 909--917.

Cited By

View all
  • (2025)Mapping AI ethics narratives: evidence from Twitter discourse between 2015 and 2022Humanities and Social Sciences Communications10.1057/s41599-025-04469-912:1Online publication date: 15-Feb-2025
  • (2024)Co-designing a knowledge management tool for educator communities of practiceProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660682(1970-1990)Online publication date: 1-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGSPATIAL '23: Proceedings of the 31st ACM International Conference on Advances in Geographic Information Systems
November 2023
686 pages
ISBN:9798400701689
DOI:10.1145/3589132
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 December 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Twitter
  2. geo-tagged tweet
  3. geo-tagged dataset
  4. large language model
  5. LLM
  6. open dataset
  7. multi-granularity location
  8. Twitter location

Qualifiers

  • Short-paper

Conference

SIGSPATIAL '23
Sponsor:

Acceptance Rates

Overall Acceptance Rate 257 of 1,238 submissions, 21%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Mapping AI ethics narratives: evidence from Twitter discourse between 2015 and 2022Humanities and Social Sciences Communications10.1057/s41599-025-04469-912:1Online publication date: 15-Feb-2025
  • (2024)Co-designing a knowledge management tool for educator communities of practiceProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3660682(1970-1990)Online publication date: 1-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media