extended-abstract

GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags

Authors:

Kiumars Soltani,

Aditya Parameswaran,

Shaowen WangAuthors Info & Claims

XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure

Article No.: 37, Pages 1 - 2

https://doi.org/10.1145/2792745.2792782

Published: 26 July 2015 Publication History

Get Access

Abstract

Since its birth in 2006, Twitter has evolved to a multi-purpose social media that attracts hundreds of millions of users to share their activities and ideas on a daily basis. The potential of capturing fine-grained activity log of users, combined with ever increasing geographical information derived from GPS-enabled devices, has made Twitter data a valuable source for spatiotemporal analysis of human activities. One of the early innovations of Twitter is the use of hashtag as a unique tagging mechanism to provide additional information about a user post. From its emergence in late 2007, hashtags have been used extensively to express ideas, group tweets and report events among Twitter users. The increasing popularity of hashtags, in addition to their simple and concise structure, has inspired multiple recent studies to propose hashtag as a medium to assess diffusion of ideas in a virtual world. Studying collective effort of users in making a hashtag go viral can shed light on the complex process of idea diffusion that involves psychological, sociological and geographical elements.

Although most of the previous research on idea diffusion in virtual world purely focuses on the users social graph, recent studies have confirmed that the spatial relationship among users and regions also play a crucial role in its adoption patterns [1]. This comes back to First Law of Geography that was formulated by Waldo Tobler more than 40 years ago, as "everything is related to everything else, but near things are more related than distant things". However, previous work on designing an interactive visual analytical framework for hashtag diffusion (http://keyhole.co/, http://hashtracking.com/, https://tagboard.com/), lack in-depth spatial analysis capabilities, hence not well-suited to be used for studying diffusion patterns. This research aims to fill this gap by providing an interactive framework to offer visual analytics on geographical diffusion of hashtags over time. Our framework, called GeoHashViz, can provide both textual and visual analytics on the role of location in adoption of hashtags and offer insights on diffusion patterns among different hashtags. GeoHashViz processes large stream of incoming tweets using a Hadoop-based approach and calculates multiple measures that will be used to generate visual analytics for the user. Furthermore, it integrates online maps with a live animation tool to visualize both spatial and temporal diffusion of hashtags at the same time.

Data Collection: we gather our data using the Twitter Streaming API (details in [3]).Since we are only interested in common hashtags, which have a certain level of popularity, we only keep the hashtags with more than 1000 appearances. Our unit of spatial resolution is set to cities in United States with a population larger than 60000 people that give us 645 unique locations. These locations will form our reference grid and every geographical point will be assigned to its nearest neighbor in the reference grid.

Analytics: To formulate the problem of spatiotemporal analysis of hashtag diffusion, we recognized two main categories of hashtag-based and location-based analytics. In hashtag-based analytics we focus on specific hashtags and their associated diffusion patterns. On the other hand, location-based analytics study the similarity and closeness of locations in terms of their hashtag adoption. To evaluate the usability of the framework, we identify five core analytical features that cover wide ranges of research questions. However, our framework can be easily extended to include more analytical features. The five visual analytical capabilities are listed in Table 1. Spread and focus points (locations with highest occurrence of the hashtag [1]) provide users with a visual estimate of how the hashtag is diffused over time. However, we also provided four metrics that gives a user a more concrete sense of the diffusion patterns: a) Entropy: Measures the randomness of hashtag distribution [1] ;b) KL-divergence: Compare the geographical distribution of hashtag in consecutive time windows using KL-divergence method ;c) Spatial Dispersion: Measures how scattered is the hashtag from its geographical midpoint ;d) Count:. Plot the cumulative count of the hashtag over time.

For location-based analytics we included two functions. Top-k hashtags calculate the most popular hashtags in a region and visualize that using a word cloud. However by simply looking at the counts, we may miss some locally significant due to their relative low count. To reduce the dominance of globally popular hashtags, we introduce another analytic that will visualize top-k locally significant hashtags. This analytic uses a Tf-idf like metric [5] to measure the local popularity of a hashtag in a specific region, hence assigning lower rank to the hashtags which are popular in other places as well. In addition, we provide two metrics for comparing two different regions in terms of hashtag adoption: a) Jaccard Similarity Compare the set of hashtag used in two different regions, with higher number assigned to more similar regions ;b) Adoption Lag This measure depicts how long it takes for a hashtag to travel between two region, by averaging the time difference between the first appearance of hashtags in two regions.

Architecture: GeoHashViz framework follows a two-layer architecture: an offline-processing module and an interactive module. The offline-processing module, implemented entirely in Apache Hadoop and called periodically, processes the raw data and pre-computes measures related to spatiotemporal diffusion of hashtags. The interactive module on the other hand is called on demand and based on user requests. The two modules connect with each other through a distributed MongoDB database. The two-layer architecture enables a fast interactive final framework by reducing the data processing that interactive module is required to do.

In the offline-processing module, significant hashtags are extracted and the points are laid on the geographical mesh that we defined above. Then two MapReduce jobs are executed: one for pre-computing measures related to hashtag-based analytics and one for location-based analytics. All the Hadoop experiments were conducted using XSEDE Gordon Hadoop cluster. The data-intensive nature of our problem, requiring aggregation of large number of tweets based on both hashtags and locations, make Hadoop an ideal choice for the offline-processing module. Using Hadoop, we distribute the tweets into multiple nodes, and then take advantage of MapReduce model to aggregate them based on their associated location on the mesh and their included hashtags. In the reduce step, having access to all the tweets for a certain location/hashtag, we can generate the analytics for different timestamps. In addition, since the nodes on Gordon Hadoop cluster have relatively high memory, we are able to store the geographical mesh in memory and quickly map the location of users to their closest point on the mesh (using kd-tree). The same technique is employed in the interactive module to find the set of mesh points which lies into the user-defined bounding box.

The interactive module includes a web application and a Java Servlet. The web application is integrated into Cyber-GIS Gateway [2] to increase usability of the application and easier integration with other CyberGIS applications. Figure 1 shows a view of the application visualizing top 20 hashtags in the southern California region in September 2014.

References

[1]

K. Y. Kamath, J. Caverlee, K. Lee, and Z. Cheng. Spatio-temporal dynamics of online memes: A study of geo-tagged tweets. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13, pages 667--678, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.

Digital Library

Google Scholar

[2]

Y. Liu, A. Padmanabhan, and S. Wang. Cybergis gateway for enabling data-rich geospatial research and education. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--3, Sept 2013.

Crossref

Google Scholar

[3]

A. Padmanabhan, S. Wang, G. Cao, M. Hwang, Y. Zhao, Z. Zhang, and Y. Gao. Flumapper: An interactive cybergis environment for massive location-based social media data analysis. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, XSEDE '13, pages 33:1--33:2, New York, NY, USA, 2013. ACM.

Digital Library

Google Scholar

[4]

C. Sheng, Y. Zheng, W. Hsu, M. L. Lee, and X. Xie. Answering top-k similar region queries. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications - Volume Part I, DASFAA'10, pages 186--201, Berlin, Heidelberg, 2010. Springer-Verlag.

Digital Library

Google Scholar

Cited By

View all

Isokpehi RJohnson CTucker AGautam ABrooks TJohnson MCozart TWathington D(2020)Integrating Datasets on Public Health and Clinical Aspects of Sickle Cell Disease for Effective Community-Based Research and PracticeDiseases10.3390/diseases80400398:4(39)Online publication date: 26-Oct-2020
https://doi.org/10.3390/diseases8040039

Index Terms

GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags
1. Computer systems organization
  1. Architectures
    1. Distributed architectures
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles

Recommendations

UrbanFlow: Large-scale Framework to Integrate Social Media and Authoritative Landuse Maps
XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale

Everyday massive amounts of geo-tagged information are generated around urban environment using micro-blogging services and content sharing platforms. These new Big Geospatial Data sources provide an opportunity to understand people activities and their ...
Information resonance on Twitter: watching Iran
SOMA '10: Proceedings of the First Workshop on Social Media Analytics

Twitter has undoubtedly caught the attention of both the general public, and academia as a microblogging service worthy of study and attention. Twitter has several features that sets it apart from other social media/networking sites, including its 140 ...
The Dynamics of (Not) Unfollowing Misinformation Spreaders
WWW '24: Proceedings of the ACM Web Conference 2024

Many studies explore how people "come into" misinformation exposure. But much less is known about how people "come out of" misinformation exposure. Do people organically sever ties to misinformation spreaders? And what predicts doing so? Over six months, ...

Comments

Information & Contributors

Information

Published In

XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure

July 2015

296 pages

ISBN:9781450337205

DOI:10.1145/2792745

General Chair:
Gregory D. Peterson
National Institute of Computational Sciences

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2015

Check for updates

Author Tags

Qualifiers

Extended-abstract

Funding Sources

National Science Foundation

Conference

XSEDE '15

Sponsor:

San Diego Super Computing Ctr
HPCWire
Omnibond
Indiana University
CASC
NICS
Intel
DDN
CORSA
ALLINEA
RENCI

XSEDE '15: Extreme Science Engineering Discovery Environment 2015 Conference

July 26 - 30, 2015

Missouri, St. Louis

Acceptance Rates

XSEDE '15 Paper Acceptance Rate 49 of 70 submissions, 70%;

Overall Acceptance Rate 129 of 190 submissions, 68%

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
228
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Isokpehi RJohnson CTucker AGautam ABrooks TJohnson MCozart TWathington D(2020)Integrating Datasets on Public Health and Clinical Aspects of Sickle Cell Disease for Effective Community-Based Research and PracticeDiseases10.3390/diseases80400398:4(39)Online publication date: 26-Oct-2020
https://doi.org/10.3390/diseases8040039

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Abstract

References

Cited By

Index Terms

Recommendations

UrbanFlow: Large-scale Framework to Integrate Social Media and Authoritative Landuse Maps

Information resonance on Twitter: watching Iran

The Dynamics of (Not) Unfollowing Misinformation Spreaders

Comments

Information

Published In

Sponsors

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations