skip to main content
10.1145/2792745.2792782acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
extended-abstract

GeoHashViz: interactive analytics for mapping spatiotemporal diffusion of Twitter hashtags

Published: 26 July 2015 Publication History

Abstract

Since its birth in 2006, Twitter has evolved to a multi-purpose social media that attracts hundreds of millions of users to share their activities and ideas on a daily basis. The potential of capturing fine-grained activity log of users, combined with ever increasing geographical information derived from GPS-enabled devices, has made Twitter data a valuable source for spatiotemporal analysis of human activities. One of the early innovations of Twitter is the use of hashtag as a unique tagging mechanism to provide additional information about a user post. From its emergence in late 2007, hashtags have been used extensively to express ideas, group tweets and report events among Twitter users. The increasing popularity of hashtags, in addition to their simple and concise structure, has inspired multiple recent studies to propose hashtag as a medium to assess diffusion of ideas in a virtual world. Studying collective effort of users in making a hashtag go viral can shed light on the complex process of idea diffusion that involves psychological, sociological and geographical elements.
Although most of the previous research on idea diffusion in virtual world purely focuses on the users social graph, recent studies have confirmed that the spatial relationship among users and regions also play a crucial role in its adoption patterns [1]. This comes back to First Law of Geography that was formulated by Waldo Tobler more than 40 years ago, as "everything is related to everything else, but near things are more related than distant things". However, previous work on designing an interactive visual analytical framework for hashtag diffusion (http://keyhole.co/, http://hashtracking.com/, https://tagboard.com/), lack in-depth spatial analysis capabilities, hence not well-suited to be used for studying diffusion patterns. This research aims to fill this gap by providing an interactive framework to offer visual analytics on geographical diffusion of hashtags over time. Our framework, called GeoHashViz, can provide both textual and visual analytics on the role of location in adoption of hashtags and offer insights on diffusion patterns among different hashtags. GeoHashViz processes large stream of incoming tweets using a Hadoop-based approach and calculates multiple measures that will be used to generate visual analytics for the user. Furthermore, it integrates online maps with a live animation tool to visualize both spatial and temporal diffusion of hashtags at the same time.
Data Collection: we gather our data using the Twitter Streaming API (details in [3]).Since we are only interested in common hashtags, which have a certain level of popularity, we only keep the hashtags with more than 1000 appearances. Our unit of spatial resolution is set to cities in United States with a population larger than 60000 people that give us 645 unique locations. These locations will form our reference grid and every geographical point will be assigned to its nearest neighbor in the reference grid.
Analytics: To formulate the problem of spatiotemporal analysis of hashtag diffusion, we recognized two main categories of hashtag-based and location-based analytics. In hashtag-based analytics we focus on specific hashtags and their associated diffusion patterns. On the other hand, location-based analytics study the similarity and closeness of locations in terms of their hashtag adoption. To evaluate the usability of the framework, we identify five core analytical features that cover wide ranges of research questions. However, our framework can be easily extended to include more analytical features. The five visual analytical capabilities are listed in Table 1. Spread and focus points (locations with highest occurrence of the hashtag [1]) provide users with a visual estimate of how the hashtag is diffused over time. However, we also provided four metrics that gives a user a more concrete sense of the diffusion patterns: a) Entropy: Measures the randomness of hashtag distribution [1] ;b) KL-divergence: Compare the geographical distribution of hashtag in consecutive time windows using KL-divergence method ;c) Spatial Dispersion: Measures how scattered is the hashtag from its geographical midpoint ;d) Count:. Plot the cumulative count of the hashtag over time.
For location-based analytics we included two functions. Top-k hashtags calculate the most popular hashtags in a region and visualize that using a word cloud. However by simply looking at the counts, we may miss some locally significant due to their relative low count. To reduce the dominance of globally popular hashtags, we introduce another analytic that will visualize top-k locally significant hashtags. This analytic uses a Tf-idf like metric [5] to measure the local popularity of a hashtag in a specific region, hence assigning lower rank to the hashtags which are popular in other places as well. In addition, we provide two metrics for comparing two different regions in terms of hashtag adoption: a) Jaccard Similarity Compare the set of hashtag used in two different regions, with higher number assigned to more similar regions ;b) Adoption Lag This measure depicts how long it takes for a hashtag to travel between two region, by averaging the time difference between the first appearance of hashtags in two regions.
Architecture: GeoHashViz framework follows a two-layer architecture: an offline-processing module and an interactive module. The offline-processing module, implemented entirely in Apache Hadoop and called periodically, processes the raw data and pre-computes measures related to spatiotemporal diffusion of hashtags. The interactive module on the other hand is called on demand and based on user requests. The two modules connect with each other through a distributed MongoDB database. The two-layer architecture enables a fast interactive final framework by reducing the data processing that interactive module is required to do.
In the offline-processing module, significant hashtags are extracted and the points are laid on the geographical mesh that we defined above. Then two MapReduce jobs are executed: one for pre-computing measures related to hashtag-based analytics and one for location-based analytics. All the Hadoop experiments were conducted using XSEDE Gordon Hadoop cluster. The data-intensive nature of our problem, requiring aggregation of large number of tweets based on both hashtags and locations, make Hadoop an ideal choice for the offline-processing module. Using Hadoop, we distribute the tweets into multiple nodes, and then take advantage of MapReduce model to aggregate them based on their associated location on the mesh and their included hashtags. In the reduce step, having access to all the tweets for a certain location/hashtag, we can generate the analytics for different timestamps. In addition, since the nodes on Gordon Hadoop cluster have relatively high memory, we are able to store the geographical mesh in memory and quickly map the location of users to their closest point on the mesh (using kd-tree). The same technique is employed in the interactive module to find the set of mesh points which lies into the user-defined bounding box.
The interactive module includes a web application and a Java Servlet. The web application is integrated into Cyber-GIS Gateway [2] to increase usability of the application and easier integration with other CyberGIS applications. Figure 1 shows a view of the application visualizing top 20 hashtags in the southern California region in September 2014.

References

[1]
K. Y. Kamath, J. Caverlee, K. Lee, and Z. Cheng. Spatio-temporal dynamics of online memes: A study of geo-tagged tweets. In Proceedings of the 22Nd International Conference on World Wide Web, WWW '13, pages 667--678, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.
[2]
Y. Liu, A. Padmanabhan, and S. Wang. Cybergis gateway for enabling data-rich geospatial research and education. In Cluster Computing (CLUSTER), 2013 IEEE International Conference on, pages 1--3, Sept 2013.
[3]
A. Padmanabhan, S. Wang, G. Cao, M. Hwang, Y. Zhao, Z. Zhang, and Y. Gao. Flumapper: An interactive cybergis environment for massive location-based social media data analysis. In Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery, XSEDE '13, pages 33:1--33:2, New York, NY, USA, 2013. ACM.
[4]
C. Sheng, Y. Zheng, W. Hsu, M. L. Lee, and X. Xie. Answering top-k similar region queries. In Proceedings of the 15th International Conference on Database Systems for Advanced Applications - Volume Part I, DASFAA'10, pages 186--201, Berlin, Heidelberg, 2010. Springer-Verlag.

Cited By

View all
  • (2020)Integrating Datasets on Public Health and Clinical Aspects of Sickle Cell Disease for Effective Community-Based Research and PracticeDiseases10.3390/diseases80400398:4(39)Online publication date: 26-Oct-2020

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '15: Proceedings of the 2015 XSEDE Conference: Scientific Advancements Enabled by Enhanced Cyberinfrastructure
July 2015
296 pages
ISBN:9781450337205
DOI:10.1145/2792745
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

  • San Diego Super Computing Ctr: San Diego Super Computing Ctr
  • HPCWire: HPCWire
  • Omnibond: Omnibond Systems, LLC
  • SGI
  • Internet2
  • Indiana University: Indiana University
  • CASC: The Coalition for Academic Scientific Computation
  • NICS: National Institute for Computational Sciences
  • Intel: Intel
  • DDN: DataDirect Networks, Inc
  • DELL
  • CORSA: CORSA Technology
  • ALLINEA: Allinea Software
  • Cray
  • RENCI: Renaissance Computing Institute

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 26 July 2015

Check for updates

Author Tags

  1. CyberGIS
  2. GeoHashViz
  3. Hadoop
  4. interactive visualization
  5. social media

Qualifiers

  • Extended-abstract

Funding Sources

Conference

XSEDE '15
Sponsor:
  • San Diego Super Computing Ctr
  • HPCWire
  • Omnibond
  • Indiana University
  • CASC
  • NICS
  • Intel
  • DDN
  • CORSA
  • ALLINEA
  • RENCI

Acceptance Rates

XSEDE '15 Paper Acceptance Rate 49 of 70 submissions, 70%;
Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)2
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2020)Integrating Datasets on Public Health and Clinical Aspects of Sickle Cell Disease for Effective Community-Based Research and PracticeDiseases10.3390/diseases80400398:4(39)Online publication date: 26-Oct-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media