Abstract
Since its birth in 2006, Twitter has evolved to a multi-purpose social media that attracts hundreds of millions of users to share their activities and ideas on a daily basis. The potential of capturing fine-grained activity log of users, combined with ever increasing geographical information derived from GPS-enabled devices, has made Twitter data a valuable source for spatiotemporal analysis of human activities. One of the early innovations of Twitter is the use of hashtag as a unique tagging mechanism to provide additional information about a user post. From its emergence in late 2007, hashtags have been used extensively to express ideas, group tweets and report events among Twitter users. The increasing popularity of hashtags, in addition to their simple and concise structure, has inspired multiple recent studies to propose hashtag as a medium to assess diffusion of ideas in a virtual world. Studying collective effort of users in making a hashtag go viral can shed light on the complex process of idea diffusion that involves psychological, sociological and geographical elements.
Although most of the previous research on idea diffusion in virtual world purely focuses on the users social graph, recent studies have confirmed that the spatial relationship among users and regions also play a crucial role in its adoption patterns [1]. This comes back to First Law of Geography that was formulated by Waldo Tobler more than 40 years ago, as "everything is related to everything else, but near things are more related than distant things". However, previous work on designing an interactive visual analytical framework for hashtag diffusion (http://keyhole.co/, http://hashtracking.com/, https://tagboard.com/), lack in-depth spatial analysis capabilities, hence not well-suited to be used for studying diffusion patterns. This research aims to fill this gap by providing an interactive framework to offer visual analytics on geographical diffusion of hashtags over time. Our framework, called GeoHashViz, can provide both textual and visual analytics on the role of location in adoption of hashtags and offer insights on diffusion patterns among different hashtags. GeoHashViz processes large stream of incoming tweets using a Hadoop-based approach and calculates multiple measures that will be used to generate visual analytics for the user. Furthermore, it integrates online maps with a live animation tool to visualize both spatial and temporal diffusion of hashtags at the same time.
Data Collection: we gather our data using the Twitter Streaming API (details in [3]).Since we are only interested in common hashtags, which have a certain level of popularity, we only keep the hashtags with more than 1000 appearances. Our unit of spatial resolution is set to cities in United States with a population larger than 60000 people that give us 645 unique locations. These locations will form our reference grid and every geographical point will be assigned to its nearest neighbor in the reference grid.
Analytics: To formulate the problem of spatiotemporal analysis of hashtag diffusion, we recognized two main categories of hashtag-based and location-based analytics. In hashtag-based analytics we focus on specific hashtags and their associated diffusion patterns. On the other hand, location-based analytics study the similarity and closeness of locations in terms of their hashtag adoption. To evaluate the usability of the framework, we identify five core analytical features that cover wide ranges of research questions. However, our framework can be easily extended to include more analytical features. The five visual analytical capabilities are listed in Table 1. Spread and focus points (locations with highest occurrence of the hashtag [1]) provide users with a visual estimate of how the hashtag is diffused over time. However, we also provided four metrics that gives a user a more concrete sense of the diffusion patterns: a) Entropy: Measures the randomness of hashtag distribution [1] ;b) KL-divergence: Compare the geographical distribution of hashtag in consecutive time windows using KL-divergence method ;c) Spatial Dispersion: Measures how scattered is the hashtag from its geographical midpoint ;d) Count:. Plot the cumulative count of the hashtag over time.
For location-based analytics we included two functions. Top-k hashtags calculate the most popular hashtags in a region and visualize that using a word cloud. However by simply looking at the counts, we may miss some locally significant due to their relative low count. To reduce the dominance of globally popular hashtags, we introduce another analytic that will visualize top-k locally significant hashtags. This analytic uses a Tf-idf like metric [5] to measure the local popularity of a hashtag in a specific region, hence assigning lower rank to the hashtags which are popular in other places as well. In addition, we provide two metrics for comparing two different regions in terms of hashtag adoption: a) Jaccard Similarity Compare the set of hashtag used in two different regions, with higher number assigned to more similar regions ;b) Adoption Lag This measure depicts how long it takes for a hashtag to travel between two region, by averaging the time difference between the first appearance of hashtags in two regions.
Architecture: GeoHashViz framework follows a two-layer architecture: an offline-processing module and an interactive module. The offline-processing module, implemented entirely in Apache Hadoop and called periodically, processes the raw data and pre-computes measures related to spatiotemporal diffusion of hashtags. The interactive module on the other hand is called on demand and based on user requests. The two modules connect with each other through a distributed MongoDB database. The two-layer architecture enables a fast interactive final framework by reducing the data processing that interactive module is required to do.
In the offline-processing module, significant hashtags are extracted and the points are laid on the geographical mesh that we defined above. Then two MapReduce jobs are executed: one for pre-computing measures related to hashtag-based analytics and one for location-based analytics. All the Hadoop experiments were conducted using XSEDE Gordon Hadoop cluster. The data-intensive nature of our problem, requiring aggregation of large number of tweets based on both hashtags and locations, make Hadoop an ideal choice for the offline-processing module. Using Hadoop, we distribute the tweets into multiple nodes, and then take advantage of MapReduce model to aggregate them based on their associated location on the mesh and their included hashtags. In the reduce step, having access to all the tweets for a certain location/hashtag, we can generate the analytics for different timestamps. In addition, since the nodes on Gordon Hadoop cluster have relatively high memory, we are able to store the geographical mesh in memory and quickly map the location of users to their closest point on the mesh (using kd-tree). The same technique is employed in the interactive module to find the set of mesh points which lies into the user-defined bounding box.
The interactive module includes a web application and a Java Servlet. The web application is integrated into Cyber-GIS Gateway [2] to increase usability of the application and easier integration with other CyberGIS applications. Figure 1 shows a view of the application visualizing top 20 hashtags in the southern California region in September 2014.