1 Introduction

Social networking sites, such as Twitter, have become a prevalent communication platform in our daily life, with discussions ranging from mainstream topics like TV and music to specialized topics like politics and climate change. Tracking and understanding these discussions provide valuable insights into the general opinions and sentiments towards specific topics and how they change over time, which are useful to researchers, companies, government organizations alike, e.g., advertising, marketing, crisis detection, disaster management. Despite its usefulness, the large volume and wide variety of tweets makes it challenging to track and understand the discussions on these topics [2, 5]. To address these challenges, we proposed and developed the Real-time Analytics Platform for Interactive Data mining (RAPID) for topic tracking and analysis on Twitter (Fig. 1). RAPID offers a unique topic-tracking capability using query keyword and user expansion to track topics and related discussions, as well as various analytics capabilities to visualize the collected tweets, users and topics, and understand tweeting and interaction behaviours.

Fig. 1.
figure 1

Overview of RAPID

Related Systems and Differences. There has been a number of interesting Twitter-based systems developed for specific application domains such as politics [10], crime and disasters [4], diseases [3], recommendations [11], and they typically utilize a mention/keyword-based retrieval of tweets relating to each domain. Others focus on specific capabilities on Twitter such as a SQL-like query language [6], clustering tweets into broad topics [8], detecting events based on keyword frequency [7]. While these systems provide many interesting capabilities, our RAPID system differs in the following ways: (i) Instead of targetting specific domains, RAPID is designed to be generalizable to any application domain, topic or event; (ii) Many earlier systems retrieve tweets based on user-provided keywords, which may not adequately represent the topic of interest. In contrast, RAPID provides a unique query expansion collection capability that allows for the expansion of seeding keywords and users for a broader collection coverage; (iii) In addition, RAPID allows its users to interact with and control the data stream in real-time, as well as perform a wide and in-depth range of analysis and visualizations techniques, which we further describe in this paper; and (iv) RAPID is highly scalable to the growing volume of tweets generated, by utilizing real-time distributed computing technologies like Apache Storm and Kafka, compared to earlier systems that do not utilize such technologies.

2 System Architecture

RAPID is developed to perform real-time analysis and visualization, as well as post-hoc analysis and visualization on previously collected data. Communication between the client and server are facilitated through Kafka queues, based on the publish-subscribe model where researchers are able to specify their various information requirements. We now describe the main components of RAPID.

Data Retrieval and Analysis Component. This component performs two main tasks, which are:

  • Data Retrieval. For real-time retrieval, RAPID interfaces with the Twitter Streaming API and collects information such as tweets related to a particular topic, posted by specific users or are within a geo location subscribed by the user, Twitter user details such as the list of followers, profile information and timeline information. For post-hoc processing, RAPID retrieves information stored in the data storage unit based on the researcher’s requests. The researcher is able to access all functionalities of the real-time retrieval and in addition, is able to further drill-down on the data by filtering the collected tweets based on specific topics, time periods, locations and set of hashtags. Unlike many earlier systems, RAPID is designed with an integrated data retrieval and analysis capability such that the data retrieval is continuously expanded for better coverage based on real-time analysis of collected tweets, which we discuss next.

  • Data Analysis. This includes the sub-tasks of: (i) tweet pre-processing, i.e., tokenizing, topic labelling, extraction of geo-location and other tweet features; (ii) topic tracking via keywords, usernames or bounding boxes, and an enhanced query expansion capability that automatically track topics and related discussions through dynamic expansion of keywords; (iii) user query processing, such as filtering and drilling down the collected data for further analysis based on topics, time periods and/or locations; and (iv) data statistics and analysis, such as updating data storage with latest collection statistics and performing advanced analytics like analyzing hashtags and inferring relationships between hashtags, analyzing word-to-word pairs and word clusters of tweets, tracking discussions through pro-actively fetching tweets replies related to discussions.

Data Storage. The data storage component uses MongoDB for storing meta-data as well as the processed tweets, which can be used later for further post-hoc processing and visualization. RAPID also allows users the freedom to decide the type of processed data that should be persisted in the storage. Meta-data stored in the database includes the details of the users, details of user activities such as commands given by users to the RAPID system and the topics users are subscribed to. In addition to the meta-data, tweets processed by the system, discussions occurred related to tweets can also be stored in the database. One major advantage of having this useful capability is that users can reprocess and visualize the tweets later if such requirement arises, e.g., further drill-down to filter and analyze crisis-related tweets posted on 20 Nov 2017 in Melbourne CBD.

User Interface Component. This component performs three main tasks, namely:

  • User Input. For topic tracking, researchers can specify a set of keywords, users and/or geo-bounding boxes associated with the topic as the input. The interface also allows users to modify or delete existing tracked topics, with a detailed log of these activities.

  • Real-time Visualization. Key information and statistics of the tracked topics are visualized using a set of predefined charts, which are updated in real-time as new tweets related to the topic are analyzed by the RAPID system. Screenshots and descriptions of selected charts are shown in Fig. 2.

  • Workbench. The workbench allows users to visualize tweets that have been stored in the storage component for further analysis. For more flexibility in post-collection analysis, users are able to define a specific time period the tweets have occurred and then the workbench retrieves the related tweets and visualizes them using the same charts used for real-time visualization. Moreover, the workbench summarizes the key statistics of the retrieved tweets including the number of tweets fetched, number of unique authors, unique hashtags, unique mentions and unique replies.

Fig. 2.
figure 2

Screenshots of selected RAPID functionalities, which include (clockwise from left): word-to-word network graph, discussion tree visualization, profile of tweeting locations, overview of RAPID workbench, workbench view on query expansion, real-time tracking of keyword frequency, word cloud of collected tweets.

3 Target Users and Demonstration

We presented the RAPID system for real-time topic tracking and analysis on Twitter, where RAPID offers a unique and effective collection approach via query expansion, numerous analysis capabilities to understand user interactions, tweeting behaviours and discussion cascades, and various visualizations of these types of information. RAPID has been used by researchers from both the Army Research Laboratory in the USA and Defence Science and Technology in Australia [1, 9], and will also be of interest to any user interested in tracking, analysing and visualizing topics on Twitter. We will demonstrate the various capabilities of RAPID via use cases of political campaign analysis, monitoring of crises and incidents, in-depth analysis of tweets and users. A demonstration video of RAPID is available at https://youtu.be/1APLeLT_t8w.