Keywords

1 Introduction

Presently, a tremendous amount of digital information namely sensor data, social media data, public web, and others are available. To be able to see trends, extract meaningful information, and form insights, from the data accumulating at such rapid rate, it requires specialized methods and techniques. As the technology is evolving, new and effective methods of big data analytics [1] are being developed.

One example of a popular source of big data is Twitter [2] that causes the largest collection of human generated data. Twitter has 330 million active monthly users creating about 500 million tweets, resulting in 200 Billion tweets a year [3]. One advantage of using Twitter data for trend analysis is that, it generates real time data [4], making it possible to gain insights and trending information instantaneously. This paper discusses various methods that are being employed to make use of the data generated on Twitter, as well as the different application areas of that data. These researches are grouped and presented in systematic review on the bases of technique of research such as Sentiment Analysis [5], Linguistic Analysis, Comparison of data, Information System. They can play a game changing role in the areas like Disaster Management, Impact on People, Cybercrime Detection, Public Health Service, Disease Management, Medical Complaints. This paper discusses the results from the analysis of selected papers and segregation based on the four categories namely, the year of publication, the journal, their field of research, and relevance to the topic.

2 Systematic Review of the Papers

2.1 Sentiment Analysis

The research published in Technological Forecasting and Social Change #iamhappybecause: Gross National Happiness through Twitter analysis and big data [6], was conducted to check the gross national happiness in Turkey. Twenty thousand people tweeted 35 million tweets, where they were analyzed by open source sentiment analysis tool. Data from previous years was compared and based on level of happiness they were categorized as happy, neutral, or negative [6]. Here are some insights.

  • There is a relationship between happiness and stock index.

  • Study published in detecting suicidality on Twitter [7], showed that people do use Twitter data to express their suicidal feelings among other sentimental expression.

  • World Cup 2014 in the Twitter World brought out a variety of public moods [8, 9]. As predicted, emotions of fear and anger peaked after events were not in favor of the U.S. soccer team.

  • English composed tweets with geolocation information were collected from March 2014 to December 2014 using Twitter’s Streaming API [10]. After cleaning and filtering the tweets, a sample of 146,357 tweets was found using a keyword Search [11] for “cancer”. Hedonometric analysis [12] was used to compute the average happiness of each type of cancer on a scale from 1–9.

The tweets with both negative and positive emotions carry many new words common in computer language that enhance the lexicon [13]. The words in the tweets were then tagged using a part-of-speech tagger and the features of the words were calculated. The research presented a new approach for opinion lexicon expansion using data from Twitter. Another study found that the sentiment of a tweet is less useful in terms of prediction than the number of tweets posted by a user [14]. A klout score [15], a score that shows the level of influence an individual has, was also calculated for each tweet. An ordinary least square regression and a linear probability model were used to review the relations between the stocks and the sentiments of the tweets. There seems to be a connection between the outcome of games and the sentiment of the fan base on Twitter [16]. The researchers built a Central Sport system, which collects data from Twitter to use in combination with Twitter’s streaming API to capture tweets using specific hashtags. It was found that the accuracy of the different models did not prove to be more accurate than a baseline odd only approach. Tweets on company’s events can also enhance its market scope and stock value [17]. With the increase in use of Internet and social media, the micro-blog data and blog sentiment provide a useful material which can be used for stock market prediction, its volatility and survey sentiments [18].

2.2 Linguistic Analysis

The study published in Understanding U.S. regional linguistic variation with Twitter data analysis [19], aimed to look at the regional linguistic variation across the continental United States. The study collected geotagged tweets within the continental U.S from October 7, 2013 to October 6, 2014. Lexical alternations were then used to look at the difference in language across the U.S. A variant preference and a mean variant preference were calculated for each county and their alternation. Further research [20] was conducted using linguistic analysis to determine the sarcastic sentences and differentiate them with irony [21]. At times grammatical mistakes are made intentionally to express the views to the end user [22]. The goal of the study published in “Looking for the perfect tweet. The use of data mining techniques to find influencers on twitter” [23] was to investigate and determine the characteristics of influencers on Twitter. IBM SPSS Statistics 23.0 [24] version was used to analyze the different variables of the tweets. Big influencers on Twitter used more hashtags and have more mentions than average users but they use less links. Their tweets were shorter in length and usually express a clear opinion. They also follow many people. They were able to find a clear trend in the ways influencers use Twitter [25]. The shortcoming in the results was that it used Spanish composed tweets and two keywords and data was only collected over a twelve-day span. Further study [26] reveals while gender and income had positive associations with real world opinion leaders, these characteristics had little association with opinion leaders on Twitter.

2.3 Disaster Management

The study titled “Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia” [27] was designed to analyze whether Twitter could be used in disaster management situations. Five different methods of data collection were examined. The research indicates that situation assessment of densely populated areas can be done using Twitter. Another study titled [28], set out to analyze the way Twitter is used during disaster situations, specifically for Japan’s tsunami, and address the current problems and concerns users have.

2.4 Information System

Tweets provide vital information about some important situation or event, etc. The Twitter data in “From Twitter to detector: Real-time traffic incident detection using social media data” was acquired by adaptive data acquisition [29]. From the extracted data, features are extracted based on the keyword. These are then classified and geotagged to gain the information. The tweets provide immediate information, which matches with the information obtained from reliable sources. The data is then compared with HERE [30], which has the time-varying travel time on most of the roads. It was noticed that the data matched closely. But the location of the tweeter is seldom available, which left the information deficient of location. Further, the key words do not pin point the subject of information. In the event of emergency, tweeters provide quick responses from the people [31].

2.5 Cybercrime Detection

Cyberbullying has become a serious problem within the vast amounts of social networks, which was the topic of study in the paper “Computers in Human Behavior Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network” [32]. Geotagged tweets were collected within the state of California from January - February 2015. Network information, activity information, user information, and tweet content were key features used in the machine learning algorithms to detect cyberbullying. Different combinations of the features were tested with four different classifiers to see which features and classifier would give greater accuracy in detecting cyberbullying. Naive Bayes [33], support vector machine [34], random forest [35], and k-nearest neighbor were the algorithms used in the study. The study’s goal was to be able to differentiate cyberbullying tweets from non-cyberbullying tweets using key features and machine learning algorithms. The machine learning algorithms could correctly label a non-cyberbullying tweet 99.4% of the time while it could only correctly label a cyberbullying tweet 71.4% of the time [32].

2.6 Public Health Services

The information on use of marijuana concentrate in different parts of America was gathered in this study “‘Time for dabs’: Analyzing Twitter data on marijuana concentrates across the U.S.” [36]. The tweets were filtered from the drug related tweets using Twitter’s API. Keywords used for this research were: tweets: “dabs”; “hash oil”; “butane honey oil”; “smoke/smoking shatter”; “smoke/smoking budder”; “smoke/smoking concentrates.” The eDrugTrends system provided a Twitter filtering and aggregation framework [37]. A sample of 125,255 tweets was collected over two-month period, out of which 27,018 of the tweets contained geolocation information. It was found that California, Texas, Florida, and New York had the highest raw number of dabs-related tweets, but after adjusting for different activity levels in each state, Oregon, Colorado, and Washington had the highest proportion of dabs-related tweets. The average adjusted proportion for Status 1, Status 2, and Status 3 states of dabs related tweets was 5.1%, 2.3%, and 1.4% respectively. The study found that dabs-related tweets were more common in states where medical and recreational use of cannabis is legal. Furthermore, the Western region of the United States of America had greater dabs-related tweeting activity [36].

2.7 Disease Management

The study titled “Predicting Flu Trends using Twitter Data” [38], aimed to track flu trends using Twitter and be able to predict influenza like illnesses (ILI). It was shown that there was a Pearson correlation coefficient of 0.9846. A regressive model was built and tested with old CDC data. Using Twitter data improved the model’s accuracy in predicting ILI cases and can provide real time analysis of influenza activity. Twitter was shown to be able to effectively track influenza like illnesses and help accurately predict influenza activity.

3 Results

Based on the popularity, importance to topic and citations, we selected following papers (See Table 1) for our review.

Table 1. Selection of papers based on field of research

Our classification reflects diverse research in the field of Twitter data extraction. The classification is done in four main categories, the year of publication, the journal in which these articles or research papers are published, field of research and relevance to the topic. After completing screening, 29 research papers from 22 journals met the criteria to be included. These classifications are represented below:

3.1 Year of Publication

It is evident from Fig. 1 that the amount of research being done in this field is rapidly increasing over the past few years.

Fig. 1.
figure 1

Year wise selected papers

3.2 Journals

The journal named “Computers in Human Behavior” published the maximum number of articles related to our research field. It published 20.69% of all research papers analyzed for classification. See Fig. 2.

Fig. 2.
figure 2

Selection of papers based on journals

3.3 Research Area

Figures 3 and 4 show available papers with keywords and selected papers based on relevance.

Fig. 3.
figure 3

Available papers with keywords

Fig. 4.
figure 4

Selected papers based on relevance

4 Conclusion

The aim of this systematic review was to study the various developments in the field of Twitter data analytics and gain insight to their applications and methods. For better understanding, we describe the idea of research methods and the various steps involved in the data science process. Based on these principles, various research papers were analyzed and classified into four categories based on the selected parameters.

The classification is helpful to understand the basics of big data methods of collecting, processing and finding insights out of the data collected. We also described various ongoing research in the field of Twitter data extraction which utilizes the concept of the data collection by means of big data. This helps in understanding the ongoing research in field of Twitter data analytics. This research paper does not claim to be comprehensive, but we have tried to put some of the major research going in the field of the Twitter data analysis. With rapid advancements in the field of Twitter data analytics, we recommend re-visiting these methods and periodically revising the review to include new developments.