Big Data Analytics on Twitter

Pradyumn, Mudit; Kapoor, Akshat; Tabrizi, Nasseh

doi:10.1007/978-3-319-94301-5_26

Mudit Pradyumn¹⁸,
Akshat Kapoor¹⁹ &
Nasseh Tabrizi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10968))

Included in the following conference series:

International Conference on Big Data

2576 Accesses

Abstract

As the amount of digital data is growing at an exponential rate, the emphasis is on forming an insight from the data. Although the new fields of research, including Twitter data analytics, are proven to be fruitful, there is a lack of literature review and classification of the research. Therefore, after segregating 1,025 research papers, we reviewed 29 papers from 20 journals on Twitter data analytics published from 2011 to 2017, and then classified them based on year of publication, the title of journals, data mining methods, and their application. This paper is written with the intent of understanding the trend of research in this field.

You have full access to this open access chapter, Download conference paper PDF

Data Mining and Social Network Analysis on Twitter

A Novel Proof of Concept for Twitter Analytics Using Popular Hashtags: Experimentation and Evaluation

A Survey on Influence and Information Diffusion in Twitter Using Big Data Analytics

Keywords

1 Introduction

Presently, a tremendous amount of digital information namely sensor data, social media data, public web, and others are available. To be able to see trends, extract meaningful information, and form insights, from the data accumulating at such rapid rate, it requires specialized methods and techniques. As the technology is evolving, new and effective methods of big data analytics [1] are being developed.

One example of a popular source of big data is Twitter [2] that causes the largest collection of human generated data. Twitter has 330 million active monthly users creating about 500 million tweets, resulting in 200 Billion tweets a year [3]. One advantage of using Twitter data for trend analysis is that, it generates real time data [4], making it possible to gain insights and trending information instantaneously. This paper discusses various methods that are being employed to make use of the data generated on Twitter, as well as the different application areas of that data. These researches are grouped and presented in systematic review on the bases of technique of research such as Sentiment Analysis [5], Linguistic Analysis, Comparison of data, Information System. They can play a game changing role in the areas like Disaster Management, Impact on People, Cybercrime Detection, Public Health Service, Disease Management, Medical Complaints. This paper discusses the results from the analysis of selected papers and segregation based on the four categories namely, the year of publication, the journal, their field of research, and relevance to the topic.

2 Systematic Review of the Papers

2.1 Sentiment Analysis

The research published in Technological Forecasting and Social Change #iamhappybecause: Gross National Happiness through Twitter analysis and big data [6], was conducted to check the gross national happiness in Turkey. Twenty thousand people tweeted 35 million tweets, where they were analyzed by open source sentiment analysis tool. Data from previous years was compared and based on level of happiness they were categorized as happy, neutral, or negative [6]. Here are some insights.

There is a relationship between happiness and stock index.
Study published in detecting suicidality on Twitter [7], showed that people do use Twitter data to express their suicidal feelings among other sentimental expression.
World Cup 2014 in the Twitter World brought out a variety of public moods [8, 9]. As predicted, emotions of fear and anger peaked after events were not in favor of the U.S. soccer team.
English composed tweets with geolocation information were collected from March 2014 to December 2014 using Twitter’s Streaming API [10]. After cleaning and filtering the tweets, a sample of 146,357 tweets was found using a keyword Search [11] for “cancer”. Hedonometric analysis [12] was used to compute the average happiness of each type of cancer on a scale from 1–9.

The tweets with both negative and positive emotions carry many new words common in computer language that enhance the lexicon [13]. The words in the tweets were then tagged using a part-of-speech tagger and the features of the words were calculated. The research presented a new approach for opinion lexicon expansion using data from Twitter. Another study found that the sentiment of a tweet is less useful in terms of prediction than the number of tweets posted by a user [14]. A klout score [15], a score that shows the level of influence an individual has, was also calculated for each tweet. An ordinary least square regression and a linear probability model were used to review the relations between the stocks and the sentiments of the tweets. There seems to be a connection between the outcome of games and the sentiment of the fan base on Twitter [16]. The researchers built a Central Sport system, which collects data from Twitter to use in combination with Twitter’s streaming API to capture tweets using specific hashtags. It was found that the accuracy of the different models did not prove to be more accurate than a baseline odd only approach. Tweets on company’s events can also enhance its market scope and stock value [17]. With the increase in use of Internet and social media, the micro-blog data and blog sentiment provide a useful material which can be used for stock market prediction, its volatility and survey sentiments [18].

2.2 Linguistic Analysis

The study published in Understanding U.S. regional linguistic variation with Twitter data analysis [19], aimed to look at the regional linguistic variation across the continental United States. The study collected geotagged tweets within the continental U.S from October 7, 2013 to October 6, 2014. Lexical alternations were then used to look at the difference in language across the U.S. A variant preference and a mean variant preference were calculated for each county and their alternation. Further research [20] was conducted using linguistic analysis to determine the sarcastic sentences and differentiate them with irony [21]. At times grammatical mistakes are made intentionally to express the views to the end user [22]. The goal of the study published in “Looking for the perfect tweet. The use of data mining techniques to find influencers on twitter” [23] was to investigate and determine the characteristics of influencers on Twitter. IBM SPSS Statistics 23.0 [24] version was used to analyze the different variables of the tweets. Big influencers on Twitter used more hashtags and have more mentions than average users but they use less links. Their tweets were shorter in length and usually express a clear opinion. They also follow many people. They were able to find a clear trend in the ways influencers use Twitter [25]. The shortcoming in the results was that it used Spanish composed tweets and two keywords and data was only collected over a twelve-day span. Further study [26] reveals while gender and income had positive associations with real world opinion leaders, these characteristics had little association with opinion leaders on Twitter.

2.3 Disaster Management

The study titled “Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia” [27] was designed to analyze whether Twitter could be used in disaster management situations. Five different methods of data collection were examined. The research indicates that situation assessment of densely populated areas can be done using Twitter. Another study titled [28], set out to analyze the way Twitter is used during disaster situations, specifically for Japan’s tsunami, and address the current problems and concerns users have.

2.4 Information System

Tweets provide vital information about some important situation or event, etc. The Twitter data in “From Twitter to detector: Real-time traffic incident detection using social media data” was acquired by adaptive data acquisition [29]. From the extracted data, features are extracted based on the keyword. These are then classified and geotagged to gain the information. The tweets provide immediate information, which matches with the information obtained from reliable sources. The data is then compared with HERE [30], which has the time-varying travel time on most of the roads. It was noticed that the data matched closely. But the location of the tweeter is seldom available, which left the information deficient of location. Further, the key words do not pin point the subject of information. In the event of emergency, tweeters provide quick responses from the people [31].

2.5 Cybercrime Detection

Cyberbullying has become a serious problem within the vast amounts of social networks, which was the topic of study in the paper “Computers in Human Behavior Cybercrime detection in online communications: The experimental case of cyberbullying detection in the Twitter network” [32]. Geotagged tweets were collected within the state of California from January - February 2015. Network information, activity information, user information, and tweet content were key features used in the machine learning algorithms to detect cyberbullying. Different combinations of the features were tested with four different classifiers to see which features and classifier would give greater accuracy in detecting cyberbullying. Naive Bayes [33], support vector machine [34], random forest [35], and k-nearest neighbor were the algorithms used in the study. The study’s goal was to be able to differentiate cyberbullying tweets from non-cyberbullying tweets using key features and machine learning algorithms. The machine learning algorithms could correctly label a non-cyberbullying tweet 99.4% of the time while it could only correctly label a cyberbullying tweet 71.4% of the time [32].

2.6 Public Health Services

The information on use of marijuana concentrate in different parts of America was gathered in this study “‘Time for dabs’: Analyzing Twitter data on marijuana concentrates across the U.S.” [36]. The tweets were filtered from the drug related tweets using Twitter’s API. Keywords used for this research were: tweets: “dabs”; “hash oil”; “butane honey oil”; “smoke/smoking shatter”; “smoke/smoking budder”; “smoke/smoking concentrates.” The eDrugTrends system provided a Twitter filtering and aggregation framework [37]. A sample of 125,255 tweets was collected over two-month period, out of which 27,018 of the tweets contained geolocation information. It was found that California, Texas, Florida, and New York had the highest raw number of dabs-related tweets, but after adjusting for different activity levels in each state, Oregon, Colorado, and Washington had the highest proportion of dabs-related tweets. The average adjusted proportion for Status 1, Status 2, and Status 3 states of dabs related tweets was 5.1%, 2.3%, and 1.4% respectively. The study found that dabs-related tweets were more common in states where medical and recreational use of cannabis is legal. Furthermore, the Western region of the United States of America had greater dabs-related tweeting activity [36].

2.7 Disease Management

The study titled “Predicting Flu Trends using Twitter Data” [38], aimed to track flu trends using Twitter and be able to predict influenza like illnesses (ILI). It was shown that there was a Pearson correlation coefficient of 0.9846. A regressive model was built and tested with old CDC data. Using Twitter data improved the model’s accuracy in predicting ILI cases and can provide real time analysis of influenza activity. Twitter was shown to be able to effectively track influenza like illnesses and help accurately predict influenza activity.

3 Results

Based on the popularity, importance to topic and citations, we selected following papers (See Table 1) for our review.

Table 1. Selection of papers based on field of research

Full size table

Our classification reflects diverse research in the field of Twitter data extraction. The classification is done in four main categories, the year of publication, the journal in which these articles or research papers are published, field of research and relevance to the topic. After completing screening, 29 research papers from 22 journals met the criteria to be included. These classifications are represented below:

3.1 Year of Publication

It is evident from Fig. 1 that the amount of research being done in this field is rapidly increasing over the past few years.

3.2 Journals

The journal named “Computers in Human Behavior” published the maximum number of articles related to our research field. It published 20.69% of all research papers analyzed for classification. See Fig. 2.

3.3 Research Area

Figures 3 and 4 show available papers with keywords and selected papers based on relevance.

4 Conclusion

The aim of this systematic review was to study the various developments in the field of Twitter data analytics and gain insight to their applications and methods. For better understanding, we describe the idea of research methods and the various steps involved in the data science process. Based on these principles, various research papers were analyzed and classified into four categories based on the selected parameters.

The classification is helpful to understand the basics of big data methods of collecting, processing and finding insights out of the data collected. We also described various ongoing research in the field of Twitter data extraction which utilizes the concept of the data collection by means of big data. This helps in understanding the ongoing research in field of Twitter data analytics. This research paper does not claim to be comprehensive, but we have tried to put some of the major research going in the field of the Twitter data analysis. With rapid advancements in the field of Twitter data analytics, we recommend re-visiting these methods and periodically revising the review to include new developments.

References

Data analytics. https://www.techopedia.com/definition/26418/data-a. Accessed 2 Oct 2018
Twitter. https://twitter.com/. Accessed 2 Oct 2018
Sayce, D.: https://www.dsayce.com/social-media/tweets-day/. Accessed 2 Oct 2018
Real time data. https://www.techopedia.com/definition/31256/real-t. Accessed 2 Oct 2018
Sentiment analysis. https://www.lexalytics.com/technology/sentiment. Accessed 3 Oct 2018
Durahim, A.O., Co, M.: Technological forecasting & social change #iamhappybecause : gross national happiness through twitter analysis and big data, vol. 99, pp. 92–105 (2015)
Google Scholar
Dea, B.O., Wan, S., Batterham, P.J., Calear, A.L., Paris, C., Christensen, H.: Detecting suicidality on Twitter. Invent 2(2), 183–188 (2015)
Google Scholar
Yu, Y., Wang, X.: Computers in human behavior world cup 2014 in the twitter world: a big data analysis of sentiments in U.S. sports fans’ tweets. Comput. Human Behav. 48, 392–400 (2015)
Article Google Scholar
Natural language processing. https://www.sas.com/en_us/insights/analytics/what-is-natural-language-processing-nlp.html
Keyword Search. http://www.columbia.edu/cu/lweb/help/clio/keyword.html
Crannell, W.C., Clark, E., Jones, C., James, T.A., Moore, J.: Sciencedirect association for academic surgery a pattern-matched twitter analysis of us cancer-patient sentiments. J. Surg. Res. 206(2), 536–542 (2016)
Article Google Scholar
Hedonometer. https://hedonometer.org/index.html
Bravo-marquez, F., Frank, E., Pfahringer, B.: Knowledge-based systems building a twitter opinion lexicon from automatically-annotated tweets, vol. 108, pp. 65–78 (2016)
Article Google Scholar
Corea, F.: Can twitter proxy the investors’ sentiment? the case for the technology sector. Big Data Res. 4, 70–74 (2016)
Article Google Scholar
Klout. https://klout.com/corp/score
Schumaker, R.P., Jarmoszko, A.T., Jr, C.S.L.: Predicting wins and spread in the premier league using a sentiment analysis of twitter. Decis. Support Syst. 88, 76–84 (2016)
Article Google Scholar
Daniel, M., Neves, R.F., Horta, N.: Company event popularity for financial markets using twitter and sentiment analysis. Expert Syst. Appl. 71, 111–124 (2016)
Article Google Scholar
Pandey, A.C., Rajpoot, D.S., Saraswat, M.: Twitter sentiment analysis using hybrid cuckoo search method. Inf. Process. Manag. 53(4), 764–779 (2017)
Article Google Scholar
Oliveira, N., Cortez, P., Areal, N.: The impact of microblogging data for stock market prediction: using Twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Syst. Appl. 73, 125–144 (2017)
Article Google Scholar
Huang, Y., Guo, D., Kasakoff, A., Grieve, J.: Understanding U.S. regional linguistic variation with Twitter data. Comput. Environ. Urban Syst. 59, 244–255 (2016)
Article Google Scholar
Sulis, E., Irazú, D., Farías, H., Rosso, P., Patti, V.: Knowledge-Based Systems Figurative messages and affect in Twitter : Differences between #irony, #sarcasm and #not, vol. 108, pp. 132–143 (2016)
Google Scholar
Oussalah, M., Escallier, B., Daher, D.: An automated system for grammatical analysis of Twitter messages. a learning task application. Knowl. Based Syst. 101, 31–47 (2015)
Article Google Scholar
Lahuerta-Otero, E., Cordero-Gutirrez, R.: Looking for the perfect tweet. the use of data mining techniques to find influencers on twitter. Comput. Hum. Behav. 64, 575–583 (2016)
Article Google Scholar
IBM. https://www.ibm.com/analytics/data-science/predictive-analytics/spss-statistical-software
Wilcoxon signed-rank test. https://statistics.laerd.com/spss-tutorials/wilcoxon-signed-rank-test-using-spss-statistics.php
Park, C.S., Kaye, B.K.: The tweet goes on: interconnection of twitter opinion leadership, network size, and civic eng. Comput. Hum. Behav. 69, 174–180 (2017)
Article Google Scholar
Carley, K.M., Malik, M., Landwehr, P.M., Pfeffer, J., Kowalchuck, M.: Crowd sourcing disaster management: the complex nature of twitter usage in padang Indonesia. Saf. Sci. 90, 48–61 (2016)
Article Google Scholar
Communities, W.B.: Twitter for crisis communication : lessons learned from Japan’ s tsunami disaster Adam Acar * and Yuya Muraki, vol. 7(3), pp. 392–402 (2011)
Google Scholar
Gu, Y., Sean, Z., Chen, F.: From Twitter to detector: real-time traffic incident detection using social media data. Transp. Res. Part C 67, 321–342 (2016)
Article Google Scholar
HERE. https://www.here.com/en
Laylavi, F., Rajabifard, A., Kalantari, M.: Event relatedness assessment of twitter messages for emergency respons. Inf. Process. Manag. 53(1), 266–280 (2015)
Article Google Scholar
Lin, X., Lachlan, K.A., Spence, P.R.: Computers in human behavior exploring extreme events on social media: A comparison of user reposting/retweeting behaviors on Twitter and Weibo. Comput. Human Behav. 65, 576–581 (2016)
Article Google Scholar
Naive Bayesian. http://www.statsoft.com/Textbook/Naive-Bayes-Classifier
Support vector machine. http://scikit-learn.org/stable/modules/svm.html
Random Forest. http://www.stat.berkeley.edu/~breiman/RandomForest/cc_home.htm
Daniulaityte, R., et al.: ‘ Time for dabs’: analyzing twitter data on marijuana concentrates across the U. S. Drug Alcohol Depend. 155, 307–311 (2015)
Article Google Scholar
Kayser, V., Bierwisch, A.: Using twitter for foresight: an opportunity? Futures 84, 50–63 (2016)
Article Google Scholar
Achrekar, H., Lazarus, R., Park, W.C.: Predicting Flu Trends using Twitter Data, pp. 702–707 (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, East Carolina University, Greenville, NC, 27858, USA
Mudit Pradyumn & Nasseh Tabrizi
Health Services and Information Management, East Carolina University, Greenville, NC, 27858, USA
Akshat Kapoor

Authors

Mudit Pradyumn
View author publications
You can also search for this author in PubMed Google Scholar
Akshat Kapoor
View author publications
You can also search for this author in PubMed Google Scholar
Nasseh Tabrizi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mudit Pradyumn .

Editor information

Editors and Affiliations

The University of Hong Kong, Hong Kong, Hong Kong
Francis Y. L. Chin
University of Macau, Macao, Macao
C. L. Philip Chen
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Louisiana State University, Baton Rouge, USA
Kisung Lee
Kingdee International Software Group Company Limited, Shenzhen, China
Liang-Jie Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pradyumn, M., Kapoor, A., Tabrizi, N. (2018). Big Data Analytics on Twitter. In: Chin, F., Chen, C., Khan, L., Lee, K., Zhang, LJ. (eds) Big Data – BigData 2018. BIGDATA 2018. Lecture Notes in Computer Science(), vol 10968. Springer, Cham. https://doi.org/10.1007/978-3-319-94301-5_26

Download citation

DOI: https://doi.org/10.1007/978-3-319-94301-5_26
Published: 21 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94300-8
Online ISBN: 978-3-319-94301-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics