skip to main content
research-article

Forecasting with twitter data

Published: 03 January 2014 Publication History

Abstract

The dramatic rise in the use of social network platforms such as Facebook or Twitter has resulted in the availability of vast and growing user-contributed repositories of data. Exploiting this data by extracting useful information from it has become a great challenge in data mining and knowledge discovery. A recently popular way of extracting useful information from social network platforms is to build indicators, often in the form of a time series, of general public mood by means of sentiment analysis. Such indicators have been shown to correlate with a diverse variety of phenomena.
In this article we follow this line of work and set out to assess, in a rigorous manner, whether a public sentiment indicator extracted from daily Twitter messages can indeed improve the forecasting of social, economic, or commercial indicators. To this end we have collected and processed a large amount of Twitter posts from March 2011 to the present date for two very different domains: stock market and movie box office revenue. For each of these domains, we build and evaluate forecasting models for several target time series both using and ignoring the Twitter-related data. If Twitter does help, then this should be reflected in the fact that the predictions of models that use Twitter-related data are better than the models that do not use this data. By systematically varying the models that we use and their parameters, together with other tuning factors such as lag or the way in which we build our Twitter sentiment index, we obtain a large dataset that allows us to test our hypothesis under different experimental conditions. Using a novel decision-tree-based technique that we call summary tree we are able to mine this large dataset and obtain automatically those configurations that lead to an improvement in the prediction power of our forecasting models. As a general result, we have seen that nonlinear models do take advantage of Twitter data when forecasting trends in volatility indices, while linear ones fail systematically when forecasting any kind of financial time series. In the case of predicting box office revenue trend, it is support vector machines that make best use of Twitter data. In addition, we conduct statistical tests to determine the relation between our Twitter time series and the different target time series.

References

[1]
Ahkter, J. and Soria, S. 2010. Sentiment analysis: Facebook status messages. http://nlp.stanford.edu/courses/cs224n/2010/reports/ssoriajr-kanej.pdf.
[2]
Asur, S. and Huberman, B. A. 2010. Predicting the future with social media. In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).
[3]
Bifet, A. and Frank, E. 2010. Sentiment knowledge discovery in twitter streaming data. In Discovery Science, Springer, 1--15.
[4]
Bifet, A., Holmes, G., Kirkby, R., and Pfahringer, B. 2010. Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601--1604.
[5]
Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 4--5, 993--1022.
[6]
Blum, A. and Mitchell, T. 1998. Combining Labeled and Unlabeled Data with Co-Training. Morgan Kaufmann Publishers, 92--100.
[7]
Bollen, J., Mao, H., and Zeng, X. 2011. Twitter mood predicts the stock market. J. Comput. Sci. 2, 1, 1--8.
[8]
Cesa-Bianchi, N. and Lugosi, G. 2006. Prediction, Learning, and Games. Cambridge University Press, New York.
[9]
Cohen, J. 1960. A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20, 1, 37--46.
[10]
Culotta, A. 2010. Detecting influenza outbreaks by analyzing twitter messages. http://arxiv.org/abs/1007.4748.
[11]
Das, S. and Chen, M. 2001. Yahoo! for amazon: Extracting market sentiment from stock message boards. In Proceedings of the Asia Pacific Finance Association Annual Conference (APFA'01).
[12]
Diks, C. and Panchenko, V. 2006. A new statistic and practical guidelines for nonparametric granger causality testing. J. Econ. Dynamics Control 30, 1647--1669.
[13]
Gama, J. A., Sebastiao, R., and Rodrigues, P. P. 2009. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'09). ACM Press, New York, 329.
[14]
Go, A., Bhayani, R., and Huang, L. 2009. Twitter sentiment classification using distant supervision. CS224N Project rep. Stanford.
[15]
Granger, C. W. J. 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37, 3, 424--38.
[16]
Gruhl, D., Guha, R., Kumar, R., Novak, J., and Tomkins, A. 2005. The predictive power of online chatter. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining (KDD'05). ACM Press, New York, 78--87.
[17]
Hastie, T., Tibshirani, R., and Friedman, J. H. 2003. The Elements of Statistical Learning. Springer.
[18]
Hoffman, M. D., Blei, D. M., and Bach, F. 2010. Online learning for latent dirichlet allocation. Adv. Neural Inf. Process. Syst. 23, 856--864.
[19]
Lampos, V., Bie, T. D., and Cristianini, N. 2010. Flu detector - Tracking epidemics on twitter. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases (ECML/PKDD'10). 599--602.
[20]
Lee, T.-H.,White, H., and Granger, C. W. 1993. Testing for neglected nonlinearity in time series models: A comparison of neural network methods and alternative tests. J. Econometrics 56, 3, 269--290.
[21]
Mishne, G. and Glance, N. 2005. Predicting movie sales from blogger sentiment. In Proceedings of the AAAI Symposium on Computational Approaches to Analysing Weblogs (AAAI-CAAW'05).
[22]
Mitchell, T. M. 1997. Machine Learning. McGraw Hill Series in Computer Science, McGraw-Hill.
[23]
O'Connor, B., Balasubramanyan, R., Routledge, B. R., and Smith, N. A. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 122--129.
[24]
Pang, B. and Lee, L. 2008. Opinion mining and sentiment analysis. Found. Trends Info. Retrieval 2, 1--2, 1--135.
[25]
Petrovic, S., Osborne, M., and Lavrenko, V. 2010. The edinburgh twitter corpus. In Proceedings of the NAACL HLT Workshop on Computational Linguistics in a World of Social Media. Association for Computational Linguistics, 25--26.
[26]
Polanyi, L. and Zaenen, A. 2006. Contextual valence shifters. Computing attitude and affect in text. Theory Appl. 20, 1--10.
[27]
Potts, C. 2010. On the negativity of negation. In Proceedings of the Semantics and Linguistic Theory Conference. Vol. 20.
[28]
Ritterman, J., Osborne, M., and Klein, E. 2009. Using prediction markets and twitter to predict a swine flu pandemic. In Proceedings of the 1st International Workshop on Mining Social Media.
[29]
Rokach, L. 2010. Ensemble-based classifiers. Artif. Intell. Rev. 33, 1--2, 1--39.
[30]
Terasvirta, T., Lin, C.-F., and Granger, C. W. J. 1993. Power of the neural network linearity test. J. Time Series Anal. 14, 209--220.
[31]
Toda, H. Y. and Yamamoto, T. 1995. Statistical inferences in vector autoregressions with possibly integrated processes. J. Econometrics 66, 225--250.
[32]
Tsay, R. S. 2010. Analysis of Financial Time Series 3rd Ed. Wiley.
[33]
Tumasjan, A., Sprenger, T., Sandner, P., and Welpe, I. 2010. Predicting elections with Twitter: What 140 characters reveal about political sentiment. In Proceedings of the 4th International AAAI Conference on Weblogs and Social Media. 178--185.
[34]
Twitter. 2010a. Developer rules of the road. http://webarchive.nationalarchives.gov.uk/20100520095247/dev.twitter.com/pages/api_terms.
[35]
Twitter. 2010b. Draft: Twitter rules for api rules. http://webarchive.nationalarchives.gov.uk/20100409154700/http://twitter.com/apirules.
[36]
Twitter. 2012. Twitter translation center adds right-to-left languages. http://blog.twitter.com/2012/01/twitter-translation-center-adds-right.html.
[37]
Wakamiya, S., Lee, R., and Sumiya, K. 2011. Crowd-powered tv viewing rates: Measuring relevancy between tweets and tv programs. In Proceedings of the 16th International Conference on Database Systems for Advanced Applications (DASFAA'11). 390--401.
[38]
White, H. 1989. An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN'89). 451--455.
[39]
Wolfram, M. S. A. 2010. Modelling the stock market using twitter. M.S. thesis, School of Informatics, University of Edinburgh.
[40]
Zhang, X., Fuehres, H., and Gloor, P. A. 2010. Predicting stock market indicators through twitter “I hope it is not as bad as I fear”. In Proceedings of the 2nd Collaborative Innovation Networks Conference (COINs'10).

Cited By

View all
  • (2025)Social media and capital markets: an interdisciplinary bibliometric analysisFinancial Innovation10.1186/s40854-024-00731-211:1Online publication date: 7-Feb-2025
  • (2024)NLP-enabled Recommendation of Hashtags for Covid based Tweets using Hybrid BERT-LSTM ModelACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3640812Online publication date: 16-Jan-2024
  • (2024)Analyzing product attributes and brand sentiment of smartwatches using Twitter/X data from a time series perspectiveJournal of Marketing Analytics10.1057/s41270-024-00349-4Online publication date: 11-Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology
ACM Transactions on Intelligent Systems and Technology  Volume 5, Issue 1
Special Section on Intelligent Mobile Knowledge Discovery and Management Systems and Special Issue on Social Web Mining
December 2013
520 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/2542182
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 January 2014
Accepted: 01 July 2012
Revised: 01 June 2012
Received: 01 February 2012
Published in TIST Volume 5, Issue 1

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Box office
  2. Twitter
  3. forecasting
  4. sentiment index
  5. stock market

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)100
  • Downloads (Last 6 weeks)8
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Social media and capital markets: an interdisciplinary bibliometric analysisFinancial Innovation10.1186/s40854-024-00731-211:1Online publication date: 7-Feb-2025
  • (2024)NLP-enabled Recommendation of Hashtags for Covid based Tweets using Hybrid BERT-LSTM ModelACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3640812Online publication date: 16-Jan-2024
  • (2024)Analyzing product attributes and brand sentiment of smartwatches using Twitter/X data from a time series perspectiveJournal of Marketing Analytics10.1057/s41270-024-00349-4Online publication date: 11-Nov-2024
  • (2024)Judgmental adjustment of demand forecasting models using social media data and sentiment analysis within industry 5.0 ecosystemsInternational Journal of Information Management Data Insights10.1016/j.jjimei.2024.1002724:2(100272)Online publication date: Nov-2024
  • (2024)Event Uncertainty for Twitter Data Using Thematic Context VectorProceedings of the NIELIT's International Conference on Communication, Electronics and Digital Technology10.1007/978-981-97-3601-0_11(135-146)Online publication date: 31-Jul-2024
  • (2023)Big Data Analytics and Machine Learning in Supply Chain 4.0: A Literature ReviewStats10.3390/stats60200386:2(596-616)Online publication date: 5-May-2023
  • (2023)Leveraging machine learning approaches for predicting potential Lyme disease cases and incidence rates in the United States using TwitterBMC Medical Informatics and Decision Making10.1186/s12911-023-02315-z23:1Online publication date: 16-Oct-2023
  • (2023)Indonesian Stock Index Price Prediction Using the Stacked Bidirectional Unidirectional Long Short-Term Memory (SBU-LSTM) with the GDELT News Sentiment2023 3rd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA)10.1109/ICICyTA60173.2023.10428768(243-248)Online publication date: 13-Dec-2023
  • (2023)Towards perceptual image watermarking with robust texture measurement▪Expert Systems with Applications: An International Journal10.1016/j.eswa.2023.119649219:COnline publication date: 1-Jun-2023
  • (2023)Adaptive Non-Maximum Suppression for improving performance of Rumex detectionExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.119634219:COnline publication date: 1-Jun-2023
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media