Skip to main content

Forests of Oblique Decision Stumps for Classifying Very Large Number of Tweets

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8860))

Abstract

Our investigation aims at constructing oblique decision stump forests to classify very large number of twitter messages (tweets). Twitter sentiment analysis is not a trivial task because tweets are short and getting generated at very fast rate. Supervised learning algorithms can thus be useful to automatically detect positive or negative sentiments. The pre-processing step performs the cleaning tasks and the representation of tweets using the bag-of-words model (BoW). And then we propose oblique decision stump forests based on the linear support vector machines (SVM) that is suitable for classifying large amounts of high dimensional datapoints. The experimental results on twittersentiment.appspot.com corpora (with 1,600,000 tweets) show that our oblique decision stump forests are efficient compared to baseline algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. Processing, 1–6 (2009)

    Google Scholar 

  2. Barbosa, L., Junlan, F.: Robust sentiment detection on twitter from biased and noisy data. In: Proceedings of the International Conference on Computational Linguistics, COLING 2010. Association for Computational Linguistics (2010)

    Google Scholar 

  3. Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trend. Now Publishers Inc. (July 2008)

    Google Scholar 

  4. Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 415–463. Springer US (January 2012)

    Google Scholar 

  5. Hassan, S.: Sentiment analysis of microblogs mining the new world (March 2012)

    Google Scholar 

  6. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  7. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  8. Breiman, L.: Arcing classifiers. The annals of statistics 26(3), 801–849 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  9. Wayne, I., Pat, L.: Minimizing the misclassification error rate using a surrogate convex loss. In: Proceedings of the Ninth International Conference on Machine Learning, ICML 1992, July 1-3, pp. 233–240. Morgan Kaufmann, CA (1992)

    Google Scholar 

  10. Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)

    Google Scholar 

  11. Go, A., Bhayani, R., Huang, L.: Twitter sentiment, http://help.sentiment140.com (accessed date May 12, 2014)

  12. Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 3–12. Springer-Verlag New York, Inc., New York (1994)

    Google Scholar 

  13. Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)

    Google Scholar 

  14. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 79–86. Association for Computational Linguistics, Stroudsburg (2002)

    Chapter  Google Scholar 

  15. Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 508–524. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  16. Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, pp. 168–177. ACM, New York (2004)

    Chapter  Google Scholar 

  17. Read, J., Carroll, J.: Weakly supervised techniques for domain-independent sentiment classification. In: Proceedings of the 1st International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion, TSA 2009, pp. 45–52. ACM, New York (2009)

    Chapter  Google Scholar 

  18. Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics, Stroudsburg (2011)

    Google Scholar 

  19. Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 17-23, pp. 1320–1326. European Language Resources Association (2010)

    Google Scholar 

  20. Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS, vol. 6332, pp. 1–15. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  21. Song, G., Ye, Y., Du, X., Huang, X., Bie, S.: Short text classification: A survey. Journal of Multimedia 9(5), 635–643 (May)

    Google Scholar 

  22. Do, T.-N., Moga, S., Lenca, P.: Random forest of oblique decision trees for ERP semi-automatic configuration. In: Sobecki, J., Boonjing, V., Chittayasothorn, S. (eds.) Advanced Approaches to Intelligent Information and Database Systems. SCI, vol. 551, pp. 25–34. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  23. Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)

    Google Scholar 

  24. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  25. Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597–604 (2006)

    Article  Google Scholar 

  26. Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)

    Article  Google Scholar 

  27. Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press (July 2008)

    Google Scholar 

  28. Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)

    Google Scholar 

  29. Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM 1998, pp. 148–155. ACM, New York (1998)

    Chapter  Google Scholar 

  30. Do, T.-N., Poulet, F.: Towards high dimensional data mining with boosting of PSVM and visualization tools. In: Proc. of 6th Intl. Conf. on Entreprise Information Systems, pp. 36–41 (2004)

    Google Scholar 

  31. Dietterich, T., Kong, E.B.: Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report (1995), http://datam.i2r.a-star.edu.sg/datasets/krbd/

  32. Freund, Y., Schapire, R.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999)

    Google Scholar 

  33. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International (1984)

    Google Scholar 

  34. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)

    Google Scholar 

  35. Murthy, S., Kasif, S., Salzberg, S., Beigel, R.: OC1: Randomized induction of oblique decision trees. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 322–327 (1993)

    Google Scholar 

  36. Do, T.-N., Lenca, P., Lallich, S., Pham, N.-K.: Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) Advances in Knowledge Discovery and Management. SCI, vol. 292, pp. 39–55. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  37. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-GrAdient SOlver for SVM. In: Proceedings of the Twenty-Fourth International Conference Machine Learning, pp. 807–814. ACM (2007)

    Google Scholar 

  38. Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. NIPS Foundation (2008), http://books.nips.cc

  39. Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)

    MATH  Google Scholar 

  40. McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow

  41. Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)

    Google Scholar 

  42. Yuan, G.X., Ho, C.H., Lin, C.J.: Recent advances of large-scale linear classification. Proceedings of the IEEE 100(9), 2584–2603 (2012)

    Article  Google Scholar 

  43. Blei, D., Ng, A., Michael, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)

    MATH  Google Scholar 

  44. Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press (May 1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Le, V.T., Tran-Nguyen, T.M., Pham, K.N., Do, N.T. (2014). Forests of Oblique Decision Stumps for Classifying Very Large Number of Tweets. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds) Future Data and Security Engineering. FDSE 2014. Lecture Notes in Computer Science, vol 8860. Springer, Cham. https://doi.org/10.1007/978-3-319-12778-1_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12778-1_2

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12777-4

  • Online ISBN: 978-3-319-12778-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics