Abstract
Our investigation aims at constructing oblique decision stump forests to classify very large number of twitter messages (tweets). Twitter sentiment analysis is not a trivial task because tweets are short and getting generated at very fast rate. Supervised learning algorithms can thus be useful to automatically detect positive or negative sentiments. The pre-processing step performs the cleaning tasks and the representation of tweets using the bag-of-words model (BoW). And then we propose oblique decision stump forests based on the linear support vector machines (SVM) that is suitable for classifying large amounts of high dimensional datapoints. The experimental results on twittersentiment.appspot.com corpora (with 1,600,000 tweets) show that our oblique decision stump forests are efficient compared to baseline algorithms.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. Processing, 1–6 (2009)
Barbosa, L., Junlan, F.: Robust sentiment detection on twitter from biased and noisy data. In: Proceedings of the International Conference on Computational Linguistics, COLING 2010. Association for Computational Linguistics (2010)
Pang, B., Lee, L.: Opinion Mining and Sentiment Analysis. Foundations and Trend. Now Publishers Inc. (July 2008)
Liu, B., Zhang, L.: A survey of opinion mining and sentiment analysis. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 415–463. Springer US (January 2012)
Hassan, S.: Sentiment analysis of microblogs mining the new world (March 2012)
Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Breiman, L.: Arcing classifiers. The annals of statistics 26(3), 801–849 (1998)
Wayne, I., Pat, L.: Minimizing the misclassification error rate using a surrogate convex loss. In: Proceedings of the Ninth International Conference on Machine Learning, ICML 1992, July 1-3, pp. 233–240. Morgan Kaufmann, CA (1992)
Vapnik, V.: The Nature of Statistical Learning Theory. Springer (1995)
Go, A., Bhayani, R., Huang, L.: Twitter sentiment, http://help.sentiment140.com (accessed date May 12, 2014)
Lewis, D.D., Gale, W.A.: A sequential algorithm for training text classifiers. In: Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1994, pp. 3–12. Springer-Verlag New York, Inc., New York (1994)
Berger, A.L., Pietra, V.J.D., Pietra, S.A.D.: A maximum entropy approach to natural language processing. Computational Linguistics 22(1), 39–71 (1996)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, EMNLP 2002, vol. 10, pp. 79–86. Association for Computational Linguistics, Stroudsburg (2002)
Saif, H., He, Y., Alani, H.: Semantic sentiment analysis of twitter. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 508–524. Springer, Heidelberg (2012)
Hu, M., Liu, B.: Mining and summarizing customer reviews. In: Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2004, pp. 168–177. ACM, New York (2004)
Read, J., Carroll, J.: Weakly supervised techniques for domain-independent sentiment classification. In: Proceedings of the 1st International CIKM Workshop on Topic-sentiment Analysis for Mass Opinion, TSA 2009, pp. 45–52. ACM, New York (2009)
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38. Association for Computational Linguistics, Stroudsburg (2011)
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, Valletta, Malta, May 17-23, pp. 1320–1326. European Language Resources Association (2010)
Bifet, A., Frank, E.: Sentiment knowledge discovery in twitter streaming data. In: Pfahringer, B., Holmes, G., Hoffmann, A. (eds.) DS 2010. LNCS, vol. 6332, pp. 1–15. Springer, Heidelberg (2010)
Song, G., Ye, Y., Du, X., Huang, X., Bie, S.: Short text classification: A survey. Journal of Multimedia 9(5), 635–643 (May)
Do, T.-N., Moga, S., Lenca, P.: Random forest of oblique decision trees for ERP semi-automatic configuration. In: Sobecki, J., Boonjing, V., Chittayasothorn, S. (eds.) Advanced Approaches to Intelligent Information and Database Systems. SCI, vol. 551, pp. 25–34. Springer, Heidelberg (2014)
Harris, Z.S.: Distributional structure. Word 10, 146–162 (1954)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5(4), 597–604 (2006)
Sebastiani, F., Ricerche, C.N.D.: Machine learning in automated text categorization. ACM Computing Surveys 34, 1–47 (2002)
Manning, C.D., Raghavan, P., Schtze, H.: Introduction to Information Retrieval, 1st edn. Cambridge University Press (July 2008)
Joachims, T.: Text categorization with support vector machines: Learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Dumais, S., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of the Seventh International Conference on Information and Knowledge Management, CIKM 1998, pp. 148–155. ACM, New York (1998)
Do, T.-N., Poulet, F.: Towards high dimensional data mining with boosting of PSVM and visualization tools. In: Proc. of 6th Intl. Conf. on Entreprise Information Systems, pp. 36–41 (2004)
Dietterich, T., Kong, E.B.: Machine learning bias, statistical bias, and statistical variance of decision tree algorithms. Technical report (1995), http://datam.i2r.a-star.edu.sg/datasets/krbd/
Freund, Y., Schapire, R.: A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14(5), 771–780 (1999)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth International (1984)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Murthy, S., Kasif, S., Salzberg, S., Beigel, R.: OC1: Randomized induction of oblique decision trees. In: Proceedings of the Eleventh National Conference on Artificial Intelligence, pp. 322–327 (1993)
Do, T.-N., Lenca, P., Lallich, S., Pham, N.-K.: Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees. In: Guillet, F., Ritschard, G., Zighed, D.A., Briand, H. (eds.) Advances in Knowledge Discovery and Management. SCI, vol. 292, pp. 39–55. Springer, Heidelberg (2010)
Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal estimated sub-GrAdient SOlver for SVM. In: Proceedings of the Twenty-Fourth International Conference Machine Learning, pp. 807–814. ACM (2007)
Bottou, L., Bousquet, O.: The tradeoffs of large scale learning. In: Platt, J.C., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information Processing Systems, vol. 20, pp. 161–168. NIPS Foundation (2008), http://books.nips.cc
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: A library for large linear classification. Journal of Machine Learning Research 9, 1871–1874 (2008)
McCallum, A.K.: Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering (1996), http://www.cs.cmu.edu/~mccallum/bow
Rijsbergen, C.J.V.: Information Retrieval, 2nd edn. Butterworth-Heinemann, Newton (1979)
Yuan, G.X., Ho, C.H., Lin, C.J.: Recent advances of large-scale linear classification. Proceedings of the IEEE 100(9), 2584–2603 (2012)
Blei, D., Ng, A., Michael, J.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press (May 1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Le, V.T., Tran-Nguyen, T.M., Pham, K.N., Do, N.T. (2014). Forests of Oblique Decision Stumps for Classifying Very Large Number of Tweets. In: Dang, T.K., Wagner, R., Neuhold, E., Takizawa, M., Küng, J., Thoai, N. (eds) Future Data and Security Engineering. FDSE 2014. Lecture Notes in Computer Science, vol 8860. Springer, Cham. https://doi.org/10.1007/978-3-319-12778-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-12778-1_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12777-4
Online ISBN: 978-3-319-12778-1
eBook Packages: Computer ScienceComputer Science (R0)