Abstract
Twitter is a widely used online social networking site where users post short messages limited to 140 characters. The small length of these messages is a challenge when it comes to classifying them into categories. In this paper we propose a system that automatically classifies Twitter messages into a set of predefined categories. The system takes into account not only the tweet text, but also external features such as words from linked URLs, mentioned user profiles, and Wikipedia articles. The system is evaluated using various combinations of feature sets. According to our results, the combination of feature sets that achieves the highest accuracy of 90.8 % is when the original tweet terms are combined with user profile terms along with terms extracted from linked URLs. Including terms from Wikipedia pages, found specifically for each tweet, is shown to decrease accuracy for the original test set, however accuracy was shown to increase using a fraction of the original test set containing only tweets without URLs.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The NReadability library is used to take only the article text from a specific HTML file ignoring irrelevant text such as sidebar text containing other stories. https://github.com/marek-stoj/NReadability [Online; accessed 6-June-2015].
- 2.
https://tweetinvi.codeplex.com [Online; accessed 6-June-2015].
- 3.
The database contains every Wikipedia page title from the English Wikipedia, originally taken from the April 2015 Wikipedia dump. https://dumps.wikimedia.org/enwiki/20150304/ [Online; accessed 6-June-2015].
- 4.
https://tweetinvi.codeplex.com [Online; accessed 6-June-2015].
References
Twitter: Twitter turns six (2012). https://blog.twitter.com/2012/twitter-turns-six. Accessed 19 May 2015
Pak, A., Paroubek, P.: Twitter as a corpus for sentiment analysis and opinion mining. In: LREC, vol. 10, pp. 1320–1326 (2010)
Go, A., Bhayani, R., Huang, L.: Twitter sentiment classification using distant supervision. CS224N Project report, Stanford, pp. 1–12 (2009)
Davidov, D., Tsur, O., Rappoport, A.: Enhanced sentiment learning using twitter hashtags and smileys. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Association for Computational Linguistics, pp. 241–249 (2010)
Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on twitter. In: Collaboration, Electronic Messaging, Anti-abuse and Spam Conference (CEAS), vol. 6, p. 12 (2010)
McCord, M., Chuah, M.: Spam detection on twitter using traditional classifiers. In: Calero, J.M.A., Yang, L.T., Mármol, F.G., García Villalba, L.J., Li, A.X., Wang, Y. (eds.) ATC 2011. LNCS, vol. 6906, pp. 175–186. Springer, Heidelberg (2011)
Song, J., Lee, S., Kim, J.: Spam filtering in twitter using sender-receiver relationship. In: Sommer, R., Balzarotti, D., Maier, G. (eds.) RAID 2011. LNCS, vol. 6961, pp. 301–317. Springer, Heidelberg (2011)
Becker, H., Naaman, M., Gravano, L.: Beyond trending topics: real-world event identification on twitter. In: ICWSM, vol. 11, pp. 438–441 (2011)
Sakaki, T., Okazaki, M., Matsuo, Y.: Earthquake shakes twitter users: real-time event detection by social sensors. In: Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. ACM (2010)
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 841–842. ACM (2010)
Genc, Y., Sakamoto, Y., Nickerson, J.V.: Discovering context: classifying tweets through a semantic transform based on wikipedia. In: Schmorrow, D.D., Fidopiastis, C.M. (eds.) FAC 2011. LNCS, vol. 6780, pp. 484–492. Springer, Heidelberg (2011)
Rosa, K.D., Shah, R., Lin, B., Gershman, A., Frederking, R.: Topical clustering of tweets. In: Proceedings of the ACM SIGIR: SWSM (2011)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco (2005)
Sokolova, M., Lapalme, G.: A systematic analysis of performance measures for classification tasks. Inf. Process. Manage. 45(4), 427–437 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Theodotou, A., Stassopoulou, A. (2015). A System for Automatic Classification of Twitter Messages into Categories. In: Christiansen, H., Stojanovic, I., Papadopoulos, G. (eds) Modeling and Using Context. CONTEXT 2015. Lecture Notes in Computer Science(), vol 9405. Springer, Cham. https://doi.org/10.1007/978-3-319-25591-0_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-25591-0_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25590-3
Online ISBN: 978-3-319-25591-0
eBook Packages: Computer ScienceComputer Science (R0)