Abstract
In this chapter we review vector space models to propose a new one based on the Jensen-Shannon divergence with the goal of classifying ignored short messages on a social network service. We assume that ignored messages are those published ones that were not interacted with. Our goal then is to attempt to classify messages to be published as ignored to discard them from a set messages that can be used by a recommender system. To evaluate our model, we conduct experiments comparing different models on a Twitter dataset with more than 13,000 Twitter accounts. Results show that our best model tested obtained an average accuracy of 0.77, compared to 0.74 from a model from the literature. Similarly, this method obtained an average precision of 0.74 compared to 0.58 from the second best performing model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
These numbers are based on discussions in blog posts such as in http://thenextweb.com/twitter/2012/01/07/interesting-fact-most-tweets-posted-are -approximately-30-characters-long/ and http://www.ayman-naaman.net/2010/04/21/how-many-characters-do-you-tweet/. But they do not provide an average. In our own dataset presented in Sect. 4.1, the average number of characters in a tweet is 84.
- 3.
Note that we do not normalize our tf-idf model based on message length since all tend to have similar sizes [18].
- 4.
References
Bell, R., Volinsky, C., Koren, Y.: Matrix factorization techniques for recommender systems. IEEE Comput. 42(8), 30–37 (2009)
Chen, K., Chen, T., Zheng, G., Jin, O., Yao, E., Yu, Y.: Collaborative personalized tweet recommendation. In: Proceedings of the 35th international ACM SIGIR conference on Research and Development in Information Retrieval, pp. 661–670. SIGIR ’12, ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348372
Chen, M., Jin, X., Shen, D.: Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, IJCAI’11, vol. 3, pp. 1776–1781. AAAI Press (2011). http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-298
Combarro, E., Montanes, E., Diaz, I., Ranilla, J., Mones, R.: Introducing a family of linear measures for feature selection in text categorization. IEEE Trans. Knowl. Data Eng. 17(9), 1223–1232 (2005)
Dagan, I., Lee, L., Pereira, F.C.N.: Similarity-based models of word cooccurrence probabilities. Mach. Learn. 34(1–3), 43–69 (1999). doi:10.1023/A:1007537716579
Díaz, I., Ranilla, J., Montañes, E., Fernández, J., Combarro, E.: Improving performance of text categorization by combining filtering and support vector machines. J. Am. Soc. Inf. Sci. Technol. 55(7), 579–592 (2004)
Halawi, G., Dror, G., Gabrilovich, E., Koren, Y.: Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, pp. 1406–1414. ACM, New York (2012). http://doi.acm.org.zorac.aub.aau.dk/10.1145/2339530.2339751
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: Proceedings of the 19th International Conference on World Wide Web, pp. 591–600. ACM, Raleigh (2010). http://portal.acm.org/citation.cfm?id=1772690.1772751
Lage, R., Durao, F., Dolog, P.: Towards effective group recommendations for microblogging users. In: Proceedings of the 27th Annual ACM Symposium on Applied Computing, SAC ’12, pp. 923–928. ACM, New York (2012). http://doi.acm.org/10.1145/2245276.2245456
Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, WWW ’05, pp. 1032–1033. ACM, New York (2005) http://doi.acm.org.zorac.aub.aau.dk/10.1145/1062745.1062854
Lewis, D.D.: Naive (Bayes) at forty: the independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–15. Springer, Heidelberg (1998)
Lin, J., Mishne, G.: A study of “Churn” in tweets and real-time search queries. In: 6th International AAAI Conference on Weblogs and Social Media, May 2012. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4599
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and naive Bayes. In: Machine Learning-International Workshop then Conference, pp. 258–267. Morgan Kaufmann Publishers, INC (1999)
Petrovic, S., Osborne, M., Lavrenko, V.: RT to win! predicting message propagation in twitter. In: 5th International AAAI Conference on Weblogs and Social Media, May 2011. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM11/paper/view/2754
Robertson, S.E., Walker, S., Beaulieu, M., Willett, P.: Okapi at TREC-7: automatic ad hoc, filtering, VLC and interactive track. In: TREC, pp. 199–210 (1998)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 377–386. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135834
Sun, A.: Short text classification using very few words. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’12, pp. 1145–1146. ACM, New York (2012). http://doi.acm.org/10.1145/2348283.2348511
Turney, P.D., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010). http://arxiv.org/abs/1003.1141. arXiv:1003.1141
Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Machine Learning-International Workshop then Conference, pp. 412–420. Morgan Kaufmann Publishers, INC. (1997)
Yih, W.T., Goodman, J., Carvalho, V.R.: Finding advertising keywords on web pages. In: Proceedings of the 15th International Conference on World Wide Web, WWW ’06, pp. 213–222. ACM, New York (2006). http://doi.acm.org/10.1145/1135777.1135813
Yih, W.T., Meek, C.: Improving similarity measures for short segments of text. In: Proceedings of the 22nd National Conference on Artificial Intelligence AAAI’07, vol. 2, pp. 1489–1494. AAAI Press (2007). http://dl.acm.org.zorac.aub.aau.dk/citation.cfm?id=1619797.1619884
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lage, R., Dolog, P., Leginus, M. (2014). Vector Space Models for the Classification of Short Messages on Social Network Services. In: Krempels, KH., Stocker, A. (eds) Web Information Systems and Technologies. WEBIST 2013. Lecture Notes in Business Information Processing, vol 189. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-44300-2_13
Download citation
DOI: https://doi.org/10.1007/978-3-662-44300-2_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-44299-9
Online ISBN: 978-3-662-44300-2
eBook Packages: Computer ScienceComputer Science (R0)