Skip to main content
Log in

RedTweet: recommendation engine for reddit

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Twitter and Reddit are two of the most popular social media sites used today. In this paper, we study the use of machine learning and WordNet-based classifiers to generate an interest profile from a user’s tweets and use this to recommend loosely related Reddit threads which the reader is most likely to be interested in. We introduce a genre classification algorithm using a similarity measure derived from WordNet lexical database for English to label genres for nouns in tweets. The proposed algorithm generates a user’s interest profile from their tweets based on a referencing taxonomy of genres derived from the genre-tagged Brown Corpus augmented with a technology genre. The top K genres of a user’s interest profile can be used for recommending subreddit articles in those genres. Experiments using real life test cases collected from Twitter have been done to compare the performance on genre classification by using the WordNet classifier and machine learning classifiers such as SVM, Random Forests, and an ensemble of Bayesian classifiers. Empirically, we have obtained similar results from the two different approaches with a sufficient number of tweets. It seems that machine learning algorithms as well as the WordNet ontology are viable tools for developing recommendation engine based on genre classification. One advantage of the WordNet approach is simplicity and no learning is required. However, the WordNet classifier tends to have poor precision on users with very few tweets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://twitter.com

  2. http://www.reddit.com/

  3. The numbers were extracted from the raw data downloaded from Reddit with the help of Dr. Arvind Srinivasan of ZL Technologies in San Jose, CA.

  4. http://www.scikit-learn.org

References

  • Bird, S. (2015). ’Natural Language Toolkit NLTK 3.0 documentation’, Nltk.org. [Online]. Available: http://www.nltk.org/. [Accessed: 27- Apr- 2015].

  • Boe, B. (2015). PRAW: The Python Reddit Api Wrapper PRAW 2.1.21 documentation, Praw.readthedocs.org. [Online]. Available: https://praw.readthedocs.org/en/v2.1.21/. [Accessed: 27- Apr- 2015].

  • Boser, B.E., Guyon, I.M., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Haussler, D. (Ed.) Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh, PA: ACM Press.

  • Brieman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MathSciNet  Google Scholar 

  • DeSmedt, T., & Daelemans, W. (2015). Pattern — CLiPS, Clips.ua.ac.be. [Online]. Available: http://www.clips.ua.ac.be/pattern. [Accessed: 27- Apr- 2015].

  • Dietterich, T. (2000). Ensemble Methods in machine learning. Multiple Classifier Systems, 1857, 1–15.

    Google Scholar 

  • Docs.opencv.org (2015). Introduction to Support Vector Machines OpenCV 2.4.11.0 documentation. [Online]. Available: http://docs.opencv.org/doc/tutorials/ml/introduction_to_svm/introduction_to_svm.html. [Accessed: 27- Apr- 2015].

  • Feldman, S., Marin, M.A., Ostendorf, M., & Gupta, M.R. (2009). Part-of-speech histograms for genre classification of text. In 2009. ICASSP 2009. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 4781–4784): IEEE.

  • Fellbaum, C. (1998). WordNet: An Electronic Lexical Database: MIT Press.

  • Finn, A., & Kushmerick, N. (2006). Learning to classify documents according to genre. Journal of the American Society for Information Science and Technology, 57 (11), 1506–1518.

    Article  Google Scholar 

  • Francis, W., & Kucera, H. (1979). Brown Corpus Manual, 1st edn. Providen ce: Brown University.

    Google Scholar 

  • Freund, L., Clarke, C.L.A., & Toms, E.G. (2006). Towards genre classification for IR in the workplace. Proceedings of the 1st International Conference on Information Interaction in Context, (p. 3036). New York, NY.

  • Karlgren, J., & Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis. Proceedings of the 15th Annual Meeting of the Association for Computational Linguistics, (p. 10711075). Morristown, NJ.

  • Kessler, B., Nunberg, G., & Schtze, H. (1997). Automatic detection of text genre. Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, (pp. 32–38). Morristown, NJ.

  • Lewis, D.D. (1992). Feature selection and feature extraction for text categorization. Proceedings of the workshop on Speech and Natural Language, 212–217.

  • Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Meyer zu Eissen, S., & Stein, B. (2004). Genre classification of web pages. KI 2004: Advances in Artificial Intelligence, 256–269.

  • Nguyen, H., Richards, R., Chan, C.-C., & Liszka, K.J. (2015). RedTweet: Recommendation Engine for Reddit. Paris, France: MSNDS Workshop 2015. (to appear Proceedings of IEEE/ACM ASONAM 2015).

    Book  Google Scholar 

  • Pennacchiotti, M., & Popescu, Ana-Maria (2011). A machine learning approach to twitter user classification. ICWSM, 11, 281–288.

    Google Scholar 

  • Qi, X., & Davison, B.D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys (CSUR), 41(2), 12.

    Article  Google Scholar 

  • Salton, G., Wong, A., & Yang, C.S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Stamatatos, E., Fakotakis, N., & Kokkinakis, G. (2000). Text genre detection using common word frequencies. Proceedings of the 18th Conference on Computational Linguistics, 808–814.

  • Stein, B., & Meyer zu Eissen, S. (2006). Distinguishing topic from genre. Proceedings of the 6th International Conference on Knowledge Management (I-KNOW 06). Graz: Journal of Universal Computer Science.

  • Taylor, L. (2014). 10 Remarkable Twitter Statistics for 2015, Social Media Consultant — Social Media Agency — Social Marketing. [Online]. Available: http://lorirtaylor.com/twitter-statistics-2015/. [Accessed: 27- Apr- 2015].

  • Verdone, M. (2015). Python Twitter Tools (command-line client and IRC bot), Mike.verdone.ca. [Online]. Available: http://mike.verdone.ca/twitter/. [Accessed: 27- Apr- 2015].

  • Westman, S, & Freund, L. (2010). Information interaction in 140 characters or less: genres on twitter. Proceedings of the third symposium on Information interaction in context: ACM.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chien-Chung Chan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nguyen, H., Richards, R., Chan, CC. et al. RedTweet: recommendation engine for reddit. J Intell Inf Syst 47, 247–265 (2016). https://doi.org/10.1007/s10844-016-0410-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-016-0410-y

Keywords

Navigation