Skip to main content
Log in

Short-text feature construction and selection in social media data: a survey

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

Social networking sites such as Facebook or Twitter attract millions of users, who everyday post an enormous amount of content in the form of tweets, comments and posts. Since social network texts are usually short, learning tasks have to deal with a very high dimensional and sparse feature space, in which most features have low frequencies. As a result, extracting useful knowledge from such noisy data is a challenging task, that converts large-scale short-text learning tasks in social environments into one of the most relevant problems in machine learning and data mining. Feature selection is one of the most known and commonly used techniques for reducing the impact of the high dimensional feature space in text learning. A wide variety of feature selection techniques can be found in the literature applied to traditional, long-texts and document collections. However, short-texts coming from the social Web pose new challenges to this well-studied problem as texts’ shortness offers a limited context to extract enough statistical evidence about words relations (e.g. correlation), and instances usually arrive in continuous streams (e.g. Twitter timeline), so that the number of features and instances is unknown, among other problems. This paper surveys feature selection techniques for dealing with short texts in both offline and online settings. Then, open issues and research opportunities for performing online feature selection over social media data are discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. http://www.smartinsights.com/social-media-marketing/social-media-strategy/new-global-social-media-research/.

  2. http://www.dmoz.org/.

  3. http://www.alchemyapi.com/.

  4. http://mpqa.cs.pitt.edu/lexicons/subj_lexicon/.

  5. http://help.sentiment140.com/.

  6. www.keenage.com/html/e_index.html.

  7. http://tagme.di.unipi.it/.

  8. http://mpqa.cs.pitt.edu/opinionfinder/.

  9. http://alt.qcri.org/semeval2014/task9/.

  10. https://www.cs.york.ac.uk/semeval-2013/task2/.

  11. http://archive.ics.uci.edu/ml/.

References

  • Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014

    MATH  Google Scholar 

  • Alelyani S, Liu H, Wang L (2011) The effect of the characteristics of the dataset on the selection stability. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI), IEEE Computer Society, pp 970–977

  • Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 29–60

  • Alexandrov M, Gelbukh A, Rosso P (2005) An approach to clustering abstracts. In: Montoyo A, Muñoz R, Métais E (eds) Natural language processing and information systems, vol 3513, Lecture notes in computer science, Springer, Berlin, pp 275–285

  • Amir S, Almeida MB, Martins B, Ja Filgueiras, Silva MJ (2014) Tugas: exploiting unlabelled data for twitter sentiment analysis. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for computational linguistics and Dublin City University, Dublin, Ireland, pp 673–677

  • Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international conference on weblogs and social media, The AAAI Press, Spain

  • Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Walsh T (ed) Proceedings of the 22th international joint conference on artificial intelligence, The AAAI Press, IJCAI’11, pp 1776–1781

  • Dong L, Wei F, Tan C, Tang D, Zhou M, Xu K (2014) Adaptive recursive neural network for target-dependent twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for computational linguistics, Baltimore, pp 49–54

  • Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578–593

    Article  Google Scholar 

  • Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75

    Article  Google Scholar 

  • Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning, ACM, New York, NY, USA, ICML’04, p 38

  • Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305

    MATH  Google Scholar 

  • Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence. MA, USA, Boston, pp 1301–1306

  • Gu Q, Han J (2011) Towards feature selection in network. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 1175–1184

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  • Han Y, Yu L (2012) A variance reduction framework for stable feature selection. Stat Anal Data Min 5(5):428–445

    Article  MathSciNet  Google Scholar 

  • Hoi SCH, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, ACM, New York, NY, USA, BigMine’12, pp 93–100

  • Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 919–928

  • Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies—vol 1, Association for computational linguistics, Stroudsburg, PA, USA, HLT’11, pp 151–160

  • Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 775–784

  • John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the 11th international conference of machine learning, Morgan Kaufmann, ICML’94, pp 121–129

  • Li J, Hu X, Tang J, Liu H (2015) Unsupervised streaming feature selection in social media. In: Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM’15, pp 1041–1050

  • Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 375–384

  • Li C, Sun A, Datta A (2012) Twevent: Segment-based event detection from tweets. In: Proceedings of the 21st ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’12, pp 155–164

  • Liu ZLZ, Yu WYW, Chen WCW, Wang SWS, Wu FWF (2010) Short text feature selection for micro-blog mining. In: Proceedings of the 2nd international conference on computational intelligence and software engineering, IEEE, CISE’10, pp 4–7

  • Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502

    Article  MathSciNet  Google Scholar 

  • Ma Z, Sun A, Cong G (2013) On predicting the popularity of newly emerging hashtags in twitter. J Am Soc Inf Sci Technol 64(7):1399–1410

    Article  Google Scholar 

  • Marsden PV, Friedkin NE (1993) Network studies of social influence. Sociol Methods Res 22(1):127–151

    Article  Google Scholar 

  • McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol 27(1):415–444

    Article  Google Scholar 

  • Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from wikipedia. Int J Hum Comput Stud 67(9):716–754

    Article  Google Scholar 

  • Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems. Lake Tahoe, Nevada, USA, pp 3111–3119

  • Moradi P, Rostami M (2015) A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell 44(C):33–45

    Article  Google Scholar 

  • Ozdikis O, Senkul P, Oguztuzun H (2012) Semantic expansion of tweet contents for enhanced event detection in twitter. In: Proceedings of the 2012 international conference on advances in social networks analysis and mining, IEEE Computer Society, Istanbul, Turkey, ASONAM’12, pp 20–24

  • Peng Y, Xuefeng Z, Jianyong Z, Yumhong X (2009) Lazy learner text categorization algorithm based on embedded feature selection. J Syst Eng Electron 20(3):651–659

    Google Scholar 

  • Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Doha, Qatar, pp 1532–1543

  • Perez-Tellez F, Pinto D, Cardiff J, Rosso P (2010) On the difficulty of clustering company tweets. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC’10, pp 95–102

  • Perkins S, Lacker K, Theiler J (2003) Grafting: fast, incremental feature selection by gradient descent in function space. J Mach Learn Res 3:1333–1356

    MathSciNet  MATH  Google Scholar 

  • Perkins S, Theiler J (2003) Online feature selection using grafting. In: Fawcett T, Mishra N (eds) Proceedings of the 21st international conference on machine learning, AAAI Press, ICML’03, pp 592–599

  • Rafeeque P, Sendhilkumar S (2011) A survey on short text analysis in web. In: Proceedings of the 3rd international conference on advanced computing, IEEE, Chennai, India, ICoAC’11, pp 365–371

  • Rosa KD, Ellen J (2009) Text classification methodologies applied to micro-text in military chat. In: Proceedings of the 2009 international conference on machine learning and applications, IEEE Computer Society, Washington, DC, USA, ICMLA’09, pp 710–714

  • Saeys Y, In Inza, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517

    Article  Google Scholar 

  • Saif H, Fernández M, He Y, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14, pp 810–817

  • Saif H, He Y, Alani H (2012) Alleviating data sparsity for twitter sentiment analysis. In: Proceedings of the 2nd workshop on making sense of microposts: big things come in small packages at the 21st international conference on the World Wide Web, CEUR Workshop Proceedings, MSM’12, pp 2–9

  • Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the 6th international conference on new methods in language processing, Manchester, UK, NeMLaP’94

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47

    Article  Google Scholar 

  • Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR’15, pp 959–962. doi:10.1145/2766462.2767830

  • Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356

    Article  MathSciNet  MATH  Google Scholar 

  • Tang J, Wang X, Gao H, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. J Front Comput Sci China 6(1):88–101

    MathSciNet  MATH  Google Scholar 

  • Tang J, Alelyani S, Liu H (2014c) Feature selection for classification: A review. In: Aggarwal CC, Reddy CK (eds) Data classification: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 37–64

  • Tang J, Hu X, Gao H, Liu H (2013) Unsupervised feature selection for multi-view data in social media. In: Proceedings of the SIAM international conference on data mining, SIAM, SDM’13, pp 270–278

  • Tang J, Liu H (2012) Feature selection with linked data in social media. In: Proceedings of the 12th SIAM International conference on data mining, SIAM / Omnipress, pp 118–128

  • Tang J, Liu H (2014a) Feature selection for social media data. ACM Trans Knowl Discov Data 8(4):19:1–19:27

  • Tang J, Liu H (2014b) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927

    Article  Google Scholar 

  • Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B (2014a) Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, The Association for computer linguistics, Baltimore, MD, USA, pp 1555–1565

  • Tang G, Xia Y, Wang W, Lau R, Zheng F (2014b) Clustering tweets using wikipedia concepts. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14

  • Verma S, Vieweg S, Corvey W, Palen L, Martin JH, Palmer M, Schram A, Anderson KM (2011) Natural language processing to the rescue? extracting “situational awareness” tweets during mass emergency. In: Proceedings of the 5th International AAAI conference on web and social media, The AAAI Press, ICWSM’11

  • Wang Bk, Huang YF, Yang Wx, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang Univ Sci C 13(9):649–659

    Article  Google Scholar 

  • Wang J, Zhao P, Hoi S, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 26(3):698–710

    Article  Google Scholar 

  • Wang J, Zhao ZQ, Hu X, Cheung YM, Wang M, Wu X (2013) Online group feature selection. In: Proceedings of the 23rd international joint conference on artificial intelligence, AAAI Press, IJCAI’13, pp 1757–1763

  • Wu Y, Hoi SCH, Mei T (2014) Massive-scale online feature selection for sparse ultra-high dimensional data. Computing Research Repository abs/1409.7794. https://arxiv.org/abs/1409.7794

  • Wu X, Yu K, Wang H, Wei D (2010) Online streaming feature selection. In: Proceedings of the 27th international conference on machine learning (ICML-10), Omnipress, ICML’10, pp 1159–1166

  • Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’97, pp 412–420

  • Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  MATH  Google Scholar 

  • Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7:1861–1885

    MathSciNet  MATH  Google Scholar 

  • Zubiaga A, Spina D, Martínez R, Fresno V (2015) Real-time classification of twitter trends. J Assoc Inf Sci Technol 66(3):462–473

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Antonela Tommasel.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tommasel, A., Godoy, D. Short-text feature construction and selection in social media data: a survey. Artif Intell Rev 49, 301–338 (2018). https://doi.org/10.1007/s10462-016-9528-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-016-9528-0

Keywords

Navigation