Abstract
Social networking sites such as Facebook or Twitter attract millions of users, who everyday post an enormous amount of content in the form of tweets, comments and posts. Since social network texts are usually short, learning tasks have to deal with a very high dimensional and sparse feature space, in which most features have low frequencies. As a result, extracting useful knowledge from such noisy data is a challenging task, that converts large-scale short-text learning tasks in social environments into one of the most relevant problems in machine learning and data mining. Feature selection is one of the most known and commonly used techniques for reducing the impact of the high dimensional feature space in text learning. A wide variety of feature selection techniques can be found in the literature applied to traditional, long-texts and document collections. However, short-texts coming from the social Web pose new challenges to this well-studied problem as texts’ shortness offers a limited context to extract enough statistical evidence about words relations (e.g. correlation), and instances usually arrive in continuous streams (e.g. Twitter timeline), so that the number of features and instances is unknown, among other problems. This paper surveys feature selection techniques for dealing with short texts in both offline and online settings. Then, open issues and research opportunities for performing online feature selection over social media data are discussed.
Similar content being viewed by others
Notes
References
Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014
Alelyani S, Liu H, Wang L (2011) The effect of the characteristics of the dataset on the selection stability. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI), IEEE Computer Society, pp 970–977
Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 29–60
Alexandrov M, Gelbukh A, Rosso P (2005) An approach to clustering abstracts. In: Montoyo A, Muñoz R, Métais E (eds) Natural language processing and information systems, vol 3513, Lecture notes in computer science, Springer, Berlin, pp 275–285
Amir S, Almeida MB, Martins B, Ja Filgueiras, Silva MJ (2014) Tugas: exploiting unlabelled data for twitter sentiment analysis. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for computational linguistics and Dublin City University, Dublin, Ireland, pp 673–677
Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international conference on weblogs and social media, The AAAI Press, Spain
Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Walsh T (ed) Proceedings of the 22th international joint conference on artificial intelligence, The AAAI Press, IJCAI’11, pp 1776–1781
Dong L, Wei F, Tan C, Tang D, Zhou M, Xu K (2014) Adaptive recursive neural network for target-dependent twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for computational linguistics, Baltimore, pp 49–54
Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578–593
Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75
Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning, ACM, New York, NY, USA, ICML’04, p 38
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305
Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence. MA, USA, Boston, pp 1301–1306
Gu Q, Han J (2011) Towards feature selection in network. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 1175–1184
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Han Y, Yu L (2012) A variance reduction framework for stable feature selection. Stat Anal Data Min 5(5):428–445
Hoi SCH, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, ACM, New York, NY, USA, BigMine’12, pp 93–100
Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 919–928
Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies—vol 1, Association for computational linguistics, Stroudsburg, PA, USA, HLT’11, pp 151–160
Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 775–784
John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the 11th international conference of machine learning, Morgan Kaufmann, ICML’94, pp 121–129
Li J, Hu X, Tang J, Liu H (2015) Unsupervised streaming feature selection in social media. In: Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM’15, pp 1041–1050
Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 375–384
Li C, Sun A, Datta A (2012) Twevent: Segment-based event detection from tweets. In: Proceedings of the 21st ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’12, pp 155–164
Liu ZLZ, Yu WYW, Chen WCW, Wang SWS, Wu FWF (2010) Short text feature selection for micro-blog mining. In: Proceedings of the 2nd international conference on computational intelligence and software engineering, IEEE, CISE’10, pp 4–7
Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502
Ma Z, Sun A, Cong G (2013) On predicting the popularity of newly emerging hashtags in twitter. J Am Soc Inf Sci Technol 64(7):1399–1410
Marsden PV, Friedkin NE (1993) Network studies of social influence. Sociol Methods Res 22(1):127–151
McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol 27(1):415–444
Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from wikipedia. Int J Hum Comput Stud 67(9):716–754
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems. Lake Tahoe, Nevada, USA, pp 3111–3119
Moradi P, Rostami M (2015) A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell 44(C):33–45
Ozdikis O, Senkul P, Oguztuzun H (2012) Semantic expansion of tweet contents for enhanced event detection in twitter. In: Proceedings of the 2012 international conference on advances in social networks analysis and mining, IEEE Computer Society, Istanbul, Turkey, ASONAM’12, pp 20–24
Peng Y, Xuefeng Z, Jianyong Z, Yumhong X (2009) Lazy learner text categorization algorithm based on embedded feature selection. J Syst Eng Electron 20(3):651–659
Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Doha, Qatar, pp 1532–1543
Perez-Tellez F, Pinto D, Cardiff J, Rosso P (2010) On the difficulty of clustering company tweets. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC’10, pp 95–102
Perkins S, Lacker K, Theiler J (2003) Grafting: fast, incremental feature selection by gradient descent in function space. J Mach Learn Res 3:1333–1356
Perkins S, Theiler J (2003) Online feature selection using grafting. In: Fawcett T, Mishra N (eds) Proceedings of the 21st international conference on machine learning, AAAI Press, ICML’03, pp 592–599
Rafeeque P, Sendhilkumar S (2011) A survey on short text analysis in web. In: Proceedings of the 3rd international conference on advanced computing, IEEE, Chennai, India, ICoAC’11, pp 365–371
Rosa KD, Ellen J (2009) Text classification methodologies applied to micro-text in military chat. In: Proceedings of the 2009 international conference on machine learning and applications, IEEE Computer Society, Washington, DC, USA, ICMLA’09, pp 710–714
Saeys Y, In Inza, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517
Saif H, Fernández M, He Y, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14, pp 810–817
Saif H, He Y, Alani H (2012) Alleviating data sparsity for twitter sentiment analysis. In: Proceedings of the 2nd workshop on making sense of microposts: big things come in small packages at the 21st international conference on the World Wide Web, CEUR Workshop Proceedings, MSM’12, pp 2–9
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the 6th international conference on new methods in language processing, Manchester, UK, NeMLaP’94
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR’15, pp 959–962. doi:10.1145/2766462.2767830
Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356
Tang J, Wang X, Gao H, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. J Front Comput Sci China 6(1):88–101
Tang J, Alelyani S, Liu H (2014c) Feature selection for classification: A review. In: Aggarwal CC, Reddy CK (eds) Data classification: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 37–64
Tang J, Hu X, Gao H, Liu H (2013) Unsupervised feature selection for multi-view data in social media. In: Proceedings of the SIAM international conference on data mining, SIAM, SDM’13, pp 270–278
Tang J, Liu H (2012) Feature selection with linked data in social media. In: Proceedings of the 12th SIAM International conference on data mining, SIAM / Omnipress, pp 118–128
Tang J, Liu H (2014a) Feature selection for social media data. ACM Trans Knowl Discov Data 8(4):19:1–19:27
Tang J, Liu H (2014b) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927
Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B (2014a) Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, The Association for computer linguistics, Baltimore, MD, USA, pp 1555–1565
Tang G, Xia Y, Wang W, Lau R, Zheng F (2014b) Clustering tweets using wikipedia concepts. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14
Verma S, Vieweg S, Corvey W, Palen L, Martin JH, Palmer M, Schram A, Anderson KM (2011) Natural language processing to the rescue? extracting “situational awareness” tweets during mass emergency. In: Proceedings of the 5th International AAAI conference on web and social media, The AAAI Press, ICWSM’11
Wang Bk, Huang YF, Yang Wx, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang Univ Sci C 13(9):649–659
Wang J, Zhao P, Hoi S, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 26(3):698–710
Wang J, Zhao ZQ, Hu X, Cheung YM, Wang M, Wu X (2013) Online group feature selection. In: Proceedings of the 23rd international joint conference on artificial intelligence, AAAI Press, IJCAI’13, pp 1757–1763
Wu Y, Hoi SCH, Mei T (2014) Massive-scale online feature selection for sparse ultra-high dimensional data. Computing Research Repository abs/1409.7794. https://arxiv.org/abs/1409.7794
Wu X, Yu K, Wang H, Wei D (2010) Online streaming feature selection. In: Proceedings of the 27th international conference on machine learning (ICML-10), Omnipress, ICML’10, pp 1159–1166
Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’97, pp 412–420
Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224
Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7:1861–1885
Zubiaga A, Spina D, Martínez R, Fresno V (2015) Real-time classification of twitter trends. J Assoc Inf Sci Technol 66(3):462–473
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tommasel, A., Godoy, D. Short-text feature construction and selection in social media data: a survey. Artif Intell Rev 49, 301–338 (2018). https://doi.org/10.1007/s10462-016-9528-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10462-016-9528-0