ABSTRACT
The prevalent use of social media produces mountains of unlabeled, high-dimensional data. Feature selection has been shown effective in dealing with high-dimensional data for efficient data mining. Feature selection for unlabeled data remains a challenging task due to the absence of label information by which the feature relevance can be assessed. The unique characteristics of social media data further complicate the already challenging problem of unsupervised feature selection, (e.g., part of social media data is linked, which makes invalid the independent and identically distributed assumption), bringing about new challenges to traditional unsupervised feature selection algorithms. In this paper, we study the differences between social media data and traditional attribute-value data, investigate if the relations revealed in linked data can be used to help select relevant features, and propose a novel unsupervised feature selection framework, LUFS, for linked social media data. We perform experiments with real-world social media datasets to evaluate the effectiveness of the proposed framework and probe the working of its key components.
Supplemental Material
- A. Argyriou, T. Evgeniou, and M. Pontil. Multi-task feature learning. NIPS, 19:41, 2007.Google Scholar
- S. Boyd and L. Vandenberghe. Convex optimization. Cambridge Univ Pr, 2004. Google ScholarDigital Library
- D. Cai, C. Zhang, and X. He. Unsupervised feature selection for multi-cluster data. In KDD, pages 333--342. ACM, 2010. Google ScholarDigital Library
- C. Constantinopoulos, M. Titsias, and A. Likas. Bayesian feature and model selection for gaussian mixture models. TPAMI, pages 1013--1018, 2006. Google ScholarDigital Library
- C. Ding, D. Zhou, X. He, and H. Zha. R 1-pca: rotational invariant l 1-norm principal component analysis for robust subspace factorization. In Proceedings of the 23rd international conference on Machine learning, pages 281--288. ACM, 2006. Google ScholarDigital Library
- R. Duda, P. Hart, D. Stork, et al. Pattern classification, volume 2. wiley New York, 2001. Google ScholarDigital Library
- J. Dy and C. Brodley. Feature selection for unsupervised learning. Journal of Machine Learning Research, 5:845--889, 2004. Google ScholarDigital Library
- J. G. Dy and C. E. Brodley. Feature subset selection and order identification for unsupervised learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 247--254, 2000. Google ScholarDigital Library
- J. G. Dy and C. E. Brodley. Visualization and interactive feature selection for unsupervised data. In KDD, pages 360--364, 2000. Google ScholarDigital Library
- J. G. Dy, C. E. Brodley, A. C. Kak, L. S. Broderick, and A. M. Aisen. Unsupervised feature selection applied to content-based retrieval of lung images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(3):373--378, 2003. Google ScholarDigital Library
- E. Erosheva, S. Fienberg, and J. Lafferty. Mixed-membership models of scientific publications. Proceedings of the National Academy of Sciences of the United States of America, 101(Suppl 1):5220, 2004.Google ScholarCross Ref
- I. Guyon, J. Weston, S. Barnhill, and V. Vapnik. Gene selection for cancer classification using support vector machines. Machine learning, 46(1):389--422, 2002. Google ScholarDigital Library
- M. Hall. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of Seventeenth International Conference on Machine Learning (ICML-00). Morgan Kaufmann Publishers, 2000. Google ScholarDigital Library
- X. He, D. Cai, and P. Niyogi. Laplacian score for feature selection. NIPS, 18:507, 2006.Google ScholarDigital Library
- R. Horn and C. Johnson. Matrix analysis. Cambridge Univ Pr, 1990. Google ScholarDigital Library
- G. John, R. Kohavi, and K. Pfleger. Irrelevant feature and the subset selection problem. In W. Cohen and H. H., editors, Machine Learning: Proceedings of the Eleventh International Conference, pages 121--129, New Brunswick, N.J., 1994. Rutgers University.Google Scholar
- Y. Kim, W. Street, and F. Menczer. Feature selection for unsupervised learning via evolutionary search. In KDD, pages 365--369, 2000. Google ScholarDigital Library
- H. Liu and H. Motoda. Computational methods of feature selection. Chapman & Hall, 2008. Google ScholarDigital Library
- H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 17(4):491, 2005. Google ScholarDigital Library
- H. Liu and L. Yu. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans. on Knowledge and Data Engineering, 17(3):1--12, 2005. Google ScholarDigital Library
- J. Liu, S. Ji, and J. Ye. Multi-task feature learning via efficient l 2, 1-norm minimization. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 339--348. AUAI Press, 2009. Google ScholarDigital Library
- U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395--416, 2007. Google ScholarDigital Library
- P. Marsden and N. Friedkin. Network studies of social influence. Sociological Methods and Research, 22(1):127--151, 1993.Google ScholarCross Ref
- M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):26113, 2004.Google Scholar
- F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robust feature selection via joint l21-norms minimization. NIPS, 2010.Google Scholar
- H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, pages 1226--1238, 2005. Google ScholarDigital Library
- V. Roth and T. Lange. Feature selection in clustering problems. NIPS, 16:473--480, 2004.Google Scholar
- J. Tang, H. Gao, and H. Liu. mtrust: Discerning multi-faceted trust in a connected world. In The ACM international conference on Web search and data mining, 2012. Google ScholarDigital Library
- J. Tang and H. Liu. Feature selection with linked data in social media. In SIAM International Conference on Data Mining, 2012.Google Scholar
- L. Tang and H. Liu. Relational learning via latent social dimensions. In KDD, pages 817--826. ACM, 2009. Google ScholarDigital Library
- X. Wang, L. Tang, H. Gao, and H. Liu. Discovering overlapping groups in social media. In 2010 IEEE International Conference on Data Mining, pages 569--578. IEEE, 2010. Google ScholarDigital Library
- L. Wolf and A. Shashua. Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach. Journal of Machine Learning Research, 6:1855--1887, 2005. Google ScholarDigital Library
- R. Xiang, J. Neville, and M. Rogati. Modeling relationship strength in online social networks. In Proceedings of the 19th international conference on World wide web, pages 981--990. ACM, 2010. Google ScholarDigital Library
- Y. Yang, H. Shen, Z. Ma, Z. Huang, and X. Zhou. L21-norm regularized discriminative feature selection for unsupervised learning. In Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence, 2011. Google ScholarDigital Library
- Z. Zhao and H. Liu. Spectral feature selection for supervised and unsupervised learning. In Proceedings of the 24th international conference on Machine learning, pages 1151--1157. ACM, 2007. Google ScholarDigital Library
- Z. Zhao, L. Wang, and H. Liu. Efficient spectral feature selection with minimum redundancy. In Proceedings of the Twenty-4th AAAI Conference on Artificial Intelligence (AAAI), 2010.Google Scholar
Index Terms
- Unsupervised feature selection for linked social media data
Recommendations
Unsupervised Streaming Feature Selection in Social Media
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementThe explosive growth of social media sites brings about massive amounts of high-dimensional data. Feature selection is effective in preparing high-dimensional data for data analytics. The characteristics of social media present novel challenges for ...
Feature Selection for Social Media Data
Feature selection is widely used in preparing high-dimensional data for effective data mining. The explosive popularity of social media produces massive and high-dimensional data at an unprecedented rate, presenting new challenges to feature selection. ...
Adaptive Graph Fusion for Unsupervised Feature Selection
Artificial Neural Networks and Machine Learning – ICANN 2019: Deep LearningAbstractThe massive high-dimensional data brings about great time complexity, high storage burden and poor generalization ability of learning models. Feature selection can alleviate curse of dimensionality by selecting a subset of features. Unsupervised ...
Comments