Abstract
In the era of big data, the dimensionality of data is increasing dramatically in many domains. To deal with high dimensionality, online feature selection becomes critical in big data mining. Recently, online selection of dynamic features has received much attention. In situations where features arrive sequentially over time, we need to perform online feature selection upon feature arrivals. Meanwhile, considering grouped features, it is necessary to deal with features arriving by groups. To handle these challenges, some state-of-the-art methods for online feature selection have been proposed. In this paper, we first give a brief review of traditional feature selection approaches. Then we discuss specific problems of online feature selection with feature streams in detail. A comprehensive review of existing online feature selection methods is presented by comparing with each other. Finally, we discuss several open issues in online feature selection.
Similar content being viewed by others
References
Wu X D, Zhu X Q, Wu G Q, Ding W. Data mining with big data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(1): 97–107
Franck M. How many photos are uploaded to flickr every day and month? 2015, http://www.flickr.com/photos/franckmichel/6855169886
Pollack J R, Perou C M, Alizadeh A A, Eisen M B, Pergamenschikov A, Williams C F, Jeffrey S S, Botstein D, Brown P O. Genome-wide analysis of DNA copy-number changes using cDNA microarrays. Nat Genet, 1999, 23(1): 41–46
Wang D, Irani D, Pu C. Evolutionary study of Web spam: Webb spam Corpus 2011 versus Webb spam Corpus 2006. In: Proceedings of the 6th Annual ACM Symposium on Parallelism in Algorithms and Architectures. 2012, 40–49
Farahat A K, Elgohary A, Ghodsi A, Kamel M S. Greedy column subset selection for large-scale data sets. Knowledge and Information Systems, 2015, 45(1): 1–34
Patra B K, Nandi S. Effective data summarization for hierarchical clustering in large datasets. Knowledge and Information Systems, 2015, 42(1): 1–20
Hoi S, Wang J L, Zhao P L, Jin R. Online feature selection for mining big data. In: Proceedings of the 1st International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications. 2012
Guyon I, Elisseeff A. An introduction to variable and feature selection. The Journal of Machine Learning Research, 2003, 3: 1157–1182
Peng H C, Long F H, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226–1238
Wang M, Li H, Tao D C, Lu K, Wu X. Multimodal graph-based reranking for Web image search. IEEE Transactions on Image Processing, 2012, 21(11): 4649–4661
Ding W, Stepinski T F, Mu Y, Bandeira L, Ricardo R, Wu Y, Lu Z, Cao T, Wu X. Sub-kilometer crater discovery with boosting and transfer learning. ACM Transactions on Intelligent Systems and Technology, 2011, 2(4): 39
Wu X D, Yu K, Wang H, Ding W. Online streaming feature selection. In: Proceedings of the 27th International Conference on Machine Learning. 2010, 1159–1166
Yu K, Wu X D, Ding W, Pei J. Towards scalable and accurate online feature selection for big data. In: Proceedings of IEEE International Conference on Data Mining. 2014, 660–669
Perkins S, Theiler J. Online feature selection using grafting. In: Proceedings of the 20th International Conference on Machine Learning. 2003, 592–599
Zhou J, Foster D P, Stine R A, Ungar L H. Streamwise feature selection. Journal of Machine Learning Research, 2006, 3(2): 1532–4435
Wu X D, Yu K, Ding W, Wang H, Zhu X Q. Online feature selection with streaming features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(5): 1178–1192
Li H G, Wu X D, Li Z, Ding W. Group feature selection with streaming features. In: Proceedings of the 13th IEEE International Conference on Data Mining. 2013, 1109–1114
Wang J, Wang M, Li P P, Liu L Q, Zhao Z Q, Hu X G, Wu X D. Online feature selection with group structure analysis. IEEE Transactions on Knowledge and Data Engineering, 2015, 27: 3029–3041
Zhang K H, Zhang L, Yang M H. Real-time object tracking via online discriminative feature selection. IEEE Transactions on Image Processing, 2013, 22(12): 4664–4677
Collins R T, Liu Y X, Leordeanu M. Online selection of discriminative tracking features. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(10): 1631–1643
Carvalho V R, Cohen WW. Single-pass online learning: Performance, voting schemes and online feature selection. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006
Jiang W, Er G H, Dai Q H, Gu J W. Similarity-based online feature selection in content-based image retrieval. IEEE Transactions on Image Processing, 2006, 15(3): 702–712
Stefanowski J, Cuzzocrea A, Slezak D. Processing and mining complex data streams. Information Sciences, 2014, 285: 63–65
Xiao J, Xiao Y, Huang A Q, Liu D H, Wang S Y. Feature-selectionbased dynamic transfer ensemble model for customer churn prediction. Knowledge and Information Systems, 2015, 43(1): 29–51
Zhou T C, Lyu M R T, King I, Lou J. Learning to suggest questions in social media. Knowledge and Information Systems, 2015, 43(2): 389–416
Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(4): 491–502
Song L, Smola A, Gretton A, Borgwardt K M, Bedo J. Supervised feature selection via dependence estimation. In: Proceedings of the 24th International Conference on Machine Learning. 2007
Mitra P, Murthy C, Pal S K. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(3): 301–312
Yu G X, Zhang G J, Zhang Z L, Yu Z W, Deng L. Semi-supervised classification based on subspace sparse representation. Knowledge and Information Systems, 2015, 43(1): 81–101
Zhao Z, Liu H. Semi-supervised feature selection via spectral analysis. In: Proceedings of SIAM International Conference on Data Mining. 2007, 641–647
Liu H, Motoda H. Computational Methods of Feature Selection. Boca Raton, FL: Chapman and Hall/CRC Press, 2007
Yu L, Liu H. Feature selection for high-dimensional data: a fast correlation-based filter solution. In: Proceedings of the 20th International Conferences on Machine Learning. 2003, 601–608
He X F, Cai D, Niyogi P. Laplacian score for feature selection. Advances in Neural Information Processing Systems, 2005, 17: 507–514
Gu Q Q, Li Z H, Han J W. Generalized fisher score for feature selection. Statistics, 2012
Zhang D Q, Chen S C, Zhou Z H. Constraint score: a new filter method for feature selection with pairwise constraints. Pattern Recognition, 2008, 41(5): 1440–1451
Sun D, Zhang D Q. Bagging constraint score for feature selection with pairwise constraints. Pattern Recognition, 2010, 43(6): 2106–2118
Liu M X, Zhang D Q. Sparsity score: a novel graph preserving feature selection method. International Journal of Pattern Recognition and Artificial Intelligence, 2014, 28(4): 1450009
Liu M X, Miao L S, Zhang D Q. Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability, 2014, 63(2): 676–686
Liu M X, Zhang D Q. Pairwise constraint-guided sparse learning for feature selection. IEEE Transactions on Cybernetics, 2015
Wei H L, Billings S A. Feature subset selection and ranking for data dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(1): 162–166
Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. Journal of Machine Learning Research, 2004, 5(1): 1205–1224
Kwak N, Choi C H. Input feature selection by mutual information based on parzen window. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(12): 1667–1671
Kira K, Rendell L A. The feature selection problem: traditional methods and a new algorithm. In: Proceedings of the 9th National Conference on Artificial Intelligence. 1992, 129–134
Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of Relief F and Rrelief F. Machine Learning, 2003, 53(1-2): 23–69
Almuallim H, Dietterich T G. Learning with many irrelevant features. In: Proceedings of the 9th National Conference on Artificial Intelligence. 1992, 547–552
Liu H, Setiono R. A probabilistic approach to feature selection–a filter solution. In: Proceedings of International Conference on Machine Learning. 1996, 319–327
Kohavi R, Johnb G H. Wrappers for feature subset selection. Artificial Intelligence, 2013, 97(1): 273–324
Liu H. Feature Selection for Knowledge Discovery and Data Mining. Boston: Kluwer Academic Publishers, 1998
Tang J L, Alelyani S, Liu H. Feature selection for classification: a review. Data Classification: Algorithms and Applications, 2014, 37
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996, 267–288
Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of Statistics, 2004, 32(2): 407–451
Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2005, 67(2): 301–320
Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 2006, 101(476): 1418–1429
Friedman J, Hastie T, Tibshirani R. A note on the group lasso and a sparse group lasso. Mathematics, 1910, (1)
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistics Society B, 2006, 68(1): 49–67
Wang J L, Zhao P L, Hoi S C, Jing R. Online feature selection and its applications. IEEE Transactions on Knowledge and Data Engineering, 2013, 26(3): 698–710
Yu K, Wu X D, Ding W, Pei J. Scalable and accurate online feature selection for big data. 2016, arXiv: 1511.092632
Yu K, Ding W, Wu X D. Lofs: library of online streaming feature selection. Knowledge Based Systems, 2016
Acknowledgements
This work was supported in part by the National Key Research and Development Program of China (2016YFB1000901), the Program for Changjiang Scholars and Innovative Research Team in University (PCSIRT) of the Ministry of Education, China (IRT13059), the National Basic Research Program (973 Program) of China (2013CB329604), the Specialized Research Fund for the Doctoral Program of Higher Education (20130111110011), and the National Natural Science Foundation of China (Grant Nos. 61273292, 61229301, 61503112, 61673152).
Author information
Authors and Affiliations
Corresponding author
Additional information
Xuegang Hu received the BS degree from the Department of Mathematics at Shandong University, China and the MS and PhD degrees at Hefei University of Technology (HFUT), China. He is a professor in the School of Computer Science and Information Engineering, HFUT and the director-general of Computer Association of Higher Education at Anhui Province. His research interests include data mining and knowledge engineering.
Peng Zhou is currently working toward the PhD degree at Hefei University of Technology, China. His research interests are in data mining and knowledge engineering.
Peipei Li is currently a lecturer at Hefei University of Technology (HFUT), China. She received her BS, MS and PhD degrees from HFUT in 2005, 2008 and 2013 respectively. She was a research fellow at Singapore Management University, Singapore from 2008 to 2009. She was a student intern at Microsoft Research Asia between August 2011 and December 2012. Her research interests are in data mining and knowledge engineering.
Jing Wang received the BE, ME and PhD degrees from the School of Computer Science and Information Engineering, Hefei University of Technology, China in 2009, 2011 and 2015 respectively. She is a visiting research student in the Learning and Vision Research Group of National University of Singapore, Singapore. Her research interests include data mining, computer vision, and machine learning.
Xindong Wu is currently the director of School of Computing and Informatics and professor at University of Louisiana at Lafayette, USA. From 2001 to 2015, he was a professor of Computer Science at the University of Vermont, USA. He is a fellow of the IEEE and the AAAS. He holds a PhD in artificial intelligence from the University of Edinburgh, Britain. He is the founder and current Steering Committee Chair of the IEEE International Conference on Data Mining and the founder and current Editor-in-Chief of Knowledge and Information Systems. He was the Editor-in-Chief of the IEEE Trans. on Knowledge and Data Eng. from 2005 to 2008. His research interests include data mining, big data analytics, etc.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Hu, X., Zhou, P., Li, P. et al. A survey on online feature selection with streaming features. Front. Comput. Sci. 12, 479–493 (2018). https://doi.org/10.1007/s11704-016-5489-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-016-5489-3