Abstract
For feature selection in machine learning, set covering (SC) is most suited, for it selects support features for data under analysis based on the individual and the collective roles of the candidate features. However, the SC-based feature selection requires the complete pair-wise comparisons of the members of the different classes in a dataset, and this renders the meritorious SC principle impracticable for selecting support features from a large number of data.
Introducing the notion of implicit SC-based feature selection, this paper presents a feature selection procedure that is equivalent to the standard SC-based feature selection procedure in supervised learning but with the memory requirement that is multiple orders of magnitude less than the counterpart. With experiments on six large machine learning datasets, we demonstrate the usefulness of the proposed implicit SC-based feature selection scheme in large-scale supervised data analysis.
This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2005-003-D00445).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Apté, C., Weiss, S., Grout, G.: Predicting defects in disk drive manufacturing: A case study in high-dimensional classification. In: Proceedings of the 9th Conference on Artificial Intelligence for Applications, Orlando, Florida, pp. 212–218 (1993)
Bhandari, I., Colet, E., Parker, J., Pines, Z., Pratap, R., Ramanujam, K.: Advanced scout: Data mining and knowledge discovery in nba. Data Mining and Knowledge Discovery 1, 121–125 (1997)
Carter, C., Catlett, S.: Assessing credit card applications using machine learning. IEEE Expert, 71–79 (1987)
Kim, K., Ryoo, H.: A lad-based method for selecting short oligo probes for genotyping applications. OR Spectrum: Special Issue on OR and Biomedical Informatics, accepted for publication (2006)
Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: IEEE Conference on Computer Vision and Pattern Recognition, Puerto Rico, pp. 130–136 (1997)
Rahmann, S.: Fast large scale oligonucleotide selection using the longest common factor approach. Journal of Bioinformatics and Computational Biology 1(2), 343–361 (2003)
Wang, X., Seed, B.: Selection of oligonucleotide probes for protein coding sequences. Bioinformatics 19(7), 796–802 (2003)
Wolberg, W., Mangasarian, O.: Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences 87, 9193–9196 (1990)
Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)
Ullman, J.: Pattern Recognition Techniques. Crane, London (1973)
Vapnik, V.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998)
Bennett, K., Mangasarian, O.: Robust linear programming discrimination of two linearly inseparable sets. Optimization Methods and Software 1, 23–34 (1992)
Falk, J., Lopez-Cardona, E.: The surgical separation of sets. Journal of Global Optimization 11, 433–462 (1997)
Megiddo, N.: On the complexity of polyhedral separability. Discrete and Computational Geometry 3, 325–337 (1988)
Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of \(\mathcal{NP}-\)Completeness. W.H. Freeman, New York (1979)
Balas, E., Carrera, M.: A dynamic subgradient-based branch-and-bound procedure for set covering problem. Operation Research 44(6), 875–890 (1996)
Caprara, A., Fischetti, M., Toth, P.: A heuristic method for the set covering problem. Operations Research 47(5), 730–743 (1999)
Ceria, S., Nobili, P., Sassano, A.: A lagrangian-based heuristic for large-scale set covering problems. Mathematical Programming 81(2), 215–228 (1998)
Fisher, M., Kedia, P.: Optimal solution of set covering/partitioning problems using dual heuristics. Management Science 36, 674–688 (1990)
Vasko, F., Wilson, G.: An efficient heuristic for large set covering problem. Naval Research Logistics Quarterly 31, 163–171 (1984)
Vasko, F., Wilson, G.: Hybrid heuristics for minimum cardinality set covering problems. Naval Research Logistics Quarterly 33, 241–249 (1986)
Boros, E., Hammer, P., Ibaraki, T., Kogan, A., Mayoraz, E., Muchnik, I.: An implementation of logical analysis of data. IEEE Transactions on Knowledge and Data Engineering 12, 292–306 (2000)
Ryoo, H., Jang, I.Y.: Milp approach to pattern generation in logical analysis of data. Machine Learning, submitted (2005)
Borneman, J., Chrobak, M., Vedova, G., Figueroa, A., Jiang, T.: Probe selection algorithms with applications in the analysis of microbial communities. Bioinformatics 17(Suppl. 1), S39–S48 (2001)
Klau, G., Rahmann, S., Schliep, A., Vingron, M., Reinert, K.: Optimal robust non-unique probe selection using integer linear programming. Bioinformatics 20(Suppl. 1), i186–i193 (2004)
Chaval, V.: A greedy heuristic for the set covering problem. Mathematics of Operations Research 4(3), 233–235 (1979)
Nemhauser, G.L., Wolsey, L.A.: Integer and Combinatorial Optimization. Wiley-Interscience Series I Discrete Mathematics and Optimization. Wiley, New York (1988)
Murphy, P., Aha, D.: Uci repository of machine learning databases: Readable data repository. Department of Computer Science, University of California at Irvine, CA (1994), Available from World Wide Web: http://www.ics.uci.edu/~mlearn/MLRepository.html.
Heisele, B., Poggio, T., Pontil, M.: Face detection in still grey images. Technical report, MIT Artificial Intelligence Laboratory and Center for Biological and Computational Learning, Massachusetts, A.I. Memo No. 1687, C.B.C.L. Paper No. 187 (2000), Data available from World Wide Web: http://cbcl.mit.edu/cbcl/software-datasets
Hammer, P., Bonates, T.: Logical analysis of data: From combinatorial optimization to medical applications. RUTCOR Research Report 10-2005 (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Ryoo, H.S., Jang, IY. (2007). A Heuristic Method for Selecting Support Features from Large Datasets. In: Kao, MY., Li, XY. (eds) Algorithmic Aspects in Information and Management. AAIM 2007. Lecture Notes in Computer Science, vol 4508. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-72870-2_39
Download citation
DOI: https://doi.org/10.1007/978-3-540-72870-2_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-72868-9
Online ISBN: 978-3-540-72870-2
eBook Packages: Computer ScienceComputer Science (R0)