Authors:
Kilho Shin
1
;
Chris Liu
2
;
Katsuyuki Maeda
1
and
Hiroaki Ohshima
3
Affiliations:
1
Computer Centre, Gakushuin University, Mejiro, Tokyo, Japan
;
2
Deloitte Tohmatsu Cyber LLC., Marunouchi, Tokyo, Japan
;
3
Graduate School of Information Science, University of Hyogo, Kobe, Hyogo, Japan
Keyword(s):
Feature Selection, Categorical Data.
Abstract:
In feature selection, we grapple with two primary challenges: devising effective evaluative indices for selected feature subsets and crafting scalable algorithms rooted in these indices. Our study addresses both. Beyond assessing the size and class relevance of selected features, we introduce a groundbreaking index, nuisance. It captures class-uncorrelated information, which can muddy subsequent processes. Our experiments confirm that a harmonious balance between class relevance and nuisance augments classification accuracy. To this end, we present the Balance-Optimized Relevance and Nuisance Feature Selection (BornFS) algorithm. It not only exhibits scalability to handle large datasets but also outperforms traditional methods by achieving better balance among the introduced indices. Notably, when applied to a dataset of 800,000 Windows executables, using LCC as a preprocessing filter, BornFS slashes the feature count from 10 million to under 200, maintaining a high accuracy in malwa
re detection. Our findings shine a light on feature selection’s complexities and pave the way forward.
(More)