Surface-level malware analysis has significant advantages over deep static analysis and dynamic analysis: it does not involve the difficult and time-consuming task of reverse engineering obfuscated code and eliminates the risk of malware infection by not requiring the execution of malware. Moreover, recent studies have demonstrated that surface-level features can provide sufficient information to distinguish between malware and benign software, as machine learning classifiers trained on surface-level features can achieve significantly high classification accuracy. However, an inherent challenge remains: effective surface-level datasets often contain an enormous number of features. A notable example is the Ember dataset, which originally contained more than ten million features before being aggregated into 2,381 features using the feature hashing trick. For malware analysis, which aims to determine which features contribute to the distinction between malware and benign software and how they do so, this vast number of features hinders manual investigation based on domain knowledge. Although feature selection, which has been extensively studied in machine learning research, may provide a solution to this issue, it faces the challenge of balancing scalability and selection accuracy (the relevance of selected features to labels). Recently, the authors of this paper proposed a new feature selection algorithm, BornFS, which significantly improves this trade-off. BornFS selects only 155 features from over ten million features in the Ember dataset, with a loss of mutual information smaller than 5%. This paper proposes a method for surface-level malware analysis that leverages scalable and accurate feature selection and demonstrates its efficacy through experiments. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.