| The increasing complexity of malware attacks necessitates the development of more sophisticated analytical and defensive techniques. This study investigates the application of reinforcement learning (RL) to model adversarial malware behavior by modifying features of portable executable (PE) files to explore evasion strategies. The goal is not to create malicious software, but rather to enhance defensive capabilities by identifying weaknesses in existing detection systems. To this end, we developed a controlled research environment in which RL agents learn to transform malicious PE features into adversarial representations that evade detection. The environment includes defined malware states, a Random Forest classifier to assess adversarial features as benign, and a reward mechanism that encourages successful evasion while penalizing actions that preserve malicious characteristics. We implemented and evaluated four RL agents: Deep Q-Network (DQN), Soft Actor-Critic (SAC), and Generative Adversarial Imitation Learning (GAIL) trained using expert trajectories from both DQN and SAC. Agent performance was assessed using a reward-based metric. Results indicate that the GAIL agent trained on SAC expert data achieved the highest average reward and the lowest variance across 100 evaluation episodes during testing. This research contributes to a deeper understanding of adversarial attack strategies and offers insights that can inform the development of more robust malware detection and cybersecurity defense systems. | 
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.