In modern times engineers have searched for various ways to bridge the many gaps between software and hardware, taking advantage of various techniques, methods, and algorithms. A well-known example of this phenomenon is the introduction of Deep Learning Super Sampling \footnote{(DLSS - \href{https://www.nvidia.com/en-us/geforce/news/dlss4-multi-frame-generation-ai-innovations/%7D%7BNVIDIA Article})} as a method of offloading part of the rendering burden onto Artificial Intelligence (AI) models. Meanwhile in the realm of compiler development similar tendencies have emerged, with the primary goal of assuaging classical compiler optimization challenges by improving loop unrolling as well as instruction scheduling, in the name of reducing latency and upward code size trends. The goal of this paper is to analyse the feasibility of contemporary autotuning methods~\cite{ashouri2018survey,lucke2024mlir} as part of the compilation process in CPU applications, focusing mostly on tile size optimization. This study aims to highlight the potential performance improvements assessed in cache misses per program, with our experiments showing that, generally, optimizing tile sizes for cache efficiency alone can yield up to a $\approx98.1%$ improvement in reducing atom level cache misses in single-threaded scenarios. With the growing complexity of modern software and the slowdown in hardware advancements, traditional heuristics and compiler optimizations are steadily becoming insufficient, these findings suggesting that compiler autotuning could become the next important step in the realm of compiler optimization. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.