| The computational demands of Large Language Models (LLMs) during inference are primarily determined by response length. A substantial portion of the computational cost of LLM inference arises from unnecessary autoregressive decoding, where models generate verbose explanations from simple prompts even when simple, structured answers are sufficient. We introduce a question-type-aware inference framework for prompt routing to produce compact, token-optimised responses. Our approach employs a lightweight BERT classifier to assign incoming queries to predefined question types(e.g., yes/no/maybe, multiple-choice, or descriptive). Each query is then routed to a specialised prompt template designed for concise response generation. We tested our framework on factual tasks (yes/no/maybe and multiple-choice) to ensure rigorous accuracy verification. The methodology is validated across three medical benchmark using four LLMs with 8-70 billion parameters. Results demonstrate a substantial decrease in both average token counts and floating-point operations (FLOPs), achieving peak savings of 99.4\%, while maintaining baseline accuracy. This approach offers a scalable solution to lower the operational costs of deploying LLMs in specialized, knowledge-intensive domains. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.