Human-AI collaboration has increasingly impacted domains such as education, medicine, creative arts, and complex problem-solving, where differentiated interactions greatly influence outcomes. Effective collaboration hinges on the capability of Large Language Models (LLMs) to accurately interpret subtle contextual hints provided by humans, underscoring the need to understand how contextual information influences LLMs’ responses. To address this, herein, a structured benchmark is introduced for evaluating the effect of contextual hints on LLMs’ performance, providing sample data and code to enable reproducible experimentation. Experimental findings support the effectiveness of contextual hints, demonstrating these significantly shape the complexity and structural variation in model-generated outputs. Specifically, prompts enhanced by hints yield more detailed, paraphrased responses that diverge lexically and structurally from concise ground-truth answers. However, traditional similarity metrics often underestimate the value of these contextually enriched responses due to their lexical diversity, indicating a discrepancy between metric-based evaluations and human perceptions of quality. These insights highlight the importance of adopting more context-sensitive evaluation methods to better capture the quality and semantic richness of collaborative human-AI outputs. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.