With their strong human-level capabilities and rapid advancements in quality and affordability, closed Large Language Models (LLMs) are increasingly being integrated into real-world solutions. However, hallucinations in LLM-generated responses contribute to misinformation, deception, and mistrust, ultimately compromising user safety and the reliability of these solutions, even when external knowledge is incorporated through Retrieval-Augmented Generation (RAG). The challenges surrounding the effectiveness and generalization of current LLM-based hallucination detection methods further exacerbate this issue. Focusing on the multilingual customer service domain, we explore LLM-based automatic hallucination detection methods—using an LLM as a judge—for closed LLMs and assess their effectiveness in practical Question Answering RAG (Response Augmentation Generation) applications. We conduct a systematic evaluation of multiple hallucination detection methods in a controlled setting, leveraging our manually labelled real-world dataset. Ultimately, we find that while existing detection methods perform well on well-structured public datasets, they encounter significant challenges when applied to complex real-world scenarios like ours due to operational challenges and model limitations. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.