21th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

Enhancing Answer Reliability Through Inter-Model Consensus of Large Language Models

Amiri-Margavi Alireza, Jebellat Iman, Jebellat Ehsan, Mousavi Davoudi Seyed Pouyan

Abstract:

  We propose a collaborative framework in which multiple large language models---including GPT-4-0125-preview, Meta-LLaMA-3-70B-Instruct, Claude-3-Opus, and Gemini-1.5-Flash---generate and answer complex, PhD-level statistical questions when definitive ground truth is unavailable. Our study examines how inter-model consensus improves both response reliability and identifies the quality of the generated questions. Employing chi-square test, and confidence interval analysis, we quantify consensus rates agreement to assess both response precision and question quality. Key results indicate that Claude and GPT-4 produce well-structured, less ambiguous questions with a higher inter-rater agreement, as shown by narrower confidence intervals and greater alignment with question-generating models. In contrast, Gemini and LLaMA exhibit greater variability and lower reliability in question formulation. These findings demonstrate that collaborative interactions among large language models enhance response reliability and provide valuable insights for optimizing AI-driven collaborative reasoning systems.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.