21th AIAI 2025, 26 - 29 June 2025, Limassol, Cyprus

Toward Safer and Trustworthy Chatbots in E-Commerce: An LLM-as-a-Judge Approach for Ethical Evaluation

Janusko Tamás, Bochnia Ricardo, Hempel Gunnar, Krauß Anna-Magdalena, Richter Daniel, Harnisch Moritz, Tomschke Steffen, Anke Jürgen, Thiele Maik

Abstract:

  The integration of chatbots and generative AI in e-commerce enhances engagement and efficiency but introduces ethical risks such as bias, toxicity, and inconsistencies. This paper presents an LLM-as-a-Judge framework to evaluate chatbot responses across five key dimensions, derived from interviews with stakeholders of one of the world's largest online retailers. Our analysis of models like GPT-4o and Prometheus shows that while larger models offer more reliable assessments, optimized prompting enables cost-effective alternatives. Strong metric correlations suggest a streamlined evaluation approach, and log probability-based scoring improves robustness. These findings provide a foundation for deploying ethical AI in e-commerce while balancing accuracy, efficiency, and trustworthiness.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.