The integration of chatbots and generative AI in e-commerce enhances engagement and efficiency but introduces ethical risks such as bias, toxicity, and inconsistencies. This paper presents an LLM-as-a-Judge framework to evaluate chatbot responses across five key dimensions, derived from interviews with stakeholders of one of the world's largest online retailers. Our analysis of models like GPT-4o and Prometheus shows that while larger models offer more reliable assessments, optimized prompting enables cost-effective alternatives. Strong metric correlations suggest a streamlined evaluation approach, and log probability-based scoring improves robustness. These findings provide a foundation for deploying ethical AI in e-commerce while balancing accuracy, efficiency, and trustworthiness. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.