22nd AIAI 2026, 16 - 19 July 2026, Chania, Crete, Greece

VoiceSense: A Multimodal AI Framework for Emotion-Aware Speech Rehabilitation

Anantapur Rohan, Addanki Sripriya, Etagi Rishi, Singh Ishmeet, Agarwal Pooja, Reddy Niveditha

Abstract:

  Speech and communication disorders pose significant clinical challenges, especially in high language diversity and low-resource settings. Prior work is limited by weak multi-modality, non-adaptive therapy and inadequate handling of dysarthric and low-resource multilingual speech. This paper presents VoiceSense, a multimodal speech therapy system that integrates artificial intelligence and reinforcement learning-based techniques to enhance adaptive therapy, generate emotion-aware feedback and create an automatic speech recognition loop fine-tuned for impaired speech. The system uses a continuous feedback loop to capture real-time acoustic and visual cues and generate personalised speech therapy prompts for individuals with speech impairments. Through the incorporation of multi-modality, the system enhances therapy interactions to ensure an emotion aware and comprehensive approach to speech therapy. The performance of VoiceSense is evaluated and fine-tuned on a dysarthric speech corpus, significantly improving ASR reliability by showing a wide distribution of confidence scores in the range 94 - 100% along with stable policy convergence of the RL agent within 600 simulation episodes.  

*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.