20th AIAI 2024, 27 - 30 June 2024, Corfu, Greece

Lip Recognition Based on Bi-GRU with Multi-Head Self-Attention

Ran Ni, Haiyang Jiang, Lu Zhou, Yuanyao Lu

Abstract:

  Currently, lip recognition technology is a significant research direction in the field of video understanding in computer vision. It aims to recognize the content expressed by the main characters through the dynamic changes of the lips' visual image. With the development of deep learning and enhanced computer performance, lip recognition techniques have evolved to effectively separate the background and target foreground across different scenes, improving uniformity in models and technical routes. However, despite the success of lip recognition with English datasets, the semantic specificity of Chinese words and the scarcity of open-source Chinese lip recognition datasets have generally led to subpar recognition standards. To address these challenges, this paper introduces a novel model that syner-gizes two-dimensional and three-dimensional networks. Our approach leverages an improved Convolutional 3D (C3D) network to extract spatiotemporal features effectively. Unlike traditional 2D networks, which lack temporal dynamics, and conventional 3D networks that are prone to overfitting due to deep layers, our enhanced C3D network provides a robust foundation for feature extraction. To capture temporal features more efficiently, we integrate the outputs into a Bi-directional Gated Recurrent Unit (Bi-GRU), combined with a Multi-Head Self-Attention mechanism. This fusion allows for better extraction of semantic and syntactic features, overcoming the limitations of normal RNNs in handling long sequences and the non-parallelizable nature of GRU due to sequence dependence. The effectiveness of the proposed model was validated through experiments on our self-compiled Chinese dataset. We compared it with mainstream networks such as ResNet-18, ResNet-34, and the original C3D model lacking the multi-attention mechanism. The analysis of loss function curves and accuracy curves demonstrated that our model achieves significant improvements, effectively addressing the challenges in lip recognition for the Chinese language and setting a new benchmark for performance in this field.  

*** Title, author list and abstract as seen in the Camera-Ready version of the paper that was provided to Conference Committee. Small changes that may have occurred during processing by Springer may not appear in this window.