Multimodal Emotion Recognition: A Tri-modal Approach Using Speech, Text, and Visual Cues for Enhanced Interaction Analysis
Main Article Content
Abstract
In an age dominated by the rapid development of human-computer interaction, knowledge of the user emotions has become a critical building block in creating engagement as well as responsiveness. This paper presents a tri-modal system towards real-time emotion perception through the merging of textual, visual, and audio information. Our method utilizes strong deep learning networks in each of the three modalities: DistilBERT to perform sentiment analysis on text data (on the SST-2 dataset), ViT (Vision Transformer, vit-base-patch16-224-in21k) to detect emotions from faces, and a task-specific Convolutional Neural Network on the RAVDESS dataset to perform emotional analysis from speech. Each modality is processed independently and fused subsequently to get a global emotion score, thereby enabling fine-grained behavioral analysis in real-time. Tri-modal fusion not only enhances accuracy but also yields robustness against varied scenarios, thereby solving the problems due to incomplete or ambiguous information in any one of the individual modalities. The paper here shows that a uniform framework performs substantially better than unimodal systems in extracting emotional context, thereby paving the way for more intelligent as well as emotionally responsive applications in mental health monitoring, customer support, and human-robot interaction.