Dynamic Gated Fusion with Cross-Modal Attention for Multimodal Tourism Sentiment Analysis
Main Article Content
Abstract
To address the limitations of unimodal sentiment analysis in the context of tourism in Heilongjiang, this paper proposes a dynamic gated multimodal fusion model that integrates textual and visual features through a cross-modal attention mechanism, enhancing both the accuracy and interpretability of sentiment analysis. Building upon previous unimodal studies—BiLSTM with FastText for text analysis and ResNet50 for image analysis—the model introduces a gating mechanism to dynamically adjust the contribution of each modality. Additionally, a Transformer-based attention layer is employed to capture inter-modal dependencies. Experiments conducted on a Heilongjiang tourism dataset (6,580 reviews and 5,976 images) demonstrate that the proposed model achieves an accuracy of 98.2%, marking a 1.2% improvement over the text-based unimodal baseline. Visualization results reveal that the gating mechanism assigns greater weight to visual features in extreme sentiment cases (e.g., negative reviews), with weights reaching up to 0.72. This study offers a transparent and interpretable framework for multimodal sentiment analysis in tourism.