HyViSE: (Hybrid ViT-SE) Approach for Crowd Anomaly Detection and Emotion-Behavior Classification

Main Article Content

Jignesh Vaniya, Safvan Vahora, Uttam Chauhan, Sudhir Vegad

Abstract

The ability to gauge emotional states through motion data is a prominent research topic within affective computing as well as crowd behavior analysis. This paper describes a hybrid model named HyViSE: (Hybrid ViT-SE) that improves the feature extraction and attention mechanisms. This proposed model contains convolutional neural networks (CNNs) for local feature representation, Vision Transformers (ViTs) for capturing long-range dependencies, and SE blocks for adaptive recalibration of feature maps. This model proposes to combine convolutional neural networks (CNNs) for learning local feature representation with Vision Transformers (ViTs) for long-range dependencies and SE blocks for adaptive recalibration of feature maps. Thus, the fusion of CNNs and ViTs lets the proposed model gain from the benefits of both architectures which, in turn, makes the system a more powerful and generalized one. From local feature extraction, CNNs analyze fine details in the image with a relatively high degree of accuracy, while ViTs enable a broader understanding via the consideration of the holistic context from the entire image. The SE blocks dynamically recalibrate the importance of feature maps so that the most salient features are weighted over the others, which would further enhance the performance of the model. The proposed model applied on Motion Emotion Dataset (MED) and able to achieves an accuracy of 99.21% on the emotion dataset, which includes 7 different classes, and 97.08% on the behavior dataset, which has 6 different classes. The models are validated using a confusion matrix.

Article Details

Section
Articles