Self-Supervised Learning for Action Recognition: Trends, Models, and Applications

Main Article Content

Mouwiya S. A. Al-Qaisieh, Mas Rina Mustaffa

Abstract

Recent advances in self-supervised learning (SSL) have reshaped the landscape of human action recognition by reducing dependency on large-scale annotated datasets. This survey provides a comprehensive overview of state-of-the-art SSL techniques developed for understanding human actions in videos. We categorize methods into three primary paradigms: contrastive learning, masked video modeling, and multimodal or sensor-based approaches. Across each category, we discuss key innovations including motion-guided contrastive sampling, transformer-based masked autoencoders, and cross-modal alignment strategies that leverage audio, skeleton, or wearable sensor signals. Models such as VideoMAE, ST-MAE, XDC, and Actionlet-Contrastive represent significant milestones in capturing both spatial and temporal cues without supervision. Beyond model design, we identify major challenges facing current SSL systems, including generalization across domains, modeling long-horizon activities, and real-time deployment constraints. We also highlight underexplored areas such as explainability and unified evaluation protocols. To guide future work, we present a structured taxonomy, a comparative table of representative models, and a discussion of promising research directions including multimodal fusion, modality-agnostic learning, and hardware-aware training. This survey aims to equip researchers with a clear understanding of the evolving trends, persistent gaps, and opportunities that lie ahead in self-supervised action recognition.

Article Details

Section
Articles