Source-Free Video Unsupervised Domain Adaptation (SFVUDA) is a difficult video action recognition issue where the adaptation procedure restricts access to source domain data. To tackle this, we suggest Temporal Attention-based Vision Transformer for SFVUDA (TAViT-SFVUDA) that makes use of confidence-aware learning techniques and temporal consistency. Our approach enables domain-invariant representation learning by guaranteeing both global temporal consistency between individual clips and the entire video, as well as local temporal consistency within clips of a single video. The model reduces noise in the target domain while successfully aligning with the distribution of source data by giving priority to high-confidence local features. Three essential elements form the foundation of TAViT-SFVUDA’s design: (1) domain-invariant representation learning, which guarantees reliable feature extraction; (2) time-dependent feature alignment, which records temporal dynamics across clips; and (3) pseudo-label generation with confidence filtering, which generates superior labels to direct adaptation. Our model is capable of self-supervised temporal feature extraction and domain alignment without the need for source domain data or target labels. Comprehensive tests on benchmark datasets such as ARID, Sports1M, HMDB51, and UCF101 show that TAViT-SFVUDA performs better than the most advanced Video Unsupervised Domain Adaptation (VUDA) and SFVUDA techniques. Our method provides a strong foundation for domain adaptation in video action identification, emphasizing its potential for practical uses in situations where source data availability is limited. |