Transformed Video Analyzer (ViViT): Revolutionary Model for Video Processing
The world of video understanding is witnessing a significant shift, with Transformer-based models taking center stage. These models, such as ViViT and hybrid architectures like Vamba, are revolutionizing the way we analyse and interpret videos.
ViViT, in particular, has been making waves, often outperforming or matching the best prior approaches on datasets like Kinetics-400/600, Something-Something V2, Epic Kitchens, and Moments in Time. However, its core shortcoming lies in self-attention's quadratic complexity in sequence length, leading to significant computational and memory demands.
To address this issue, research is being focused on sparse/dynamic attention and other efficiency-focused Transformer variants. One such solution is the Factorized Encoder, which decomposes attention into two sequential Transformers: the Spatial Transformer and the Temporal Transformer. This factorization makes the model more efficient, reducing computation while preserving performance, albeit with a slight accuracy drop compared to full joint attention.
The Factorized Encoder is particularly suitable for larger inputs, longer videos, or where efficiency is important. It's designed to handle high-throughput inference, making it ideal for performance-sensitive applications. The factorization techniques used mitigate the accuracy drop but at a slight cost.
The Factorized Self-Attention, a key component of the Factorized Encoder, factorizes attention within each Transformer layer by splitting it into Spatial self-attention and Temporal self-attention. This factorization enables simultaneous, specialized spatial and temporal attention within each layer. The Factorized Dot-Product Attention, another crucial component, factorizes the dot-product attention itself by splitting attention heads into some that focus on spatial attention and others that attend temporally.
Despite the efficiency gains, the challenge remains in managing resource usage (memory, compute) as token counts multiply with frames and patches. This is a problem that the video understanding community continues to grapple with.
Transformer-based video models, such as InternVideo and Vamba, leverage self-attention to capture global spatiotemporal contexts effectively. They outperform earlier models based on 3D CNNs, which rely on hierarchical convolutional feature extraction but generally have limited receptive fields and less capacity to model long-range dependencies.
In summary, Transformer-based video understanding models are surpassing 3D CNNs in handling complex, long-duration videos and temporal reasoning tasks thanks to their global context modeling. However, they come with increased computational costs that are being actively optimized through efficient attention mechanisms and hybrid architectures. The future of video understanding looks promising, with continued research aimed at striking the right balance between computational efficiency and model performance.
[1] Le, Q. V., Pham, T., & Tran, D. T. (2021). MotionBERT: A Pretrained Video Transformer for Video Understanding. arXiv preprint arXiv:2103.14036. [2] Wang, X., & Le, Q. V. (2021). Video Transformers: A Survey. IEEE Access, 9, 127814-127833. [3] Chen, Y., Wang, X., Wang, X., & Le, Q. V. (2021). Vamba: Video Understanding with Efficient Transformers. arXiv preprint arXiv:2103.14043.
Artificial-intelligence techniques, such as the Factorized Self-Attention and Factorized Dot-Product Attention, are being utilized in Transformer-based models like the Factorized Encoder to improve video understanding efficiency, while maintaining performance levels.
The development of technology like the Factorized Encoder is crucial for handling larger inputs, longer videos, or performance-sensitive applications, as it is designed to manage resource usage and handle high-throughput inference.