Researchers Unearth Explicit Registration Techniques Curtailing Attention Peaks in Vision Transformers
In the realm of artificial intelligence, a new study sheds light on an intriguing problem that has been plaguing Vision Transformers (ViTs). Researchers have discovered that these models, which have become a popular choice for various vision tasks, often focus on background noise instead of the main subjects in images. This issue arises due to the emergence of high-norm tokens, leading to noisy attention maps, as identified in previous research.
The root cause of this problem lies in models like CLIP and DINOv2, where a sparse set of neurons concentrates high-norm activations on outlier tokens. This leads to irregular attention patterns, degrading downstream visual processing by focusing on background noise rather than the main subjects.
To address this issue, researchers have traditionally used trained registers as a solution. This method involves retraining models from scratch with additional learned register tokens. These registers help to mitigate the impact of high-norm tokens by capturing and redirecting the high-norm activations away from the main processing pipeline.
However, the researchers in the new paper, "Vision Transformers Don't Need Trained Registers," propose a novel training-free approach called test-time registers. This method involves shifting the high-norm activations from certain neurons into an additional untrained token at test time. This approach mimics the effect of register tokens on a model already trained without them, effectively enhancing performance and interpretability without needing retraining.
By using either trained or test-time registers, researchers aim to improve the focus of ViTs on relevant image content, enhancing their overall performance in visual tasks. The authors of the paper hypothesize that ViT models learn to identify low-information patches and recycle them for temporary global information storage.
It's important to note that the outliers in ViT models only appear during the training of sufficiently large models. Additional experiments reveal that removing these high-norm tokens improves the models' performance and reduces the attention spikes on uninformative background patches.
The recycling of patch embeddings causes undesirable side effects such as loss of original patch details, spiky attention maps, and incompatibility with object discovery methods. By alleviating this recycling, researchers aim to improve the models' ability to discover objects effectively.
High L2 norms in neural networks can indicate issues such as overfitting, numerical instability, or poor generalization. This simple fix improves attention maps and downstream performance, suggesting that there are likely other undiscovered model artifacts to investigate.
The new paper offers a promising solution to a problem that has been hindering the performance of ViTs. By understanding and addressing these issues, researchers can continue to improve the performance of these models, potentially leading to even better results in various vision tasks.
Science has unveiled a significant problem in Vision Transformers (ViTs), where they focus more on background noise instead of the main subjects in images. This issue arises due to the emergence of high-norm tokens in models like CLIP and DINOv2, leading to irregular attention patterns and degradation of downstream visual processing. To tackle this issue, researchers have proposed a novel training-free approach called test-time registers, which involves shifting high-norm activations at test time to improve ViTs' focus on relevant image content and enhance their performance in medical-conditions diagnosis, science, technology, and artificial-intelligence applications.