Learning to mask and permute visual tokens for vision transformer pre-training
Published in Computer Vision and Image Understanding, 2025
This paper presents a novel approach for pre-training Vision Transformers by learning optimal strategies for masking and permuting visual tokens, improving the model’s understanding of visual relationships.
Recommended citation: L. Baraldi, R. Amoroso, M. Cornia, A. Pilzer, R. Cucchiara (2025). "Learning to mask and permute visual tokens for vision transformer pre-training." Computer Vision and Image Understanding, 104294.
Download Paper