Learning to mask and permute visual tokens for vision transformer pre-training

Published in Computer Vision and Image Understanding, 2025

This paper presents a novel approach for pre-training Vision Transformers by learning optimal strategies for masking and permuting visual tokens, improving the model’s understanding of visual relationships.

Recommended citation: L. Baraldi, R. Amoroso, M. Cornia, A. Pilzer, R. Cucchiara (2025). "Learning to mask and permute visual tokens for vision transformer pre-training." Computer Vision and Image Understanding, 104294.
Download Paper