1. Joint 2d-3d-semantic data for indoor scene understanding;Armeni,2017
2. ViViT: A Video Vision Transformer
3. Unified Graph Structured Models for Video Understanding
4. Attentional mixtures of soft prompt tuning for parameter-efficient multi-task knowledge sharing;Asai,2022
5. Bringing image scene structure to video via frame-clip consistency of object tokens;Avraham