1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). IEEE.
2. Baldrati, A., Bertini, M., Uricchio, T., & Del Bimbo, A. (2022). Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21466–21474). IEEE COMPUER SOC.
3. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts;Bao H.;Advances in Neural Information Processing Systems,2022
4. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, pp. 4). ICML.
5. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos;Bruce X.;IEEE Transactions on Pattern Analysis and Machine Intelligence,2022