Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training-Reference-Cited by-同舟云学术

Cross-modal learning with multi-modal model for video action recognition based on adaptive weight training

Published:2024-03-27 Issue:1 Volume:36 Page:
ISSN:0954-0091
Container-title:Connection Science
language:en
Short-container-title:Connection Science

Author:

Zhou Qingguo¹,Hou Yufeng¹,Zhou Rui¹,Li Yan¹,Wang JinQiang¹,Wu Zhen¹,Li Hung-Wei²,Weng Tien-Hsiung²

Affiliation:

1. School of Information Science and Engineering, Lanzhou University, Lanzhou, People's Republic of China

2. Department of Computer Science and Information Engineering, Providence University, Taichung City, Taiwan

Publisher

Informa UK Limited

Link

https://www.tandfonline.com/doi/pdf/10.1080/09540091.2024.2325474

Reference60 articles.

1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., & Schmid, C. (2021). ViViT: A Video Vision Transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6836–6846). IEEE.

2. Baldrati, A., Bertini, M., Uricchio, T., & Del Bimbo, A. (2022). Effective conditioned and composed image retrieval combining clip-based features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 21466–21474). IEEE COMPUER SOC.

3. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts;Bao H.;Advances in Neural Information Processing Systems,2022

4. Bertasius, G., Wang, H., & Torresani, L. (2021). Is space-time attention all you need for video understanding?. In ICML (Vol. 2, pp. 4). ICML.

5. Mmnet: A model-based multimodal network for human action recognition in rgb-d videos;Bruce X.;IEEE Transactions on Pattern Analysis and Machine Intelligence,2022