1. A survey on sentiment analysis and opinion mining for social multimedia
2. Chen, X. Li, Q. Jin, S. Zhang, and Y. Qin, ''Video emotion recognition in the wild based on the fusion of multimodal features,'' in Proc. 18th ACM Int. Conf. Multimodal Interact., Oct. 2016, pp. 494–500.
3. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, ‘‘Multimodal transformer for unaligned multimodal language sequences,’’ in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics, 2019, p. 6558.
4. VATT: Transformers for multimodal self-supervised learning from raw video, audio, and text;Yuan L.;Proc. Adv. Neural Inf. Process. Syst.,2021
5. Dai, Z. Liu, T. Yu, and P. Fung, ‘‘Modality-transferable emotion embeddings for low-resource multimodal emotion recognition,’’ 2020, arXiv:2009.09629