1. Arevalo, J., Solorio, T., Montes-y Gómez, M., & González, F. (2017). Gated Multimodal Units for Information Fusion. In Proceedings of the international conference on learning representations.
2. Longformer: The long-document transformer;Beltagy,2020
3. Event-centric multi-modal fusion method for dense video captioning;Chang;Neural Networks,2021
4. Cui, P., & Hu, L. (2021). Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: human language technologies (pp. 5881–5891).
5. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the conference of the North American chapter of the association for computational linguistics:human language technologies, vol. 1 (pp. 4171–4186).