1. Bottom-up and top-down attention for image captioning and visual question answering;Anderson,2018
2. IMRAM: Iterative matching with recurrent attention memory for cross-modal image-text retrieval;Chen,2020
3. Learning the best pooling strategy for visual semantic embedding;Chen,2021
4. A simple framework for contrastive learning of visual representations;Chen,2020
5. Uniter: Universal image-text representation learning;Chen,2020