Affiliation:
1. School of Electronics and Information Technology, Sun Yat-Sen University, Guangdong, People's Republic of China
Abstract
In this article, we propose a novel Visual-Semantic Double Attention (VSDA) model for image captioning. In our approach, VSDA consists of two parts: a modified visual attention model is used to extract sub-region image features, then a new SEmantic Attention (SEA) model is proposed to distill semantic features. Traditional attribute-based models always neglect the distinctive importance of each attribute word and fuse all of them into recurrent neural networks, resulting in abundant irrelevant semantic features. In contrast, at each timestep, our model selects the most relevant word that aligns with current context. In other words, the real power of VSDA lies in the ability of not only leveraging semantic features but also eliminating the influence of irrelevant attribute words to make the semantic guidance more precise. Furthermore, our approach solves the problem that visual attention models cannot boost generating non-visual words. Considering that visual and semantic features are complementary to each other, our model can leverage both of them to strengthen the generations of visual and non-visual words. Extensive experiments are conducted on famous datasets: MS COCO and Flickr30k. The results show that VSDA outperforms other methods and achieves promising performance.
Funder
Science and Technology Program of Guangzhou of China
Fundamental Research Funds for the Central Universities of China
National Natural Science Foundation of China
Natural Science Foundation of Guangdong Province
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture
Cited by
16 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献