Affiliation:
1. Tianjin University, Tianjin, China
2. Guangxi University of Finance and Economics, Nanning, China
Abstract
As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.
Funder
National Natural Science Foundation of China
Guangxi Key Laboratory of Big Data in Finance and Economics
Doctor Start-up Funds
Publisher
Association for Computing Machinery (ACM)
Reference84 articles.
1. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247–1255.
2. Multimodal Machine Learning: A Survey and Taxonomy
3. Learning multi-label scene classification
4. High Accuracy Optical Flow Estimation Based on a Theory for Warping
5. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation;Cai Desheng;IEEE Transactions on Multimedia,2021