Multimodal Attentive Representation Learning for Micro-video Multi-label Classification-Reference-Cited by-同舟云学术

Multimodal Attentive Representation Learning for Micro-video Multi-label Classification

Published:2024-03-08 Issue:6 Volume:20 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Jing Peiguang¹^ORCID,Liu Xianyi¹^ORCID,Zhang Lijuan¹^ORCID,Li Yun²^ORCID,Liu Yu¹^ORCID,Su Yuting¹^ORCID

Affiliation:

1. Tianjin University, Tianjin, China

2. Guangxi University of Finance and Economics, Nanning, China

Abstract

As one of the representative types of user-generated contents (UGCs) in social platforms, micro-videos have been becoming popular in our daily life. Although micro-videos naturally exhibit multimodal features that are rich enough to support representation learning, the complex correlations across modalities render valuable information difficult to integrate. In this paper, we introduced a multimodal attentive representation network (MARNET) to learn complete and robust representations to benefit micro-video multi-label classification. To address the commonly missing modality issue, we presented a multimodal information aggregation mechanism module to integrate multimodal information, where latent common representations are obtained by modeling the complementarity and consistency in terms of visual-centered modality groupings instead of single modalities. For the label correlation issue, we designed an attentive graph neural network module to adaptively learn the correlation matrix and representations of labels for better compatibility with training data. In addition, a cross-modal multi-head attention module is developed to make the learned common representations label-aware for multi-label classification. Experiments conducted on two micro-video datasets demonstrate the superior performance of MARNET compared with state-of-the-art methods.

Funder

National Natural Science Foundation of China

Guangxi Key Laboratory of Big Data in Finance and Economics

Doctor Start-up Funds

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3643888

Reference84 articles.

1. Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of International Conference on Machine Learning. 1247–1255.

2. Multimodal Machine Learning: A Survey and Taxonomy

3. Learning multi-label scene classification

4. High Accuracy Optical Flow Estimation Based on a Theory for Warping

5. Heterogeneous hierarchical feature aggregation network for personalized micro-video recommendation;Cai Desheng;IEEE Transactions on Multimedia,2021