Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features-Reference-Cited by-同舟云学术

Multi-Modal Representation via Contrastive Learning with Attention Bottleneck Fusion and Attentive Statistics Features

Published:2023-10-07 Issue:10 Volume:25 Page:1421
ISSN:1099-4300
Container-title:Entropy
language:en
Short-container-title:Entropy

Author:

Guo Qinglang¹²^ORCID,Liao Yong²,Li Zhe³^ORCID,Liang Shenglin⁴

Affiliation:

1. School of Cyber Science and Technology, University of Science and Technology of China, Heifei 230027, China

2. National Engineering Research Center for Public Safety Risk Perception and Control by Big Data (RPP), CETC Academy of Electronics and Information Technology Group Co., Ltd., China Academic of Electronics and Information Technology, Beijing 100041, China

3. Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China

4. School of Telecommunications Engineering, Xidian University, Xi’an 710071, China

Abstract

The integration of information from multiple modalities is a highly active area of research. Previous techniques have predominantly focused on fusing shallow features or high-level representations generated by deep unimodal networks, which only capture a subset of the hierarchical relationships across modalities. However, previous methods are often limited to exploiting the fine-grained statistical features inherent in multimodal data. This paper proposes an approach that densely integrates representations by computing image features’ means and standard deviations. The global statistics of features afford a holistic perspective, capturing the overarching distribution and trends inherent in the data, thereby facilitating enhanced comprehension and characterization of multimodal data. We also leverage a Transformer-based fusion encoder to effectively capture global variations in multimodal features. To further enhance the learning process, we incorporate a contrastive loss function that encourages the discovery of shared information across different modalities. To validate the effectiveness of our approach, we conduct experiments on three widely used multimodal sentiment analysis datasets. The results demonstrate the efficacy of our proposed method, achieving significant performance improvements compared to existing approaches.

Funder

National Key Research and Development Program of China

Publisher

MDPI AG

Subject

General Physics and Astronomy

Link

https://www.mdpi.com/1099-4300/25/10/1421/pdf

Reference53 articles.

1. Colombo, P., Chapuis, E., Labeau, M., and Clavel, C. (2021, January 7–11). Improving Multimodal fusion via Mutual Dependency Maximisation. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.

2. Han, W., Chen, H., and Poria, S. (2021, January 7–11). Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual.

3. Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., and Ng, A.Y. (July, January 28). Multimodal deep learning. Proceedings of the ICML, Bellevue, WA, USA.

4. Multimodal learning with deep boltzmann machines;Srivastava;Adv. Neural Inf. Process. Syst.,2012

5. Audiovisual information fusion in human–computer interfaces and intelligent environments: A survey;Shivappa;Proc. IEEE,2010