Deep Multimodal Data Fusion-Reference-Cited by-同舟云学术

Deep Multimodal Data Fusion

Published:2024-04-24 Issue:9 Volume:56 Page:1-36
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Zhao Fei¹^ORCID,Zhang Chengcui¹^ORCID,Geng Baocheng¹^ORCID

Affiliation:

1. The University of Alabama at Birmingham, Birmingham, AL, USA

Abstract

Multimodal Artificial Intelligence (Multimodal AI), in general, involves various types of data (e.g., images, texts, or data collected from different sensors), feature engineering (e.g., extraction, combination/fusion), and decision-making (e.g., majority vote). As architectures become more and more sophisticated, multimodal neural networks can integrate feature extraction, feature fusion, and decision-making processes into one single model. The boundaries between those processes are increasingly blurred. The conventional multimodal data fusion taxonomy (e.g., early/late fusion), based on which the fusion occurs in, is no longer suitable for the modern deep learning era. Therefore, based on the main-stream techniques used, we propose a new fine-grained taxonomy grouping the state-of-the-art (SOTA) models into five classes: Encoder-Decoder methods, Attention Mechanism methods, Graph Neural Network methods, Generative Neural Network methods, and other Constraint-based methods. Most existing surveys on multimodal data fusion are only focused on one specific task with a combination of two specific modalities. Unlike those, this survey covers a broader combination of modalities, including Vision + Language (e.g., videos, texts), Vision + Sensors (e.g., images, LiDAR), and so on, and their corresponding tasks (e.g., video captioning, object detection). Moreover, a comparison among these methods is provided, as well as challenges and future directions in this area.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3649447

Reference244 articles.

1. SpeakingFaces: A Large-Scale Multimodal Dataset of Voice Commands with Visual and Thermal Video Streams

2. Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey

3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari Hassan;Advances in Neural Information Processing Systems,2021

4. Data Fusion and IoT for Smart Ubiquitous Environments: A Survey

5. Chris Alberti Jeffrey Ling Michael Collins and David Reitter. 2019. Fusion of detected objects in text for visual question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2131–2140.

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Contemporary Survey on Multisource Information Fusion for Smart Sustainable Cities: Emerging Trends and Persistent Challenges;Information Fusion;2025-02

2. Multimodal fusion network for ICU patient outcome prediction;Neural Networks;2024-12

3. Complementary information mutual learning for multimodality medical image segmentation;Neural Networks;2024-09

4. Multimodal data integration for oncology in the era of deep neural networks: a review;Frontiers in Artificial Intelligence;2024-07-25