Temporal-Spatial Redundancy Reduction in Video Sequences: A Motion-Based Entropy-Driven Attention Approach
-
Published:2025-03-21
Issue:4
Volume:10
Page:192
-
ISSN:2313-7673
-
Container-title:Biomimetics
-
language:en
-
Short-container-title:Biomimetics
Author:
Yuan Ye1ORCID, Wu Baolei1, Mo Zifan2, Liu Weiye1, Hong Ji1, Li Zongdao1, Liu Jian1ORCID, Liu Na1
Affiliation:
1. Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai 200093, China 2. School of Automation and Electronic Information, Xiangtan University, Xiangtan 411105, China
Abstract
The existence of redundant video frames results in a substantial waste of computational resources during video-understanding tasks. Frame sampling is a crucial technique in improving resource utilization. However, existing sampling strategies typically adopt fixed-frame selection, which lacks flexibility in handling different action categories. In this paper, inspired by the neural mechanism of the human visual pathway, we propose an effective and interpretable frame-sampling method called Entropy-Guided Motion Enhancement Sampling (EGMESampler), which can remove redundant spatio-temporal information in videos. Our fundamental motivation is that motion information is an important signal that drives us to adaptively select frames from videos. Thus, we first perform motion modeling in EGMESampler to extract motion information from irrelevant backgrounds. Then, we design an entropy-based dynamic sampling strategy based on motion information to ensure that the sampled frames can cover important information in videos. Finally, we perform attention operations on the motion information and sampled frames to enhance the motion expression of the sampled frames and remove redundant spatial background information. Our EGMESampler can be embedded in existing video processing algorithms, and experiments on five benchmark datasets demonstrate its effectiveness compared to previous fixed-sampling strategies, as well as its generalizability across different video models and datasets.
Funder
National Natural Science Foundation of China Pujiang Talents Plan of Shanghai Artificial Intelligence Innovation and Development Special Fund of Shanghai
Reference42 articles.
1. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study;Borji;IEEE Trans. Image Process.,2012 2. High-throughput classification of clinical populations from natural viewing eye movements;Tseng;J. Neurol.,2013 3. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7–9). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, PMLR, Lille, France. 4. Perception-oriented video saliency detection via spatio-temporal attention analysis;Zhong;Neurocomputing,2016 5. Zhi, Y., Tong, Z., Wang, L., and Wu, G. (2021, January 11–17). Mgsampler: An explainable sampling strategy for video action recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada.
|
|