Discriminative Action Snippet Propagation Network for Weakly Supervised Temporal Action Localization


Dang Yuanjie1,Huang Chunxia1,Chen Peng1,Zhao Dongdong1,Gao Nan1,Liang Ronghua1,Huan Ruohong1


1. College of Computer Science and Technology, Zhejiang University of Technology, Hangzhou, China


Weakly-supervised temporal action localization (WTAL) aims to classify and localize actions in untrimmed videos with only video-level labels. Recent studies have attempted to obtain more accurate temporal boundaries by exploiting latent action instances in ambiguous snippets or propagating representative action features. However, empirically handcrafted ambiguous snippet extraction and the imprecise alignment of representative snippet propagation lead to challenges in modeling the completeness of actions for these methods. In this paper, we propose a Discriminative Action Snippet Propagation Network (DASP-Net) to accurately discover ambiguous snippets in videos and propagate discriminative instance-level features throughout the video for improving action completeness. Specifically, we introduce a novel discriminative feature propagation module for capturing the global contextual attention and propagating the action concept across the whole video by perceiving the discriminative action snippets with instance information from the same video. Simultaneously, we incorporate denoised pseudo-labels as supervision, where we correct the controversial prediction based on the feature space distribution during training, thereby alleviating false detection caused by noise background features. Furthermore, we design an ambiguous feature mining module, which maximizes the feature affinity information of action and background in ambiguous snippets to generate more accurate latent action and background snippets, and learns more precise action instance boundaries through contrastive learning of action and background snippets. Extensive experiments show that DASP-Net achieves state-of-the-art results on THUMOS14 and ActivityNet1.2 datasets.


Association for Computing Machinery (ACM)


Computer Networks and Communications,Hardware and Architecture

Reference64 articles.

1. Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. 2015. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 961–970.

2. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning. 1597–1607.

3. Arridhana Ciptadi, Matthew S Goodwin, and James M Rehg. 2014. Movement pattern histogram for action recognition and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV’14). Springer, 695–710.

4. Multiple Temporal Pooling Mechanisms for Weakly Supervised Temporal Action Localization

5. Junyu Gao, Mengyuan Chen, and Changsheng Xu. 2022. Fine-grained temporal contrastive learning for weakly-supervised temporal action localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19999–20009.








Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3