HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group-Activity Scene Graph Generation in Videos
Author:
Chappa Naga Venkata Sai Raviteja1ORCID, Nguyen Pha1ORCID, Le Thi Hoang Ngan1, Dobbs Page Daniel2ORCID, Luu Khoa1
Affiliation:
1. Department of EECS, University of Arkansas, Fayetteville, AR 72701, USA 2. Department of Health, Human Performance and Recreation, University of Arkansas, Fayetteville, AR 72701, USA
Abstract
Group-activity scene graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional video scene graph generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene-understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving appearance, interaction, position, relationship, and situation attributes. This work also introduces an innovative approach, a Hierarchical Attention–Flow (HAtt-Flow) mechanism, rooted in flow network theory to enhance GASG performance. Flow–attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional “values” and “keys” are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed flow–attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.
Reference83 articles.
1. Gupta, S., and Malik, J. (2015). Visual semantic role labeling. arXiv. 2. Gkioxari, G., Girshick, R., Dollár, P., and He, K. (2018, January 18–22). Detecting and recognizing human-object interactions. Proceedings of the IEEE Conference on Computer Vision and Recognition, Salt Lake City, UT, USA. 3. Kato, K., Li, Y., and Gupta, A. (2018, January 8–14). Compositional learning for human object interaction. Proceedings of the European Conference on Computer Vision, Munich, Germany. 4. Chao, Y.W., Liu, Y., Liu, X., Zeng, H., and Deng, J. (2018, January 12–15). Learning to detect human-object interactions. Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA. 5. Wang, T., Anwer, R.M., Khan, M.H., Khan, F.S., Pang, Y., Shao, L., and Laaksonen, J. (November, January 27). Deep contextual attention for human-object interaction detection. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea.
|
|