An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention-Reference-Cited by-同舟云学术

An Accuracy Enhanced Vision Language Grounding Method Fused with Gaze Intention

Published:2023-12-14 Issue:24 Volume:12 Page:5007
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zhang Junqian¹²,Tu Long²³,Zhang Yakun²³,Xie Liang²³,Xu Minpeng¹,Ming Dong¹,Yan Ye¹²³,Yin Erwei¹²³

Affiliation:

1. Academy of Medical Engineering and Translational Medicine, Tianjin University, Tianjin 300072, China

2. Tianjin Artificial Intelligence Innovation Center (TAIIC), Tianjin 300450, China

3. Defense Innovation Institute, Academy of Military Sciences (AMS), Beijing 100071, China

Abstract

Visual grounding aims to recognize and locate the target in the image according to human intention, which provides a new intelligent interaction idea and method for augmented reality (AR) and virtual reality (VR) devices. However, existing vision language grounding adopts language modals for visual grounding, but it performs ineffectively for images containing multiple similar objects. Gaze interaction is an important interaction mode in AR/VR devices, and it provides an advanced solution to the inaccurate vision language grounding cases. Based on the above questions and analysis, a vision language grounding framework fused with gaze intention is proposed. Firstly, we collect the manual gaze annotations using the AR device and construct a novel multi-modal dataset, RefCOCOg-Gaze, combining it with the proposed data augmentation methods. Secondly, an attention-based multi-modal feature fusion model is designed, providing a baseline framework for vision language grounding with gaze intention (VLG-Gaze). Through a series of precisely designed experiments, we analyze the proposed dataset and framework qualitatively and quantitatively. Comparing with the state-of-the-art vision language grounding model, our proposed scheme improves the accuracy by 5.3%, which indicates the significance of gaze fusion in multi-modal grounding tasks.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/24/5007/pdf

Reference34 articles.

1. Chen, L.L., Li, Y.X., Bai, X.W., Wang, X.D., Hu, Y.Q., Song, M.W., Xie, L., Yan, Y., and Yin, E.W. (2022, January 17–21). Real-time gaze tracking with head-eye coordination for head-mounted displays. Proceedings of the IEEE International Symposium on Mixed and Augmented Reality, Singapore.

2. COutfitGAN: Learning to synthesize compatible outfits supervised by silhouette masks and fashion styles;Zhou;IEEE Trans. Multimed.,2022

3. I-GSI: A novel grasp switching interface based on eye-tracking and augmented reality for multi-grasp prosthetic hands;Shi;IEEE Robot. Autom. Lett.,2023

4. Yu, L.C., Poirson, P., Yang, S., Berg, A.C., and Berg, T.L. (2016, January 11–14). Modeling context in referring expressions. Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands.

5. Mao, J.H., Huang, J., Toshev, A., Camburu, O., Yuille, A., and Murphy, K. (July, January 26). Generation and comprehension of unambiguous object descriptions. Proceedings of the 2016 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.